《何恺明MAE局限性被打破与Swin Transformer结合训练速度提升.docx》由会员分享,可在线阅读,更多相关《何恺明MAE局限性被打破与Swin Transformer结合训练速度提升.docx(5页珍藏版)》请在第一文库网上搜索。
1、自何恺明MAE横空出世以来,MIM (Masked Image Modeling)这一自监督预训练表征越来越引发关注。但与此同时,研究人员也不得不思考它的局限性。MAE论文中只尝试了使用原版ViT架构作为编码器,而表现更好的分层设计结构(以Swin Transformer为代表),并不能直接用上MAE方法。于是,一场整合的范式就此在研究团队中上演。代表工作之一是来自清华、微软亚研院以及西安交大提出SimMIM,它探索了 Swin Transformer在MIM中的应用。但与MAE相比,它在可见和掩码图块均有操作,且计算量过大。有研究人员发现,即便是SimMIM的基本尺寸模型,也无法在一台配置8
2、个32GB GPU的机器上完成训练。基于这样的背景,东京大学&商汤&悉尼大学的研究员,提供一个新思路。Green Hierarchical Vision Transformerfor Masked Image ModelingLang Huang Shan You; Mingkai Zheng Fei Wang2t Chen Qian Toshihiko Yamasaki11The University of Tokyo; 2ScnseTimc Research; 3The University of Sydney(langhuang, yamasakiOcvm. t u-tokyo .ac.
3、jpyoushan.wangfei,qianchen)sensetimecom. mzhe4001CuniSydneyedu.u-i不光将Swin Transformer整合到了 MAE框架上,既有与SimMIM相当的任务表现,还保证了计算效率和性能将分层ViT的训练速度提高2.7倍,GPU内存使用量减少70%0来康康这是一项什么研究?当分层设计引入MAE这篇论文提出了一种面向MIM的绿色分层视觉Transformer。即允许分层ViT丢弃掩码图块,只对可见图块进行操作。Stagel Stage2 Stage3 Stage4EncoderGreen Httirdrchical Vi l wit
4、h Group Window Attention一| MSE |MSEMSE| MSE |mse| MSE |Decoder:Isotropic vrrMethod Overview.具体实现,由两个关键部分组成。首先,设计了一种基于分治策略的群体窗口注意力方案。将具有不同数量可见图块的局部窗口聚集成几个大小相等的组,然后在每组内进行掩码自注意力。 M3SK Lm Z1: 一MaskedAtterH一 onGroup Attention Scheme,其次,把上述分组任务视为有约束动态规划问题,受贪心算法的启发提出了一种分组算法。Algonthm 1 Optimal GroupingRequi
5、re: The number of visible patches within each local window 皿;?。,1: Minimum computational cost ce 4 +82: for% = maxX叫; i to E:irt do3: Remaining windows 4 叫; partition II - 0; the number of group ng 1 04: repeat5: 万小 Knapsack(p,小).as in Equation (7)6: n4- nu7rng;力一6:万“07: % 1 % + 18: until = 09: c -
6、C, II), as in Equation (8)10: if c c* thenii: c* Cov3 Il|ViTB800BEiT 2|ViT-B800MAE 22ViT-B16002()69MAE |221ViT-B16(X)2069SimMIM (66|Swin-B8001609OursSwin-B800887I(xl(K25I(M36369 OC 7 94 413 4 0Ho.0.0.4 s- 5 59 5 - - 7 7.4 二.2 9-&6 69 4 142.4444.4444.9 7ao.7 7MethodBackbone PTEp. PT Hours FT Epochs A
7、Pb AP AP% APn, AP?i AP?iTraining from scratchBenchmarking 39J %TB0040048.9-43.6-Supervised PretrainingBenchmarking 39 ViT-B300992l(X)47.9,42.9PVT 60PVT-L3003644.566.048340.763.443.7Swin 1431Swin-B3008403648.569.853.243266.946.7而跟SimMIM相比,这一方法在所需训练时间大大减少,消耗GPU内存也小得多。具体而言,在相同的训练次数下,在Swin-B上提高2倍的速度和减少6
8、0%的内存。GPU HOURS/EPOCH Ours (224) BSimMIM (192) SimMIM (224)5.6Swin-BGPU MEMORY (GB)Group size 明Group we %Group size g.Figure 4: The optimal group size at each stage. The figure of the fourth stage is omitted herebecause there is only one local window in this stage, so the grouping is not necessary.
9、The simulationis repeated 100 times, of which the mean and standard deviation (the shaded regions) are reported.6543210Figure I: Comparison with SimMIM in terms of efficiency. All methods use a Swin-B/SwinLbackbone and batch size of 2,048. The experiments of our method are conducted on a single machinewith e