自注意力和CNN的结合ACmix：OntheIntegrationofSelf-Atten。。。

更新时间:2023-07-08 17:32:17 阅读：评论：0

Figure 1. A sketch of ACmix. We explore a clor relationship between convolution and lf-attention in the n of sharing the same computation overhead (1×1 convolutions), and combining with the remaining lightweight aggregation operations. We show the computation complexity of each the feature channel.

本⽂证明了它们之间存在着⼀种强烈的潜在关系，即这两种范式的⼤部分计算实际上是⽤相同的操作完成的。

具体来说，本⽂⾸先证明了⼀个传统的核⼤⼩为 k x k 的卷积可以分解为 k^2 个单独的卷积，然后进⾏移位和求和操作。然后，将⾃注意模块中的 query、key 和 value 的投影解释为多个 1x1 卷积，然后计算注意⼒权重和值的聚合。因此，这两个模块的第⼀阶段包含类似的操作。更重要的是，与第⼆阶段相⽐，第⼀阶段的计算复杂度 (通道⼤⼩的平⽅) 占主导地位。这种观察⾃然地导致了这两个看起来截然不同的范式的优雅集成，也就是说，混合模型 ACmix 可以同时享受⾃注意和卷积的好处，同时与纯卷积或⾃注意对应的模型相⽐，具有最⼩的计算开销。

⼤量的实验表明，本⽂的模型在图像识别和下游任务上取得了⽐具有竞争⼒都 baline ⼀致的改进结果。

1. Introduction

羽毛球发球姿势Recent years have witnesd the vast development of convolution and lf-attention in computer vision. Convolution neural networks (CNNs) are widely adopted on image recognition [20, 24], mantic gmentation [9] and

object detection [39], and achieve state-of-the-art performances on various benchmarks. On the other hand, lf-attention is first introduced in natural language processing [1, 43], and also shows great potential in the fields of image generation and super-resolution [10, 35]. More recently, with the advent of vision transformers [7,16,38], attention-bad modules have achieved comparable or even better performances than their CNN counterparts on many vision tasks.

简述本⽂两⼤核⼼技术的发展情况：

近年来，卷积和⾃注意技术在计算机视觉中的应⽤得到了极⼤的发展。卷积神经⽹络被⼴泛应⽤于图像识别、语义分割和⽬标检测，并在各种基准上取得了最先进的性能。另⼀⽅⾯，⾃注意⼒⾸先在⾃然语⾔处理中被引⼊，在图像⽣成和超分辨率领域也显⽰出巨⼤的潜⼒。最近，随着 vision transformer 的出现，基于注意⼒的模块在许多视觉任务上取得了与 CNN 相当甚⾄更好的性能。

Despite the great success that both approaches have achieved, convolution and lf-attention modules usually follow different design paradigms. Traditional convolution leverages an aggregation

function over a localized receptive field according to the convolution filter weights, which are shared in the whole feature map. The intrinsic characteristics impo crucial inductive bias for image processing. Comparably, the lf-attention module applies a weighted

average operation bad on the context of input features, where the attention weights are computed dynamically via a similarity function between related pixel pairs. The flexibility enables the attention module to focus on different regions adaptively and capture more informative features.

简析本⽂两⼤核⼼技术的本质特点：

尽管这两种⽅法都取得了巨⼤的成功，卷积和⾃注意模块通常遵循不同的设计范式。传统的卷积是根据卷积滤波器的权值在局部接受域上利⽤⼀个聚合函数，这些权值在整个特征图中共享。内在特性对图像处理有重要的归纳偏差（inductive bias）。相⽐之下，⾃注意模块采⽤基于输⼊特征上下⽂的加权平均操作，其中通过相关像素对之间的相似函数动态计算注意权重。这种灵活性使注意⼒模块能够⾃适应地聚焦于不同的区域，并捕捉更多的信息特征。

Considering the different and complementary properties of convolution and lf-attention, there exists a potential possibility to benefit from both paradigms by integrating the modules. Previous work has explored the combination of lf-attention and convolution from veral different perspectiv

es.

Rearches from early stages, e.g., SENet [23], CBAM [47], show that lf-attention mechanism can rve as an augmentation for convolution modules.

More recently, lf-attention modules are propod as individual blocks to substitute traditional convolutions in CNN models, e.g., SAN [54], BoTNet [41].

Another line of rearch focus on combining lf-attention and convolution in a single block, e.g., AA-ResNet [3], Container [17], while the architecture is limited in designing independent paths for each module.

Therefore, existing approaches still treat lf-attention and convolution as distinct parts, and the underlying relations between them have not been fully exploited.

2月25日考虑到卷积和⾃我注意的不同和互补性质，通过集成这些模块，存在从这两种范式中获益的潜在可能性。之前的研究已经从⼏个不同的⾓度探讨了⾃我注意和卷积的结合。

第⼀种，将⾃注意机制作为卷积模块的增强，如 SENet、CBAM；

掉头车道

第⼆种，将⾃注意模块作为单个块来替代 CNN 模型中的传统卷积，如 SAN、BoTNet。

第三种，将⾃注意和卷积结合在⼀个块中，例如 AA-ResNet， Container，但体系结构在为每个模块设计独⽴路径⽅⾯存在局限性。SAN ： Exploring lf-attention for image recognition. CVPR, 2020.

BoTNet：Bottleneck transformers for visual recognition. CVPR, 2021.

范仲淹的名句

王维图片AA-ResNet：Attention augmented convolutional networks. CVPR, 2019.

Container： Container: Context aggregation network. 2021.

总之，现有的⽅法仍然将⾃注意⼒和卷积视为两个不同的部分，它们之间的内在联系还没有被充分挖掘。

In this paper, we ek to unearth a clor relationship between lf-attention and convolution. By decomposing the operations of the two modules, we show that they heavily rely on the same 1×1 convolution operations. Bad on this obrvation, we develop a mixed model, named ACmix, and integrate lf-attention and convolution elegantly with minimum computational overhead. Specifically, we first project the input feature maps with 1×1 convolutions and obtain a rich t of intermediate features. Then, the intermediate features are reud and aggregated following

different paradigms, i.e, in lf-attention and convolution manners respectively. In this way, ACmix enjoys the benefit of both modules, and effectively avoids conducting expensive projection operations twice.

本⽂试图揭⽰⾃注意⼒和卷积之间的⼀个更密切的关系。炒田螺的做法

通过分解这两个模块的运算，作者发现它们在很⼤程度上依赖于相同的 1x1 卷积运算。

基于这⼀观察，本⽂开发了⼀个名为 ACmix 的混合模型，并以最⼩的计算开销优雅地集成了⾃注意⼒和卷积。

具体来说，⾸先，⽤ 1x1 卷积投影输⼊特征映射，得到⼀个丰富的中间特征集。然后，按照⾃注意⼒和卷积两种不同的模式对中间特征进⾏重⽤和聚合。

通过这种⽅式，ACmix 享有了两个模块的好处，并有效地避免了两次执⾏⾼复杂度的投影操作。

灯笼教案Consider a standard convolution with the kernel K ∈ , where k is the kernel size and C_in, C_out are the input and output channel size. Given tensors F∈ , G∈ as the input and output feature maps, where H, W denote the height and width, we denote as the feature tensors of pixel (i, j)

纯色电脑壁纸where ，reprents the kernel weights with regard to the indices of the kernel position

给定⼀个与核 K∈ 的标准卷积，其中 K 是核⼤⼩，C_in、C_out 是输⼊和输出通道⼤⼩。给定张量 F∈ , G∈作为输⼊和输出的特征映射，其中 H, W 表⽰⾼度和宽度，⽤分别为 F 和 G 对应的像素 (i, j) 的特征张量。那么，标准卷积可以表⽰为 (1)。其中，，表⽰核在位置 (p, q) 的权重

To further simplify the formulation, we define the Shift operation, , as:

Consider a standard lf-attention module with N heads. Let ， denote the input and output feature. Let denote the corresponding tensor of pixel (i, j). Then, output of the attention

本文发布于:2023-07-08 17:32:17，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1073267.html

上一篇：进口贸易合同中英文 tt

下一篇：业务计划控制程序(中英文对照)

标签：注意卷积模块计算特征操作范式模型

留言与评论（共有 0 条评论）