首页 > 美文阅读

VisionTransformer（Pytorch版）代码阅读注释

更新时间:2023-07-30 15:00:32 阅读：评论：0

Vision Transformer（Pytorch版）代码阅读

前⾔

因为Google Rearch官⽅的Vision Transformer源码是tensorflow版本，⽽笔者平时多⽤pytorch，所以在github上找了作者rwightman版本的代码：

Vision Transformer介绍博客：

下⾯的代码介绍以vit_ba_patch16_224(ViT-B/16：patch_size=16, img_size=224)为例。

VIT Model

原⽂中模型由三个模块组成：

· Linear Projection of Flattened Patches

· Transformer Encoder

· MLP Head

对应代码中的三个模块：

· patch embedding layer

· Block

· Reprentation layer + Classifier head

Linear Projection of Flattened Patches

如图，Linear Projection of Flattened Patches的实现的通过⼀个kernel_size=stride=16的卷积加上⼀个flatten实现的。他的功能是将 244×244×3196×768

的的2D Image转换为的Patch Embedding。具体代码及注释如下：

"""

2D Image to Patch Embedding

"""

def__init__(lf, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None): super().__init__()

'''

image_size = (244,244)

patch_size = (16,16)

gird_size = (244/16,244/16)=(14,14)

num_patches = 14 * 14 = 196

'''

img_size =(img_size, img_size)

patch_size =(patch_size, patch_size)

lf.img_size = img_size

lf.patch_size = patch_size

lf.num_patches = lf.grid_size[0]* lf.grid_size[1]

'''

使⽤⼤⼩为16，stride为16的卷积核实现embeding，

输出14*14⼤⼩，通道为768（768 = 16*16*3，相当于将每个patch部分转换为1维向量）的patch '''

lf.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size) '''

如果norm_layer为true则使⽤layerNorm，这⾥作者没有使⽤，

所以lf.norm = nn.Identity()，对输⼊不做任何改变直接输出

word上标'''

< = norm_layer(embed_dim)if norm_layer el nn.Identity()

def forward(lf, x):

B, C, H, W = x.shape

asrt H == lf.img_size[0]and W == lf.img_size[1], \

f"Input image size ({H}*{W}) doesn't match model ({lf.img_size[0]}*{lf.img_size[1]})."

'''

lf.proj(x):[B,3,244,244]->[B,768,14,14]

flatten(2):[B,768,14,14]->[B,768,14*14]=[B,768,196]

transpo(1, 2):[B,768,196]->[B,196,768]

<(x)不对输⼊做处理直接输出

'''

x = lf.proj(x).flat1ten(2).transpo(1,2)

什么样的大树x = lf.norm(x)

return x

Transformer Encoder

Transformer Encoder由Attention、MLP和DropPath代码组成，其结构图如下：

Multi-Head Attention

关于 Multi-Head Attention 的结构图和详细介绍可查看博⽂，。

Attention具体代码及注释如下：

def__init__(lf,

dim,# 输⼊token的dim 768

num_heads=8,

qkv_bias=Fal,

qk_scale=None,

attn_drop_ratio=0.,

proj_drop_ratio=0.):

super(Attention, lf).__init__()

'''

num_heads = 12

head_dim = 768 // 12 = 64 （Attention is all you need论⽂中提到的dk=dv=dmodel/h）

scale = 64 ^ -0.5 = 1/8（Attention is all you need论⽂中Scaled Dot-Product Attention提到的公式Attention(Q,K,V)中的根号dk分之⼀） qkv：将输⼊线性映射到q,k,v

proj：Attention is all you need论⽂中Multi-Head Attention最后的Linear

臭鼻症

'''

lf.num_heads = num_heads

head_dim = dim // num_heads

lf.scale = qk_scale or head_dim **-0.5

lf.qkv = nn.Linear(dim, dim *3, bias=qkv_bias)

lf.attn_drop = nn.Dropout(attn_drop_ratio)

lf.proj = nn.Linear(dim, dim)

lf.proj_drop = nn.Dropout(proj_drop_ratio)

银耳枸杞def forward(lf, x):

'''

B = batch_size

N = 197

C = 768

'''

B, N, C = x.shape

'''

qkv(x) : [B,197,768] -> [B,197,768*3]

reshape : [B,197,768*3] -> [B,197,3,12,64] (3分别代表qkv，12个head，每个head为64维向量)

permute：[B,197,3,12,64] -> [3,B,12,197,64]

'''

qkv = lf.qkv(x).reshape(B, N,3, lf.num_heads, C // lf.num_heads).permute(2,0,3,1,4)

'''

q,k,v = [B,12,197,64]

'''

q, k, v = qkv[0], qkv[1], qkv[2]# make torchscript happy (cannot u tensor as tuple)

'''

q @ K.transpo(-2, -1) : [B,12,197,64] @ [B,12,64,197] = [B,12,197,197]

attn : [B,12,197,197]

attn.softmax(dim=-1)对最后⼀个维度（即每⼀⾏）进⾏softmax处理

'''

attn =(q @ k.transpo(-2,-1))* lf.scale

attn = attn.softmax(dim=-1)

attn = lf.attn_drop(attn)

'''

attn @ v = [B,12,197,197] @ [B,12,197,64] = [B,12,197,64]

transpo(1, 2) : [B,197,12,64]

reshape : [B,197,768]

'''

x =(attn @ v).transpo(1,2).reshape(B, N, C)

x = lf.proj(x)

x = lf.proj_drop(x)

return x

MLP

MLP结构和代码都很简单，就是全连接加激活函数加dropout，这⾥的激活函数⽤的GELU:

大耳念什么

MLP模块代码如下：

class Mlp (nn .Module ):

"""

MLP as ud in Vision Transformer, MLP-Mixer and related networks

"""

def __init__(lf , in_features , hidden_features =None , out_features =None , act_layer =nn .GELU , drop =0.): super ().__init__()

out_features = out_features or in_features

hidden_features = hidden_features or in_features

lf .fc1 = nn .Linear (in_features , hidden_features )

lf .act = act_layer ()

lf .fc2 = nn .Linear (hidden_features , out_features )

lf .drop = nn .Dropout (drop )

def forward (lf , x ):

x = lf .fc1(x )

x = lf .act (x )

x = lf .drop (x )

x = lf .fc2(x )

x = lf .drop (x )

return x DropPath

在Transformer Encoder中代码使⽤DropPath代替论⽂中的Dropout，具体代码及注释如下：GELU (x )=0.5x (1+tanh [(x +π2

0.044715x )])

孩子拉肚子怎么办def drop_path(x, drop_prob:float=0., training:bool=Fal):

'''

x.shape : [B,197,768]

'''

if drop_prob ==0.or not training:

return x

keep_prob =1- drop_prob

'''

shape = [B,1,1]

即将X的第⼀维度保留，其他维度改为1

'''

shape =(x.shape[0],)+(1,)*(x.ndim -1)# work with diff dim tensors, not just 2D ConvNets

'''

⽣成形状为shape的随机张量并加上keep_prob

'''

random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)

'''

将随机张量向下取整，⼀部分为0，⼀部分为1

'''

random_tensor.floor_()# binarize

'''

将x除以keep_prob再乘上随机张量，⼀部分变成0，⼀部分保留

'''

output = x.div(keep_prob)* random_tensor

return output

class DropPath(nn.Module):

"""

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

"""对语

def__init__(lf, drop_prob=None):

深粉色super(DropPath, lf).__init__()

lf.drop_prob = drop_prob

def forward(lf, x):

return drop_path(x, lf.drop_prob, lf.training)

MLP Head

原⽂中关于MLP Head的代码：

# Reprentation layer

if reprentation_size and not distilled:

lf.has_logits =True

lf.num_features = reprentation_size

lf.pre_logits = nn.Sequential(OrderedDict([

("fc", nn.Linear(embed_dim, reprentation_size)),

("act", nn.Tanh())

]))

el:

lf.has_logits =Fal

lf.pre_logits = nn.Identity()

# Classifier head(s)

lf.head = nn.Linear(lf.num_features, num_class)if num_class >0el nn.Identity() lf.head_dist =None

if distilled:

lf.head_dist = nn.bed_dim, lf.num_class)if num_class >0el nn.Identity()

这⾥的代码也很简单，就不做过多注释了，代码中distilled = Fal，所以:

lf.pre_logits = nn.Sequential(nn.Linear,(embed_dim, reprentation_size)nn.Tanh())

lf.head = nn.Linear(lf.num_features, num_class)

MLPHead(x) = lf.head(lf.pre_logits(x[:, 0]))

本文发布于:2023-07-30 15:00:32，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1123838.html

上一篇：Low power decision feedback equalization (DFE) thr

下一篇：《化学药物仿制药人体生物等效性研究技术指导原则》

标签：代码部分注释版本输出维度保留组成

留言与评论（共有 0 条评论）