VisionTransformer(Pytorch版)代码阅读注释
Vision Transformer(Pytorch版)代码阅读
前⾔
因为Google Rearch官⽅的Vision Transformer源码是tensorflow版本,⽽笔者平时多⽤pytorch,所以在github上找了作者rwightman版本的代码:
Vision Transformer介绍博客:
下⾯的代码介绍以vit_ba_patch16_224(ViT-B/16:patch_size=16, img_size=224)为例。
VIT Model
原⽂中模型由三个模块组成:
· Linear Projection of Flattened Patches
· Transformer Encoder
· MLP Head
对应代码中的三个模块:
· patch embedding layer
· Block
· Reprentation layer + Classifier head
Linear Projection of Flattened Patches
如图,Linear Projection of Flattened Patches的实现的通过⼀个kernel_size=stride=16的卷积加上⼀个flatten实现的。他的功能是将 244×244×3196×768
的的2D Image转换为 的Patch Embedding。具体代码及注释如下:
"""
2D Image to Patch Embedding
"""
def__init__(lf, img_size=224, patch_size=16, in_c=3, embed_dim=768, norm_layer=None): super().__init__()
'''
image_size = (244,244)
patch_size = (16,16)
gird_size = (244/16,244/16)=(14,14)
num_patches = 14 * 14 = 196
'''
img_size =(img_size, img_size)
patch_size =(patch_size, patch_size)
lf.img_size = img_size
lf.patch_size = patch_size
lf.num_patches = lf.grid_size[0]* lf.grid_size[1]
'''
使⽤⼤⼩为16,stride为16的卷积核实现embeding,
输出14*14⼤⼩,通道为768(768 = 16*16*3,相当于将每个patch部分转换为1维向量)的patch '''
lf.proj = nn.Conv2d(in_c, embed_dim, kernel_size=patch_size, stride=patch_size) '''
如果norm_layer为true则使⽤layerNorm,这⾥作者没有使⽤,
所以lf.norm = nn.Identity(),对输⼊不做任何改变直接输出
word上标'''
< = norm_layer(embed_dim)if norm_layer el nn.Identity()
def forward(lf, x):
B, C, H, W = x.shape
asrt H == lf.img_size[0]and W == lf.img_size[1], \
f"Input image size ({H}*{W}) doesn't match model ({lf.img_size[0]}*{lf.img_size[1]})."
'''
lf.proj(x):[B,3,244,244]->[B,768,14,14]
flatten(2):[B,768,14,14]->[B,768,14*14]=[B,768,196]
transpo(1, 2):[B,768,196]->[B,196,768]
<(x)不对输⼊做处理直接输出
'''
x = lf.proj(x).flat1ten(2).transpo(1,2)
什么样的大树x = lf.norm(x)
return x
Transformer Encoder
Transformer Encoder由Attention、MLP和DropPath代码组成,其结构图如下:
Multi-Head Attention
关于 Multi-Head Attention 的结构图和详细介绍可查看博⽂,。
Attention具体代码及注释如下:
def__init__(lf,
dim,# 输⼊token的dim 768
num_heads=8,
qkv_bias=Fal,
qk_scale=None,
attn_drop_ratio=0.,
proj_drop_ratio=0.):
super(Attention, lf).__init__()
'''
num_heads = 12
head_dim = 768 // 12 = 64 (Attention is all you need论⽂中提到的dk=dv=dmodel/h)
scale = 64 ^ -0.5 = 1/8(Attention is all you need论⽂中Scaled Dot-Product Attention提到的公式Attention(Q,K,V)中的根号dk分之⼀) qkv:将输⼊线性映射到q,k,v
proj:Attention is all you need论⽂中Multi-Head Attention最后的Linear
臭鼻症
'''
lf.num_heads = num_heads
head_dim = dim // num_heads
lf.scale = qk_scale or head_dim **-0.5
lf.qkv = nn.Linear(dim, dim *3, bias=qkv_bias)
lf.attn_drop = nn.Dropout(attn_drop_ratio)
lf.proj = nn.Linear(dim, dim)
lf.proj_drop = nn.Dropout(proj_drop_ratio)
银耳枸杞def forward(lf, x):
'''
B = batch_size
N = 197
C = 768
'''
B, N, C = x.shape
'''
qkv(x) : [B,197,768] -> [B,197,768*3]
reshape : [B,197,768*3] -> [B,197,3,12,64] (3分别代表qkv,12个head,每个head为64维向量)
permute:[B,197,3,12,64] -> [3,B,12,197,64]
'''
qkv = lf.qkv(x).reshape(B, N,3, lf.num_heads, C // lf.num_heads).permute(2,0,3,1,4)
'''
q,k,v = [B,12,197,64]
'''
q, k, v = qkv[0], qkv[1], qkv[2]# make torchscript happy (cannot u tensor as tuple)
'''
q @ K.transpo(-2, -1) : [B,12,197,64] @ [B,12,64,197] = [B,12,197,197]
attn : [B,12,197,197]
attn.softmax(dim=-1)对最后⼀个维度(即每⼀⾏)进⾏softmax处理
'''
attn =(q @ k.transpo(-2,-1))* lf.scale
attn = attn.softmax(dim=-1)
attn = lf.attn_drop(attn)
'''
attn @ v = [B,12,197,197] @ [B,12,197,64] = [B,12,197,64]
transpo(1, 2) : [B,197,12,64]
reshape : [B,197,768]
'''
x =(attn @ v).transpo(1,2).reshape(B, N, C)
x = lf.proj(x)
x = lf.proj_drop(x)
return x
MLP
MLP结构和代码都很简单,就是全连接加激活函数加dropout,这⾥的激活函数⽤的GELU:
大耳念什么
MLP模块代码如下:
class Mlp (nn .Module ):
"""
MLP as ud in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(lf , in_features , hidden_features =None , out_features =None , act_layer =nn .GELU , drop =0.): super ().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
lf .fc1 = nn .Linear (in_features , hidden_features )
lf .act = act_layer ()
lf .fc2 = nn .Linear (hidden_features , out_features )
lf .drop = nn .Dropout (drop )
def forward (lf , x ):
x = lf .fc1(x )
x = lf .act (x )
x = lf .drop (x )
x = lf .fc2(x )
x = lf .drop (x )
return x DropPath
在Transformer Encoder中代码使⽤DropPath代替论⽂中的Dropout,具体代码及注释如下:GELU (x )=0.5x (1+tanh [(x +π2
0.044715x )])
3
孩子拉肚子怎么办def drop_path(x, drop_prob:float=0., training:bool=Fal):
'''
x.shape : [B,197,768]
'''
if drop_prob ==0.or not training:
return x
keep_prob =1- drop_prob
'''
shape = [B,1,1]
即将X的第⼀维度保留,其他维度改为1
'''
shape =(x.shape[0],)+(1,)*(x.ndim -1)# work with diff dim tensors, not just 2D ConvNets
'''
⽣成形状为shape的随机张量并加上keep_prob
'''
random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
'''
将随机张量向下取整,⼀部分为0,⼀部分为1
'''
random_tensor.floor_()# binarize
'''
将x除以keep_prob再乘上随机张量,⼀部分变成0,⼀部分保留
'''
output = x.div(keep_prob)* random_tensor
return output
class DropPath(nn.Module):
"""
Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""对语
def__init__(lf, drop_prob=None):
深粉色super(DropPath, lf).__init__()
lf.drop_prob = drop_prob
def forward(lf, x):
return drop_path(x, lf.drop_prob, lf.training)
MLP Head
原⽂中关于MLP Head的代码:
# Reprentation layer
if reprentation_size and not distilled:
lf.has_logits =True
lf.num_features = reprentation_size
lf.pre_logits = nn.Sequential(OrderedDict([
("fc", nn.Linear(embed_dim, reprentation_size)),
("act", nn.Tanh())
]))
el:
lf.has_logits =Fal
lf.pre_logits = nn.Identity()
# Classifier head(s)
lf.head = nn.Linear(lf.num_features, num_class)if num_class >0el nn.Identity() lf.head_dist =None
if distilled:
lf.head_dist = nn.bed_dim, lf.num_class)if num_class >0el nn.Identity()
这⾥的代码也很简单,就不做过多注释了,代码中distilled = Fal,所以:
lf.pre_logits = nn.Sequential(nn.Linear,(embed_dim, reprentation_size)nn.Tanh())
lf.head = nn.Linear(lf.num_features, num_class)
MLPHead(x) = lf.head(lf.pre_logits(x[:, 0]))