word2vec代码详解(2)-创建字典、⽣成训练样本
第⼆步,创建字典,取频数最⾼的50000个单词,按照频数从⾼到低存到字典中,编号为1到50000,其它单词认定其为Unknow,编号为0
# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000
def build_datat(words, n_words):
"""Process raw inputs into a datat."""
count = [['UNK', -1]]
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)湖南科技馆
data = list()
unk_count = 0
for word in words:
index = (word, 0) #获取单词的编号
if index == 0: # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reverd_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverd_dictionary
# Filling 4 global variables:
# data - list of codes (integers from 0 to vocabulary_size-1).
# This is the original text but words are replaced by their codes
# count - map of words(strings) to count of occurrences献血一般多少毫升
# dictionary - map of words(strings) to their codes(integers)
# rever_dictionary - maps codes(integers) to words(strings)
data, count, dictionary, rever_dictionary = build_datat(vocabulary, vocabulary_size)
del vocabulary # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [rever_dictionary[i] for i in data[:10]])
#data中存放顺序与原始⽂件单词顺序--对应
输出:
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
试剂纯度Sample data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abu', 'first', 'ud', 'against']
其中,data存储的是将原始⽂件转化为单词编号的列表;
count存储的是频数前50000的单词及其频数,count[0]为‘UNK’(其它单词)和其频数;文爱句子
第三步,⽣成训练样本,采⽤Skip-Gram模式(从⽬标单词反推语境),其中,batch_size为batch的⼤⼩,num_skips表⽰为每个单词⽣成多少样本,skip_window为每个单词最远可以联系的距离。
程序中,buffer为⼀个容量为span的队列,每次从data_index开始,读span个单词到buffer作为初始值
第⼀个循环,对⼀个⽬标单词⽣成样本,其中buffer为⽬标单词和所有相关单词;
第⼆个循环,将该⽬标单词的样本存储到batch、labels中,滑窗向后移动⼀个单词
data_index = 0
# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):
global data_index
asrt batch_size % num_skips == 0
asrt num_skips <= 2 * skip_window工程师职称论文
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span) # pylint: disable=redefined-builtin if data_index + span > len(data):
data_index = 0
data_index += span
for i in range(batch_size // num_skips): #可理解为每个batch中的单词
context_words = [w for w in range(span) if w != skip_window]
words_to_u = random.sample(context_words, num_skips)
#实现从上下⽂随机抽取该单词周围的单词num_skips个
for j, context_word in enumerate(words_to_u):
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[context_word]
if data_index == len(data):
data_index = span
el:
buffer.append(data[data_index])
data_index += 1
# Backtrack a little bit to avoid skipping words in the end of a batch
data_index = (data_index + len(data) - span) % len(data)
return batch, labels
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1) for i in range(8):
摄影师前景怎么样
print(batch[i], rever_dictionary[batch[i]], '->', labels[i, 0],
rever_dictionary[labels[i, 0]])
输出:
党员缺点3081 originated -> 5234 anarchism
3081 originated -> 12 as
12 as -> 3081 originated
12 as -> 6 a
6 a -> 195 term清明饼
6 a -> 12 as
195 term -> 2 of
195 term -> 6 a