首页 > 美文阅读

word2vec代码详解（2）-创建字典、生成训练样本

更新时间:2023-07-05 12:41:50 阅读：评论：0

word2vec代码详解（2）-创建字典、⽣成训练样本

第⼆步，创建字典，取频数最⾼的50000个单词，按照频数从⾼到低存到字典中，编号为1到50000，其它单词认定其为Unknow，编号为0

# Step 2: Build the dictionary and replace rare words with UNK token.

vocabulary_size = 50000

def build_datat(words, n_words):

"""Process raw inputs into a datat."""

count = [['UNK', -1]]

dictionary = dict()

for word, _ in count:

dictionary[word] = len(dictionary)湖南科技馆

data = list()

unk_count = 0

for word in words:

index = (word, 0) #获取单词的编号

if index == 0: # dictionary['UNK']

unk_count += 1

data.append(index)

count[0][1] = unk_count

reverd_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

return data, count, dictionary, reverd_dictionary

# Filling 4 global variables:

# data - list of codes (integers from 0 to vocabulary_size-1).

# This is the original text but words are replaced by their codes

# count - map of words(strings) to count of occurrences献血一般多少毫升

# dictionary - map of words(strings) to their codes(integers)

# rever_dictionary - maps codes(integers) to words(strings)

data, count, dictionary, rever_dictionary = build_datat(vocabulary, vocabulary_size)

del vocabulary # Hint to reduce memory.

print('Most common words (+UNK)', count[:5])

print('Sample data', data[:10], [rever_dictionary[i] for i in data[:10]])

#data中存放顺序与原始⽂件单词顺序--对应

输出：

Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]

试剂纯度Sample data [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abu', 'first', 'ud', 'against']

其中，data存储的是将原始⽂件转化为单词编号的列表；

count存储的是频数前50000的单词及其频数，count[0]为‘UNK’(其它单词)和其频数；文爱句子

第三步，⽣成训练样本，采⽤Skip-Gram模式（从⽬标单词反推语境），其中，batch_size为batch的⼤⼩，num_skips表⽰为每个单词⽣成多少样本，skip_window为每个单词最远可以联系的距离。

程序中，buffer为⼀个容量为span的队列，每次从data_index开始，读span个单词到buffer作为初始值

第⼀个循环，对⼀个⽬标单词⽣成样本，其中buffer为⽬标单词和所有相关单词；

第⼆个循环，将该⽬标单词的样本存储到batch、labels中，滑窗向后移动⼀个单词

data_index = 0

# Step 3: Function to generate a training batch for the skip-gram model.

def generate_batch(batch_size, num_skips, skip_window):

global data_index

asrt batch_size % num_skips == 0

asrt num_skips <= 2 * skip_window工程师职称论文

batch = np.ndarray(shape=(batch_size), dtype=np.int32)

labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

span = 2 * skip_window + 1 # [ skip_window target skip_window ]

buffer = collections.deque(maxlen=span) # pylint: disable=redefined-builtin if data_index + span > len(data):

data_index = 0

data_index += span

for i in range(batch_size // num_skips): #可理解为每个batch中的单词

context_words = [w for w in range(span) if w != skip_window]

words_to_u = random.sample(context_words, num_skips)

#实现从上下⽂随机抽取该单词周围的单词num_skips个

for j, context_word in enumerate(words_to_u):

batch[i * num_skips + j] = buffer[skip_window]

labels[i * num_skips + j, 0] = buffer[context_word]

if data_index == len(data):

data_index = span

el:

buffer.append(data[data_index])

data_index += 1

# Backtrack a little bit to avoid skipping words in the end of a batch

data_index = (data_index + len(data) - span) % len(data)

return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1) for i in range(8):

摄影师前景怎么样

print(batch[i], rever_dictionary[batch[i]], '->', labels[i, 0],

rever_dictionary[labels[i, 0]])

输出：

党员缺点3081 originated -> 5234 anarchism

3081 originated -> 12 as

12 as -> 3081 originated

12 as -> 6 a

6 a -> 195 term清明饼

6 a -> 12 as

195 term -> 2 of

195 term -> 6 a

本文发布于:2023-07-05 12:41:50，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1079798.html

上一篇：美国国会图书馆分类法【英文版】

下一篇：用100个句子,背完7000个单词

标签：单词编号字典频数顺序

留言与评论（共有 0 条评论）