赞
踩
目录
2013 年,Google 团队发表了 word2vec 工具。word2vec 工具主要包含两个模型:跳字模型(skip-gram)和连续词模型(continuous bag of words,简称 CBOW),以及两种高效训练的方法:负采样(negative sampling)和层序 softmax(hierarchical softmax)。
类似于f(x)->y,Word2vec 的最终目的,不是要把 f 训练得多么完美,而是只关心模型训练完后的副产物——模型参数(这里特指神经网络的权重),并将这些参数,作为输入 x 的某种向量化的表示,这个向量便叫做——词向量。
word2vec 词向量可以较好地表达不同词之间的相似度和类比关系。
处理前:
处理后:
代码如下:
- import re
- import jieba
-
- stopwords = {}
- fstop = open('stop_words.txt', 'r', encoding='utf-8', errors='ingnore')
- for eachWord in fstop:
- stopwords[eachWord.strip()] = eachWord.strip() # 创建停用词典
- fstop.close()
-
- f1 = open('红楼梦.txt', 'r', encoding='utf-8', errors='ignore')
- f2 = open('红楼梦_p.txt', 'w', encoding='utf-8')
-
- line = f1.readline()
- while line:
- line = line.strip() # 去前后的空格
- if line.isspace(): # 跳过空行
- line = f1.readline()
-
- line = re.findall('[\u4e00-\u9fa5]+', line) # 去除标点符号
- line = "".join(line)
-
- seg_list = jieba.cut(line, cut_all=False) # 结巴分词
-
- outStr = ""
- for word in seg_list:
- if word not in stopwords: # 去除停用词
- outStr += word
- outStr += " "
-
- if outStr: # 不为空添加换行符
- outStr = outStr.strip() + '\n'
-
- f2.writelines(outStr)
- line = f1.readline()
-
- f1.close()
- f2.close()

测试结果:
这里 min_count=1 也就是不剔除低频词,窗口大小设定为2,负样本数量 k 设定为3。
代码如下:
- import math
- import numpy
- from collections import deque
- from numpy import random
-
- numpy.random.seed(6)
-
-
- class InputData:
-
- def __init__(self, file_name, min_count):
- self.input_file_name = file_name
- self.get_words(min_count)
- self.word_pair_catch = deque() # deque为队列,用来读取数据
- self.init_sample_table() # 采样表
- print('Word Count: %d' % len(self.word2id))
- print("Sentence_Count:", self.sentence_count)
- print("Sentence_Length:", self.sentence_length)
-
- def get_words(self, min_count): # 剔除低频词,生成id到word、word到id的映射
- self.input_file = open(self.input_file_name, encoding="utf-8")
- self.sentence_length = 0
- self.sentence_count = 0
- word_frequency = dict()
- for line in self.input_file:
- self.sentence_count += 1
- line = line.strip().split(' ') # strip()去除首尾空格,split(' ')按空格划分词
- self.sentence_length += len(line)
- for w in line:
- try:
- word_frequency[w] += 1
- except:
- word_frequency[w] = 1
- self.word2id = dict()
- self.id2word = dict()
- wid = 0
- self.word_frequency = dict()
- for w, c in word_fre

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。