【人工智能学习】【四】文本预处理_文本表示学习人工智能

作者：酷酷是懒虫 | 2024-07-13 18:55:50

踩

文本表示学习人工智能

文本预处理

文本是一种序列，一篇文章可以看做是单词字母的序列。文本不同于输入像素尺寸固定的图片，所以对文本预处理是进行文字处理必要的一步。文本预处理主要分为以下几个步骤：

文本读取
对文本信息进行分词
建立字典，将单词进行映射
将文本序列转换成索引序列（到这一步，文本信息就变成了一堆固定长度的数字信息了）

这只是一个整体的大致步骤，最终还需要针对不同的任务来进行不同的处理方式。

文本读入

最简单的例子
第一步：文本读取（顺便转换了大小写，筛掉了除a-z意外的字符）

import collections
import re

# 读入文件，顺便通过正则的形式把英文文本全部转换为小写，并且只保留小写的a-z，其他字符全部替换成空格
# 返回的是个一维数组，数组里存的是文本的每一行
def readtext():
	with open('/home/usr/novel.txt') as f:
		lines = [re.sub('[^a-z]+', ' ',line.strip().lower())for line in f]
	return lines
1
2
3
4
5
6
7
8
9

打印一下lines里的内容

print(lines[0])
the time machine by h g wells

第二步：分词


def tokenize(lines):
	return [line.split(' ') for line in lines]
1
2
3

打印一下返回值

print(line[0:3])
[[‘the’, ‘time’, ‘machine’, ‘by’, ‘h’, ‘g’, ‘wells’, ‘’],
[‘the’, ‘time’, ‘traveller’, ‘for’, ‘so’, ‘it’, ‘will’, ‘be’, ‘convenient’, ‘to’, ‘speak’, ‘of’, ‘him’, ‘’],
[‘was’, ‘expounding’, ‘a’,‘recondite’, ‘matter’, ‘to’, ‘us’, ‘his’, ‘grey’, ‘eyes’, ‘shone’, ‘and’]]

返回值是一个二维数组，数组的一行是原来的一行分词后的结果。

第三步：建立字典

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数

1
2
3
4

打印一下返回结果

print(count_corpus(tokens))
Counter({‘the’: 2261, ‘’: 1282, ‘i’: 1267, ‘and’: 1245, ‘of’: 1155, ‘a’: 816, ‘to’: 695, ‘was’: 552, ‘in’: 541, ‘that’: 443, ‘my’: 440, ‘it’: 437, ‘had’: 354, ‘me’: 281, ‘as’: 270, ‘at’: 243, ‘for’: 221, ‘with’: 216, ‘but’: 204, ‘time’: 200, ‘were’: 158, ‘this’: 152, ‘you’: 137, ‘on’: 137, ‘then’: 134, ‘his’: 129, ‘there’: 127, ‘he’: 123, ‘have’: 122, ‘they’: 122, ‘from’: 122, ‘one’: 120, ‘all’: 118, ‘not’: 114, ‘into’: 114, ‘upon’: 113, ‘little’: 113, ‘so’: 112, ‘is’: 106, ‘came’: 105, ‘by’: 103, ‘some’: 94, ‘be’: 93, ‘no’: 92,…省略

经过这一步，得到了文本中每个单词出现的频率

tokens = count_corpus(tokens)
# 将dic对象转换成list列表，得到一个词频列表
word_fre = list(dic.items())
print(word_fre)
1
2
3
4

[(‘the’, 2261), (‘time’, 200), (‘machine’, 85), (‘by’, 103), (‘h’, 1), (‘g’, 1), (‘wells’, 9), (’’, 1282), (‘i’, 1267), (‘traveller’, 61), (‘for’, 221), (‘so’, 112), (‘it’, 437), (‘will’, 37), (‘be’, 93), (‘convenient’, 5), (‘to’, 695), (‘speak’, 6), (‘of’, 1155), (‘him’, 40), (‘was’, 552), (‘expounding’, 2), (‘a’, 816), (‘recondite’, 1), (‘matter’, 6), (‘us’, 35), (‘his’, 129), (‘grey’, 11), (‘eyes’, 35), (‘shone’, 8), (‘and’, 1245), (‘twinkled’, 1), (‘usually’, 3), (‘pale’, 10), (‘face’, 38), (‘flushed’, 2), (‘animated’, 3), (‘fire’, 30), (‘burned’, 6), (‘brightly’, 4), (‘soft’, 16), (‘radiance’, 1), (‘incandescent’, 1), (‘lights’, 1), (‘in’, 541), (‘lilies’, 1), (‘silver’, 6), (‘caught’, 10), (‘bubbles’, 1), (‘that’, 443), (‘flashed’, 4), (‘passed’, 13), …省略

for idx, token in enumerate(word_fre):
	token_to_idx[token] = idx
1
2

这一步的结果是将单词添加上索引，token_to_idx里面的内容如下：

{’’: 3, ‘the’: 4, ‘time’: 5, ‘machine’: 6, ‘by’: 7, ‘h’: 8, ‘g’: 9, ‘wells’: 10, ‘i’: 11, ‘traveller’: 12, ‘for’: 13, ‘so’: 14, ‘it’: 15, ‘will’: 16, ‘be’: 17, ‘convenient’: 18, ‘to’: 19, ‘speak’: 20, ‘of’: 21, ‘him’: 22, ‘was’: 23, ‘expounding’: 24, ‘a’: 25, ‘recondite’: 26, ‘matter’: 27, ‘us’: 28, ‘his’: 29, ‘grey’: 30, ‘eyes’: 31, ‘shone’: 32, ‘and’: 33,

然后对lines里面的内容循环，去上面的索引字典里去找对应的索引即可。这其中还涉及到空行填充，补全对齐一些细节。到这里实现了一个非常非常简单（low）的分词。可能在看的过程中已经发现一些问题，我们再把这些问题补充一下：

标点符号通常可以提供语义信息，但是我们的方法直接将其丢弃了
类似“shouldn’t", "doesn’t"这样的词会被错误地处理
类似"Mr.", "Dr."这样的词会被错误地处理

我们要做的就是针对这些情况优化分词。在了解预处理原理后，常用的分词库来了——spaCy和NLTK。

spaCy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
1
2
3
4

[‘Mr.’, ‘Chen’, ‘does’, “n’t”, ‘agree’, ‘with’, ‘my’, ‘suggestion’, ‘.’]

NLTK

from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('/home/kesci/input/nltk_data3784/nltk_data')
print(word_tokenize(text))
1
2
3
4

[‘Mr.’, ‘Chen’, ‘does’, “n’t”, ‘agree’, ‘with’, ‘my’, ‘suggestion’, ‘.’]

分词效果是一样的。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/酷酷是懒虫/article/detail/821072