当前位置:   article > 正文

NLTK 入门_nltk的作用

nltk的作用

NLTK 入门

from matplotlib import pyplot as plt
from nltk import book
  • 1
  • 2
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
book.text1
  • 1
<Text: Moby Dick by Herman Melville 1851>
  • 1
# 搜索相关词
book.text1.concordance("monstrous")
  • 1
  • 2
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
# 查看相似上下文的词语。例如, the ___ pictures和the ___ size. 上下文一样的词.
book.text1.similar("monstrous")
  • 1
  • 2
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
  • 1
  • 2
  • 3
# common_contexts 找出两个或两个以上词共同的上下文. 中间用 _ 分隔两个词.
book.text2.common_contexts(["monstrous", "very"])
print '-'*100
book.text2.common_contexts(["monstrous"])
  • 1
  • 2
  • 3
  • 4
a_pretty is_pretty a_lucky am_glad be_glad
----------------------------------------------------------------------------------------------------
a_pretty was_happy is_fond a_lucky a_deal am_glad is_pretty be_glad
  • 1
  • 2
  • 3

查看文本中每一个出现的词的分布情况。其中 x轴表示每一个词出现的位置,能够看出一个词在文章的分布情况。

book.text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
plt.show()
  • 1
  • 2

这里写图片描述

计算某个词的个数

book.text3.count("smote")
  • 1
5
  • 1

频率统计. 产生的fd并没有被排序,如果需要统计词频最高的,使用 most_common 来获取. 总之这是一个字典。

from nltk import probability
fd = probability.FreqDist(book.text1)
fd
words = fd.keys()
print words[0:50]

w_str = ''
for w in words[0:10]:
    w_str += str(fd[w]) + ' '
print w_str
print fd['whale']
fd.most_common(50)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
[u'funereal', u'unscientific', u'divinely', u'foul', u'four', u'gag', u'prefix', u'woods', u'clotted', u'Duck', u'hanging', u'plaudits', u'woody', u'Until', u'marching', u'disobeying', u'canes', u'granting', u'advantage', u'Westers', u'insertion', u'DRYDEN', u'formless', u'Untried', u'superficially', u'Western', u'portentous', u'beacon', u'meadows', u'sinking', u'Ding', u'Spurn', u'treasuries', u'churned', u'oceans', u'powders', u'tinkerings', u'tantalizing', u'yellow', u'bolting', u'uncertain', u'stabbed', u'bringing', u'elevations', u'ferreting', u'believers', u'wooded', u'songster', u'uttering', u'scholar']
1 1 2 11 74 2 1 9 2 2 
906





[(u',', 18713),
 (u'the', 13721),
 (u'.', 6862),
 (u'of', 6536),
 (u'and', 6024),
 (u'a', 4569),
 (u'to', 4542),
 (u';', 4072),
 (u'in', 3916),
 (u'that', 2982),
 (u"'", 2684),
 (u'-', 2552),
 (u'his', 2459),
 (u'it', 2209),
 (u'I', 2124),
 (u's', 1739),
 (u'is', 1695),
 (u'he', 1661),
 (u'with', 1659),
 (u'was', 1632),
 (u'as', 1620),
 (u'"', 1478),
 (u'all', 1462),
 (u'for', 1414),
 (u'this', 1280),
 (u'!', 1269),
 (u'at', 1231),
 (u'by', 1137),
 (u'but', 1113),
 (u'not', 1103),
 (u'--', 1070),
 (u'him', 1058),
 (u'from', 1052),
 (u'be', 1030),
 (u'on', 1005),
 (u'so', 918),
 (u'whale', 906),
 (u'one', 889),
 (u'you', 841),
 (u'had', 767),
 (u'have', 760),
 (u'there', 715),
 (u'But', 705),
 (u'or', 697),
 (u'were', 680),
 (u'now', 646),
 (u'which', 640),
 (u'?', 637),
 (u'me', 627),
 (u'like', 624)]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
fd.plot(50)
plt.show()
  • 1
  • 2

这里写图片描述

fd.plot(50, cumulative=True)
plt.show()
  • 1
  • 2

这里写图片描述

n-gram

使用collections 获取 n-gram的数据。下面是默认n-gram=2

book.text4.collocations(window_size=2)
  • 1
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
  • 1
  • 2
  • 3
  • 4
  • 5
  • 1
本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/人工智能uu/article/detail/991373
推荐阅读
相关标签
  

闽ICP备14008679号