赞
踩
使用huggingface中transformers的字典和tokenizer
step1
pip install transformers
step2
加载分词器
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained( # 传递模型名称 pretrained_model_name_or_path='bert-base-chinese', cache_dir = None, force_download = False ) sents = [ '今天天气真好,我很开心!', '遇见你三生有幸,我一生最美好的际遇', '他很无聊,可是我还是很耐心地陪他玩耍', '公园里有个影子,我在寻找它的时候不小心摔倒了' ]
out = tokenizer.encode( # 第一个句子 text = sents[0], # 第二个句子,当不传时为一个句子 text_pair = sents[1], # 当大于max_length时截断 truncation = True, # 当不足max_length时填充pad padding = 'max_length', # 是否添加特殊标志 add_special_tokens=True, max_length=30, # None表示返回list,tf->tensorflow,pt->pytorch,np->numpy return_tensor=None ) print(out) tokenizer.decode(out)
Keyword arguments {'return_tensor': None} not recognized.
Keyword arguments {'return_tensor': None} not recognized.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102]
[CLS] 今 天 天 气 真 好 , 我 很 开 心 ! [SEP] 遇 见 你 三 生 有 幸 , 我 一 生 最 美 好 的 [SEP]
out = tokenizer.encode_plus(
text = sents[0],
text_pair = sents[1],
truncation = True,
padding = 'max_length',
max_length = 30,
add_special_tokens = True,
return_tensor = None,
return_token_type_id = True,
return_special_tokens_mask = True,
return_attention_mask = True,
return_length = True
)
input_ids : [101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
length : 30
[CLS] 今 天 天 气 真 好 , 我 很 开 心 ! [SEP] 遇 见 你 三 生 有 幸 , 我 一 生 最 美 好 的 [SEP]
out = tokenizer.batch_encode_plus(
batch_text_or_text_pairs = [sents[0],sents[1],sents[2]],
add_special_tokens = True,
truncation = True,
padding = 'max_length',
max_length = 15,
return_tensors = None,
return_attention_mask = True,
return_special_tokens_mask = True,
return_length = True
)
for k,v in out.items():
print(k,':',v)
tokenizer.decode(out['input_ids'][0])
tokenizer.decode(out['input_ids'][1])
input_ids : [[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 0], [101, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 102], [101, 800, 2523, 3187, 5464, 8024, 1377, 3221, 2769, 6820, 3221, 2523, 5447, 2552, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [14, 15, 15]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[CLS] 遇 见 你 三 生 有 幸 , 我 一 生 最 美 [SEP]
out = tokenizer.batch_encode_plus( batch_text_or_text_pairs=[ (sents[0],sents[1]), (sents[2],sents[3]) ], add_special_tokens = True, truncation = True, padding = 'max_length', max_length = 30, return_tensors = None, return_attention_mask = True, return_special_tokens_mask = True, return_length = True ) for k,v in out.items(): print(k,':',v) tokenizer.decode(out['input_ids'][0]) tokenizer.decode(out['input_ids'][1])
input_ids : [[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102], [101, 800, 2523, 3187, 5464, 8024, 1377, 3221, 2769, 6820, 3221, 2523, 5447, 2552, 1765, 102, 1062, 1736, 7027, 3300, 702, 2512, 2094, 8024, 2769, 1762, 2192, 2823, 2124, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [30, 30]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[CLS] 他 很 无 聊 , 可 是 我 还 是 很 耐 心 地 [SEP] 公 园 里 有 个 影 子 , 我 在 寻 找 它 [SEP]
# 字典操作
zidian = tokenizer.get_vocab()
type(zidian),len(zidian),'阅读' in zidian
[101, 4867, 4886, 21128, 21129, 3615, 3615, 1403, 5783, 8013, 21130, 102, 0, 0, 0]
[CLS] 祝 福 北京 每天 欣 欣 向 荣 ! [EOS] [SEP] [PAD] [PAD] [PAD]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。