使用huggingface中transformers的字典和tokenizer_keyword arguments {'inference': true} not recogniz

作者：在线问答5 | 2024-08-11 02:37:26

踩

keyword arguments {'inference': true} not recognized.

文章目录

题目

使用huggingface中transformers的字典和tokenizer

步骤

step1

pip install transformers
1

在这里插入图片描述
step2

加载分词器

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(
    # 传递模型名称
    pretrained_model_name_or_path='bert-base-chinese',
    cache_dir = None,
    force_download = False
)

sents = [
         '今天天气真好，我很开心！',
         '遇见你三生有幸，我一生最美好的际遇',
         '他很无聊，可是我还是很耐心地陪他玩耍',
         '公园里有个影子，我在寻找它的时候不小心摔倒了'
]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在这里插入图片描述

普通的编码函数

out = tokenizer.encode(
    # 第一个句子
    text = sents[0],
    # 第二个句子，当不传时为一个句子
    text_pair = sents[1],

    # 当大于max_length时截断
    truncation = True,
    # 当不足max_length时填充pad
    padding = 'max_length',
    # 是否添加特殊标志
    add_special_tokens=True,
    max_length=30,
    # None表示返回list，tf->tensorflow,pt->pytorch,np->numpy
    return_tensor=None
)
print(out)
tokenizer.decode(out)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Keyword arguments {'return_tensor': None} not recognized.
Keyword arguments {'return_tensor': None} not recognized.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102]
[CLS] 今 天 天 气 真 好 ， 我 很 开 心 ！ [SEP] 遇 见 你 三 生 有 幸 ， 我 一 生 最 美 好 的 [SEP]
1
2
3
4
5

增强的编码函数

out = tokenizer.encode_plus(
    text = sents[0],
    text_pair = sents[1],
    truncation = True,
    padding = 'max_length',
    max_length = 30,
    add_special_tokens = True,
    
    return_tensor = None,
    return_token_type_id = True,
    return_special_tokens_mask = True,
    return_attention_mask = True,
    return_length = True

)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

input_ids : [101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
length : 30
[CLS] 今 天 天 气 真 好 ， 我 很 开 心 ！ [SEP] 遇 见 你 三 生 有 幸 ， 我 一 生 最 美 好 的 [SEP]
1
2
3
4
5
6

批量处理句子

out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs = [sents[0],sents[1],sents[2]],
    add_special_tokens = True,
    truncation = True,
    padding = 'max_length',
    max_length = 15,
    return_tensors = None,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)
for k,v in out.items():
  print(k,':',v)
tokenizer.decode(out['input_ids'][0])
tokenizer.decode(out['input_ids'][1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

input_ids : [[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 0], [101, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 102], [101, 800, 2523, 3187, 5464, 8024, 1377, 3221, 2769, 6820, 3221, 2523, 5447, 2552, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [14, 15, 15]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[CLS] 遇 见 你 三 生 有 幸 ， 我 一 生 最 美 [SEP]
1
2
3
4
5
6

批量成对编码

out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=[
        (sents[0],sents[1]),
        (sents[2],sents[3])
    ],
    add_special_tokens = True,
    truncation = True,
    padding = 'max_length',
    max_length = 30,
    return_tensors = None,
    return_attention_mask = True,
    return_special_tokens_mask = True,
    return_length = True
)
for k,v in out.items():
  print(k,':',v)
tokenizer.decode(out['input_ids'][0])
tokenizer.decode(out['input_ids'][1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

input_ids : [[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 2769, 2523, 2458, 2552, 8013, 102, 6878, 6224, 872, 676, 4495, 3300, 2401, 8024, 2769, 671, 4495, 3297, 5401, 1962, 4638, 102], [101, 800, 2523, 3187, 5464, 8024, 1377, 3221, 2769, 6820, 3221, 2523, 5447, 2552, 1765, 102, 1062, 1736, 7027, 3300, 702, 2512, 2094, 8024, 2769, 1762, 2192, 2823, 2124, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [30, 30]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[CLS] 他 很 无 聊 ， 可 是 我 还 是 很 耐 心 地 [SEP] 公 园 里 有 个 影 子 ， 我 在 寻 找 它 [SEP]
1
2
3
4
5
6

字典操作

# 字典操作
zidian = tokenizer.get_vocab()
type(zidian),len(zidian),'阅读' in zidian
1
2
3

在这里插入图片描述

[101, 4867, 4886, 21128, 21129, 3615, 3615, 1403, 5783, 8013, 21130, 102, 0, 0, 0]
[CLS] 祝 福 北京 每天 欣 欣 向 荣 ！ [EOS] [SEP] [PAD] [PAD] [PAD]
1
2

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/在线问答5/article/detail/961785