当前位置:   article > 正文

零基础(连python都不会)用huggingface中的transformers模型分词和编解码_huggingface自然语言处理详解分词

huggingface自然语言处理详解分词
  • 在anaconda中创建虚拟环境
conda env list    # 查看已有环境
  • 1
  • 在anaconda中创建名字为transformer的新环境
conda create -n transformers python=3.6
  • 1
  • 进入创建好的环境
conda activate transformers
  • 1
  • 安装所需要的库
pip install torch
pip install transformers
  • 1
  • 2
  • 导入所需工具
import torch
from transformers import AutoTokenizer
  • 1
  • 2
  • 对单句英文进行基本操作(想用中文分词可以自己在github上找哈工大的model)
# 用bert-base-uncased 分词 + convert to id + encoder(prepare for model)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Sometime it last in love, sometime it hurts instead.") # 分词
input_ids = tokenizer.convert_tokens_to_ids(tokens)  # 将tokens转化成id

final_inputs = tokenizer.prepare_for_model(input_ids)   # 将上一步token的id转化为模型训练所需id(transformer训练还需要attention mask matrix)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 解码部分
# 用bert-base-uncased解码 ———> [CLS] should i give up, or should i just keep chasing pavement, even if it leads nowhere. [SEP]
inputs = tokenizer("Should i give up, or should i just keep chasing pavement, even if it leads nowhere.")
sentence = tokenizer.decode(inputs["input_ids"]) # 解码后的句子;tokenizer之后包含input_ids、token_ids、attention_mask

# 用roberta-base解码 ———> <s>Should i give up, or should i just keep chasing pavement, even if it leads nowhere.</s>
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer("Should i give up, or should i just keep chasing pavement, even if it leads nowhere.")
sentence2 = tokenizer.decode(inputs["input_ids"])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

以上内容来自https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/938782
推荐阅读
相关标签
  

闽ICP备14008679号