赞
踩
Paddlenlp之UIE模型实战实体抽取任务【打车数据、快递单】
Paddlenlp之UIE分类模型【以情感倾向分析新闻分类为例】含智能标注方案)
应用实践:分类模型大集成者[PaddleHub、Finetune、prompt]
自动从无结构或半结构的文本中抽取出结构化信息的任务, 主要包含的任务包含了实体识别、关系抽取、事件抽取、情感分析、评论抽取等任务; 同时信息抽取涉及的领域非常广泛,信息抽取的技术需求高,下面具体展现一些示例
针对以上难题,中科院软件所和百度共同提出了一个大一统诸多任务的通用信息抽取技术 UIE(Unified Structure Generation for Universal Information Extraction),发表在ACL‘22。UIE在实体、关系、事件和情感等4个信息抽取任务、13个数据集的全监督、低资源和少样本设置下,UIE均取得了SOTA性能。
PaddleNLP结合文心大模型中的知识增强NLP大模型ERNIE 3.0,发挥了UIE在中文任务上的强大潜力,开源了首个面向通用信息抽取的产业级技术方案,不需要标注数据(或仅需少量标注数据),即可快速完成各类信息抽取任务。
**链接指路:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie )
1.1安装PaddleNLP
! pip install --upgrade paddlenlp
! pip show paddlenlp
人力资源入职证明信息抽取
from paddlenlp import Taskflow
schema = ['姓名', '毕业院校', '职位', '月收入', '身体状况']
ie = Taskflow('information_extraction', schema=schema)
[2022-06-02 16:49:41,477] [ INFO] - Downloading model_state.pdparams from https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v0.1/model_state.pdparams
100%|██████████| 450M/450M [00:06<00:00, 73.2MB/s]
[2022-06-02 16:49:49,053] [ INFO] - Downloading model_config.json from https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/model_config.json
100%|██████████| 377/377 [00:00<00:00, 329kB/s]
[2022-06-02 16:49:49,089] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-base-zh'.
[2022-06-02 16:49:49,091] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-06-02 16:49:49,093] [ INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt
100%|██████████| 182k/182k [00:00<00:00, 20.6MB/s]
W0602 16:49:49.182585 197 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0602 16:49:49.186472 197 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2022-06-02 16:49:54,484] [ INFO] - Converting to the inference model cost a little time.
[2022-06-02 16:50:05,041] [ INFO] - The inference model save in the path:/home/aistudio/.paddlenlp/taskflow/information_extraction/uie-base/static/inference
schema = ['姓名', '毕业院校', '职位', '月收入', '身体状况']
ie.set_schema(schema)
ie('兹证明凌霄为本单位职工,已连续在我单位工作5 年。学历为嘉利顿大学毕业,目前在我单位担任总经理助理 职位。近一年内该员工在我单位平均月收入(税后)为 12000 元。该职工身体状况良好。本单位仅此承诺上述表述是正确的,真实的。')
[{'姓名': [{'text': '凌霄', 'start': 3, 'end': 5, 'probability': 0.9042383385504706}], '毕业院校': [{'text': '嘉利顿大学', 'start': 28, 'end': 33, 'probability': 0.9927952662605009}], '职位': [{'text': '总经理助理', 'start': 44, 'end': 49, 'probability': 0.9922470268350594}], '月收入': [{'text': '12000 元', 'start': 77, 'end': 84, 'probability': 0.9788556518998917}], '身体状况': [{'text': '良好', 'start': 92, 'end': 94, 'probability': 0.9939678710475306}]}]
# Jupyter Notebook默认做了格式化输出,如果使用其他代码编辑器,可以使用Python原生包pprint进行格式化输出
from pprint import pprint
pprint(ie('兹证明凌霄为本单位职工,已连续在我单位工作5 年。学历为嘉利顿大学毕业,目前在我单位担任总经理助理 职位。近一年内该员工在我单位平均月收入(税后)为 12000 元。该职工身体状况良好。本单位仅此承诺上述表述是正确的,真实的。'))
[{'姓名': [{'end': 5, 'probability': 0.9042383385504706, 'start': 3, 'text': '凌霄'}], '月收入': [{'end': 84, 'probability': 0.9788556518998917, 'start': 77, 'text': '12000 元'}], '毕业院校': [{'end': 33, 'probability': 0.9927952662605009, 'start': 28, 'text': '嘉利顿大学'}], '职位': [{'end': 49, 'probability': 0.9922470268350594, 'start': 44, 'text': '总经理助理'}], '身体状况': [{'end': 94, 'probability': 0.9939678710475306, 'start': 92, 'text': '良好'}]}]
医疗病理分析
schema = ['肿瘤部位', '肿瘤大小']
ie.set_schema(schema)
ie('胃印戒细胞癌,肿瘤主要位于胃窦体部,大小6*2cm,癌组织侵及胃壁浆膜层,并侵犯血管和神经。')
[{'肿瘤部位': [{'text': '胃窦体部',
'start': 13,
'end': 17,
'probability': 0.9601818899487213}],
'肿瘤大小': [{'text': '6*2cm',
'start': 20,
'end': 25,
'probability': 0.9670914301489972}]}]
# 实体抽取
schema = ['时间', '赛手', '赛事名称']
ie.set_schema(schema)
ie('2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!')
[{'时间': [{'text': '2月8日上午',
'start': 0,
'end': 6,
'probability': 0.9857379716035553}],
'赛手': [{'text': '中国选手谷爱凌',
'start': 24,
'end': 31,
'probability': 0.7232891682586384}],
'赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛',
'start': 6,
'end': 23,
'probability': 0.8503080086948529}]}]
# 关系抽取
schema = {'歌曲名称': ['歌手', '所属专辑']}
ie.set_schema(schema)
ie('《告别了》是孙耀威在专辑爱的故事里面的歌曲')
[{'歌曲名称': [{'text': '告别了', 'start': 1, 'end': 4, 'probability': 0.629614912348881, 'relations': {'歌手': [{'text': '孙耀威', 'start': 6, 'end': 9, 'probability': 0.9988381005599081}], '所属专辑': [{'text': '爱的故事', 'start': 12, 'end': 16, 'probability': 0.9968462078543183}]}}, {'text': '爱的故事', 'start': 12, 'end': 16, 'probability': 0.28168707817316374, 'relations': {'歌手': [{'text': '孙耀威', 'start': 6, 'end': 9, 'probability': 0.9951415104192272}]}}]}]
# 事件抽取
schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # 事件需要通过xxx触发词来选择触发词
ie.set_schema(schema)
ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')
[{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9977425555988333, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.998080217831891}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9853299772936026}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.7874014521275967}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.9937974422968665}]}}]}]
# 情感倾向分类
schema = '情感倾向[正向,负向]' # 分类任务需要[]来设置分类的label
ie.set_schema(schema)
ie('这个产品用起来真的很流畅,我非常喜欢')
[{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.9990024058203417}]}]
# 评价抽取
schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # 评价抽取的schema是固定的,后续直接按照这个schema进行观点抽取
ie.set_schema(schema) # Reset schema
ie('地址不错,服务一般,设施陈旧')
[{'评价维度': [{'text': '地址', 'start': 0, 'end': 2, 'probability': 0.9888139270606509, 'relations': {'观点词': [{'text': '不错', 'start': 2, 'end': 4, 'probability': 0.9927845886615216}], '情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.998228967796706}]}}, {'text': '设施', 'start': 10, 'end': 12, 'probability': 0.9588298547520608, 'relations': {'观点词': [{'text': '陈旧', 'start': 12, 'end': 14, 'probability': 0.928675281256794}], '情感倾向[正向,负向]': [{'text': '负向', 'probability': 0.9949388606013692}]}}, {'text': '服务', 'start': 5, 'end': 7, 'probability': 0.9592857070501211, 'relations': {'观点词': [{'text': '一般', 'start': 7, 'end': 9, 'probability': 0.9949359182521675}], '情感倾向[正向,负向]': [{'text': '负向', 'probability': 0.9952498258302498}]}}]}]
# 跨任务跨领域抽取
schema = ['寺庙', {'丈夫': '妻子'}] # 抽取的任务中包含了实体抽取和关系抽取
ie.set_schema(schema)
ie('李治即位后,让身在感业寺的武则天续起头发,重新纳入后宫。')
[{'寺庙': [{'text': '感业寺',
'start': 9,
'end': 12,
'probability': 0.9888581774497425}],
'丈夫': [{'text': '李治',
'start': 0,
'end': 2,
'probability': 0.989690572797457,
'relations': {'妻子': [{'text': '武则天',
'start': 13,
'end': 16,
'probability': 0.9987625986790256}]}}]}]
schema = ['才人'] #提取词不能太奇怪!
ie.set_schema(schema)
ie('李治即位后,让身在感业寺的武则天续起头发,重新纳入后宫。')
[{}]
schema = ['妃子']
ie.set_schema(schema)
ie('李治即位后,让身在感业寺的武则天续起头发,重新纳入后宫。')
[{'妃子': [{'text': '武则天',
'start': 13,
'end': 16,
'probability': 0.9976319401117237}]}]
from paddlenlp import Taskflow
schema = ['费用']
ie.set_schema(schema)
ie = Taskflow('information_extraction', schema=schema, batch_size=2) #资源不充裕情况,batch_size设置小点,利用率增加。。
ie(['二十号21点49分打车回家46块钱', '8月3号往返机场交通费110元', '2019年10月17日22点18分回家打车46元', '三月三0号23点10分加班打车21元'])
[2022-06-02 16:50:08,273] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-base-zh'. [2022-06-02 16:50:08,275] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh_vocab.txt [{'费用': [{'text': '46块钱', 'start': 13, 'end': 17, 'probability': 0.9781786110574338}]}, {'费用': [{'text': '110元', 'start': 11, 'end': 15, 'probability': 0.9504088995163151}]}, {'费用': [{'text': '46元', 'start': 21, 'end': 24, 'probability': 0.9753814247531167}]}, {'费用': [{'text': '21元', 'start': 15, 'end': 18, 'probability': 0.9761294626311425}]}]
from paddlenlp import Taskflow
schema = ['费用']
ie.set_schema(schema)
ie = Taskflow('information_extraction', schema=schema, batch_size=2, model='uie-tiny') #
ie(['二十号21点49分打车回家46块钱', '8月3号往返机场交通费110元', '2019年10月17日22点18分回家打车46元', '三月三0号23点10分加班打车21元'])
[2022-06-02 16:50:09,568] [ INFO] - Downloading model_state.pdparams from https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny_v0.1/model_state.pdparams 100%|██████████| 288M/288M [00:04<00:00, 64.1MB/s] [2022-06-02 16:50:15,051] [ INFO] - Downloading model_config.json from https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_tiny/model_config.json 100%|██████████| 404/404 [00:00<00:00, 189kB/s] [2022-06-02 16:50:15,100] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'. [2022-06-02 16:50:15,102] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-medium-zh [2022-06-02 16:50:15,104] [ INFO] - Downloading ernie_3.0_medium_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt 100%|██████████| 182k/182k [00:00<00:00, 25.0MB/s] [2022-06-02 16:50:16,397] [ INFO] - Converting to the inference model cost a little time. [2022-06-02 16:50:23,128] [ INFO] - The inference model save in the path:/home/aistudio/.paddlenlp/taskflow/information_extraction/uie-tiny/static/inference [{'费用': [{'text': '46块钱', 'start': 13, 'end': 17, 'probability': 0.8945340489542026}]}, {'费用': [{'text': '110元', 'start': 11, 'end': 15, 'probability': 0.9757676375014448}]}, {'费用': [{'text': '46元', 'start': 21, 'end': 24, 'probability': 0.860397941604333}]}, {'费用': [{'text': '21元', 'start': 15, 'end': 18, 'probability': 0.8595131018474689}]}]
Taskflow中的UIE基线版本我们是通过大量的有标签样本进行训练,但是UIE抽取的效果面对部分子领域的效果也不是令人满意,UIE可以通过小样本就可以快速提升效果。
为什么UIE可以通过小样本来提升效果呢?UIE的建模方式主要是通过 Prompt
方式来建模, Prompt
在小样本上进行微调效果非常有效,下面我们通过一个具体的case
来展示UIE微调的效果。
背景
在某公司内部可以通过语音输入来报销打车费用,通过语音ASR模型可以将语音识别为文字,同时对文字信息进行信息抽取,抽取的信息主要是包括了4个方面,时间、出发地、目的地、费用,通过对文字4个方面的信息进行抽取就可以完成一个报销工单的填写。
挑战
目前Taskflow UIE任务对于这种非常垂类的任务效果没有完全达到工业使用水平,因此需要一定的微调手段来完成UIE模型的微调来提升模型的效果,下面是一些case的展现
ie.set_schema(['时间', '出发地', '目的地', '费用'])
ie('10月16日高铁从杭州到上海南站车次d5414共48元') # 无法准确抽取出发地、目的地
[{'时间': [{'text': '10月16日',
'start': 0,
'end': 6,
'probability': 0.9552445817793149}],
'出发地': [{'text': '杭州',
'start': 9,
'end': 11,
'probability': 0.5713024802221334}],
'费用': [{'text': '48元',
'start': 24,
'end': 27,
'probability': 0.8932524634666485}]}]
我们推荐使用数据标注平台doccano 进行数据标注,本案例也打通了从标注到训练的通道,即doccano导出数据后可通过doccano.py脚本轻松将数据转换为输入模型时需要的形式,实现无缝衔接。为达到这个目的,您需要按以下标注规则在doccano平台上标注数据:
Step 1. 本地安装doccano(请勿在AI Studio内部运行,本地测试环境python=3.8)
$ pip install doccano
Step 2. 初始化数据库和账户(用户名和密码可替换为自定义值)
$ doccano init
$ doccano createuser --username my_admin_name --password my_password
Step 3. 启动doccano
$ doccano webserver --port 8000
$ doccano task
Step 4. 运行doccano来标注实体和关系
http://127.0.0.1:8000/
后回车即得以下界面。LOGIN
,输入Step 2中设置的用户名和密码登陆。CREATE
,跳转至以下界面。
Sequence Labeling
)Project name
)等必要信息Allow overlapping entity
)、使用关系标注(Use relation labeling
)设置标签。在Labels一栏点击Actions
,Create Label
手动设置或者Import Labels
从文件导入。
导入数据。在Datasets一栏点击Actions
、Import Dataset
从文件导入文本数据。
标注数据。点击每条数据最右边的Annotate
按钮开始标记。标记页面右侧的标签类型(Label Types)开关可在实体标签和关系标签之间切换。
导出数据。在Datasets一栏点击Actions
、Export Dataset
导出已标注的数据。
将标注数据转化成UIE训练所需数据
./data/
目录。对于语音报销工单信息抽取的场景,可以直接下载标注好的数据。各个任务标注文档
https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/uie/doccano.md
! wget https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl
! mv audio-expense-account.jsonl ./data/
--2022-06-02 16:50:24-- https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl
Resolving paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)... 100.67.200.6
Connecting to paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)|100.67.200.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16241 (16K) [application/octet-stream]
Saving to: ‘audio-expense-account.jsonl’
audio-expense-accou 100%[===================>] 15.86K --.-KB/s in 0s
2022-06-02 16:50:24 (285 MB/s) - ‘audio-expense-account.jsonl’ saved [16241/16241]
splits 0.2 0.8 0.0 训练集 测试集 验证集
可配置参数说明
doccano_file
: 从doccano导出的数据标注文件。save_dir
: 训练数据的保存目录,默认存储在data
目录下。negative_ratio
: 最大负例比例,该参数只对抽取类型任务有效,适当构造负例可提升模型效果。负例数量和实际的标签数量有关,最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效,默认为5。为了保证评估指标的准确性,验证集和测试集默认构造全负例。splits
: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照8:1:1
的比例将数据划分为训练集、验证集和测试集。task_type
: 选择任务类型,可选有抽取和分类两种类型的任务。options
: 指定分类任务的类别标签,该参数只对分类类型任务有效。prompt_prefix
: 声明分类任务的prompt前缀信息,该参数只对分类类型任务有效。is_shuffle
: 是否对数据集进行随机打散,默认为True。seed
: 随机种子,默认为1000.! python preprocess.py --input_file ./data/audio-expense-account.jsonl --save_dir ./data/ --negative_ratio 5 --splits 0.2 0.8 0.0 --seed 1000
Converting doccano data... 100%|████████████████████████████████████████| 10/10 [00:00<00:00, 31091.95it/s] Adding negative samples for first stage prompt... 100%|███████████████████████████████████████| 10/10 [00:00<00:00, 137518.16it/s] Converting doccano data... 100%|████████████████████████████████████████| 40/40 [00:00<00:00, 61658.27it/s] Adding negative samples for first stage prompt... 100%|███████████████████████████████████████| 40/40 [00:00<00:00, 127292.99it/s] Converting doccano data... 0it [00:00, ?it/s] Adding negative samples for first stage prompt... 0it [00:00, ?it/s] Save 40 examples to ./data/train.txt. Save 129 examples to ./data/dev.txt. Save 0 examples to ./data/test.txt. Finished! It takes 0.01 seconds
训练UIE模型
./checkpoint/
目录。tips: 推荐使用GPU环境,否则可能会内存溢出。CPU环境下,可以修改model为uie-tiny
,适当调下batch_size。
增加准确率的话:–num_epochs 设置大点多训练训练
可配置参数说明:
train_path
: 训练集文件路径。dev_path
: 验证集文件路径。save_dir
: 模型存储路径,默认为./checkpoint
。learning_rate
: 学习率,默认为1e-5。batch_size
: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数,默认为16。max_seq_len
: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。num_epochs
: 训练轮数,默认为100。model
: 选择模型,程序会基于选择的模型进行模型微调,可选有uie-base
和uie-tiny
,默认为uie-base
。seed
: 随机种子,默认为1000.logging_steps
: 日志打印的间隔steps数,默认10。valid_steps
: evaluate的间隔steps数,默认100。device
: 选用什么设备进行训练,可选cpu或gpu。! python finetune.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --model uie-tiny --learning_rate 1e-5 --batch_size 2 --max_seq_len 512 --num_epochs 50 --seed 1000 --logging_steps 10 --valid_steps 10
[2022-06-02 16:50:29,694] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh_vocab.txt [2022-06-02 16:50:29,721] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieModel'> to load 'ernie-3.0-base-zh'. [2022-06-02 16:50:29,722] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh [2022-06-02 16:50:29,722] [ INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams 100%|████████████████████████████████████████| 452M/452M [00:06<00:00, 75.2MB/s] W0602 16:50:36.096419 501 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1 W0602 16:50:36.101285 501 device_context.cc:465] device: 0, cuDNN Version: 7.6. [2022-06-02 16:50:40,289] [ INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.transform.weight', 'cls.predictions.layer_norm.weight', 'cls.predictions.transform.bias', 'cls.predictions.layer_norm.bias', 'cls.predictions.decoder_bias'] Init from: uie-tiny/model_state.pdparams global step 10, epoch: 1, loss: 0.00316, speed: 10.32 step/s Evaluation precision: 0.50781, recall: 0.50388, F1: 0.50584 best F1 performence has been updated: 0.00000 --> 0.50584 global step 20, epoch: 1, loss: 0.00276, speed: 13.50 step/s Evaluation precision: 0.57576, recall: 0.58915, F1: 0.58238 best F1 performence has been updated: 0.50584 --> 0.58238 global step 30, epoch: 2, loss: 0.00231, speed: 12.67 step/s Evaluation precision: 0.63492, recall: 0.62016, F1: 0.62745 best F1 performence has been updated: 0.58238 --> 0.62745 global step 40, epoch: 2, loss: 0.00219, speed: 13.15 step/s Evaluation precision: 0.67521, recall: 0.61240, F1: 0.64228 best F1 performence has been updated: 0.62745 --> 0.64228 global step 50, epoch: 3, loss: 0.00201, speed: 13.03 step/s Evaluation precision: 0.77193, recall: 0.68217, F1: 0.72428 best F1 performence has been updated: 0.64228 --> 0.72428 global step 60, epoch: 3, loss: 0.00179, speed: 13.28 step/s Evaluation precision: 0.77686, recall: 0.72868, F1: 0.75200 best F1 performence has been updated: 0.72428 --> 0.75200 global step 70, epoch: 4, loss: 0.00166, speed: 13.15 step/s Evaluation precision: 0.81967, recall: 0.77519, F1: 0.79681 best F1 performence has been updated: 0.75200 --> 0.79681 global step 80, epoch: 4, loss: 0.00150, speed: 13.50 step/s Evaluation precision: 0.83607, recall: 0.79070, F1: 0.81275 best F1 performence has been updated: 0.79681 --> 0.81275 global step 90, epoch: 5, loss: 0.00140, speed: 12.64 step/s Evaluation precision: 0.86066, recall: 0.81395, F1: 0.83665 best F1 performence has been updated: 0.81275 --> 0.83665 global step 100, epoch: 5, loss: 0.00128, speed: 13.48 step/s Evaluation precision: 0.88235, recall: 0.81395, F1: 0.84677 best F1 performence has been updated: 0.83665 --> 0.84677 global step 110, epoch: 6, loss: 0.00119, speed: 12.99 step/s Evaluation precision: 0.89076, recall: 0.82171, F1: 0.85484 best F1 performence has been updated: 0.84677 --> 0.85484 global step 120, epoch: 6, loss: 0.00110, speed: 13.44 step/s Evaluation precision: 0.88333, recall: 0.82171, F1: 0.85141 global step 130, epoch: 7, loss: 0.00104, speed: 13.15 step/s Evaluation precision: 0.90909, recall: 0.85271, F1: 0.88000 best F1 performence has been updated: 0.85484 --> 0.88000 global step 140, epoch: 7, loss: 0.00098, speed: 13.39 step/s Evaluation precision: 0.90083, recall: 0.84496, F1: 0.87200 global step 150, epoch: 8, loss: 0.00092, speed: 13.05 step/s Evaluation precision: 0.90244, recall: 0.86047, F1: 0.88095 best F1 performence has been updated: 0.88000 --> 0.88095 global step 160, epoch: 8, loss: 0.00087, speed: 13.36 step/s Evaluation precision: 0.90323, recall: 0.86822, F1: 0.88538 best F1 performence has been updated: 0.88095 --> 0.88538 global step 170, epoch: 9, loss: 0.00082, speed: 13.05 step/s Evaluation precision: 0.91129, recall: 0.87597, F1: 0.89328 best F1 performence has been updated: 0.88538 --> 0.89328 global step 180, epoch: 9, loss: 0.00078, speed: 13.47 step/s Evaluation precision: 0.90400, recall: 0.87597, F1: 0.88976 global step 190, epoch: 10, loss: 0.00074, speed: 12.99 step/s Evaluation precision: 0.92063, recall: 0.89922, F1: 0.90980 best F1 performence has been updated: 0.89328 --> 0.90980 global step 200, epoch: 10, loss: 0.00070, speed: 13.47 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 best F1 performence has been updated: 0.90980 --> 0.91765 global step 210, epoch: 11, loss: 0.00067, speed: 12.82 step/s Evaluation precision: 0.92063, recall: 0.89922, F1: 0.90980 global step 220, epoch: 11, loss: 0.00065, speed: 13.48 step/s Evaluation precision: 0.92063, recall: 0.89922, F1: 0.90980 global step 230, epoch: 12, loss: 0.00062, speed: 13.17 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 240, epoch: 12, loss: 0.00059, speed: 13.50 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 250, epoch: 13, loss: 0.00057, speed: 13.13 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 260, epoch: 13, loss: 0.00055, speed: 13.43 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 270, epoch: 14, loss: 0.00053, speed: 13.20 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 280, epoch: 14, loss: 0.00051, speed: 13.51 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 290, epoch: 15, loss: 0.00050, speed: 13.25 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 300, epoch: 15, loss: 0.00048, speed: 13.44 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 310, epoch: 16, loss: 0.00046, speed: 13.16 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 320, epoch: 16, loss: 0.00045, speed: 13.57 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 330, epoch: 17, loss: 0.00044, speed: 13.13 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 best F1 performence has been updated: 0.91765 --> 0.92126 global step 340, epoch: 17, loss: 0.00042, speed: 12.89 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 350, epoch: 18, loss: 0.00041, speed: 13.17 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 360, epoch: 18, loss: 0.00040, speed: 13.55 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 370, epoch: 19, loss: 0.00039, speed: 13.11 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 380, epoch: 19, loss: 0.00038, speed: 13.45 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 390, epoch: 20, loss: 0.00037, speed: 13.19 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 400, epoch: 20, loss: 0.00036, speed: 13.36 step/s Evaluation precision: 0.93548, recall: 0.89922, F1: 0.91700 global step 410, epoch: 21, loss: 0.00035, speed: 13.03 step/s Evaluation precision: 0.94309, recall: 0.89922, F1: 0.92063 global step 420, epoch: 21, loss: 0.00035, speed: 13.34 step/s Evaluation precision: 0.94309, recall: 0.89922, F1: 0.92063 global step 430, epoch: 22, loss: 0.00034, speed: 13.00 step/s Evaluation precision: 0.94309, recall: 0.89922, F1: 0.92063 global step 440, epoch: 22, loss: 0.00033, speed: 13.27 step/s Evaluation precision: 0.94309, recall: 0.89922, F1: 0.92063 global step 450, epoch: 23, loss: 0.00033, speed: 13.00 step/s Evaluation precision: 0.94215, recall: 0.88372, F1: 0.91200 global step 460, epoch: 23, loss: 0.00032, speed: 13.27 step/s Evaluation precision: 0.94872, recall: 0.86047, F1: 0.90244 global step 470, epoch: 24, loss: 0.00031, speed: 12.94 step/s Evaluation precision: 0.93333, recall: 0.86822, F1: 0.89960 global step 480, epoch: 24, loss: 0.00031, speed: 13.27 step/s Evaluation precision: 0.92562, recall: 0.86822, F1: 0.89600 global step 490, epoch: 25, loss: 0.00030, speed: 12.89 step/s Evaluation precision: 0.92562, recall: 0.86822, F1: 0.89600 global step 500, epoch: 25, loss: 0.00030, speed: 13.37 step/s Evaluation precision: 0.93388, recall: 0.87597, F1: 0.90400 global step 510, epoch: 26, loss: 0.00029, speed: 13.14 step/s Evaluation precision: 0.92742, recall: 0.89147, F1: 0.90909 global step 520, epoch: 26, loss: 0.00029, speed: 13.28 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 530, epoch: 27, loss: 0.00028, speed: 12.87 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 540, epoch: 27, loss: 0.00027, speed: 13.27 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 550, epoch: 28, loss: 0.00027, speed: 12.99 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 560, epoch: 28, loss: 0.00027, speed: 13.30 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 570, epoch: 29, loss: 0.00026, speed: 12.99 step/s Evaluation precision: 0.92800, recall: 0.89922, F1: 0.91339 global step 580, epoch: 29, loss: 0.00026, speed: 13.30 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 590, epoch: 30, loss: 0.00025, speed: 12.94 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 600, epoch: 30, loss: 0.00025, speed: 13.35 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 610, epoch: 31, loss: 0.00024, speed: 12.94 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 620, epoch: 31, loss: 0.00024, speed: 13.32 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 630, epoch: 32, loss: 0.00024, speed: 12.41 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 640, epoch: 32, loss: 0.00023, speed: 13.13 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 650, epoch: 33, loss: 0.00023, speed: 13.10 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 660, epoch: 33, loss: 0.00023, speed: 13.43 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 670, epoch: 34, loss: 0.00022, speed: 13.01 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 680, epoch: 34, loss: 0.00022, speed: 13.38 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 690, epoch: 35, loss: 0.00022, speed: 13.08 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 700, epoch: 35, loss: 0.00021, speed: 13.43 step/s Evaluation precision: 0.93496, recall: 0.89147, F1: 0.91270 global step 710, epoch: 36, loss: 0.00021, speed: 13.17 step/s Evaluation precision: 0.93496, recall: 0.89147, F1: 0.91270 global step 720, epoch: 36, loss: 0.00021, speed: 13.42 step/s Evaluation precision: 0.93496, recall: 0.89147, F1: 0.91270 global step 730, epoch: 37, loss: 0.00020, speed: 13.02 step/s Evaluation precision: 0.93548, recall: 0.89922, F1: 0.91700 global step 740, epoch: 37, loss: 0.00020, speed: 13.35 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 750, epoch: 38, loss: 0.00020, speed: 13.02 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 760, epoch: 38, loss: 0.00020, speed: 13.40 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 770, epoch: 39, loss: 0.00019, speed: 12.95 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 780, epoch: 39, loss: 0.00019, speed: 13.26 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 790, epoch: 40, loss: 0.00019, speed: 13.01 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 800, epoch: 40, loss: 0.00019, speed: 13.23 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 810, epoch: 41, loss: 0.00018, speed: 13.09 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 820, epoch: 41, loss: 0.00018, speed: 13.49 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 830, epoch: 42, loss: 0.00018, speed: 13.09 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 840, epoch: 42, loss: 0.00018, speed: 13.36 step/s Evaluation precision: 0.92857, recall: 0.90698, F1: 0.91765 global step 850, epoch: 43, loss: 0.00018, speed: 12.96 step/s Evaluation precision: 0.93651, recall: 0.91473, F1: 0.92549 best F1 performence has been updated: 0.92126 --> 0.92549 global step 860, epoch: 43, loss: 0.00017, speed: 12.87 step/s Evaluation precision: 0.93651, recall: 0.91473, F1: 0.92549 global step 870, epoch: 44, loss: 0.00017, speed: 13.02 step/s Evaluation precision: 0.93651, recall: 0.91473, F1: 0.92549 global step 880, epoch: 44, loss: 0.00017, speed: 13.47 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 best F1 performence has been updated: 0.92549 --> 0.92969 global step 890, epoch: 45, loss: 0.00017, speed: 13.03 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 900, epoch: 45, loss: 0.00017, speed: 13.46 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 910, epoch: 46, loss: 0.00017, speed: 13.10 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 920, epoch: 46, loss: 0.00016, speed: 13.33 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 930, epoch: 47, loss: 0.00016, speed: 13.03 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 940, epoch: 47, loss: 0.00016, speed: 13.33 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 950, epoch: 48, loss: 0.00016, speed: 12.98 step/s Evaluation precision: 0.93701, recall: 0.92248, F1: 0.92969 global step 960, epoch: 48, loss: 0.00016, speed: 13.32 step/s Evaluation precision: 0.93651, recall: 0.91473, F1: 0.92549 global step 970, epoch: 49, loss: 0.00016, speed: 12.89 step/s Evaluation precision: 0.93651, recall: 0.91473, F1: 0.92549 global step 980, epoch: 49, loss: 0.00015, speed: 13.46 step/s Evaluation precision: 0.93600, recall: 0.90698, F1: 0.92126 global step 990, epoch: 50, loss: 0.00015, speed: 13.09 step/s Evaluation precision: 0.94400, recall: 0.91473, F1: 0.92913 global step 1000, epoch: 50, loss: 0.00015, speed: 13.44 step/s Evaluation precision: 0.94400, recall: 0.91473, F1: 0.92913
–max_seq_len 512 本文长度
#! python finetune.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --model uie-base --learning_rate 1e-5 --batch_size 16 --max_seq_len 512 --num_epochs 50 --seed 1000 --logging_steps 10 --valid_steps 10
[2022-06-02 16:58:58,871] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh_vocab.txt [2022-06-02 16:58:58,897] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieModel'> to load 'ernie-3.0-base-zh'. [2022-06-02 16:58:58,897] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh.pdparams W0602 16:58:58.898428 1730 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1 W0602 16:58:58.901824 1730 device_context.cc:465] device: 0, cuDNN Version: 7.6. [2022-06-02 16:59:08,690] [ INFO] - Weights from pretrained model not used in ErnieModel: ['cls.predictions.transform.weight', 'cls.predictions.layer_norm.weight', 'cls.predictions.transform.bias', 'cls.predictions.layer_norm.bias', 'cls.predictions.decoder_bias'] 100%|████████████████████████████████| 460749/460749 [00:06<00:00, 71829.44it/s] Init from: uie-base/model_state.pdparams global step 10, epoch: 4, loss: 0.00284, speed: 2.37 step/s Evaluation precision: 0.60156, recall: 0.59690, F1: 0.59922 best F1 performence has been updated: 0.00000 --> 0.59922 global step 20, epoch: 7, loss: 0.00212, speed: 2.45 step/s Evaluation precision: 0.76786, recall: 0.66667, F1: 0.71369 best F1 performence has been updated: 0.59922 --> 0.71369 global step 30, epoch: 10, loss: 0.00169, speed: 2.60 step/s Evaluation precision: 0.79310, recall: 0.71318, F1: 0.75102 best F1 performence has been updated: 0.71369 --> 0.75102 global step 40, epoch: 14, loss: 0.00137, speed: 2.44 step/s Evaluation precision: 0.85950, recall: 0.80620, F1: 0.83200 best F1 performence has been updated: 0.75102 --> 0.83200 global step 50, epoch: 17, loss: 0.00114, speed: 2.45 step/s Evaluation precision: 0.89344, recall: 0.84496, F1: 0.86853 best F1 performence has been updated: 0.83200 --> 0.86853 global step 60, epoch: 20, loss: 0.00097, speed: 2.58 step/s Evaluation precision: 0.90164, recall: 0.85271, F1: 0.87649 best F1 performence has been updated: 0.86853 --> 0.87649 global step 70, epoch: 24, loss: 0.00084, speed: 2.43 step/s Evaluation precision: 0.91057, recall: 0.86822, F1: 0.88889 best F1 performence has been updated: 0.87649 --> 0.88889 ^C Traceback (most recent call last): File "finetune.py", line 164, in <module> do_train() File "finetune.py", line 105, in do_train optimizer.clear_grad() File "<decorator-gen-217>", line 2, in clear_grad File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__ return wrapped_func(*args, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 229, in __impl__ return func(*args, **kwargs) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 1028, in clear_grad p.clear_gradient() KeyboardInterrupt
from paddlenlp import Taskflow
schema = ['时间', '出发地', '目的地', '费用']
few_ie = Taskflow('information_extraction', schema=schema, task_path='./checkpoint/model_best')
few_ie(['10月16日高铁从杭州到上海南站车次d5414共48元',
'10月22日从公司到首都机场38元过路费'])
[2022-06-02 17:01:18,430] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-base-zh'. [2022-06-02 17:01:18,432] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh/ernie_3.0_base_zh_vocab.txt [{'时间': [{'text': '10月16日', 'start': 0, 'end': 6, 'probability': 0.9998620769863464}], '出发地': [{'text': '杭州', 'start': 9, 'end': 11, 'probability': 0.997861665709749}], '目的地': [{'text': '上海南站', 'start': 12, 'end': 16, 'probability': 0.9974161074329579}], '费用': [{'text': '48', 'start': 24, 'end': 26, 'probability': 0.950222029031579}]}, {'时间': [{'text': '10月22日', 'start': 0, 'end': 6, 'probability': 0.9995716364718135}], '目的地': [{'text': '首都机场', 'start': 10, 'end': 14, 'probability': 0.9984550308953608}], '费用': [{'text': '38', 'start': 14, 'end': 16, 'probability': 0.9465688451171062}]}]
UIE(Universal Information Extraction):Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。PaddleNLP借鉴该论文的方法,基于ERNIE 3.0知识增强预训练模型,训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取,实现零样本快速冷启动,并具备优秀的小样本微调能力,快速适配特定的抽取目标。
UIE的优势
使用简单: 用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。实现开箱即用,并满足各类信息抽取需求。
降本增效: 以往的信息抽取技术需要大量标注数据才能保证信息抽取的效果,为了提高开发过程中的开发效率,减少不必要的重复工作时间,开放域信息抽取可以实现零样本(zero-shot)或者少样本(few-shot)抽取,大幅度降低标注数据依赖,在降低成本的同时,还提升了效果。
效果领先: 开放域信息抽取在多种场景,多种任务上,均有不俗的表现。
本人博客:https://blog.csdn.net/sinat_39620217?type=blog
此文仅为搬运,原作链接:https://aistudio.baidu.com/aistudio/projectdetail/4160689
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。