赞
踩
用的模型
AutoModelForCausalLM
[examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling]
[examples/pytorch/language-modeling/run_clm.py]
示例:
[colab.research.google.com/Causal Language modeling]
[examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling]
[examples/pytorch/language-modeling/run_mlm.py]
用的模型
AutoModelForMaskedLM,具体可以是BertForMaskedLM。
run_mlm.py中可能需要改的:
1 max_seq_length的读取和设置默认是1024,如果和模型不同,可能需要修改。
2 有将多条文本拼接成max_seq_length长度的逻辑,tokenized_datasets = tokenized_datasets.map(group_texts...),这个看情况要去掉。
3 这里AutoModelForMaskedLM/BertForMaskedLM只预训练mlm任务而没有nsp任务。如果要加上nsp任务,需要使用BertForPreTraining。mlm任务没有下面这些参数,也更不能训练了:bert.pooler.dense.weight;bert.pooler.dense.bias;cls.seq_relationship.weight;cls.seq_relationship.bias。有没有一些参数,和你用什么模型来加载checkpoint有关。
数据处理
这里相对下面的wwm在data处理上简单点
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
trainer = Trainer(model=model, args=training_args, train_dataset=lm_datasets["train"], eval_dataset=lm_datasets["validation"], data_collator=data_collator,)
示例:
[colab.research.google.com/Masked language modeling]
TrainOutput(global_step=7218, training_loss=2.0377309222603213)
Perplexity: 6.37
[examples/pytorch/language-modeling#whole-word-masking]
[examples/research_projects/mlm_wwm]
用的模型
其实和上面一样是AutoModelForMaskedLM,具体可以是BertForMaskedLM。
数据处理
复杂些:data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=data_args.mlm_probability)
__main()__:
- pretrained_model = './models/chinese-roberta-wwm-ext'
- output_model = './models/chinese-roberta-wwm-ext_new'
- dataset_name = 'data/text'
- sys.argv = 'run_mlm.py --model_name_or_path {} --dataset_name {} --do_train --do_eval --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --num_train_epochs 3 --output_dir {}'.format(pretrained_model, dataset_name, output_model).split()
- main()
main():
- dataset_dict = load_dataset(data_args.dataset_name, sep='\t', header=0,
- column_names=['id', 'text', 'label', 'type'],
- usecols=['text']) # , 'label'
- dataset_dict = dataset_dict.filter(
- lambda line: line['text'] and len(line['text']) > 0 and len(line['text']) <= 500 and not line['text'].isspace())
- dataset_dict = dataset_dict['train'].train_test_split(test_size=0.1, seed=123)
- dataset_dict['validation'] = dataset_dict['test']
- del dataset_dict['test']
- print(dataset_dict)
Note:
1 代码中的设定是,如果output_dir中存在last_checkpoint,或者如果model_name_or_path中包含checkpoint目录,从checkpoint目录训练时,会只从已经训到的global step开始,训练时也会直接从0%跳到对应比如60%处开始训练,否则从指定的输入model_name_or_path开始train。
所以增量训练时,要保证global step比原来训完的大。比如前一次训练的step数(迭代次数或者说是batches数)为 数据量*epoch/batch_size=8000时,当前这次的数据量*epoch/batch_size就要比8000更大。如果batch_size不变,epoch只要设置大点。但如果batch_size也改大了,或者数据量变小了,那epoch都要设置更大些。
2 从保存的checkpoint目录开始训练时,一般会
All model checkpoint weights were used when initializing BertForMaskedLM.
否则提示
Some weights of the model checkpoint at ./models/chinese-roberta-wwm-ext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
3 输入句子长度需要截断一下lambda line: line['text'] and len(line['text']) > 0 and len(line['text']) <= 500 and not line['text'].isspace()),否则出错:RuntimeError: The expanded size of the tensor (527) must match the existing size (512) at non-singleton dimension 1. Target sizes: [16, 527]. Tensor sizes: [1, 512]
__main__ - Process rank: 0, device: cuda:0, n_gpu: 4 distributed training: True, 16-bits training: False
per_device_eval_batch_size=16,
per_device_train_batch_size=16,
Downloading and preparing dataset csv to...Dataset csv downloaded and prepared to... #如果是第1次运行data_loader
Found cached dataset csv ...datasets.arrow_dataset - Loading cached processed dataset at...#如果是第2+次运行data_loader
Map: 3%|█ | 3000/94431 [00:00<00:05, 16974.46 examples/s] #应该是dataset.map的处理
[INFO|trainer.py:1786] >> ***** Running training *****
[INFO|trainer.py:1787] >> Num examples = 149,655
[INFO|trainer.py:1788] >> Num Epochs = 3
[INFO|trainer.py:1789] >> Instantaneous batch size per device = 64
[INFO|trainer.py:1790] >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1791] >> Gradient Accumulation steps = 1
[INFO|trainer.py:1792] >> Total optimization steps = 7,017
[INFO|trainer.py:1793] >> Number of trainable parameters = 102,290,312
10%|██ | 667/7017 [04:24<39:48, 2.66it/s]
{'loss': 0.9955, 'learning_rate': 4.643722388485108e-05, 'epoch': 0.21}...
{'loss': 0.7538, 'learning_rate': 1.211343879150634e-07, 'epoch': 2.99}
{'train_runtime': 2735.8275, 'train_samples_per_second': 164.106, 'train_steps_per_second': 2.565, 'train_loss': 0.8171015656944081, 'epoch': 3.0}
- INFO - __main__ - ***** Train results *****
- INFO - __main__ - epoch = 3.0
- INFO - __main__ - train_loss = 0.8171015656944081
- INFO - __main__ - train_runtime = 2735.8275
- INFO - __main__ - train_samples_per_second = 164.106
- INFO - __main__ - train_steps_per_second = 2.565
- INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3200] >> ***** Running Evaluation *****
[INFO|trainer.py:3202] >> Num examples = 16629
[INFO|trainer.py:3205] >> Batch size = 64
- __main__ - ***** Eval results *****
- INFO - __main__ - perplexity = 2.085315833957044
Note:
1 这里有一个等式,即Total optimization steps 7,017 * Instantaneous batch size per device 64 = Num examples 149,655 * Num Epochs 3.
如果像run_mlm.py中那样对dataset_dict进行group_texts操作(concatenate all texts from our dataset and generate chunks of max_seq_length),那么Num examples可能会小很多,因为多个examples合并成了一个。
2 Total train batch size=64,可能是设置的per_device_train_batch_size*n_gpu=16*4=64。算迭代次数/steps是按64来的,即迭代次数就是batch的数目。
3 10%|██ | 667/7017 [04:24<39:48, 2.66it/s]
运行进度表示:已经执行了667次迭代,共7017次迭代。已经执行了04:24分钟,还剩39:48分钟。其中每秒执行2.66次迭代。
4 如果是连续训练的第2+次训练,则train之前会显示的log如下:
[INFO|trainer.py:1813] >> Continuing training from checkpoint, will skip to saved global_step
[INFO|trainer.py:1814] >> Continuing training from epoch 3
[INFO|trainer.py:1815] >> Continuing training from global step 7017
[INFO|trainer.py:1827] >> Will skip the first 3 epochs then the first 0 batches in the first epoch.
0%| | 0/11695 [00:00<?, ?it/s]
60%|████████ | 7018/11695 [00:10<00:07, 660.84it/s]
如果改了batch_size和数据量大小,开始位置是从上一步的 global step 开始。
Continuing training from global step 11695
Will skip the first 15 epochs then the first 625 batches in the first epoch.
如这里的11695步开始。对应于新的batch_size和数据量大小,它的每个epochs假如是738次迭代,则刚好就是738*15+625=11695。
cd output_model/***
tensorboard --logdir=./run/
http://localhost:6006/ (Press CTRL+C to quit)
[PyTorch:可视化TensorBoard_-柚子皮-的博客-CSDN博客]
Note: 必须要安装tensorflow(会同时安装tensorboard),否则运行时TrainingArguments的参数report_to=['tensorboard']会变成[],不会生成summary,最后就不存在run目录。
from:-柚子皮-
ref: 官方finetune预训练LM文档[https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。