当前位置:   article > 正文

LLM:finetune预训练语言模型_chinese-roberta-wwm-ext-large were not used when i

chinese-roberta-wwm-ext-large were not used when initializing bertformaskedl

模型训练

GPT-2/GPT and causal language modeling

用的模型

AutoModelForCausalLM

[examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling]

[examples/pytorch/language-modeling/run_clm.py]

示例:

[colab.research.google.com/Causal Language modeling]

RoBERTa/BERT/DistilBERT and masked language modeling

[examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling]

[examples/pytorch/language-modeling/run_mlm.py]

用的模型

AutoModelForMaskedLM,具体可以是BertForMaskedLM

run_mlm.py中可能需要改的:
1 max_seq_length的读取和设置默认是1024,如果和模型不同,可能需要修改。
2 有将多条文本拼接成max_seq_length长度的逻辑,tokenized_datasets = tokenized_datasets.map(group_texts...),这个看情况要去掉。

3 这里​AutoModelForMaskedLM/BertForMaskedLM只预训练mlm任务而没有nsp任务。如果要加上nsp任务,需要使用BertForPreTraining。mlm任务没有下面这些参数,也更不能训练了:bert.pooler.dense.weight;bert.pooler.dense.bias;cls.seq_relationship.weight;cls.seq_relationship.bias。有没有一些参数,和你用什么模型来加载checkpoint有关。

数据处理

这里相对下面的wwm在data处理上简单点

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

trainer = Trainer(model=model,    args=training_args,    train_dataset=lm_datasets["train"],    eval_dataset=lm_datasets["validation"],    data_collator=data_collator,)

示例:

[colab.research.google.com/Masked language modeling]

TrainOutput(global_step=7218, training_loss=2.0377309222603213)

Perplexity: 6.37

Whole word masking

[examples/pytorch/language-modeling#whole-word-masking]

[examples/research_projects/mlm_wwm]

用的模型

其实和上面一样是AutoModelForMaskedLM,具体可以是BertForMaskedLM。

数据处理

复杂些:data_collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=data_args.mlm_probability)

__main()__:

  1. pretrained_model = './models/chinese-roberta-wwm-ext'
  2. output_model = './models/chinese-roberta-wwm-ext_new'
  3. dataset_name = 'data/text'
  4. sys.argv = 'run_mlm.py --model_name_or_path {} --dataset_name {} --do_train --do_eval --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --num_train_epochs 3 --output_dir {}'.format(pretrained_model, dataset_name, output_model).split()
  5. main()

main():

  1. dataset_dict = load_dataset(data_args.dataset_name, sep='\t', header=0,
  2. column_names=['id', 'text', 'label', 'type'],
  3. usecols=['text']) # , 'label'
  4. dataset_dict = dataset_dict.filter(
  5. lambda line: line['text'] and len(line['text']) > 0 and len(line['text']) <= 500 and not line['text'].isspace())
  6. dataset_dict = dataset_dict['train'].train_test_split(test_size=0.1, seed=123)
  7. dataset_dict['validation'] = dataset_dict['test']
  8. del dataset_dict['test']
  9. print(dataset_dict)

Note: 

1 代码中的设定是,如果output_dir中存在last_checkpoint,或者如果model_name_or_path中包含checkpoint目录,从checkpoint目录训练时,会只从已经训到的global step开始,训练时也会直接从0%跳到对应比如60%处开始训练,否则从指定的输入model_name_or_path开始train。

所以增量训练时,要保证global step比原来训完的大。比如前一次训练的step数(迭代次数或者说是batches数)为 数据量*epoch/batch_size=8000时,当前这次的数据量*epoch/batch_size就要比8000更大。如果batch_size不变,epoch只要设置大点。但如果batch_size也改大了,或者数据量变小了,那epoch都要设置更大些。

2 从保存的checkpoint目录开始训练时,一般会
All model checkpoint weights were used when initializing BertForMaskedLM.
否则提示
Some weights of the model checkpoint at ./models/chinese-roberta-wwm-ext were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']

3 输入句子长度需要截断一下lambda line: line['text'] and len(line['text']) > 0  and len(line['text']) <= 500 and not line['text'].isspace()),否则出错:RuntimeError: The expanded size of the tensor (527) must match the existing size (512) at non-singleton dimension 1. Target sizes: [16, 527]. Tensor sizes: [1, 512]

运行log

__main__ - Process rank: 0, device: cuda:0, n_gpu: 4 distributed training: True, 16-bits training: False

per_device_eval_batch_size=16,
per_device_train_batch_size=16,

Downloading and preparing dataset csv to...Dataset csv downloaded and prepared to...  #如果是第1次运行data_loader

Found cached dataset csv ...datasets.arrow_dataset - Loading cached processed dataset at...#如果是第2+次运行data_loader

Map:   3%|█                | 3000/94431 [00:00<00:05, 16974.46 examples/s]   #应该是dataset.map的处理

[INFO|trainer.py:1786] >> ***** Running training *****
[INFO|trainer.py:1787] >>   Num examples = 149,655
[INFO|trainer.py:1788] >>   Num Epochs = 3
[INFO|trainer.py:1789] >>   Instantaneous batch size per device = 64
[INFO|trainer.py:1790] >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1791] >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1792] >>   Total optimization steps = 7,017
[INFO|trainer.py:1793]  >>   Number of trainable parameters = 102,290,312

10%|██                  | 667/7017 [04:24<39:48,  2.66it/s]

{'loss': 0.9955, 'learning_rate': 4.643722388485108e-05, 'epoch': 0.21}...
{'loss': 0.7538, 'learning_rate': 1.211343879150634e-07, 'epoch': 2.99}
{'train_runtime': 2735.8275, 'train_samples_per_second': 164.106, 'train_steps_per_second': 2.565, 'train_loss': 0.8171015656944081, 'epoch': 3.0}
 - INFO - __main__ - ***** Train results *****
 - INFO - __main__ -   epoch = 3.0
 - INFO - __main__ -   train_loss = 0.8171015656944081
 - INFO - __main__ -   train_runtime = 2735.8275
 - INFO - __main__ -   train_samples_per_second = 164.106
 - INFO - __main__ -   train_steps_per_second = 2.565
 - INFO - __main__ - *** Evaluate ***
[INFO|trainer.py:3200]  >> ***** Running Evaluation *****
[INFO|trainer.py:3202]  >>   Num examples = 16629
[INFO|trainer.py:3205]  >>   Batch size = 64
- __main__ - ***** Eval results *****
 - INFO - __main__ -   perplexity = 2.085315833957044

Note:

1 这里有一个等式,即Total optimization steps 7,017 * Instantaneous batch size per device 64 = Num examples 149,655 * Num Epochs 3.

如果像run_mlm.py中那样对dataset_dict进行group_texts操作(concatenate all texts from our dataset and generate chunks of max_seq_length),那么Num examples可能会小很多,因为多个examples合并成了一个。

2 Total train batch size=64,可能是设置的per_device_train_batch_size*n_gpu=16*4=64。算迭代次数/steps是按64来的,即迭代次数就是batch的数目。

3 10%|██                  | 667/7017 [04:24<39:48,  2.66it/s]
运行进度表示:已经执行了667次迭代,共7017次迭代。已经执行了04:24分钟,还剩39:48分钟。其中每秒执行2.66次迭代。

4 如果是连续训练的第2+次训练,则train之前会显示的log如下:
[INFO|trainer.py:1813] >>   Continuing training from checkpoint, will skip to saved global_step
[INFO|trainer.py:1814] >>   Continuing training from epoch 3
[INFO|trainer.py:1815] >>   Continuing training from global step 7017
[INFO|trainer.py:1827] >>   Will skip the first 3 epochs then the first 0 batches in the first epoch.
0%|                                       | 0/11695 [00:00<?, ?it/s]
60%|████████               | 7018/11695 [00:10<00:07, 660.84it/s]

如果改了batch_size和数据量大小,开始位置是从上一步的 global step 开始。
Continuing training from global step 11695
Will skip the first 15 epochs then the first 625 batches in the first epoch.
如这里的11695步开始。对应于新的batch_size和数据量大小,它的每个epochs假如是738次迭代,则刚好就是738*15+625=11695。

Tensorboard可视化

cd output_model/***
tensorboard --logdir=./run/
http://localhost:6006/ (Press CTRL+C to quit)
[PyTorch:可视化TensorBoard_-柚子皮-的博客-CSDN博客]

Note: 必须要安装tensorflow(会同时安装tensorboard),否则运行时TrainingArguments的参数report_to=['tensorboard']会变成[],不会生成summary,最后就不存在run目录。

from:-柚子皮-

ref: 官方finetune预训练LM文档[https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling]

所有可能参数[src/transformers/training_args.py]

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号