从前慢现在也慢

这个屌丝很懒，什么也没留下！

热门标签

LLaMA-Factory在华为显卡上的实验记录_llama-factory npu

作者：从前慢现在也慢 | 2024-08-13 12:31:24

踩

llama-factory npu

如何判断目前所选择的模型是否支持
LLaMA-Factory/src/llamafactory/data/template.py
在项目的这个地址中会有不同模型的支持模版。

这里用目前我最常用的两个模型举例子一个是智谱的glm4-9B模型

_register_template(
    name="glm4",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>"]),
    format_assistant=StringFormatter(slots=["\n{{content}}"]),
    format_system=StringFormatter(slots=["<|system|>\n{{content}}"]),
    format_function=FunctionFormatter(slots=["{{name}}\n{{arguments}}"]),
    format_observation=StringFormatter(slots=["<|observation|>\n{{content}}<|assistant|>"]),
    format_tools=ToolFormatter(tool_format="glm4"),
    format_prefix=EmptyFormatter(slots=["[gMASK]<sop>"]),
    stop_words=["<|user|>", "<|observation|>"],
    efficient_eos=True,
)
1
2
3
4
5
6
7
8
9
10
11
12

这段代码看起来是在定义一个模板（template）的注册过程，可能是在某个框架或者系统中使用。让我来解释一下每个参数的作用和含义：

`_register_template(...)`

这是一个函数或者方法，用来注册一个名为 "glm4" 的模板。

参数解释：

name=“glm4”：
- 这里指定了模板的名称，即 "glm4"。
format_user=StringFormatter(slots=[“\n{{content}}”])：
- format_user 是用来格式化用户输入的内容的格式器（formatter）。
- StringFormatter(slots=["\n{{content}}"]) 表示使用字符串格式化器，slots=["\n{{content}}"] 指定了插槽（slots），用于接收用户输入内容，并在格式化时将内容放置在 \n{{content}} 的位置上。
format_assistant=StringFormatter(slots=[“\n{{content}}”])：
- format_assistant 是用来格式化助理（assistant）输出的内容的格式器。
- 同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。
format_system=StringFormatter(slots=[“\n{{content}}”])：
- format_system 是用来格式化系统（system）输出的内容的格式器。
- 同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。
format_function=FunctionFormatter(slots=[“{{name}}\n{{arguments}}”])：
- format_function 是用来格式化函数（function）定义的格式器。
- FunctionFormatter(slots=["{{name}}\n{{arguments}}"]) 表示格式化时会使用 {{name}} 和 {{arguments}} 插槽，用于显示函数名称和参数。
format_observation=StringFormatter(slots=[“\n{{content}}”])：
- format_observation 是用来格式化观察（observation）输出的内容的格式器。
- 同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。
format_tools=ToolFormatter(tool_format=“glm4”)：
- format_tools 是用来格式化工具（tools）的格式器。
- ToolFormatter(tool_format="glm4") 表示工具格式化器将使用 "glm4" 格式。
format_prefix=EmptyFormatter(slots=[“[gMASK]”])：
- format_prefix 是用来格式化前缀（prefix）的格式器。
- EmptyFormatter(slots=["[gMASK]<sop>"]) 表示前缀格式化器将输出 "[gMASK]<sop>"。
stop_words=[“”, “”]：
- stop_words 是停用词列表，但在这里给出的示例中，停用词列表为空，即 ["", ""]。
efficient_eos=True：
- efficient_eos 是一个布尔值参数，表示是否启用高效的EOS（End of Sentence）处理。在这里设置为 True，可能意味着系统会优化处理句子结束的方式。

这段代码的主要作用是定义了一个名为 "glm4" 的模板，包括了各种用于格式化用户输入、助理输出、系统输出、函数定义、观察输出、工具、前缀等内容的格式化器和设置。这种模板的定义通常用于在特定的系统或框架中，为不同类型的输入和输出提供统一的格式化和处理规则，以便于后续的处理和展示。

_register_template(
    name="qwen",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
    format_observation=StringFormatter(slots=["<|im_start|>tool\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_separator=EmptyFormatter(slots=["\n"]),
    default_system="You are a helpful assistant.",
    stop_words=["<|im_end|>"],
    replace_eos=True,
)
1
2
3
4
5
6
7
8
9
10

目前看所有的qwen模型在llama factory中都用这一套模版。

从最简化的角度来看目前我在三个阶段分别用到的数据结构
预训练数据结构

{"text":""}
1

对应的data_info.json中需要加入以下配置

"pre_dataset_name": {
  "file_name": "预训练数据文件在data目录下的地址",
  "columns": {
    "prompt": "text"
  }
}
1
2
3
4
5
6

微调训练数据结构

{"input_colum": "根据:TWY:滑行道;BTN:在……之间;TWY:滑行道;AND:与;TWY:滑行道;AVBL:可供使用;FOR:为了;OPS.:作业、运行、经营、操作、运转;DRG:在……期间;FLW:如下，以下;TWY:滑行道;FOR:为了;ACFT:航空器;ACFT:航空器;IN:在;APN:停机坪;FOR:为了;ACFT:航空器;ONLY.:只能;AND:与;ACFT:航空器;ON:在;RWY:跑道，逐词翻译：PORTIONOFTWYMBTNTWYLINK31ANDTWYLINK32NOTAVBLFOROPS.\nDRGTHISPERIODFLWRESTRICTIONSSHALLAPPLY:\n1.COMPATIBILITYOFTWYKRESTRICTEDFORACFTUPTOWINGSPAN68.40M.\n2.ACFTSTAND265INCARGOAPNDOWNGRADEDFORACFTUPTOWINGSPAN68.40MONLY.\n3.MOVEMENTOFA388ANDAN124ACFTONRWY10/28NOTPERMITED.","output_colum": "<部分:PORTION:0> <的:OF:1> <滑行道:TWY:2> <M:M:3> <在:BTN:4.1> <之间:BTN:4.2> <滑行道:TWY:5> <连接:LINK:6> <31:31:7> <与:AND:8> <滑行道:TWY:9> <连接:LINK:10> <32:32:11> <不可用:NOT AVBL:12> <因为:FOR:13> <运行:OPS:14> <.:.:15> <在……期间:DRG:16> <这个:THIS:17> <时期:PERIOD:18> <如下，以下:FLW:19> <限制:RESTRICTIONS:20> <应该:SHALL:21> <适用:APPLY:22> <::::23> <1:1:24> <.:.:25> <兼容:COMPATIBILITY:26> <的:OF:27> <滑行道:TWY:28> <K:K:29> <被限制:RESTRICTED:30> <对于:FOR:31> <航空器:ACFT:32> <到:UPTO:33> <翼展:WINGSPAN:34> <68.40M:68.40M:35> <.:.:36> <2:2:37> <.:.:38> <航空器:ACFT:39> <停在:STAND:40> <265:265:41> <在:IN:42> <货物:CARGO:43> <停机坪:APN:44> <降级:DOWNGRADED:45> <对于:FOR:46> <航空器的:ACFT:47> <到:UPTO:48> <翼展:WING SPAN:49> <68.40M:68.40M:50> <只能:ONLY:51> <.:.:52> <3:3:53> <.:.:54> <移动:MOVEMENT:55> <的:OF:56> <A388:A388:57> <与:AND:58> <AN124:AN124:59> <航空器:ACFT:60> <在:ON:61> <跑道:RWY:62> <10/28:10/28:63> <不:NOT:64> <被允许:PERMITED:65> <.:.:66> "}
1

对应的datainfo中的内容为

"sft_dataset_name": {
  "file_name": "微调数据文件在data目录下的地址",
  "columns": {
    "query": "input_colum",
    "response": "output_colum",
  }
}
1
2
3
4
5
6
7

因为数据量比较大所以使用jsonl,在数据量大的情况下json文件会导致模型报错。
相对于老版本的llamafactory来说新版的加入了多线程分词能力。这样预处理的过程会更快。

处理好数据以后我们开始处理训练命令。这里注意细节，我们的预训练数据叫做pre_dataset_name，微调数据叫做sft_dataset_name。目前我所在的环境是国内。所以这里我们需要一条指令让模型下载通过魔搭社区进行下载。

export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
1

在配置modelscope以后要记得安装modelscope

pip install modelscope -U
1

这里我们用了一种比较落后的方式实用华为的npu。使用torch-npu模块来进行npu的使用。
在训练之前我们先介绍一下llama factory支撑的几种训练模式
LlamaFactory 支持的训练模式的解释：

1、dpo 强化训练 - Data Parallel Optimization 的缩写，数据并行优化。这种方法涉及在多个设备上并行训练模型，每个设备处理不同的数据批次，以提高训练效率和速度。

2、kto 强化训练 - Knowledge Transfer Optimization 的缩写，知识迁移优化。这通常涉及将预训练模型的知识迁移到新的模型上，以改善新模型的性能。

3、ppo 强化训练 - Probabilistic Policy Optimization 的缩写，概率策略优化。这是一种强化学习算法，用于优化策略的期望回报，通常用于训练代理在给定环境中执行特定任务。

4、pt 预训练 - Pre-training 的缩写，预训练。这是在大规模数据集上训练模型的过程，以便模型能够学习通用的语言表示，这些表示可以在各种下游任务中进行微调。

5、rm 强化反馈训练 - 这可能是一种使用强化学习技术的训练方法，其中模型根据收到的反馈（奖励或惩罚）来调整其行为。

6、sft 微调训练 - Supervised Fine-Tuning 的缩写，监督式微调。这是在特定任务上使用标注数据对预训练模型进行微调的过程，以提高模型在该任务上的性能。

第一步我们设置预训练训练的配置文件这里我推荐使用glm4

### model
model_name_or_path: ZhipuAI/glm-4-9b

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: pre_dataset_name
template: glm4
cutoff_len: 4096
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/glm-4-9b/full/pt
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

这里我们指定了训练模式是pt也就是预训练，在openi平台最高可以选择4卡910显卡进行训练。也就是4*32G显存。这是足够进行预训练的。

如果需要更好的预训练效果可以通过调节以下几个参数来实现。

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
1
2
3
4
5
6
7

这一部分的配置文件详细描述了训练过程的具体参数：

train

per_device_train_batch_size: 这个参数指定了每个训练设备（例如，GPU或TPU）上的批量大小。在这里，它被设置为1，这意味着每个设备在每个训练步骤中处理一个样本。较小的批量大小可以减少内存需求，但可能需要更多的训练步骤来达到收敛。
gradient_accumulation_steps: 这个参数定义了在执行权重更新之前累积梯度的步骤数。在这里，它被设置为2，意味着模型将在累积了两步的梯度之后才进行权重更新。这种方法可以在不增加内存使用的情况下模拟更大的批量大小。
learning_rate: 学习率是决定模型参数更新速度的关键因素。在这里，它被设置为1.0e-4（即0.0001），这是一个常见的初始学习率值。学习率的选择对模型训练至关重要，过高的学习率可能导致训练不稳定，而过低的学习率可能导致训练过程缓慢。
num_train_epochs: 这个参数指定了模型将在训练数据上运行的完整次数。在这里，它被设置为3.0，意味着模型将看到整个训练数据集三次。增加训练轮数可以提高模型的性能，但也可能导致过拟合。
lr_scheduler_type: 学习率调度器用于在训练过程中动态调整学习率。在这里，它被设置为“cosine”，这意味着学习率将按照余弦函数的规律变化。余弦调度器通常在训练开始时保持较高的学习率，并在训练过程中逐渐降低。
warmup_ratio: 这个参数定义了学习率热身期间的比例。在这里，它被设置为0.1，这意味着在训练的前10%时间内，学习率将从0逐渐增加到初始学习率。热身阶段有助于在训练初期稳定模型的学习。
这些参数共同决定了模型训练的效率和质量。调整这些参数可以帮助优化模型的性能，同时确保训练过程的有效性和稳定性。
我们开始安装在npu中的llama factory训练框架
第一步安装npu版本的llama factory

pip install -e '.[torch-npu,metrics]'
1

第二步安装npu环境

# 请替换 URL 为 CANN 版本和设备型号对应的 URL
# 安装 CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install

# 安装 CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install

# 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
1
2
3
4
5
6
7
8
9
10
11

第三步在我安装的时候遇到了一个小bug，因为没有云平台的root权限，所以这里我才用了conda进行环境安装。

conda install -c conda-forge libsndfile
1

还有一个提升性能的库

conda install conda-forge::libaio
1

单机多卡情况下使用deepspeed zero3会带来相对原生的单机多卡更高的计算效率。
第四步安装deepspeed。

pip install deepspeed
1

接下来我们运行命令开始进行训练

 llamafactory-cli train LLaMA-Factory/examples/train_full/glm4_full_pt_ds3.yaml
1

成功训练的日志的样子

[2024-07-09 09:10:46,149] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
07/09/2024 09:11:02 - INFO - llamafactory.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.35k/1.35k [00:00<00:00, 4.44kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.0/36.0 [00:00<00:00, 86.4B/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.21k/2.21k [00:00<00:00, 5.46kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205/205 [00:00<00:00, 451B/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.34k/6.34k [00:00<00:00, 19.5kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.81G/1.81G [01:23<00:00, 23.2MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:36<00:00, 18.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:36<00:00, 18.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:18<00:00, 25.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80G/1.80G [01:19<00:00, 24.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:15<00:00, 24.0MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:22<00:00, 24.0MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80G/1.80G [01:12<00:00, 26.4MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:03<00:00, 28.6MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:10<00:00, 27.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.54G/1.54G [01:00<00:00, 27.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.4k/28.4k [00:00<00:00, 65.7kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57.1k/57.1k [00:00<00:00, 100kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.34k/3.34k [00:00<00:00, 11.8kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.78k/3.78k [00:00<00:00, 12.2kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 28.9kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.50M/2.50M [00:00<00:00, 3.07MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.12k/3.12k [00:00<00:00, 9.51kB/s]
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer.json
[WARNING|logging.py:313] 2024-07-09 09:24:49,392 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/09/2024 09:24:49 - INFO - llamafactory.data.template - Add <|user|>,<|observation|> to stop words.
07/09/2024 09:24:49 - INFO - llamafactory.data.loader - Loading dataset identity.json...
Generating train split: 91 examples [00:00, 1770.27 examples/s]
Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 187.32 examples/s]
07/09/2024 09:25:01 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Generating train split: 1000 examples [00:00, 19614.77 examples/s]
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2385.01 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:43<00:00, 24.98 examples/s]
input_ids:
[151331, 151333, 151336, 198, 6023, 151337, 198, 9703, 0, 358, 1079, 5867, 606, 37953, 458, 15223, 17821, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151329]
inputs:
[gMASK] <sop> <|user|> 
hi <|assistant|> 
Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today? <|endoftext|>
label_ids:
[-100, -100, -100, -100, -100, -100, 198, 9703, 0, 358, 1079, 5867, 606, 37953, 458, 15223, 17821, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151329]
labels:

Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today? <|endoftext|>
[INFO|configuration_utils.py:731] 2024-07-09 09:26:01,831 >> loading configuration file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/config.json
[INFO|configuration_utils.py:731] 2024-07-09 09:26:01,844 >> loading configuration file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/config.json
[INFO|configuration_utils.py:800] 2024-07-09 09:26:01,846 >> Model config ChatGLMConfig {
  "_name_or_path": "/root/.cache/modelscope/hub/ZhipuAI/glm-4-9b",
  "add_bias_linear": false,
  "add_qkv_bias": true,
  "apply_query_key_layer_scaling": true,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "ChatGLMModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
  },
  "bias_dropout_fusion": true,
  "classifier_dropout": null,
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "ffn_hidden_size": 13696,
  "fp32_residual_connection": false,
  "hidden_dropout": 0.0,
  "hidden_size": 4096,
  "kv_channels": 128,
  "layernorm_epsilon": 1.5625e-07,
  "model_type": "chatglm",
  "multi_query_attention": true,
  "multi_query_group_num": 2,
  "num_attention_heads": 32,
  "num_layers": 40,
  "original_rope": true,
  "pad_token_id": 151329,
  "padded_vocab_size": 151552,
  "post_layer_norm": true,
  "rmsnorm": true,
  "rope_ratio": 1,
  "seq_length": 8192,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 151552
}

[INFO|modeling_utils.py:3553] 2024-07-09 09:26:01,975 >> loading weights file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/model.safetensors.index.json
[INFO|modeling_utils.py:3698] 2024-07-09 09:26:01,976 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2024-07-09 09:26:01,979] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-09 09:26:01,979] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109

日志解读
根据你提供的日志信息，这是一个涉及机器学习模型训练的过程。我会逐步解释每个部分的含义和可能的影响：

INFO 和 WARNING 日志：
- Setting ds_accelerator to npu (auto detect)：指示程序将使用NPU（神经处理单元）加速器，系统自动检测到这一设置。
- async_io requires the dev libaio .so object and headers but these were not found.：警告提示缺少 libaio 库，这可能影响异步IO的性能。
- If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.：建议如果已经安装了 libaio，可以尝试设置 CFLAGS 和 LDFLAGS 环境变量来正确定位该库。
下载和处理数据集：
- 大量的 Downloading 和 Converting format of dataset 行指示正在下载和转换数据集，这是模型训练过程中常见的操作。
模型配置和加载：
- 模型配置信息显示了模型的参数设置，如层数、隐藏单元大小等。
- loading weights file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/model.safetensors.index.json 表示正在加载模型的权重文件。
- Detected DeepSpeed ZeRO-3: activating zero.init() for this model 表示检测到使用了 DeepSpeed ZeRO-3 技术，这是一种优化模型训练内存使用和效率的方法。
MPI 环境检测：
- Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... 检测到不是使用 DeepSpeed 或分布式启动器，正在尝试检测 MPI 环境。

综上所述，日志记录了一个使用 NPU 加速器的机器学习模型训练过程，涉及数据集下载、模型加载和配置，以及一些系统环境的警告和优化建议。
这里报了个错误。mpi环境失败手动安装mpi环境

conda install -c conda-forge mpi4py openmpi
1

安装的时候返回了一段日志。对这段日志进行解读。

On Linux, Open MPI is built with UCX support but it is disabled by default.                                                                                                              
To enable it, first install UCX (conda install -c conda-forge ucx).                                                                                                                      
Afterwards, set the environment variables                                                                                                                                                
OMPI_MCA_pml=ucx OMPI_MCA_osc=ucx                                                                                                                                                        
before launching your MPI processes.                                                                                                                                                     
Equivalently, you can set the MCA parameters in the command line:
mpiexec --mca pml ucx --mca osc ucx ...


On Linux, Open MPI is built with CUDA awareness but it is disabled by default.
To enable it, please set the environment variable
OMPI_MCA_opal_cuda_support=true
before launching your MPI processes.
Equivalently, you can set the MCA parameter in the command line:
mpiexec --mca opal_cuda_support 1 ...
Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via
UCX. Please consult UCX documentation for further details.


done
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

这段日志是在告知如何在Linux系统中启用Open MPI的UCX（Unified Communication X）支持和CUDA（Compute Unified Device Architecture）意识支持。UCX是一个高性能通信库，用于支持不同通信机制（如InfiniBand, RoCE, TCP/IP等），而CUDA是由NVIDIA开发的并行计算平台和编程模型。
以下是日志的解读：

启用UCX支持：
- Open MPI在Linux上编译时包含了UCX支持，但默认是禁用的。
- 要启用UCX支持，首先需要安装UCX。可以通过conda包管理器安装，命令是 conda install -c conda-forge ucx。
- 安装UCX后，在启动MPI进程之前，需要设置环境变量 OMPI_MCA_pml=ucx 和 OMPI_MCA_osc=ucx。
- 或者，可以在命令行中设置MCA参数，使用命令 mpiexec --mca pml ucx --mca osc ucx ...。
启用CUDA意识支持：
- Open MPI在Linux上编译时也包含了CUDA意识支持，但默认也是禁用的。
- 要启用CUDA意识支持，需要设置环境变量 OMPI_MCA_opal_cuda_support=true。
- 同样，可以在命令行中设置MCA参数，使用命令 mpiexec --mca opal_cuda_support 1 ...。
- 如果要通过UCX启用CUDA意识支持，可能还需要设置 UCX_MEMTYPE_CACHE=n。具体细节可以查阅UCX的文档。

本来早点结束。嘿嘿又爆出了新的问题。

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/loader.py", line 151, in load_model
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3710, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 928, in __init__
    self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 852, in __init__
    self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio,
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 96, in __init__
    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
RuntimeError: call aclnnCast failed, detail:EZ1001: 2024-07-09-09:38:52.309.843 The param dtype not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT16,DT_FLOAT,DT_DOUBLE,DT_INT8,DT_UINT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT16,DT_UINT32,DT_UINT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,].

[ERROR] 2024-07-09-09:38:52 (PID:17196, Device:0, RankID:0) ERR01005 OPS internal error
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

先解读一下异常。
这段日志是Python程序运行时出现的错误堆栈，具体是使用DeepSpeed库（一种用于深度学习训练的库）在MindSpore（一种深度学习框架）上运行时遇到的。错误信息表明在执行模型初始化时出现了运行时错误，导致无法创建模型。
错误信息显示在尝试创建模型的某些组件时，由于某种原因，无法将参数的数据类型转换为DeepSpeed支持的类型。具体来说，问题出现在创建RotaryEmbedding对象时，这个对象需要一个rotary_dim参数，但是在转换过程中遇到了问题。
错误堆栈的详细部分如下：

RuntimeError: call aclnnCast failed, detail:EZ1001: 2024-07-09-09:38:52.309.843 The param dtype not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT16,DT_FLOAT,DT_DOUBLE,DT_INT8,DT_UINT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT16,DT_UINT32,DT_UINT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,].
这段信息表明，在尝试将数据类型从DT_BFLOAT16转换为DeepSpeed支持的类型时失败了。DT_BFLOAT16不在DeepSpeed支持的类型列表中，所以转换失败。
[ERROR] 2024-07-09-09:38:52 (PID:17196, Device:0, RankID:0) ERR01005 OPS internal error
这表明DeepSpeed内部的某个操作（可能是模型初始化的一部分）遇到了内部错误。
解决这个问题的方法可能包括：

检查模型的配置，确保rotary_dim参数的数据类型是DeepSpeed支持的类型之一。
如果rotary_dim参数的值不是DeepSpeed支持的类型，尝试将值转换为支持的类型。
检查DeepSpeed的文档，了解如何配置或调整以支持DT_BFLOAT16类型。
联系DeepSpeed或MindSpore的支持团队，寻求帮助解决这个特定问题。
由于这涉及到具体的代码和库配置，最直接的方法是联系项目的开发者或社区，他们可能提供更具体的解决方案或工作around。

哎嘿 910 计算芯片版本不支持 DT_BFLOAT16。所以我们要改deepspeed的配置文件。这时候最绝望的事情来了。写到最后发现一个无法逾越的问题。

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 88, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1221, in forward
    outputs = self.model(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1012, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/model_utils/checkpointing.py", line 65, in custom_gradient_checkpointing_func
    return gradient_checkpointing_func(func, *args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 230, in forward
    outputs = run_function(*args)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 763, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 257, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py", line 111, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight, bias)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch_npu/npu/amp/autocast_mode.py", line 113, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py", line 59, in forward
    output += bias
RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-07-09-10:40:00.116.800 the size of tensor selfRef [1,120] must match the size of tensor other [0].
        TraceBack (most recent call last):
        120 and 0 cannot broadcast.
        the size of tensor selfRef [1,120] must match the size of tensor other [0].

[ERROR] 2024-07-09-10:40:00 (PID:21727, Device:0, RankID:0) ERR01005 OPS internal error
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

更换成qwen2-7B进行微调训练出现了tensor 不匹配的问题。对异常日志进行解读。
从日志来看，报错的原因是发生了张量操作的维度不匹配。具体来说，错误信息 the size of tensor selfRef [1,120] must match the size of tensor other [0] 表示在进行 aclnnInplaceAdd 操作时，一个张量的维度是 [1,120]，另一个张量的维度是 [0]，导致无法进行广播操作。这通常是由于数据输入的形状或大小设置不正确引起的。以下是详细的解读及可能的解决方案：

错误日志解读

主函数调用：
```
sys.exit(main())
1
```
程序从主函数 main 开始执行。
执行实验：
```
run_exp()
1
```
在 run_exp 函数中运行实验。

运行 SFT（Supervised Fine-Tuning）：

run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
1

在 run_sft 中进行模型的监督微调。

训练模型：

train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
1

在 trainer.train 中进行训练，并可能从检查点恢复训练。

训练步骤：
```
tr_loss_step = self.training_step(model, inputs)
1
```
进行训练的单步操作 training_step。

计算损失：

loss = self.compute_loss(model, inputs)
1

计算模型的损失。

模型前向传播：
```
outputs = model(**inputs)
1
```
进行模型的前向传播。

深度学习库内部调用：
这里涉及多个内部函数调用，最终在 aclnnInplaceAdd 时出错：

RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-07-09-10:40:00.116.800 the size of tensor selfRef [1,120] must match the size of tensor other [0].
1

可能的解决方案

检查数据输入：
- 确保输入数据的形状和大小正确。尤其是在数据预处理步骤中，确认数据没有丢失或者形状不匹配。
模型配置检查：
- 检查模型的配置，尤其是线性层（如 self.q_proj）的输入输出维度是否与数据匹配。
检查自定义函数：
- 如果有自定义的梯度检查点函数 custom_gradient_checkpointing_func，确保其实现正确，并且不会改变输入数据的形状。
更新库和框架：
- 确保使用的库（如 transformers, torch, deepspeed 等）是最新版本，因为新版本可能包含错误修复和改进。
调试信息：
- 在模型前向传播的关键步骤添加调试信息，打印张量的形状以便确定错误发生的位置和原因。

具体到这个错误，可以首先检查 self.q_proj 的输入 hidden_states 的形状，并在出错前打印相关张量的形状，确保其维度匹配。如果问题仍然存在，建议进一步简化代码并逐步调试，以确定确切的错误原因。
接下来我们去除掉deepspeed配置项。
发生以下异常

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/loader.py", line 160, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/adapter.py", line 306, in init_adapter
    _setup_full_tuning(model, model_args, finetuning_args, is_trainable, cast_trainable_params_to_fp32)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/adapter.py", line 59, in _setup_full_tuning
    param.data = param.data.to(torch.float32)
RuntimeError: NPU out of memory. Tried to allocate 2.03 GiB (NPU 0; 32.00 GiB total capacity; 29.19 GiB already allocated; 29.19 GiB current active; 412.09 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

到时间了重新想办法今天必须把这个代码跑通

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】