“本文正在参加 人工智能创作者扶持计划”
近来,Meta开源了他们的LLaMA系列模型,包含了参数量为7B/13B/33B/65B的不同模型,但是,原模型的作用较差(如生成的成果文不对题、以及无法自然地完毕生成等)。因而,斯坦福的 Alpaca 模型根据 LLaMA-7B 和指令微调,仅运用约 5 万条练习数据,就能到达相似 GPT-3.5 的作用。
该项目提供了廉价的对LLaMA模型进行微调的办法,大体思路如下:
首先,利用OpenAI提供的GPT模型API生成质量较高的指令数据(仅52k),例如:
{
"instruction": "Rewrite the following sentence in the third person",
"input": "I am anxious",
"output": "She is anxious."
}, {
"instruction": "What are the three primary colors?",
"input": "",
"output": "The three primary colors are red, blue, and yellow."
}
然后,根据这些指令数据运用HuggingFace Transformers结构精调LLaMA-7B模型。
下面根据 LLaMA-7B 测验复现 Alpaca。
环境搭建
根底环境装备如下:
- 操作系统: CentOS 7
- CPUs: 单个节点具有 1TB 内存的 Intel CPU,物理CPU个数为64,每颗CPU核数为16
- GPUs: 8 卡 A800 80GB GPUs
- Python: 3.10 (需求先升级OpenSSL到1.1.1t版别(点击下载OpenSSL),然后再编译装置Python),点击下载Python
- NVIDIA驱动程序版别: 470.161.03,依据不同型号挑选不同的驱动程序,点击下载。
- CUDA东西包: 11.3,点击下载
- NCCL: 2.9.9-1,点击下载
- cuDNN: v8.2.0,点击下载
上面的NVIDIA驱动、CUDA、Python等东西的装置就不逐个赘述了。
新建并激活虚拟环境llama-venv-py310。
cd /home/guodong.li/virtual-venv
virtualenv -p /usr/bin/python3.10 llama-venv-py310
source /home/guodong.li/virtual-venv/llama-venv-py310/bin/activate
离线装置PyTorch,点击下载对应cuda版别的torch和torchvision即可。
pip install torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl
pip install torchvision-0.13.1+cu113-cp310-cp310-linux_x86_64.whl
装置transformers,现在,LLaMA相关的完成并没有发布对应的版别,但是现已合并到主分支了,因而,咱们需求切换到对应的commit,从源代码进行相应的装置。
cd transformers
git checkout 0041be5
pip install .
装置apex。
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 22.04-dev
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
其他其他相关的库。
pip install -r requirements.txt
requirements.txt
文件具体的内容如下。
numpy
rouge_score
fire
openai
sentencepiece
tokenizers==0.12.1
wandb
deepspeed==0.8.0
accelerate
tensorboardX
模型格局转换
将LLaMA原始权重文件转换为Transformers库对应的模型文件格局。
cd transformers
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /data/nfs/guodong.li/pretrain/llama-model \
--model_size 7B \
--output_dir /data/nfs/guodong.li/pretrain/hf-llama-model
转换之后会生成tokenizer和llama-7b(模型权重文件)两个目录。能够经过以下方式加载模型和分词器:
tokenizer = transformers.LlamaTokenizer.from_pretrained("/data/nfs/guodong.li/pretrain/hf-llama-model/tokenizer/")
model = transformers.LlamaForCausalLM.from_pretrained("/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b/")
LLaMA 分词器(tokenizer)根据 sentencepiece分词东西。 sentencepiece在解码序列时,假如第一个token是单词(例如:Banana
)开头,则tokenizer不会在字符串前添加前缀空格。 要让tokenizer输出前缀空格,请在LlamaTokenizer
目标或tokenizer装备中设置decode_with_prefix_space=True
。
这儿将tokenizer目录的文件拷贝到llama-7b目录下。
cp tokenizer/* llama-7b/
注: 假如不想转换也能够直接从Hugging Face下载转换好的模型。
数据集准备
Stanford Alpaca中的alpaca_data.json
文件即是他们用于练习的指令数据集,咱们能够直接运用该数据集进行模型精调。但是在Alpaca-LoRA中说到该数据集存在一些噪声,因而,他们对该数据集做了清洗后得到了alpaca_data_cleaned.json
文件。采用该数据集进行练习大概率会得到更好成果。
模型精调
本文未运用PyTorch FSDP是因为当前环境的Cuda版别为11.3,且PyTorch版别为1.12.1,运转会报错。Cuda版别升级到11.6及以上,且PyTorch版别升级为1.13.1及以上,应该不会有问题(后来在cuda-11.7和torch-1.13.1上面进行过验证,的确没问题),具体命令如下:
torchrun --nproc_per_node=8 --master_port=25001 train.py \
--model_name_or_path /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
--data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
--bf16 True \
--output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
在这儿,咱们运用DeepSpeed结构来削减显存占用和提高练习效率。
git clone https://github.com/tatsu-lab/stanford_alpaca.git
cd stanford_alpaca
修改train.py
文件:
# 注释掉原有代码
"""
model = transformers.AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
)
"""
# 经过Llama加载tokenizer和model
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
)
tokenizer = transformers.LlamaTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
)
trainer.save_state()
# safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
trainer.save_model()
启动命令:
torchrun --nproc_per_node=8 --master_port=11223 train.py \
--model_name_or_path /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
--data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
--output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json
其间,ds_config.json
文件内容如下所示:
{
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_prefetch_bucket_size": 0,
"stage3_param_persistence_threshold": 1e2,
"reduce_bucket_size": 1e2,
"sub_group_size": 1e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
运转进程:
torchrun --nproc_per_node=8 --master_port=11223 train.py \
> --model_name_or_path /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
> --data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
> --output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
> --num_train_epochs 1 \
> --per_device_train_batch_size 4 \
> --per_device_eval_batch_size 1 \
> --gradient_accumulation_steps 4 \
> --evaluation_strategy "no" \
> --save_strategy "steps" \
> --save_steps 1000 \
> --save_total_limit 1 \
> --learning_rate 2e-5 \
> --weight_decay 0. \
> --warmup_ratio 0.03 \
> --lr_scheduler_type "cosine" \
> --logging_steps 1 \
> --report_to "tensorboard" \
> --gradient_checkpointing True \
> --fp16 True \
> --deepspeed ds_config.json
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-03-28 11:13:02,320] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-03-28 11:13:20,236] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:41<00:00, 1.26s/it]
...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:41<00:00, 1.26s/it]
Using pad_token, but it is not set yet.
...
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
..
WARNING:root:Tokenizing inputs... This may take some time...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
...
Loading extension module utils...
Time to load utils op: 0.10286140441894531 seconds
...
Time to load utils op: 0.20401406288146973 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...Loading extension module utils...
Time to load utils op: 0.0004200935363769531 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Time to load utils op: 0.0003352165222167969 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003571510314941406 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006623268127441406 seconds
Time to load utils op: 0.0005290508270263672 seconds
Time to load utils op: 0.0006077289581298828 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.001024484634399414 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003275871276855469 seconds
{'loss': 1.5163, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 1.5216, 'learning_rate': 0.0, 'epoch': 0.02}
...
{'loss': 1.0547, 'learning_rate': 2.025571894372794e-06, 'epoch': 0.98}
{'loss': 1.0329, 'learning_rate': 1.8343633694278895e-06, 'epoch': 0.99}
{'loss': 1.0613, 'learning_rate': 1.6517194697072903e-06, 'epoch': 1.0}
{'train_runtime': 4605.8781, 'train_samples_per_second': 11.277, 'train_steps_per_second': 0.022, 'train_loss': 1.175760779050317, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [1:16:45<00:00, 45.60s/it]
GPU显存占用情况:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
| N/A 47C P0 75W / 300W | 66615MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... Off | 00000000:35:00.0 Off | 0 |
| N/A 46C P0 70W / 300W | 31675MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... Off | 00000000:36:00.0 Off | 0 |
| N/A 49C P0 72W / 300W | 35529MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... Off | 00000000:37:00.0 Off | 0 |
| N/A 50C P0 76W / 300W | 54277MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A800 80G... Off | 00000000:9B:00.0 Off | 0 |
| N/A 51C P0 80W / 300W | 44229MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A800 80G... Off | 00000000:9C:00.0 Off | 0 |
| N/A 49C P0 72W / 300W | 59841MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A800 80G... Off | 00000000:9D:00.0 Off | 0 |
| N/A 47C P0 77W / 300W | 65217MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
| N/A 43C P0 68W / 300W | 30141MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 30534 C ...lama-venv-py310/bin/python 66593MiB |
| 1 N/A N/A 30535 C ...lama-venv-py310/bin/python 31653MiB |
| 2 N/A N/A 30536 C ...lama-venv-py310/bin/python 35507MiB |
| 3 N/A N/A 30537 C ...lama-venv-py310/bin/python 54255MiB |
| 4 N/A N/A 30540 C ...lama-venv-py310/bin/python 44207MiB |
| 5 N/A N/A 30541 C ...lama-venv-py310/bin/python 59819MiB |
| 6 N/A N/A 30542 C ...lama-venv-py310/bin/python 65195MiB |
| 7 N/A N/A 30543 C ...lama-venv-py310/bin/python 30119MiB |
+-----------------------------------------------------------------------------+
尽管 LLaMA 在英文上具有强壮的零样本学习和迁移能力,但是由于在预练习阶段 LLaMA 几乎没有见过中文语料,因而,它的中文能力很弱。
至此,从0到1完整的复现了斯坦福羊驼。
参考文档:
- LLaMA
- Stanford Alpaca:斯坦福-羊驼
- Alpaca-LoRA
- Alpaca: A Strong, Replicable Instruction-Following Model