从0到1复现斯坦福羊驼（Stanford Alpaca 7B）-六虎

“本文正在参加人工智能创作者扶持计划”

近来，Meta开源了他们的LLaMA系列模型，包含了参数量为7B/13B/33B/65B的不同模型，但是，原模型的作用较差（如生成的成果文不对题、以及无法自然地完毕生成等）。因而，斯坦福的 Alpaca 模型根据 LLaMA-7B 和指令微调，仅运用约 5 万条练习数据，就能到达相似 GPT-3.5 的作用。

该项目提供了廉价的对LLaMA模型进行微调的办法，大体思路如下：

首先，利用OpenAI提供的GPT模型API生成质量较高的指令数据（仅52k），例如：

{
    "instruction": "Rewrite the following sentence in the third person",
    "input": "I am anxious",
    "output": "She is anxious."
}, {
    "instruction": "What are the three primary colors?",
    "input": "",
    "output": "The three primary colors are red, blue, and yellow."
}

然后，根据这些指令数据运用HuggingFace Transformers结构精调LLaMA-7B模型。

下面根据 LLaMA-7B 测验复现 Alpaca。

环境搭建

根底环境装备如下：

操作系统: CentOS 7
CPUs: 单个节点具有 1TB 内存的 Intel CPU，物理CPU个数为64，每颗CPU核数为16
GPUs: 8 卡 A800 80GB GPUs
Python: 3.10 (需求先升级OpenSSL到1.1.1t版别（点击下载OpenSSL），然后再编译装置Python)，点击下载Python
NVIDIA驱动程序版别: 470.161.03，依据不同型号挑选不同的驱动程序，点击下载。
CUDA东西包: 11.3，点击下载
NCCL: 2.9.9-1，点击下载
cuDNN: v8.2.0，点击下载

上面的NVIDIA驱动、CUDA、Python等东西的装置就不逐个赘述了。

新建并激活虚拟环境llama-venv-py310。

cd /home/guodong.li/virtual-venv
virtualenv -p /usr/bin/python3.10 llama-venv-py310 
source /home/guodong.li/virtual-venv/llama-venv-py310/bin/activate

离线装置PyTorch，点击下载对应cuda版别的torch和torchvision即可。

pip install torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl
pip install torchvision-0.13.1+cu113-cp310-cp310-linux_x86_64.whl

装置transformers，现在，LLaMA相关的完成并没有发布对应的版别，但是现已合并到主分支了，因而，咱们需求切换到对应的commit，从源代码进行相应的装置。

cd transformers
git checkout 0041be5 
pip install .

装置apex。

git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 22.04-dev
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

其他其他相关的库。

pip install -r requirements.txt

requirements.txt文件具体的内容如下。

numpy
rouge_score
fire
openai
sentencepiece
tokenizers==0.12.1
wandb
deepspeed==0.8.0
accelerate
tensorboardX

模型格局转换

将LLaMA原始权重文件转换为Transformers库对应的模型文件格局。

cd transformers
python src/transformers/models/llama/convert_llama_weights_to_hf.py \ 
--input_dir /data/nfs/guodong.li/pretrain/llama-model \
--model_size 7B \
--output_dir /data/nfs/guodong.li/pretrain/hf-llama-model

转换之后会生成tokenizer和llama-7b（模型权重文件）两个目录。能够经过以下方式加载模型和分词器：

tokenizer = transformers.LlamaTokenizer.from_pretrained("/data/nfs/guodong.li/pretrain/hf-llama-model/tokenizer/")
model = transformers.LlamaForCausalLM.from_pretrained("/data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b/")

LLaMA 分词器（tokenizer）根据 sentencepiece分词东西。 sentencepiece在解码序列时，假如第一个token是单词（例如：Banana）开头，则tokenizer不会在字符串前添加前缀空格。要让tokenizer输出前缀空格，请在LlamaTokenizer目标或tokenizer装备中设置decode_with_prefix_space=True。

这儿将tokenizer目录的文件拷贝到llama-7b目录下。

cp tokenizer/* llama-7b/

注: 假如不想转换也能够直接从Hugging Face下载转换好的模型。

数据集准备

Stanford Alpaca中的alpaca_data.json文件即是他们用于练习的指令数据集，咱们能够直接运用该数据集进行模型精调。但是在Alpaca-LoRA中说到该数据集存在一些噪声，因而，他们对该数据集做了清洗后得到了alpaca_data_cleaned.json文件。采用该数据集进行练习大概率会得到更好成果。

模型精调

本文未运用PyTorch FSDP是因为当前环境的Cuda版别为11.3，且PyTorch版别为1.12.1，运转会报错。Cuda版别升级到11.6及以上，且PyTorch版别升级为1.13.1及以上，应该不会有问题（后来在cuda-11.7和torch-1.13.1上面进行过验证，的确没问题），具体命令如下：

torchrun --nproc_per_node=8 --master_port=25001 train.py \
    --model_name_or_path  /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
    --data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
    --bf16 True \
    --output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

在这儿，咱们运用DeepSpeed结构来削减显存占用和提高练习效率。

git clone https://github.com/tatsu-lab/stanford_alpaca.git
cd stanford_alpaca

修改train.py文件：

    # 注释掉原有代码
    """
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=False,
    )
    """
    # 经过Llama加载tokenizer和model
    model = transformers.LlamaForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )
    tokenizer = transformers.LlamaTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )
    trainer.save_state()
    # safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
    trainer.save_model()

启动命令：

torchrun --nproc_per_node=8 --master_port=11223 train.py \
--model_name_or_path /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
--data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
--output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
--num_train_epochs 1 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json

其间，ds_config.json文件内容如下所示：

{
    "zero_optimization": {
        "stage": 3,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 0,
        "stage3_max_reuse_distance": 0,
        "stage3_prefetch_bucket_size": 0,
        "stage3_param_persistence_threshold": 1e2,
        "reduce_bucket_size": 1e2,
        "sub_group_size": 1e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

运转进程：

torchrun --nproc_per_node=8 --master_port=11223 train.py \
> --model_name_or_path /data/nfs/guodong.li/pretrain/hf-llama-model/llama-7b \
> --data_path /data/nfs/guodong.li/data/alpaca_data_cleaned.json \
> --output_dir /data/nfs/guodong.li/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
> --num_train_epochs 1 \
> --per_device_train_batch_size 4 \
> --per_device_eval_batch_size 1 \
> --gradient_accumulation_steps 4 \
> --evaluation_strategy "no" \
> --save_strategy "steps" \
> --save_steps 1000 \
> --save_total_limit 1 \
> --learning_rate 2e-5 \
> --weight_decay 0. \
> --warmup_ratio 0.03 \
> --lr_scheduler_type "cosine" \
> --logging_steps 1 \
> --report_to "tensorboard" \
> --gradient_checkpointing True \
> --fp16 True \
> --deepspeed ds_config.json
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-03-28 11:13:02,320] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-03-28 11:13:20,236] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:41<00:00,  1.26s/it]
...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:41<00:00,  1.26s/it]
Using pad_token, but it is not set yet.
...
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
..
WARNING:root:Tokenizing inputs... This may take some time...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
...
Loading extension module utils...
Time to load utils op: 0.10286140441894531 seconds
...
Time to load utils op: 0.20401406288146973 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...Loading extension module utils...
Time to load utils op: 0.0004200935363769531 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Time to load utils op: 0.0003352165222167969 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003571510314941406 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006623268127441406 seconds
Time to load utils op: 0.0005290508270263672 seconds
Time to load utils op: 0.0006077289581298828 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.001024484634399414 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003275871276855469 seconds
{'loss': 1.5163, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 1.5216, 'learning_rate': 0.0, 'epoch': 0.02}
...
{'loss': 1.0547, 'learning_rate': 2.025571894372794e-06, 'epoch': 0.98}
{'loss': 1.0329, 'learning_rate': 1.8343633694278895e-06, 'epoch': 0.99}
{'loss': 1.0613, 'learning_rate': 1.6517194697072903e-06, 'epoch': 1.0}
{'train_runtime': 4605.8781, 'train_samples_per_second': 11.277, 'train_steps_per_second': 0.022, 'train_loss': 1.175760779050317, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [1:16:45<00:00, 45.60s/it]

GPU显存占用情况：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   47C    P0    75W / 300W |  66615MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:35:00.0 Off |                    0 |
| N/A   46C    P0    70W / 300W |  31675MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  Off  | 00000000:36:00.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |  35529MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   50C    P0    76W / 300W |  54277MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A800 80G...  Off  | 00000000:9B:00.0 Off |                    0 |
| N/A   51C    P0    80W / 300W |  44229MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A800 80G...  Off  | 00000000:9C:00.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |  59841MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A800 80G...  Off  | 00000000:9D:00.0 Off |                    0 |
| N/A   47C    P0    77W / 300W |  65217MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   43C    P0    68W / 300W |  30141MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     30534      C   ...lama-venv-py310/bin/python    66593MiB |
|    1   N/A  N/A     30535      C   ...lama-venv-py310/bin/python    31653MiB |
|    2   N/A  N/A     30536      C   ...lama-venv-py310/bin/python    35507MiB |
|    3   N/A  N/A     30537      C   ...lama-venv-py310/bin/python    54255MiB |
|    4   N/A  N/A     30540      C   ...lama-venv-py310/bin/python    44207MiB |
|    5   N/A  N/A     30541      C   ...lama-venv-py310/bin/python    59819MiB |
|    6   N/A  N/A     30542      C   ...lama-venv-py310/bin/python    65195MiB |
|    7   N/A  N/A     30543      C   ...lama-venv-py310/bin/python    30119MiB |
+-----------------------------------------------------------------------------+

尽管 LLaMA 在英文上具有强壮的零样本学习和迁移能力，但是由于在预练习阶段 LLaMA 几乎没有见过中文语料，因而，它的中文能力很弱。

至此，从0到1完整的复现了斯坦福羊驼。

参考文档：

LLaMA
Stanford Alpaca：斯坦福-羊驼
Alpaca-LoRA
Alpaca: A Strong, Replicable Instruction-Following Model

从0到1复现斯坦福羊驼（Stanford Alpaca 7B）

环境搭建

模型格局转换

数据集准备

模型精调

相关文章

EventBridge 集成云服务实践

OpenGL ES 2.0 笔记 #7：空间变换

大模型应用新范式：统一特征表示优化（UFO）

doris入门初体验

作者信息