PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！-六虎

PyTorch 2.0 经过简略一行 torch.compile() 就能够使模型练习速度提高 30%-200%，本教程将演示怎么实在复现这种提速。

torch.compile() 能够****轻松地测验不同的编译器后端， 从而加快 PyTorch 代码的运转。它作为 torch.jit.script() 的直接替代品，能够直接在 nn.Module 上运转，无需修正源代码。

上篇文章中，咱们介绍了 torch.compile 支撑恣意的 PyTorch 代码、control flow、mutation，并必定程度上支撑 dynamic shapes。

经过对 163 个开源模型进行测验，咱们发现 torch.compile() 能够带来 30%-200% 的加快。

opt_module = torch.compile(module)

测验成果详见：

github.com/pytorch/tor…

本教程将演示怎么运用 torch.compile() 为模型练习提速。

要求及设置

关于 GPU 而言（越新的 GPU 性能提高越突出）：

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

关于 CPU 而言：

pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

可选：验证装置

git clone https://github.com/pytorch/pytorchcd tools/dynamopython verify_dynamo.py

可选：Docker 装置

在 PyTorch 的 Nightly Binaries 文件中供给了所有必要的依赖项，能够经过以下方法下载：

docker pull ghcr.io/pytorch/pytorch-nightly

关于暂时测验 (ad hoc experiment)， 只需确保容器能够拜访所有 GPU 即可：

docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:latest /bin/bash

开始

简略示例

先来看一个简略示例，留意，GPU 越新速度提高越显着。

import torch   def fn(x, y):       a = torch.sin(x).cuda()       b = torch.sin(y).cuda()       return a + b   new_fn = torch.compile(fn, backend="inductor")   input_tensor = torch.randn(10000).to(device="cuda:0")   a = new_fn()

这个比如实际上不会提高速度，但是能够抛砖引玉。

该示例中，torch.cos() 和 torch.sin() 是逐点运算 (pointwise ops) 的比如，他们能够在向量上逐一操作 element，一个更闻名的逐点运算是 torch.relu()。

eager mode 下的逐点运算并不是最优解，由于每个算子都需要从内存中读取一个张量、做一些更改，然后再写回这些更改。

PyTorch 2.0 最重要的一项优化是融合 (fusion)。

因而，该例中就能够把 2 次读和 2 次写变成 1 次读和 1 次写，这对较新的 GPU 来说是至关重要的，由于这些 GPU 的瓶颈是内存带宽（能多快地把数据发送到 GPU）而不是核算（GPU 能多快地进行浮点运算）。

PyTorch 2.0 第二个重要优化是 CUDA graphs。

CUDA graphs 有助于消除从 Python 程序中发动单个内核的开销。

torch.compile() 支撑许多不同的后端，其中最值得重视的是 Inductor，它能够生成 Triton 内核。

github.com/openai/trit…

这些内核是用 Python 写的，但却优于绝大多数手写的 CUDA 内核。 假定上面的比如叫做 trig.py，实际上能够经过运转来查看生成 triton 内核的代码。

TORCHINDUCTOR_TRACE=1 python trig.py
@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})   @triton.jit   def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):       xnumel = 10000       xoffset = tl.program_id(0) * XBLOCK       xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])       xmask = xindex < xnumel       x0 = xindex       tmp0 = tl.load(in_ptr0 + (x0), xmask)       tmp1 = tl.sin(tmp0)       tmp2 = tl.sin(tmp1)       tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

以上代码可知：两个 sins 确实发生了融合，由于两个 sin 算子发生在一个 Triton 内核中，而且暂时变量被保存在 register 中，拜访速度非常快。

实在模型示例

以 PyTorch Hub 中的 resnet50 为例：

import torch   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)   opt_model = torch.compile(model, backend="inductor")   model(torch.randn(1,3,64,64))

实际运转中会发现，第一次运转速度很慢，这是由于模型正在被编译。随后的运转速度会加快，所以在开始基准测验之前，通常的做法是对模型进行 warm up。

能够看到，这儿咱们用「inductor」表示编译器名称，但它不是唯一可用的后端，能够在 REPL 中运转 torch._dynamo.list_backends() 来查看可用后端的完整列表。

也能够试试 aot_cudagraphs 或 nvfuser 。

Hugging Face 模型示例

PyTorch 社区常常运用 transformers 或 TIMM 的预练习模型：

github.com/huggingface…

github.com/rwightman/p…

PyTorch 2.0 的规划目标之一，就是恣意编译栈，都需要在实际运转的绝大多数模型中，开箱即用。

这儿咱们直接从 HuggingFace hub 下载一个预练习的模型，并进行优化：

import torch   from transformers import BertTokenizer, BertModel   # Copy pasted from here https://huggingface.co/bert-base-uncased   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")   model = torch.compile(model) # This is the only line of code that we changed   text = "Replace me by any text you'd like."   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")   output = model(**encoded_input)

假如从模型中删除 to(device=”cuda:0″) 和 encoded_input ，PyTorch 2.0 将生成为在 CPU 上运转优化的 C++ 内核。

能够查看 BERT 的 Triton 或 C++ 内核，它们明显比上面的三角函数的比如更复杂。但假如你了解 PyTorch 能够略过。

相同的代码与以下一同运用，仍旧能够得到更好的作用：

github.com/huggingface…
[](github.com/huggingface… DDP) DDP

相同的，试试 TIMM 的比如：

import timm   import torch   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)   opt_model = torch.compile(model, backend="inductor")   opt_model(torch.randn(64,3,7,7))

PyTorch 的目标是建立一个能适配更多模型的编译器，为绝大多数开源模型的运转提速， 现在就拜访 HuggingFace Hub**，**用 PyTorch 2.0 为 TIMM 模型加快吧！

huggingface.co/timm

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

相关文章

利用ChatGPT + Midjoureny 制作自己卡通头像

Android 音视频入门/进阶教程

Go并发编程 Goroutine、Channel、Select、Mutex锁、sync、Atomic等

Python 使用和高性能技巧总结

作者信息