Pine 发自凹非寺
量子位 | 公众号 QbitAI

现在只用60行代码，就能从0构建GPT了！

想当初，前特斯拉前AI总监的minGPT和nanoGPT也都还要300行代码。

这个60行代码的GPT也有名字，博主将它命名为PicoGPT。

不过和此前minGPT和nanoGPT的教程不同，今天要讲的这个博主的教程，更侧重于代码实现部分，模型的权重则用现已练习好的。

对此，博主解释称这篇教程的要点在于提供一个简略且易于破解的完好技能介绍。

这对还不了解GPT背面概念的盆友，算是十分友好了。

还有网友称赞，这篇博客介绍得十分明晰，榜首部分尤为如此。

这篇介绍GPT模型的文章太好了，它比我之前看到的介绍都要明晰，至少在榜首部分讨论文本生成和取样是这样的。

现在，此项目在GitHub上标星已破百，HackerNews上的点击量也行将破千。

从GPT是什么讲起

在介绍之前，仍是需求说明一下，这篇教程不是完全零门槛，需求读者提早熟悉Python、NumPy以及一些根本的练习神经网络。

教程的要点聚焦在技能介绍上，统共有六大部分：

什么是GPT？

按照惯例，在正式构建GPT之前得先对它做一些根本介绍，教程从输入/输出、生成文本以及练习三个部分别离来讲GPT是如何作业的。

在这趴，博主附上代码，乃至还用了一些比喻来让读者们更好地了解GPT。

举个栗子

，在输入这一部分，作者将语句比作一条绳子，tokenizer则会将其分割成一小段一小段（单词），被称作token。

又比如说，在生成文本这part介绍自动回归时，博主直接贴上代码：

def generate(inputs, n_tokens_to_generate):
for _ in range(n_tokens_to_generate): # auto-regressive decode loop
output = gpt(inputs) # model forward pass
next_id = np.argmax(output[-1]) # greedy sampling
inputs = np.append(out, [next_id]) # append prediction to input
return list(inputs[len(inputs) – n_tokens_to_generate :]) # only return generated ids
input_ids = [1, 0] # “not” “all”
output_ids = generate(input_ids, 3) # output_ids = [2, 4, 6]
output_tokens = [vocab[i] for i in output_ids] # “heroes” “wear” “capes”

在每次迭代中，它会将猜测的token追加回输入，这个猜测未来值并将其添加回输入的进程便是GPT被描绘为自动回归的原因。

60行代码怎么运转？

了解完GPT的根本概念之后，就直接快进到了如何在电脑上运转这个PicoGPT。

博主先是甩出了他那只有60行的代码：

import numpy as np
def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):
pass # TODO: implement this
def generate(inputs, params, n_head, n_tokens_to_generate):
from tqdm import tqdm
for _ in tqdm(range(n_tokens_to_generate), “generating”): # auto-regressive decode loop
logits = gpt2(inputs, **params, n_head=n_head) # model forward pass
next_id = np.argmax(logits[-1]) # greedy sampling
inputs = np.append(inputs, [next_id]) # append prediction to input
return list(inputs[len(inputs) – n_tokens_to_generate :]) # only return generated ids
def main(prompt: str, n_tokens_to_generate: int = 40, model_size: str = “124M”, models_dir: str = “models”):
from utils import load_encoder_hparams_and_params

load encoder, hparams, and params from the released open-ai gpt-2 files

encoder, hparams, params = load_encoder_hparams_and_params(model_size, models_dir)

encode the input string using the BPE tokenizer

input_ids = encoder.encode(prompt)

make sure we are not surpassing the max sequence length of our model

assert len(input_ids) + n_tokens_to_generate < hparams[“n_ctx”]

generate output ids

output_ids = generate(input_ids, params, hparams[“n_head”], n_tokens_to_generate)

decode the ids back into a string

output_text = encoder.decode(output_ids)
return output_text
if name == “main“:
import fire
fire.Fire(main)

然后从克隆存储库，安装依赖项等步骤一步步教你如何在电脑上运转GPT。

其间，还不乏一些贴心的小tips，比如说假如运用的是M1 Macbook，那在运转pip install之前，需求将requments.txt中的tensorflow更改为tensorflow-macos。

此外，对于代码的四个部分：gpt2，generate，main以及fire.Fire(main)，博主也有做具体解释。

比及代码能够运转之后，下一步博主就准备具体介绍编码器、超参数（hparams）以及参数（params）这三部分了。

直接在笔记本或者Python会话中运转下面这个代码：

from utils import load_encoder_hparams_and_params
encoder, hparams, params = load_encoder_hparams_and_params(“124M”, “models”)

Bingo！一些必要的模型和tokenizer文件就直接下载到model/124M，编码器、hparams和params也能直接加载。

更具体的内容这里就不多说了，教程的链接现已附在文末。

一些根底神经网络层的介绍

这一趴涉及到的知识就更加根底了，因为下一趴是实际GPT本身的架构，所以在此之前，需求了解一些非特定于GPT的更根本的神经网络层。

博主介绍了GeLU、Softmax函数以及Layer Normalization和Linear。

GPT架构

终于！这部分要来讲GPT本身的架构了，博主从transformer的架构引入。

△transformer架构

GPT的架构只运用了transformer中的解码器堆栈（即图表的右边部分），而且其间的的“穿插留意”层也没有用到。

△GPT架构

随后，博主将GPT的架构总结成了三大部分：

文本 + 方位嵌入
变压器解码器堆栈
下一个token猜测头

而且还将这三部分用代码展现了出来，是酱紫的：

def gpt2(inputs, wte, wpe, blocks, ln_f, n_head): # [n_seq] -> [n_seq, n_vocab]

token + positional embeddings

x = wte[inputs] + wpe[range(len(inputs))] # [n_seq] -> [n_seq, n_embd]

forward pass through n_layer transformer blocks

for block in blocks:
x = transformer_block(x, block, n_head=n_head) # [n_seq, n_embd] -> [n_seq, n_embd]

projection to vocab

x = layer_norm(x, ln_f) # [n_seq, n_embd] -> [n_seq, n_embd]
return x @ wte.T # [n_seq, n_embd] -> [n_seq, n_vocab]

再后面，便是关于这三部分的更多细节……

测验构建的GPT

这部分将悉数的代码组合在一起，就得到了gpt2.py，统共有120行代码，删去注释和空格的话，便是60行。

然后测验一下！

python gpt2.py \
“Alan Turing theorized that computers would one day become” \
–n_tokens_to_generate 8

结果是这样的：

the most powerful machines on the planet.

成功了！

一些后续弥补

最后一部分，博主也总结了这短短60行代码的不足：十分低效！

不过他仍是给出了两个能够让GPT变高效的方法：

一起地而不是次序地执行留意力核算。
实现 KV 缓存。

此外，博主还推荐了一些练习模型、评价模型以及改进架构的方法和教程。

感兴趣的话，直接戳文末链接～

作者介绍

Jay Mody，现在在加拿大一家NLP初创公司Cohere从事机器学习的作业，此前，他还别离在特斯拉和亚马逊作为软件工程师实习过一段时间。

除了这篇教程之外，小哥的博客网站上还有更新其他文章，而且都有附代码～

代码传送门：
github.com/jaymody/pic…
教程链接：
jaykmody.com/blog/gpt-fr…

—完—

@量子位追踪AI技能和产品新动态

深有感触的朋友，欢迎附和、关注、分享三连’ᴗ’ ❤

60行代码就能构建GPT！网友：比之前的教程都要清晰｜附代码

从GPT是什么讲起

什么是GPT？

60行代码怎么运转？

load encoder, hparams, and params from the released open-ai gpt-2 files

encode the input string using the BPE tokenizer

make sure we are not surpassing the max sequence length of our model

generate output ids

decode the ids back into a string

一些根底神经网络层的介绍

GPT架构

△transformer架构

token + positional embeddings

forward pass through n_layer transformer blocks

projection to vocab

测验构建的GPT

一些后续弥补

作者介绍

作者信息

60行代码就能构建GPT！网友：比之前的教程都要清晰｜附代码

从GPT是什么讲起

什么是GPT？

60行代码怎么运转？

load encoder, hparams, and params from the released open-ai gpt-2 files

encode the input string using the BPE tokenizer

make sure we are not surpassing the max sequence length of our model

generate output ids

decode the ids back into a string

一些根底神经网络层的介绍

GPT架构

△transformer架构

token + positional embeddings

forward pass through n_layer transformer blocks

projection to vocab

测验构建的GPT

一些后续弥补

作者介绍

相关文章

如果你也会手抖输错命令，那就必须给你推荐这个 Fuck 工具

读懂一个 demo，入门机器学习

从零开始入门深度学习：一个AI入门者的学习指南

基于模板配置的数据可视化平台

作者信息