突破性的多语言代码大模型基CodeShell：北京大学与四川天府银行联合打造，引领AI编程新时代

1.CodeShell简介

CodeShell是北京大学常识核算实验室联合四川天府银行AI团队研发的多语言代码大模型基座。它具有70亿参数，经过对五千亿Tokens的练习，并具有8192的上下文窗口长度。CodeShell在威望的代码评价Benchmark（HumanEval与MBPP）上取得了同等规模最好的功能。这个项目为多语言代码处理和了解供给了有力的东西

才能点
- 强壮的功能：CodelShell在HumanEval和MBPP上达到了7B代码基座大模型的最优功能
- 完好的系统：除了代码大模型，一起开源IDE（VS Code与JetBrains）插件，形成开源的全栈技能系统
- 轻量化布置：支撑本地C++布置，供给轻量快速的本地化软件开发帮手解决方案
- 全面的评测：供给支撑完好项目上下文、覆盖代码生成、代码缺点检测与修正、测试用例生成等常见软件开发活动的多使命评测系统（即将开源）
- 高效的练习：根据高效的数据管理系统，CodeShell在完全冷发动情况下，只练习了五千亿Token即获得了优异的功能
开源模型
- CodeShell Base：CodelShell底座模型，具有强壮的代码根底才能。
- CodeShell Chat：CodelShell对话模型，在代码问答、代码补全等下流使命重功能优异。
- CodeShell Chat 4bit：CodelShell对话模型4bit量化版别，在确保模型功能的前提下内存消耗更小，速度更快。
- CodeShell CPP：CodelShell对话模型CPP版别，支撑开发者在没有GPU的个人电脑中运用。留意，CPP版别相同支撑量化操作，用户能够在最小内存为8G的个人电脑中运转CodeShell。

2.作用评价

咱们选取了现在最流行的两个代码评测数据集（HumanEval与MBPP）对模型进行评价，与现在最先进的两个7b代码大模型CodeLllama与Starcoder相比，Codeshell 取得了最优的成果。详细评测成果如下。

使命	CodeShell-7b	CodeLlama-7b	Starcoder-7b
humaneval	34.32	29.44	27.80
mbpp	38.65	37.60	34.16
multiple-js	33.17	31.30	27.02
multiple-java	30.43	29.24	24.30
multiple-cpp	28.21	27.33	23.04
multiple-swift	24.30	25.32	15.70
multiple-php	30.87	25.96	22.11
multiple-d	8.85	11.60	8.08
multiple-jl	22.08	25.28	22.96
multiple-lua	22.39	30.50	22.92
multiple-r	20.52	18.57	14.29
multiple-rkt	17.20	12.55	10.43
multiple-rs	24.55	25.90	22.82

3.快速开始

3.1环境依赖

- python 3.8 and above
- pytorch 2.0 and above are recommended
- transformers 4.32 and above
- CUDA 11.8 and above are recommended (this is for GPU users, flash-attention users, etc.)

CodeShell系列模型现已上传至 Hugging Face，开发者能够经过Transformers快速调用CodeShell和CodeShell-Chat。

在开始之前，请确保现已正确设置了环境，并安装了必要的代码包，以及满足上一末节的环境要求。你能够经过下列代码快速安装相关依赖。

pip install -r requirements.txt

接下来你能够经过Transformers运用CodeShell。

3.2 Code Generation

开发者能够运用CodeShell快速生成代码，加速开发效率。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained("WisdomShell/CodeShell-7B")
model = AutoModelForCausalLM.from_pretrained("WisdomShell/CodeShell-7B", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
inputs = tokenizer('def merge_sort():', return_tensors='pt').to(device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

Fill in the Moddle

CodeShell 支撑Fill-in-the-Middle形式，然后更好的支撑软件开发过程。

input_text = "<fim_prefix>def print_hello_world():n    <fim_suffix>n    print('Hello world!')<fim_middle>"
inputs = tokenizer(input_text, return_tensors='pt').to(device)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

代码问答

CodeShell一起开源了代码帮手模型CodeShell-7B-Chat，开发者能够经过下列代码与模型进行交互。

model = AutoModelForCausalLM.from_pretrained('WisdomShell/CodeShell-7B-Chat', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained('WisdomShell/CodeShell-7B-Chat')
history = []
query = '你是谁?'
response = model.chat(query, history, tokenizer)
print(response)
history.append((query, response))
query = '用Python写一个HTTP server'
response = model.chat(query, history, tokenizer)
print(response)
history.append((query, response))

开发者也能够经过VS Code与JetBrains插件与CodeShell-7B-Chat交互，详情请参VSCode插件库房与IntelliJ插件库房。

Model Quantization

CodeShell 支撑4 bit/8 bit量化，4 bit量化后，占用显存大小约6G，用户能够在显存较小的GPU上运用CodeShell。

model = AutoModelForCausalLM.from_pretrained('WisdomShell/CodeShell-7B-Chat-int4', trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained('WisdomShell/CodeShell-7B-Chat-int4')

CodeShell in c/c++

由于大部分个人电脑没有GPU，CodeShell供给了C/C++版别的推理支撑，开发者能够根据本地环境进行编译与运用，详见CodeShell C/C++本地化版。

3.3 Demo

咱们供给了Web-UI、命令行、API、IDE四种形式的Demo。

3.3.1 Web UI

开发者经过下列命令发动Web服务，服务发动后，能够经过https://127.0.0.1:8000进行拜访。

python demos/web_demo.py

3.3.2 CLI Demo

咱们也供给了命令行交互的Demo版别，开发者能够经过下列命令运转。

python demos/cli_demo.py

3.3.3 API

CodeShell也供给了根据OpenAI API的布置方法。

python demos/openai_api.py

发动后即可经过HTTP请求与CodeShell交互。

curl http://127.0.0.1:8000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "CodeShell-7B-Chat",
    "messages": [
      {
        "role": "user",
        "content": "你好"
      }
    ]
  }'

3.3.4 IDE

CodeShell最终供给了线上IDE，开发者能够经过IDE进行代码补全、代码问答等操作。一起，IDE插件也一起发布，开发者能够自行在本地进行安装运用。插件相关问题欢迎在VSCode插件库房与IntelliJ插件库房中评论。

4.模型详情

Code Shell运用GPT-2作为根底架构，选用Grouped-Query Attention、RoPE相对位置编码等技能。

4.1 Hyper-parameter

Hyper-parameter	Value
n_layer	42
n_embd	4096
n_inner	16384
n_head	32
num_query_groups	8
seq-length	8192
vocab_size	70144

4.2 数据集

CodeShell根据自己爬取的Github数据、Big Code开源的Stack和StarCoder数据集、以及少数高质量的中英文数据进行练习。在原始数据集的根底上，CodeShell选用根据Minihash对数据去重，根据KenLM以及高质量数据筛选模型对数据进行了过滤与筛选，最终得到高质量的预练习数据集。

4.3 Tokenizer

CodeShell根据Starcoder词表进行了优化，去除了运用频率较低的词语，并添加了部分中文词表，明显提升了中文的压缩率，为Chat版别的练习供给了根底。

Tokenizer	Size	Chinese	English	Code	Total
Starcoder	49152	1.22	3.47	3.30	2.66
CodeShell	70020	1.50	3.47	3.30	2.95

参阅链接：

* Hugging Face模型链接：[https://huggingface.co/WisdomShell/CodeShell-7B/tree/main](https://huggingface.co/WisdomShell/CodeShell-7B/tree/main)
* [codeshell](https://github.com/WisdomShell/codeshell)
* https://se.pku.edu.cn/kcl/

更多优质内容请重视公号：汀丶人工智能；会供给一些相关的资源和优质文章，免费获取阅览。

突破性的多语言代码大模型基CodeShell：引领AI编程新时代

突破性的多语言代码大模型基CodeShell：北京大学与四川天府银行联合打造，引领AI编程新时代

1.CodeShell简介

2.作用评价

3.快速开始

3.1环境依赖

3.2 Code Generation

3.3 Demo

3.3.1 Web UI

3.3.2 CLI Demo

3.3.3 API

3.3.4 IDE

4.模型详情

4.1 Hyper-parameter

4.2 数据集

4.3 Tokenizer

作者信息

突破性的多语言代码大模型基CodeShell：引领AI编程新时代

突破性的多语言代码大模型基CodeShell：北京大学与四川天府银行联合打造，引领AI编程新时代

1.CodeShell简介

2.作用评价

3.快速开始

3.1环境依赖

3.2 Code Generation

3.3 Demo

3.3.1 Web UI

3.3.2 CLI Demo

3.3.3 API

3.3.4 IDE

4.模型详情

4.1 Hyper-parameter

4.2 数据集

4.3 Tokenizer

相关文章

LSTM之父炮轰LLaMA 2：抄我想法还羞辱我！网友：LeCun干的吧？

Android 优雅的读写Excel

如何利用 Kotlin 特性封装 DataStore

SwiftUI基础篇Property Wrappers

作者信息