如何在不安装 Transformers 的情况下使用 LLM Tokenizer

在构建 LLM 应用时,想对文本进行分词或者计算长度,却不得不安装了几 GB 的 transformers 和 torch(transformers 依赖 torch),这会导致 Docker 镜像体积剧增。网上对于这个问题提到的很少,但还是有人问,想着还是写下来吧。

其实只要利用 HuggingFace 的 tokenizers 库(Rust 实现),在不依赖 PyTorch 的前提下,就可以复现 AutoTokenizer 的功能(包括 apply_chat_template)。实测下来,在 Qwen 2.5 模型的 tokenizer 下,同样的 1k 长度的 Prompt,就 tokenize 的速度而言,rust 版的 tokenizer 相比于 transformers 中的 AutoTokenizer 可以提速十倍。

以 Qwen2.5 为例的 tokenizers 使用

import json
from tokenizers import Tokenizer
from jinja2 import Template
from pathlib import Path

model_dir = Path("models/Qwen2.5-14B-Instruct/")
tokenizer_json_path = str((model_dir / "tokenizer.json").resolve())
config_json_path = str((model_dir / "tokenizer_config.json").resolve())

tokenizer = Tokenizer.from_file(tokenizer_json_path)

# 加载 Chat Template
# transformers 的 apply_chat_template 实际上就是读取 tokenizer_config.json 里的 chat_template 字段
with open(config_json_path, 'r', encoding='utf-8') as f:
    config = json.load(f)
    chat_template = config.get("chat_template")
    # 获取特殊 token,用于传递给模板上下文
    bos_token = config.get("bos_token", "")
    eos_token = config.get("eos_token", "")

# 准备输入数据
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# 手动渲染 Chat Template (替代 tokenizer.apply_chat_template)
# Qwen 的模板通常需要 messages, add_generation_prompt, bos_token, eos_token 等参数
template = Template(chat_template)
text = template.render(
    messages=messages,
    add_generation_prompt=True,
    bos_token=bos_token,
    eos_token=eos_token
)

print("=== Rendered Text ===")
print(text)

prompt = "你好,世界"

encoded = tokenizer.encode(prompt)

# 获取 input_ids 和 attention_mask
input_ids = encoded.ids
attention_mask = encoded.attention_mask

print(input_ids)

print(f"Total tokens: {len(input_ids)}")

给英语人士使用:

How to use LLM tokenizers without the transformers library.

When building LLM applications, if you simply want to tokenize text or calculate token lengths, you are often forced to install several GBs of transformers and torch (since transformers depends on torch). This causes the Docker image size to balloon significantly. There isn’t much discussion about this online, but since people keep asking about it, I thought I’d write it down.

Actually, by utilizing HuggingFace’s tokenizers library (implemented in Rust), you can replicate the functionality of AutoTokenizer (including apply_chat_template) without any dependency on PyTorch. In my tests using the Qwen 2.5 model’s tokenizer, for a prompt of the same 1k length, the Rust-based tokenizer was ten times faster than the AutoTokenizer found in transformers.

Using tokenizers with Qwen 2.5 as an Example

import json
from tokenizers import Tokenizer
from jinja2 import Template
from pathlib import Path

model_dir = Path("models/Qwen2.5-14B-Instruct/")
tokenizer_json_path = str((model_dir / "tokenizer.json").resolve())
config_json_path = str((model_dir / "tokenizer_config.json").resolve())

tokenizer = Tokenizer.from_file(tokenizer_json_path)

# Load Chat Template
# The apply_chat_template method in transformers actually just reads the 
# chat_template field from tokenizer_config.json
with open(config_json_path, 'r', encoding='utf-8') as f:
    config = json.load(f)
    chat_template = config.get("chat_template")
    # Get special tokens to pass to the template context
    bos_token = config.get("bos_token", "")
    eos_token = config.get("eos_token", "")

# Prepare input data
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Manually render Chat Template (replacing tokenizer.apply_chat_template)
# Qwen's template usually requires parameters like messages, add_generation_prompt, bos_token, eos_token, etc.
template = Template(chat_template)
text = template.render(
    messages=messages,
    add_generation_prompt=True,
    bos_token=bos_token,
    eos_token=eos_token
)

print("=== Rendered Text ===")
print(text)

prompt = "你好,世界" # "Hello, World"

encoded = tokenizer.encode(prompt)

# Get input_ids and attention_mask
input_ids =

 encoded.ids
attention_mask = encoded.attention_mask

print(input_ids)

print(f"Total tokens: {len(input_ids)}")
订阅评论
提醒
guest
0 评论
最旧
最新 最多投票
内联反馈
查看所有评论