MiniCPM-V-4.6-Thinking-gguf
MiniCPM-V-4.6-Thinking-gguf
该仓库提供 OpenBMB MiniCPM-V 4.6 Thinking 的 GGUF(llama.cpp)量化版本;原始 BF16 权重在 Hugging Face。模型为 MiniCPM-V 4.6 的 long chain-of-thought 变体,采用 SigLIP2-400M vision encoder、Qwen3.5-0.8B LLM 和 4x/16x 视觉 token 压缩,支持图像、视频理解及 iOS、Android、HarmonyOS 部署。
本仓库托管 MiniCPM-V 4.6 Thinking 的 GGUF(llama.cpp)量化版本。 原始 BF16 权重和完整 model card 请参见 openbmb/MiniCPM-V-4.6-Thinking。
一款口袋大小的 MLLM,可在手机上实现高效的图像与视频理解
GitHub | CookBook | Demo | 飞书(Lark)
MiniCPM-V 4.6 Thinking
MiniCPM-V 4.6 Thinking 是 MiniCPM-V 4.6 的长 chain-of-thought(思维链)推理变体。它会在给出最终答案前生成显式的推理轨迹,从而显著提升复杂多模态推理、数学和 OCR 密集型任务的表现,同时保持相同的边缘友好架构(SigLIP2-400M vision encoder + Qwen3.5-0.8B LLM)以及 MiniCPM-V 4.6 的 4x/16x 混合视觉 token 压缩。
评测
整体性能(Thinking)
高并发吞吐量
单请求 TTFT(ms)
示例
整体
MiniCPM-V 4.6 可部署在三大主流端侧平台——iOS、Android 和 HarmonyOS。以下片段为手机设备上的原始屏幕录制,未经剪辑。
用法
使用 Transformers 推理
安装
pip install "transformers[torch]>=5.7.0" torchvision torchcodec
关于 CUDA 兼容性的说明:
torchcodec(用于视频解码)可能与某些 CUDA 版本存在兼容性问题。例如,torch>=2.11默认捆绑 CUDA 13.1,而使用 CUDA 12.x 的环境可能会遇到RuntimeError: Could not load libtorchcodec等错误。两种解决方法:
- 用
PyAV替换torchcodec—— 支持图像和视频推理,且不受 CUDA 版本限制:pip install "transformers[torch]>=5.7.0" torchvision av- 安装 torch 时固定 CUDA 版本,使其与环境匹配(例如 CUDA 12.8):
pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128
加载模型
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "openbmb/MiniCPM-V-4.6-Thinking"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
# model_id,
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
图像推理
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
{"type": "text", "text": "What causes this phenomenon?"},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_slice_nums=36,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
视频推理
messages = [
{
"role": "user",
"content": [
{"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
{"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
],
}
]
downsample_mode = "16x" # Using `downsample_mode="4x"` for Finer Detail
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
downsample_mode=downsample_mode,
max_num_frames=128,
stack_frames=1,
max_slice_nums=1,
use_image_id=False,
).to(model.device)
generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
高级参数
你可以向 apply_chat_template 传入额外参数来自定义图像/视频处理:
| 参数 | 默认值 | 适用于 | 说明 |
|---|---|---|---|
downsample_mode |
"16x" |
图像和视频 | 视觉 token 下采样。"16x" 会合并 token 以提高效率;"4x" 保留 4× 更多 token,以获得更细粒度的细节。也必须传给 generate()。 |
max_slice_nums |
9 |
图像和视频 | 拆分高分辨率图像时的最大切片数。更高的值可为大图保留更多细节。建议:图像使用 36,视频使用 1。 |
max_num_frames |
128 |
仅视频 | 从视频中采样的主帧最大数量。 |
stack_frames |
1 |
仅视频 | 每秒的总采样点数。1 = 仅主帧(不堆叠)。N(N>1)= 每秒 1 个主帧 + N−1 个子帧;子帧会合成为网格图像,并与主帧交错排列。建议:3 或 5。 |
use_image_id |
True |
图像和视频 | 是否在每个图像/帧占位符之前添加 <image_id>N</image_id> 标签。建议:图像使用 True,视频使用 False。 |
注意:
downsample_mode必须同时传给apply_chat_template(以确保占位符数量正确)和generate(供 vision encoder 使用)。其他参数只需要传给apply_chat_template。
使用 transformers serve 提供服务
Hugging Face Transformers 内置了一个轻量的 OpenAI 兼容服务器,适用于快速测试和中等负载部署。
pip install "transformers[serving]>=5.7.0"
启动服务器:
transformers serve openbmb/MiniCPM-V-4.6-Thinking --port 8000 --host 0.0.0.0 --continuous-batching
发送请求:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "openbmb/MiniCPM-V-4.6-Thinking",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]
}]
}'
处理模型输出中的转义换行符
在某些情况下,模型可能会输出转义换行符 \n,将其作为字符串字面量,而不是真正的换行。为了正确渲染文本,尤其是在 UI 层中,可以使用以下工具函数。该函数会谨慎地将字面量 \n 替换为真实换行,同时保护 \n 具有特定语义的场景。
工具函数:
import re
_PATTERN = re.compile(
r'(```[\s\S]*?```' # fenced code blocks
r'|`[^`]+`' # inline code
r'|\$\$[\s\S]*?\$\$' # display math
r'|\$[^$]+\$' # inline math
r'|\\\([\s\S]*?\\\)' # \(...\)
r'|\\\[[\s\S]*?\\\]' # \[...\]
r')'
r'|(?<!\\)(?:\\r\\n|\\[nr])'
)
def normalize_response_text(text: str) -> str:
"""
Lightweight post-processing: Converts literal '\\n' to actual newlines,
while protecting code blocks, inline code, and LaTeX commands.
"""
if not isinstance(text, str) or "\\" not in text:
return text
return _PATTERN.sub(lambda m: m.group(1) or '\n', text)
在 iOS、Android 和 HarmonyOS 平台部署 MiniCPM-V 4.6
我们已适配 MiniCPM-V 4.6,使其可部署在 iOS、Android 和 HarmonyOS 平台,并且所有端侧适配代码均已完全开源。开发者只需几步即可复现端侧体验。请访问我们的端侧部署仓库查看各平台构建指南,或前往下载页面直接试用预构建应用。
在其他推理与训练框架中使用 MiniCPM-V 4.6
MiniCPM-V 4.6 支持多种推理和训练框架。以下是各框架的快速入门命令。完整细节请参见我们的 Cookbook。
vllm serve openbmb/MiniCPM-V-4.6-Thinking \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--default-chat-template-kwargs '{"enable_thinking": true}'
注意:
--enable-auto-tool-choice和--tool-call-parser qwen3_coder用于启用 tool/function calling 支持。如果不需要使用工具,可以省略这些 flag,直接运行vllm serve openbmb/MiniCPM-V-4.6-Thinking。
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6-Thinking",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'
Tool calling 示例:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6-Thinking",
"messages": [{"role": "user", "content": [
{"type": "text", "text": "北京的天气"}
]}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
}'
python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-Thinking --port 30000
curl -s http://localhost:30000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6-Thinking",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'
llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080
curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'
ollama run minicpm-v-4.6-thinking
在交互式会话中,直接粘贴图像路径或 URL 即可与模型对话。
llamafactory-cli train examples/train_lora/minicpmv4_6_lora_sft.yaml
swift sft --model_type minicpm-v-4_6 --dataset <your-dataset>
License
模型 License
- MiniCPM-o/V 模型权重和代码基于 Apache-2.0 license 开源。
声明
- 作为 MLLM,MiniCPM-o/V 模型通过学习大量多模态语料来生成内容,但它们不能理解、表达个人观点或进行价值判断。MiniCPM-o/V 模型生成的任何内容均不代表模型开发者的观点和立场
- 对因使用 MiniCPM-o/V 模型而产生的任何问题,我们不承担责任,包括但不限于数据安全问题、舆情风险,或因模型被误导、误用、传播或滥用而产生的任何风险和问题。
技术报告与关键技术论文
👏 欢迎了解 MiniCPM-o/V 的关键技术以及我们团队的其他多模态项目:
技术报告: MiniCPM-o 4.5 | MiniCPM-V 4.5 | MiniCPM-o 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0
其他多模态项目: VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V
引用
如果我们的模型/代码/论文对你有帮助,请考虑引用我们的论文 📝 并给我们点星 ⭐️!
@misc{cui2026minicpmo45realtimefullduplex,
title={MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction},
author={Junbo Cui and Bokai Xu and Chongyi Wang and Tianyu Yu and Weiyue Sun and Yingjing Xu and Tianran Wang and Zhihui He and Wenshuo Ma and Tianchi Cai and others},
year={2026},
url={https://arxiv.org/abs/2604.27393},
}
@proceedings{yu2025minicpmv45cookingefficient,
title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe},
author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},
year={2025},
url={https://arxiv.org/abs/2509.18154},
}
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}