MiniCPM-V-4.6-BNB

二〇二六年六月六日 · 英文原文

摘要

该仓库提供 OpenBMB MiniCPM-V 4.6 的 bitsandbytes NF4 4-bit 量化版本；模型基于 SigLIP2-400M 与 Qwen3.5-0.8B，支持图像、视频理解和 4x/16x visual token compression，可部署于 iOS、Android、HarmonyOS，并适配 Transformers、vLLM、SGLang、llama.cpp、Ollama 及 fine-tuning 框架。

本仓库托管 MiniCPM-V 4.6 的 bitsandbytes（NF4, 4-bit）量化版本。 原始 BF16 权重和完整 model card 请参见 openbmb/MiniCPM-V-4.6。

一款可在手机上进行高效图像与视频理解的口袋型 MLLM

GitHub | CookBook | Demo | 飞书（Lark）

MiniCPM-V 4.6

MiniCPM-V 4.6 是我们迄今为止最适合 edge deployment（端侧部署）的模型。该模型基于 SigLIP2-400M 和 Qwen3.5-0.8B LLM 构建。它继承了 MiniCPM-V 系列强大的单图、多图和视频理解能力，同时显著提升了计算效率。它还引入了混合 4x/16x visual token compression（视觉 token 压缩）。MiniCPM-V 4.6 的主要特性包括：

🔥 领先的基础能力。 MiniCPM-V 4.6 在 Artificial Analysis Intelligence Index benchmark 上得分 13，超过 Qwen3.5-0.8B 的 10 分，且 token cost 少 19x；也超过 Qwen3.5-0.8B-Thinking 的 11 分，且 token cost 少 43x。它还超过了更大的 Ministral 3 3B（得分 11）。
💪 强大的多模态能力。 MiniCPM-V 4.6 在大多数视觉-语言理解任务上优于 Qwen3.5-0.8B，并在 OpenCompass、RefCOCO、HallusionBench、MUIRBench 和 OCRBench 等多个 benchmark 上达到 Qwen3.5 2B 级别能力。
🚀 高效架构。 基于 LLaVA-UHD v4 中的最新技术，MiniCPM-V 4.6 将视觉编码计算 FLOPs 降低了 50% 以上。这使 MiniCPM-V 4.6 相比更小的模型也能获得更好的效率，相比 Qwen3.5-0.8B 实现约 ~1.5x token throughput。它还支持混合 4x/16x visual token compression rate，可在精度和速度之间灵活切换。
📱 广泛覆盖移动平台。 MiniCPM-V 4.6 可部署在三大主流移动平台——iOS、Android 和 HarmonyOS。所有 edge adaptation code 均已开源，开发者只需几个步骤即可复现端侧体验。
🛠️ 开发者友好。 MiniCPM-V 4.6 已适配 vLLM、SGLang、llama.cpp、Ollama 等 inference frameworks，并支持 SWIFT、LLaMA-Factory 等 fine-tuning ecosystems。开发者可以在消费级 GPU 上快速为新领域和任务定制模型。我们提供 GGUF、BNB、AWQ 和 GPTQ 等格式的多种量化变体。

Evaluation

整体性能（Instruct）

高并发吞吐量

单请求 TTFT（ms）

Examples

Overall

MiniCPM-V 4.6 可部署在三大主流端侧平台——iOS、Android 和 HarmonyOS。下面的片段是手机设备上的原始屏幕录制，未经剪辑。

Usages

使用 Transformers 进行 Inference

Installation

pip install "transformers[torch]>=5.7.0" torchvision torchcodec

关于 CUDA 兼容性的说明： torchcodec（用于视频解码）可能与某些 CUDA 版本存在兼容性问题。例如，torch>=2.11 默认捆绑 CUDA 13.1，而 CUDA 12.x 环境可能遇到 RuntimeError: Could not load libtorchcodec 等错误。两种解决方法：
用 PyAV 替换 torchcodec —— 支持图像和视频 inference，且不受 CUDA 版本限制：
pip install "transformers[torch]>=5.7.0" torchvision av
安装 torch 时固定 CUDA 版本以匹配你的环境（例如 CUDA 12.8）：
pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128

Load Model

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "openbmb/MiniCPM-V-4.6-BNB"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

# Flash Attention 2 is recommended for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# model = AutoModelForImageTextToText.from_pretrained(
#     model_id,
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

Image Inference

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"},
            {"type": "text", "text": "What causes this phenomenon?"},
        ],
    }
]

downsample_mode = "16x"  # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
    downsample_mode=downsample_mode,
    max_slice_nums=36,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Video Inference

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/football.mp4"},
            {"type": "text", "text": "Describe this video in detail. Follow the timeline and focus on on-screen text, interface changes, main actions, and scene changes."},
        ],
    }
]

downsample_mode = "16x"  # Using `downsample_mode="4x"` for Finer Detail

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
    downsample_mode=downsample_mode,
    max_num_frames=128,
    stack_frames=1,
    max_slice_nums=1,
    use_image_id=False,
).to(model.device)

generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Advanced Parameters

你可以向 apply_chat_template 传入额外参数来自定义图像/视频处理：

参数	默认值	适用于	说明
`downsample_mode`	`"16x"`	图像和视频	Visual token 下采样。`"16x"` 会合并 token 以提升效率；`"4x"` 保留 4× 更多 token 以获得更精细细节。也必须传给 `generate()`。
`max_slice_nums`	`9`	图像和视频	切分高分辨率图像时的最大 slice 数量。更高的值会为大图保留更多细节。推荐：图像用 `36`，视频用 `1`。
`max_num_frames`	`128`	仅视频	从视频中采样的主帧最大数量。
`stack_frames`	`1`	仅视频	每秒的总采样点数。`1` = 仅主帧（不堆叠）。`N`（N>1）= 每秒 1 个主帧 + N−1 个子帧；子帧会被合成为网格图像，并与主帧交错排列。推荐：`3` 或 `5`。
`use_image_id`	`True`	图像和视频	是否在每个图像/帧占位符之前添加 `<image_id>N</image_id>` 标签。推荐：图像用 `True`，视频用 `False`。

注意： downsample_mode 必须同时传给 apply_chat_template（用于正确的占位符数量）和 generate（用于 vision encoder）。其他所有参数只需要传给 apply_chat_template。

使用 `transformers serve` 提供服务

Hugging Face Transformers 包含一个轻量级、兼容 OpenAI 的服务器，适合快速测试和中等负载部署。

pip install "transformers[serving]>=5.7.0"

启动服务器：

transformers serve openbmb/MiniCPM-V-4.6-BNB --port 8000 --host 0.0.0.0 --continuous-batching

发送请求：

curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "openbmb/MiniCPM-V-4.6-BNB",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
        {"type": "text", "text": "What causes this phenomenon?"}
      ]
    }]
  }'

处理模型输出中的转义换行符

在某些情况下，模型可能会输出转义换行字符 \n 的字符串字面量，而不是实际换行。为了正确渲染文本，尤其是在 UI 层中，可以使用以下工具函数。该函数会谨慎地将字面量 \n 替换为真实换行，同时保护 \n 具有特定语义的场景。

工具函数：

import re

_PATTERN = re.compile(
    r'(```[\s\S]*?```'       # fenced code blocks
    r'|`[^`]+`'              # inline code
    r'|\$\$[\s\S]*?\$\$'     # display math
    r'|\$[^$]+\$'            # inline math
    r'|\\\([\s\S]*?\\\)'     # \(...\)
    r'|\\\[[\s\S]*?\\\]'     # \[...\]
    r')'
    r'|(?<!\\)(?:\\r\\n|\\[nr])'
)

def normalize_response_text(text: str) -> str:
    """
    Lightweight post-processing: Converts literal '\\n' to actual newlines, 
    while protecting code blocks, inline code, and LaTeX commands.
    """
    if not isinstance(text, str) or "\\" not in text:
        return text
    return _PATTERN.sub(lambda m: m.group(1) or '\n', text)

在 iOS、Android 和 HarmonyOS 平台上部署 MiniCPM-V 4.6

我们已适配 MiniCPM-V 4.6，使其可部署在 iOS、Android 和 HarmonyOS 平台上，并且所有 edge adaptation code 均已完全开源。开发者只需几个步骤即可复现端侧体验。请访问我们的 edge deployment repository 获取各平台的构建指南，或前往 download page 直接试用预构建应用。

在其他 Inference 和 Training Frameworks 中使用 MiniCPM-V 4.6

MiniCPM-V 4.6 支持多种 inference 和 training frameworks。下面是各框架的 quick-start 命令。完整细节请参见我们的 Cookbook。

vllm serve openbmb/MiniCPM-V-4.6-BNB \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --default-chat-template-kwargs '{"enable_thinking": false}'

注意： --enable-auto-tool-choice 和 --tool-call-parser qwen3_coder 用于启用 tool/function calling 支持。如果不需要使用 tool，可以省略这些 flags，直接运行 vllm serve openbmb/MiniCPM-V-4.6-BNB。

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-BNB",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

Tool calling 示例：

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-BNB",
  "messages": [{"role": "user", "content": [
    {"type": "text", "text": "北京的天气"}
  ]}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather for a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
      }
    }
  }]
}'

python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-BNB --port 30000

curl -s http://localhost:30000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "openbmb/MiniCPM-V-4.6-BNB",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080

curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "MiniCPM-V-4.6",
  "messages": [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
    {"type": "text", "text": "What causes this phenomenon?"}
  ]}]
}'

ollama run minicpm-v-4.6

在交互式会话中，直接粘贴图像路径或 URL 即可与模型对话。

llamafactory-cli train examples/train_lora/minicpmv4_6_lora_sft.yaml

swift sft --model_type minicpm-v-4_6 --dataset <your-dataset>

License

Model License

MiniCPM-o/V 模型权重和代码基于 Apache-2.0 license 开源。

Statement

作为 MLLMs，MiniCPM-o/V 模型通过学习大量多模态语料生成内容，但它们不能理解、表达个人观点或进行价值判断。MiniCPM-o/V 模型生成的任何内容均不代表模型开发者的观点和立场
对因使用 MiniCPM-o/V 模型而产生的任何问题，包括但不限于数据安全问题、舆论风险，或因模型误导、误用、传播或滥用而产生的任何风险和问题，我们不承担责任。

Technical Reports and Key Techniques Papers

👏 欢迎了解 MiniCPM-o/V 以及我们团队其他多模态项目的关键技术：

Technical Reports： MiniCPM-o 4.5 | MiniCPM-V 4.5 | MiniCPM-o 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0

其他多模态项目： VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

Citation

如果我们的模型/代码/论文对你有帮助，请考虑引用我们的论文 📝，并给我们 star ⭐️！

@misc{cui2026minicpmo45realtimefullduplex,
      title={MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction}, 
      author={Junbo Cui and Bokai Xu and Chongyi Wang and Tianyu Yu and Weiyue Sun and Yingjing Xu and Tianran Wang and Zhihui He and Wenshuo Ma and Tianchi Cai and others},
      year={2026},
      url={https://arxiv.org/abs/2604.27393}, 
}

@proceedings{yu2025minicpmv45cookingefficient,
      title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe}, 
      author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and others},
      year={2025},
      url={https://arxiv.org/abs/2509.18154}, 
}

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

译自 OpenBMB · HF · 录于二〇二六年六月六日