MiniCPM-V-4_5-GPTQ

二〇二六年六月六日 · 英文原文

摘要

OpenBMB 发布 MiniCPM-V 4.5，基于 Qwen3-8B 与 SigLIP2-400M，总参数 8B。模型引入统一 3D-Resampler，6 帧压缩为 64 tokens，支持单图、多图、最高 10FPS 视频理解，并提供 fast/deep thinking、OCR、PDF 解析、多语言及 iOS、本地和云端部署支持。

GitHub | CookBook | 技术报告 | Demo

MiniCPM-V 4.5

MiniCPM-V 4.5 是 MiniCPM-V 系列中最新、能力最强的模型。该模型基于 Qwen3-8B 和 SigLIP2-400M 构建，总参数量为 8B。相比此前的 MiniCPM-V 和 MiniCPM-o 模型，它的性能显著提升，并引入了新的实用功能。MiniCPM-V 4.5 的主要特性包括：

🔥 SOTA Vision-Language 能力。 MiniCPM-V 4.5 在 OpenCompass 上取得 77.0 的平均分，该评测涵盖 8 个常用 benchmark。它仅有 8B 参数，在 vision-language 能力上超过了 GPT-4o-latest、Gemini-2.0 Pro 等广泛使用的闭源模型，以及 Qwen2.5-VL 72B 等强开源模型，成为 30B 参数以下性能最强的 MLLM。
🎬 高效的高 FPS 与长视频理解。 借助面向图像和视频的新统一 3D-Resampler，MiniCPM-V 4.5 现在可以实现 96x 的视频 token 压缩率，即 6 帧 448x448 视频帧可被联合压缩为 64 个视频 token（大多数 MLLM 通常需要 1,536 个 token）。这意味着模型可以在不增加 LLM 推理成本的情况下感知显著更多的视频帧。因此，它在 Video-MME、LVBench、MLVU、MotionBench、FavorBench 等任务上高效具备 SOTA 的高 FPS（最高 10FPS）视频理解和长视频理解能力。
⚙️ 可控的 Hybrid Fast/Deep Thinking。 MiniCPM-V 4.5 同时支持 fast thinking，用于高效的高频使用并保持有竞争力的性能；也支持 deep thinking，用于更复杂的问题求解。为覆盖不同用户场景中的效率与性能权衡，该 fast/deep thinking 模式可以以高度可控的方式切换。
💪 强 OCR、文档解析及其他能力。 基于 LLaVA-UHD 架构，MiniCPM-V 4.5 可以处理任意宽高比、最高 1.8 million pixels（例如 1344x1344）的高分辨率图像，同时使用的视觉 token 比大多数 MLLM 少 4x。该模型在 OCRBench 上达到领先性能，超过 GPT-4o-latest 和 Gemini 2.5 等闭源模型。在 OmniDocBench 上，它也在通用 MLLM 中实现了 SOTA 的 PDF 文档解析能力。基于最新的 RLAIF-V 和 VisCPM 技术，它具备可信行为，在 MMHal-Bench 上超过 GPT-4o-latest，并支持 30 多种语言的多语言能力。
💫 易于使用。 MiniCPM-V 4.5 可以通过多种方式轻松使用：(1) llama.cpp 和 ollama 支持在本地设备上进行高效 CPU 推理，(2) 提供 int4、GGUF 和 AWQ 格式的量化模型，共 16 种大小，(3) SGLang 和 vLLM 支持高吞吐、内存高效推理，(4) 使用 Transformers 和 LLaMA-Factory 在新领域和新任务上进行微调，(5) 快速运行本地 WebUI demo，(6) 在 iPhone 和 iPad 上使用优化后的本地 iOS app，以及 (7) 在 server 上使用在线 web demo。完整用法请参见我们的 Cookbook！

关键技术

架构：用于高密度视频压缩的统一 3D-Resampler。 MiniCPM-V 4.5 引入了 3D-Resampler，突破了视频理解中的性能—效率权衡。通过将最多 6 个连续视频帧分组并联合压缩为仅 64 个 token（与 MiniCPM-V 系列中单张图像使用的 token 数相同），MiniCPM-V 4.5 实现了 96× 的视频 token 压缩率。这使模型能够在不增加 LLM 计算成本的情况下处理更多视频帧，从而支持高 FPS 视频和长视频理解。该架构支持图像、多图输入和视频的统一编码，确保能力和知识可以顺畅迁移。
预训练：面向文档 OCR 与知识的统一学习。 现有 MLLM 通常用相互隔离的训练方式学习 OCR 能力和文档知识。我们观察到，这两种训练方式的本质差异在于图像中文本的可见性。通过用不同噪声水平动态破坏文档中的文本区域，并要求模型重建文本，模型学会在精确文本识别（文本可见时）与基于多模态上下文的知识推理（文本被严重遮挡时）之间自适应、合理地切换。这消除了从文档学习知识时对易出错文档解析器的依赖，也避免了过度增强 OCR 数据带来的 hallucination，从而以较低工程开销实现顶级 OCR 和多模态知识性能。
后训练：结合 Multimodal RL 的 Hybrid Fast/Deep Thinking。 MiniCPM-V 4.5 通过两种可切换模式提供均衡的推理体验：fast thinking 用于高效日常使用，deep thinking 用于复杂任务。通过一种新的混合 reinforcement learning 方法，模型联合优化两种模式，在不损害 deep-mode 能力的情况下显著增强 fast-mode 性能。结合 RLPR 和 RLAIF-V，它可以从广泛的多模态数据中泛化出稳健的推理技能，同时有效减少 hallucination。

评测

推理效率

OpenCompass

Video-MME

Video-MME 和 OpenCompass 均使用 8×A100 GPU 进行推理评测。为公平比较，Video-MME 报告的推理时间包含完整的模型侧计算，不包含视频帧抽取的外部成本（取决于具体的帧抽取工具）。

示例

我们通过 iOS demo 将 MiniCPM-V 4.5 部署在 iPad M4 上。Demo 视频为未经剪辑的原始屏幕录制。

Framework 支持矩阵

注：如果你希望我们优先支持其他开源 framework，请通过这个简短表单告诉我们。

用法

如果你希望启用 thinking mode，请向 chat 函数传入参数 enable_thinking=True。

与图像聊天

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
stream=True # If `stream=True`, the answer is string

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [generated_text]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    stream=True
)

generated_text = ""
for new_text in answer:
    generated_text += new_text
    print(new_text, flush=True, end='')

你将得到如下输出：

# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.

This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.

# round2
When traveling to a karst landscape like this, here are some important tips:

1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.

By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.

与视频聊天

## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids. 
# To achieve this, you need to organize your video data into two corresponding sequences: 
#   frames: List[Image]
#   temporal_ids: List[List[Int]].

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord
from scipy.spatial import cKDTree
import numpy as np
import math

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,  # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)  # or openbmb/MiniCPM-o-2_6

MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
MAX_NUM_PACKING=3  # indicates the maximum packing number of video frames. valid range: 1-6
TIME_SCALE = 0.1 

def map_to_nearest_scale(values, scale):
    tree = cKDTree(np.asarray(scale)[:, None])
    _, indices = tree.query(np.asarray(values)[:, None])
    return np.asarray(scale)[indices]


def group_array(arr, size):
    return [arr[i:i+size] for i in range(0, len(arr), size)]

def encode_video(video_path, choose_fps=3, force_packing=None):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    fps = vr.get_avg_fps()
    video_duration = len(vr) / fps
        
    if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
        packing_nums = 1
        choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
        
    else:
        packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
        if packing_nums <= MAX_NUM_PACKING:
            choose_frames = round(video_duration * choose_fps)
        else:
            choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
            packing_nums = MAX_NUM_PACKING

    frame_idx = [i for i in range(0, len(vr))]      
    frame_idx =  np.array(uniform_sample(frame_idx, choose_frames))

    if force_packing:
        packing_nums = min(force_packing, MAX_NUM_PACKING)
    
    print(video_path, ' duration:', video_duration)
    print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
    
    frames = vr.get_batch(frame_idx).asnumpy()

    frame_idx_ts = frame_idx / fps
    scale = np.arange(0, video_duration, TIME_SCALE)

    frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
    frame_ts_id = frame_ts_id.astype(np.int32)

    assert len(frames) == len(frame_ts_id)

    frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
    frame_ts_id_group = group_array(frame_ts_id, packing_nums)
    
    return frames, frame_ts_id_group


video_path="video_test.mp4"
fps = 5 # fps for video
force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)

question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]


answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    use_image_id=False,
    max_slice_nums=1,
    temporal_ids=frame_ts_id_group
)
print(answer)

与多张图像聊天

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

In-context few-shot learning

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5-GPTQ', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

License

模型 License

MiniCPM-o/V 模型权重和代码基于 Apache-2.0 license 开源。
为帮助我们更好地理解和支持用户，如果你愿意选择性填写一份简短的注册"问卷"，我们将非常感谢。

声明

作为 LMM，MiniCPM-V 4.5 通过学习大量多模态语料生成内容，但它不能理解、表达个人观点或作出价值判断。MiniCPM-V 4.5 生成的任何内容均不代表模型开发者的观点和立场。
对于使用 MinCPM-V 模型产生的任何问题，包括但不限于数据安全问题、舆论风险，或因模型误导、误用、传播或滥用产生的任何风险和问题，我们不承担责任。

关键技术与其他多模态项目

👏 欢迎了解 MiniCPM-V 4.5 的关键技术以及我们团队的其他多模态项目：

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

Citation

如果你觉得我们的工作有帮助，请考虑引用我们的论文 📝 并为该项目点赞 ❤️！

@misc{yu2025minicpmv45cookingefficient,
      title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe}, 
      author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
      year={2025},
      eprint={2509.18154},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.18154}, 
}

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}

译自 OpenBMB · HF · 录于二〇二六年六月六日