moonshotai/Kimi-K2.6

二〇二六年六月六日 · 英文原文

摘要

Moonshot AI 发布开源原生多模态 agentic 模型 Kimi K2.6，采用 MoE 架构，总参数量 1T，激活参数量 32B，上下文长度 256K。模型在长周期编码、编码驱动设计、Agent Swarm（可扩展至 300 个子 agent、4000 个协调步骤）及主动编排方面取得进展。在 HLE-Full（w/ tools）上得分 54.0，SWE-Bench Verified 得分 80.2，AIME 2026 得分 96.4。模型权重以修改版 MIT 许可证发布。

1. 模型介绍

Kimi K2.6 是一个开源的原生多模态 agentic 模型，在长周期编码、编码驱动设计、主动自主执行以及基于 swarm 的任务编排等实用能力上取得了进步。

主要特性

长周期编码（Long-Horizon Coding）：K2.6 在复杂的端到端编码任务上取得了显著提升，并能稳健地泛化到多种编程语言（Rust、Go、Python）以及前端、DevOps 和性能优化等领域。
编码驱动设计（Coding-Driven Design）：K2.6 能够将简单的 prompt 和视觉输入转化为可直接上线的界面和轻量级全栈工作流，生成结构化的布局、交互元素和丰富的动画，并具备精心的美学精度。
增强的 Agent Swarm：K2.6 可水平扩展至 300 个子 agent，执行 4000 个协调步骤，能够动态地将任务分解为并行的、领域专业化的子任务，在单次自主运行中端到端地输出文档、网站和电子表格。
主动与开放编排（Proactive & Open Orchestration）：对于自主任务，K2.6 在驱动持久化、7x24 小时后台 agent 方面表现出色，这些 agent 能主动管理日程、执行代码并编排跨平台操作，无需人工监督。

2. 模型摘要


架构	Mixture-of-Experts (MoE)
总参数量	1T
激活参数量	32B
层数（含稠密层）	61
稠密层数	1
Attention 隐藏维度	7168
MoE 隐藏维度（每个 Expert）	2048
Attention Head 数量	64
Expert 数量	384
每个 Token 选择的 Expert 数	8
共享 Expert 数量	1
词表大小	160K
上下文长度	256K
Attention 机制	MLA
激活函数	SwiGLU
视觉编码器	MoonViT
视觉编码器参数量	400M

3. 评测结果

通用测试细节
- 我们报告了 Kimi K2.6 和 Kimi K2.5（启用思考模式）、Claude Opus 4.6（max effort）、GPT-5.4（xhigh reasoning effort）以及 Gemini 3.1 Pro（high thinking level）的结果。
- 除非另有说明，所有 Kimi K2.6 实验均在 temperature = 1.0、top-p = 1.0、上下文长度为 262,144 tokens 的条件下进行。
- 对于没有公开分数的 benchmark，我们在与 Kimi K2.6 相同的条件下重新评估，并用星号（*）标记。除标有星号的项目外，所有其他结果均引自官方报告。
推理 Benchmark
- GPT-5.4 和 Claude 4.6 在 IMO-AnswerBench 上的分数来自 z.ai/blog/glm-5.1。
- Humanity's Last Exam (HLE) 及其他推理任务的最大生成长度为 98,304 tokens。默认情况下，我们报告 HLE 完整集的结果。在纯文本子集上，Kimi K2.6 不使用工具时准确率为 36.4%，使用工具时为 55.5%。
工具增强 / Agentic 任务
- Kimi K2.6 在 HLE with tools、BrowseComp、DeepSearchQA 和 WideSearch 中配备了搜索、代码解释器和网页浏览工具。
- 对于 HLE-Full with tools，最大生成长度为 262,144 tokens，每步限制为 49,152 tokens。我们采用简单的上下文管理策略：一旦上下文窗口超过阈值，仅保留最近一轮与工具相关的消息。
- 对于 BrowseComp，我们报告了使用与 Kimi K2.5 和 DeepSeek-V3.2 相同的丢弃所有策略进行上下文管理所获得的分数。
- 对于 DeepSearchQA，Kimi K2.6 测试未应用上下文管理，超过支持上下文长度的任务直接计为失败。Claude Opus 4.6、GPT-5.4 和 Gemini 3.1 Pro 在 DeepSearchQA 上的分数引自 Claude Opus 4.7 System Card。
- 对于 WideSearch，我们在“隐藏工具结果”的上下文管理设置下报告结果。一旦上下文窗口超过阈值，仅保留最近一轮与工具相关的消息。
- 测试系统 prompt 与 Kimi K2.5 技术报告中使用的相同。
- Claw Eval 使用版本 1.1 进行，max-tokens-per-step = 16384。
- 对于 APEX-Agents，我们评估了公开 480 个任务中的 452 个，与 Artificial Analysis 的做法一致（排除了具有外部运行时依赖的 Investment Banking Worlds 244 和 246）。
编码任务
- Terminal-Bench 2.0 分数是在默认 agent 框架（Terminus-2）和提供的 JSON parser 下，以 preserve thinking 模式运行获得的。
- 对于 SWE-Bench 系列评估（包括 Verified、Multilingual 和 Pro），我们使用了基于 SWE-agent 改编的内部评估框架。该框架包含一组最少的工具：bash tool、createfile tool、insert tool、view tool、strreplace tool 和 submit tool。
- 所有报告的编码任务分数均为 10 次独立运行的平均值。
视觉 Benchmark
- Max-tokens = 98,304，三次运行的平均值（avg@3）。
- 使用 Python 工具时，设置 max-tokens-per-step = 65,536，max-steps = 50 用于多步推理。
- MMMU-Pro 遵循官方协议，保留输入顺序并在前面添加图像。

4. 原生 INT4 量化

Kimi-K2.6 采用与 Kimi-K2-Thinking 相同的原生 int4 量化方法。

5. 部署

[!Note] 您可以在 https://platform.moonshot.ai 访问 Kimi-K2.6 的 API，我们提供兼容 OpenAI/Anthropic 的 API。为验证部署是否正确，我们还提供了 Kimi Vendor Verifier。目前，建议在以下推理引擎上运行 Kimi-K2.6：

vLLM
SGLang
KTransformers

Kimi-K2.6 与 Kimi-K2.5 具有相同的架构，部署方法可直接复用。

transformers 的版本要求为 >=4.57.1, <5.0.0。

部署示例可在模型部署指南中找到。

6. 模型使用

以下使用示例演示了如何调用我们的官方 API。

对于使用 vLLM 或 SGLang 部署的第三方 API，请注意：

[!Note]

与视频内容的聊天是一项实验性功能，目前仅在我们的官方 API 中支持。

推荐的 temperature 在思考模式（Thinking mode）下为 1.0，在即时模式（Instant mode）下为 0.6。

推荐的 top_p 为 0.95。

要使用即时模式，您需要在 extra_body 中传递 {'chat_template_kwargs': {"thinking": False}}。

聊天补全

这是一个简单的聊天补全脚本，展示了如何在思考模式和即时模式下调用 K2.6 API。

import openai
import base64
import requests
def simple_chat(client: openai.OpenAI, model_name: str):
    messages = [
        {'role': 'system', 'content': 'You are Kimi, an AI assistant created by Moonshot AI.'},
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'which one is bigger, 9.11 or 9.9? think carefully.'}
            ],
        },
    ]
    response = client.chat.completions.create(
        model=model_name, messages=messages, stream=False, max_tokens=4096
    )
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # To use instant mode, pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')

带视觉内容的聊天补全

K2.6 支持图像和视频输入。

以下示例演示了如何使用图像输入调用 K2.6 API：

import openai
import base64
import requests

def chat_with_image(client: openai.OpenAI, model_name: str):
    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/kimi-logo.png'
    image_base64 = base64.b64encode(requests.get(url).content).decode()
    messages = [
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'Describe this image in detail.'},
                {
                    'type': 'image_url',
                    'image_url': {'url': f'data:image/png;base64, {image_base64}'},
                },
            ],
        }
    ]

    response = client.chat.completions.create(
        model=model_name, messages=messages, stream=False, max_tokens=8192
    )
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # Also support instant mode if you pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')

    return response.choices[0].message.content

以下示例演示了如何使用视频输入调用 K2.6 API：

import openai
import base64
import requests

def chat_with_video(client: openai.OpenAI, model_name:str):
    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/demo_video.mp4'
    video_base64 = base64.b64encode(requests.get(url).content).decode()
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text","text": "Describe the video in detail."},
                {
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{video_base64}"},
                },
            ],
        }
    ]

    response = client.chat.completions.create(model=model_name, messages=messages)
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # Also support instant mode if pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')
    return response.choices[0].message.content

保留思考内容（Preserve Thinking）

Kimi K2.6 支持 preserve_thinking 模式，该模式在多轮交互中保留完整的推理内容，并提升编码 agent 场景下的性能。

此功能默认禁用。以下示例演示了如何在 preserve_thinking 模式下调用 K2.6 API：

def chat_with_preserve_thinking(client: openai.OpenAI, model_name: str):
    messages = [
        {
            "role": "user",
            "content": "Tell me three random numbers."
        },
        {
            "role": "assistant",
            "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
            "content": "473, 921, 235"
        },
        {
            "role": "user",
            "content": "What are the other two numbers you have in mind?"
        }
    ]

    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'enabled', 'keep': 'all'}},  # this is for official API
        # extra_body={"chat_template_kwargs": {"thinking":True, "preserve_thinking": True}},  # this is for vLLM/SGLang
        # We recommend enabling preserve_thinking only in think mode.
    )
    # the assistant should mention 215 and 222 that appear in the prior reasoning content
    print(f"response: {response.choices[0].message.reasoning}")
    return response.choices[0].message.content

交错思考与多步工具调用

K2.6 与 K2 Thinking 共享相同的交错思考与多步工具调用设计。使用示例请参考 K2 Thinking 文档。

编码 Agent 框架

Kimi K2.6 与 Kimi Code CLI 作为其 agent 框架配合使用效果最佳——请访问 https://www.kimi.com/code 尝试。

7. 许可证

代码仓库和模型权重均在修改版 MIT 许可证下发布。

8. 第三方声明

请参阅第三方声明

9. 联系我们

如有任何问题，请通过 support@moonshot.ai 联系我们。

译自 Kimi · HF · Moonshot · 录于二〇二六年六月六日