aws-ml

克服奖励信号挑战：在 SageMaker AI 上使用 GRPO 进行基于可验证奖励的强化学习

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

二〇二六年五月八日 · 英文原文

摘要

本文介绍 AWS 使用 Amazon SageMaker Training Jobs 在 GSM8K 上 fine-tune Qwen2.5-0.5B，结合 RLVR、GRPO、8-shot examples 和 QLoRA，设计格式与正确性双 reward functions；100 个测试样本上 accuracy 从 base model 11% 提至 41%。作者为 Surya Kari、Giuseppe Zappia、Amin Dashti。

训练大型语言模型需要准确的反馈信号，但传统的 reinforcement learning (RL) 往往难以保证 reward signal 的可靠性。这些信号的质量会直接影响模型如何学习和做决策。然而，构建稳健的反馈机制可能复杂且容易出错。真实训练场景中常常会引入隐藏偏差、非预期激励以及模糊的成功标准，这些都会使学习过程偏离方向，导致模型行为不可预测，或无法达到预期目标。在本文中，你将学习如何实现 reinforcement learning with verifiable rewards (RLVR)，为 reward signal 引入验证与透明度，从而提升训练性能。当输出可以被客观验证其正确性时，这种方法效果最好，例如数学推理、代码生成或符号操作任务。你还将学习如何叠加 Group Relative Policy Optimization (GRPO) 和 few-shot examples 等技术，以进一步提升结果。本文使用 GSM8K dataset（Grade School Math 8K：一组小学数学题集合）来提升数学问题求解准确率，但这里使用的技术也可以适配到多种其他用例。

技术概览

在进入实现之前，先理解支撑该方法的 RL 概念会很有帮助。RL 通过 reward signal 建立结构化反馈系统，从而应对模型训练中的挑战。这一范式使模型能够通过交互学习，接收反馈并逐步趋向最优行为。RL 为模型提供了一个框架，使其能够基于关于输出质量的明确定义信号迭代改进响应，因此非常适合训练需要与用户交互、并根据结果调整行为的模型。

传统 RL 强调了一个重要因素：reward signal 的质量非常关键。当 reward function 不精确或不完整时，模型可能会出现 “reward hacking”，即找到非预期方式来最大化得分，却没有实现期望行为。认识到这一限制后，研究者开始发展更严格的方法，重点是创建可靠、定义清晰的 reward function。

RLVR 通过由 model tuner 定义的基于规则的反馈来解决 reward hacking。它使用 programmatic reward functions，根据特定标准自动对输出评分，从而避免收集人工评分带来的瓶颈，并支持快速迭代。这些“可验证”的 rewards 来自客观、可复现的规则，因此 RLVR 非常适合需求不断变化的场景，因为它会学习通用优化策略，并能快速适应新场景。

GRPO 是一种 reinforcement learning algorithm，它通过在组内比较性能，而不是一次性在所有数据上比较性能，来改进 AI model 的学习。它将训练数据组织成有意义的组，并相对于每个组的 baseline 优化性能，从而对每个类别给予适当关注。这种 group-aware optimization 可以降低训练方差、加快收敛，并生成在不同类别上表现更一致的模型。

将 RLVR 与 GRPO 结合，可以形成一个框架：automated rewards 引导学习，而 group-relative optimization 有助于推动均衡性能。你可以为任务的不同方面定义 reward functions，GRPO 会在训练过程中将这些 reward functions 视为不同的组，从而促进多个维度同时改进。这种组合可以带来快速适应和稳健性能，适合需要泛化到训练分布之外的动态环境。

加入 few-shot learning 会从三个方面增强这一框架。首先，few-shot examples 提供模板，向模型展示良好输出的形式，从而缩小探索的搜索空间。其次，GRPO 会利用这些示例：针对每个 prompt 生成多个 candidate responses，并从每个组内的相对性能中学习。第三，verifiable rewards 会立即确认哪些方法有效。这种组合可以加速学习：模型从期望格式的具体示例开始，通过基于组的比较高效探索变体，并接收关于正确性的明确反馈。

解决方案概览

在本节中，你将了解如何使用 Amazon SageMaker Training Jobs 在 SageMaker AI 上 fine-tune 一个 Qwen2.5-0.5B model。Amazon SageMaker Training jobs 支持分布式 multi-GPU 和 multi-node 配置，因此你可以按需启动高性能集群，更快训练 billion-parameter models，并在作业完成后自动关闭资源。

注意：虽然本文用例选择了 Qwen2.5-0.5B，但代码生成等其他用例需要更大的模型（例如 Qwen2.5-Coder-7B），并相应需要更大的训练实例。

前提条件

要在 Amazon SageMaker AI 上运行本文示例，你必须满足以下前提条件：

一个用于承载 AWS resources 的 AWS account。

一个用于访问 SageMaker AI 的 AWS Identity and Access Management (IAM) role。要进一步了解 IAM 如何与 SageMaker AI 配合使用，请参阅 AWS Identity and Access Management for Amazon SageMaker AI。

你可以在自己偏好的开发环境中运行本文提供的 notebook，包括 PyCharm 或 Visual Studio Code 等 interactive development environments (IDEs)，前提是你的 AWS credentials 已正确设置并配置为可访问你的 AWS account。要设置本地环境，请参阅 Configuring settings for the AWS CLI。

你也可以选择使用 Amazon SageMaker Studio，以便在 SageMaker AI 上获得更直接的开发流程。

如果你跟随本文操作，需要一个 ml.p4d.24xlarge instance 进行训练。你需要拥有这些 SageMaker training instances 的访问权限，才能运行示例训练代码。如果不确定，可以在 AWS Management Console 上查看 AWS service quotas：

选择 Amazon SageMaker 作为 Manage Quotas 下的 AWS service。

选择 ml.p4d.24xlarge 用于 training job usage，并在 account level 请求提升配额。

访问 GitHub repo：https://github.com/aws-samples/amazon-sagemaker-generativeai

环境设置

你可以使用自己偏好的 IDE，例如 VS Code 或 PyCharm，但请确保你的本地环境已按前提条件中所述配置为可与 AWS 配合使用。

要使用 SageMaker Studio JupyterLab spaces，请完成以下步骤：

在 Amazon SageMaker AI console 上，在导航窗格中选择 Domains，然后打开你的 domain。

在导航窗格中的 Applications and IDEs 下，选择 Studio。

在 User profiles 选项卡中，找到你的 user profile，然后选择 Launch 和 Studio。

在 Amazon SageMaker Studio 中，启动一个 ml.t3.medium JupyterLab notebook instance，并配置至少 50 GB 存储。这里不需要大型 notebook instance，因为 fine-tuning job 会在单独的临时 GPU 加速 training instance 上运行。

要开始 fine-tuning，首先 clone GitHub repo，并导航到 3_distributed_training/reinforcement-learning/grpo-with-verifiable-reward 目录，然后使用 Python 3.12 或更高版本 kernel 启动 model-finetuning-grpo-rlvr.ipynb Notebook。

为 fine-tuning 准备 dataset

运行带有 RLVR 的 GRPO 要求你拥有每个问题的最终答案，以便计算 reward。首先，通过提取每个问题的最终答案来准备数据。

dataset = GSM8K(split='train', include_answer=False, include_reasoning=True, few_shot=True, num_shots=8, seed=None, cot=True).dataset.shuffle(seed=42)

Dataset({ features: ['question', 'answer', 'prompt', 'final_answer'], num_rows: 7473 })

此外，本示例使用 few-shot examples（8 shots）来提升模型训练性能。关于 reinforcement learning 中 few-shot examples 的更多信息，请参阅论文 “Reinforcement Learning for Reasoning in Large Language Models with One Training Example”。虽然该研究论文关注 single-shot examples，但本文会展示 single-shot 和 multi-shot 的性能。

每个输入将包含 8 个示例，后接需要求解的问题：

"Question: Mark has $50 and buys a toy that costs $35. How much money does he have left? Solution: Let's think step by step. To find out how much money Mark has left, subtract the cost of the toy from the total amount of money Mark has. So, $50 - $35 = $15. #### The final answer is 15 Question: Emily has 3 times as many pencils as Alice. If Alice has 15 pencils, how many pencils does Emily have? Solution: Let's think step by step. To find out how many pencils Emily has, we multiply the number of pencils Alice has by 3. Alice has 15 pencils, so Emily has 15 * 3 = 45 pencils. #### The final answer is 45 Question: Jack has collected 12 more marbles than Kevin. If Kevin has 27 marbles, how many marbles does Jack have? Solution: Let's think step by step. To find how many marbles Jack has, we add 12 to the number of marbles Kevin has. So, Jack has 27 + 12 = 39 marbles. #### The final answer is 39 Question: There are 24 students in a classroom. If each group must have 4 students, how many groups can be formed? Solution: Let's think step by step. To find how many groups can be formed, we divide the number of students by the number of students per group. So, 24 / 4 = 6 groups can be formed. #### The final answer is 6 Question: Samantha baked 40 cookies and wants to divide them equally into bags, with each bag containing 5 cookies. How many bags will Samantha need? Solution: Let's think step by step. To find the number of bags needed, divide the total number of cookies by the number of cookies per bag. Thus, 40 divided by 5 equals 8. #### The final answer is 8 Question: A pack of pencils costs $4. If you buy 7 packs, how much will you spend in total? Solution: Let's think step by step. The total cost is found by multiplying the cost per pack by the number of packs. Hence, you spend 7 * $4 = $28. #### The final answer is 28 Question: A book has 240 pages, and Sarah reads 20 pages each day. How many days will it take her to finish the book? Solution: Let's think step by step. Sarah reads 20 pages per day, so we divide the total pages by the number of pages she reads per day. Therefore, it takes her 240 / 20 = 12 days to finish the book. #### The final answer is 12 Question: A farmer has a total of 80 apples and oranges. If he has 30 apples, how many oranges does he have? Solution: Let's think step by step. To determine the number of oranges, we subtract the number of apples from the total number of fruits. So, the number of oranges is 80 - 30 = 50.\n #### The final answer is 50 Question: Mimi picked up 2 dozen seashells on the beach. Kyle found twice as many shells as Mimi and put them in his pocket. Leigh grabbed one-third of the shells that Kyle found. How many seashells did Leigh have? Solution: Let's think step by step.

数据准备完成后，将 10% 的数据保留为 validation set，并将 training set 和 validation set 都推送到 S3。

Verifiable Reward Function

这个用于数学推理的 GRPO 实现采用 dual-reward system，在训练过程中提供客观、可验证的反馈。该方法利用数学问题本身可验证的特性，在不需要人工标注或主观评估的情况下创建可靠的训练信号。你将实现两个互补的 reward functions，它们共同引导模型同时生成正确的响应格式和数学上准确的结果：

Format Reward Function

该函数通过以下方式帮助验证模型是否学会正确组织响应结构：

Pattern Matching：搜索特定格式 #### The final answer is [number]

Consistent Scoring：格式正确给 0.5 分，格式错误给 0.0 分

Training Signal：鼓励模型遵循预期答案结构

#Format reward function def format_reward_func_qa(completions, **kwargs): pattern = r"\n#### The final answer is \d+" completion_contents = [completion for completion in completions] matches = [re.search(pattern, content) for content in completion_contents] return [0.5 if match else 0.0 for match in matches]

Correctness Reward Function

该函数通过以下方式提供核心数学验证：

Answer Extraction：使用 regex 从格式化响应中提取数值答案

Normalization：移除常见格式字符（逗号、货币符号、单位）

Precision Comparison：使用 1e-3 的 tolerance 来处理 floating-point precision

Binary Scoring：答案正确给 1.0 分，答案错误给 0.0 分

#Correctness reward function def correctness_reward_func_qa(completions, final_answer, *kwargs): rewards = [] for completion, ground_truth in zip(completions, final_answer): try: match = re.search(r'####.?([\d,]+(?:.\d+)?)', completion) if match: answer = match.group(1) for remove_char in [',', '$', '%', 'g']: answer = answer.replace(remove_char, '') if abs(float(answer)-float(ground_truth))

将 RLVR 与 GRPO 集成

这些 reward functions 会通过 GRPOTrainer 集成到 GRPO training pipeline 中：

rewards_funcs = [format_reward_func_qa, correctness_reward_func_qa]

trainer = GRPOTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, processing_class=tokenizer, peft_config=peft_config, reward_funcs=rewards_funcs, )

训练期间，GRPO 使用这些 reward functions 计算 policy gradients。首先，模型会为每个数学问题生成多个 completions。接着，系统会针对两个 reward functions 计算每个响应的 reward。format reward function 会为正确的响应结构最多给 0.5 分，correctness reward function 会为答案的数学准确性最多给 1.0 分，因此每个 completion 的最高 combined reward 为 1.5。然后，GRPO 会在组内比较这些 completions，以识别最佳响应。最后，在 policy update 步骤中，loss function 使用 reward differences 更新模型参数。reward 更高的 completions 会提高其概率，而 reward 更低的 completions 会降低其概率。这种相对排序驱动优化过程。以下示例展示如何 fine-tune Qwen2.5-0.5B。recipe 位于 scripts 文件夹中，你可以对其进行自定义或更改 base model。这里将使用带有 verifiable rewards 的 GRPO，并采用 Quantized Low-Rank Adaptation (QLoRA)。QLoRA 在这里用于降低训练资源需求并加快训练过程，但会带来较小的准确率折中。

Model arguments

model_name_or_path: Qwen/Qwen2.5-0.5B tokenizer_name_or_path: Qwen/Qwen2.5-0.5B model_revision: main torch_dtype: bfloat16 attn_implementation: flash_attention_2 bf16: true tf32: true output_dir: /opt/ml/model/Qwen2.5-0.5B-RL-VR-GRPO

Dataset arguments

train_dataset_id_or_path: /opt/ml/input/data/train/dataset.json test_dataset_id_or_path: /opt/ml/input/data/val/dataset.json dataset_splits: 'train' max_seq_length: 2048 packing: true

LoRA arguments

use_peft: true load_in_4bit: true lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"] lora_modules_to_save: ["lm_head", "embed_tokens"] lora_r: 16 lora_alpha: 16

Training arguments

num_train_epochs: 2 per_device_train_batch_size: 16 gradient_accumulation_steps: 2 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: True learning_rate: 1.84e-4 lr_scheduler_type: cosine warmup_ratio: 0.1

Logging arguments

logging_strategy: steps logging_steps: 5 report_to:

mlflow save_strategy: "no" seed: 42

Recipe 概览

该 recipe 实现了带有 verifiable rewards 的 Group Relative Policy Optimization (GRPO)，用于在数学推理任务上 fine-tune Qwen2.5-0.5B model。该 recipe 使用 dual-reward system，在不需要人工标注的情况下，客观评估答案格式和数学正确性。

重要 hyperparameters：

learning_rate：1.84e-4 – 针对 GRPO training 优化的学习率

num_train_epochs：2 – 用于避免 overfitting 的训练 epoch 数

per_device_train_batch_size：16，配合 gradient_accumulation_steps: 2 – effective batch size 为 32

max_seq_length：2048 – 用于 8-shot prompting 的 context window

lora_r：16 和 lora_alpha：16 – LoRA rank 和 scaling parameters

warmup_ratio：0.1，配合 cosine scheduler – learning rate scheduling

lora_target_modules – 以 attention 和 MLP layers 为 adaptation 目标

下一步，你将使用 SageMaker AI training job 启动训练集群并运行模型 fine-tuning。SageMaker AI Model Trainer。ModelTrainer 在 fully managed infrastructure 上运行 training jobs；负责 environment setup、scaling 和 artifact management。它还允许你指定 training scripts、input data 和 compute resources，而无需手动 provision servers。Library dependencies 可以通过 scripts 文件夹中的 requirements.txt 文件管理。ModelTrainer 会自动检测该文件，并在 runtime 安装列出的 dependencies。

首先，设置你的环境。这里你将指定训练使用的 instance type、instances 数量以及 training container 的位置。

from sagemaker.core import image_uris from sagemaker.core.helper.session_helper import Session

sagemaker_session = Session() bucket_name = sagemaker_session.default_bucket() default_prefix = sagemaker_session.default_bucket_prefix configs = load_sagemaker_config()

instance_type = "ml.g6.48xlarge" instance_count = 1 config_filename = "Qwen2.5-0.5B.yaml"

image_uri = image_uris.retrieve( framework="pytorch", region=sagemaker_session.boto_session.region_name, version="2.7.1", instance_type=instance_type, image_scope="training" )

接下来，配置 environment variables、code locations 和 data paths：

from sagemaker.train.configs import ( CheckpointConfig, Compute, OutputDataConfig, SourceCode, StoppingCondition, ) from sagemaker.train.distributed import Torchrun from sagemaker.train.model_trainer import ModelTrainer

env = {}

env["FI_PROVIDER"] = "efa" env["NCCL_PROTO"] = "simple" env["NCCL_SOCKET_IFNAME"] = "eth0" env["NCCL_IB_DISABLE"] = "1" env["NCCL_DEBUG"] = "WARN"

env["HF_token"] = os.environ['hf_token'] env["CONFIG_PATH"] = f"recipes/{config_filename}"

env["MLFLOW_EXPERIMENT_NAME"]= "grpo-rlvr" env["MLFLOW_TAGS"] = '{"source.job": "sm-training-jobs", "source.type": "grpo-rlvr", "source.framework": "pytorch"}' env["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_SERVER_ARN

Define the script to be run

source_code = SourceCode( source_dir="./scripts", requirements="requirements.txt", entry_script="run_finetuning.sh", )

Define the compute

compute_configs = Compute( instance_type=instance_type, instance_count=instance_count, keep_alive_period_in_seconds=3600, )

define Training Job Name

job_name = f"train-{config_filename.split('/')[-1].replace('.', '-').replace('yaml', 'rlvr')}"

define OutputDataConfig path

output_path = f"s3://{bucket_name}/{job_name}"

Define the ModelTrainer

model_trainer = ModelTrainer( training_image=image_uri, environment=env, source_code=source_code, base_job_name=job_name, compute=compute_configs, stopping_condition=StoppingCondition(max_runtime_in_seconds=18000), output_data_config=OutputDataConfig(s3_output_path=output_path), checkpoint_config=CheckpointConfig( s3_uri=output_path + "/checkpoint", local_path="/opt/ml/checkpoints" ), )

设置 training 和 validation data 的 channels：

from sagemaker.train.configs import InputData

Pass the input data

train_input = InputData( channel_name="train", data_source=train_dataset_s3_path, # S3 path where training data is stored )

val_input = InputData( channel_name="val", data_source=val_dataset_s3_path, # S3 path where training data is stored )

Check input channels configured

data = [train_input, val_input]

然后开始训练：

model_trainer.train(input_data_config=data)

以下是该示例 source code 的目录结构：

scripts/ ├── accelerate_configs/ # Accelerate configuration files ├── run_finetuning.sh # Launch script for distributed training with Accelerate on SageMaker training jobs ├── run_grpo.py # Main training script for GRPO ├── utils/ # utilities to load data and create prompt ├── recipes/ # Predefined training configuration recipes (YAML) └── requirements.txt # Python dependencies installed at runtime

为了跨多个 GPUs 进行 fine-tune，示例训练脚本使用 Huggingface Accelerate 和 DeepSpeed ZeRO-3，它们协同工作以更高效地训练大型模型。Huggingface Accelerate 通过自动处理 device placement、process management 和 mixed precision settings，简化了分布式训练的启动。DeepSpeed ZeRO-3 通过在 GPUs 之间划分 optimizer states、gradients 和 parameters 来降低内存使用，使 billion-parameter models 能够装入显存并更快训练。你可以使用如下简单命令，通过 Huggingface Accelerate 运行你的 GRPO trainer script：

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l) echo "Detected ${NUM_GPUS} GPUs on the machine"

Launch fine-tuning with Accelerate + DeepSpeed (Zero3)

accelerate launch
  --config_file accelerate_configs/deepspeed_zero3.yaml
  --num_processes ${NUM_GPUS}
  run_grpo.py
  --config $CONFIG_PATH

结果

在 100 个 test samples 上评估模型后，经过 8-shot GRPO 训练的模型达到 41% accuracy，而 base model 为 11%，表明 chain-of-thought 数学推理能力提升了 3.7x。

下图显示了一个与 context length 相关的明显阈值，揭示了 reasoning activation 的最优样本范围。虽然 0-shot（6%）和 2-shot（3%）配置表现较差，甚至不如 base model，但性能在 4-shot prompting（33%）时显著提升，并在 8-shot context（41%）达到峰值。这种非线性 scaling pattern 表明，GRPO training 形成的推理模式需要一定数量的示例才能有效激活。模型似乎已经学会利用多个示例中的组比较，这与 GRPO 的 group-based policy optimization 方法一致，即模型学习从多个生成解中比较并选择最优推理路径。

将 RLVR 扩展到其他领域

虽然本文聚焦于使用 GSM8K 进行数学推理，但 RLVR 方法可以泛化到具有客观可验证输出的领域。以下两个有潜力的方向展示了这种通用性：

基于执行 reward 的代码生成

代码生成可以通过执行来天然验证。当代码能够编译并无错误运行时，可以给予 partial rewards；当输出通过完整 unit tests 时，可以给予 full rewards。Domain experts 使用 natural language prompts 指定需求，而 reward model 通过代码执行自动评估正确性，从而减少主观人工评估。

带有 semantic validation 的领域特定文本生成

对于医疗或技术写作等专业领域，基于 keyword 的 rewards 可以引导模型使用合适术语。Partial rewards 鼓励包含必需术语，而 full rewards 要求在语义合适的上下文中包含完整 keyword sets。例如，医疗文本生成可以奖励那些在临床有效模式中结合诊断关键词（“symptoms”、“diagnosis”）和治疗关键词（“therapy”、“medication”）的输出，通过可测量目标教授领域词汇。

这些示例说明，verifiable rewards 可以扩展到数学推理之外的任务，只要其正确性可以通过程序验证，就可以为这种训练方法的更广泛应用奠定基础。

清理

要清理资源以避免产生更多费用，请按以下步骤操作：

删除任何未使用的 SageMaker Studio resources。

也可以选择删除 SageMaker Studio domain。

删除创建的任何 S3 buckets。

确认你的 training job 已不再运行！为此，请在 SageMaker console 上选择 Training，并检查 Training jobs。

要进一步了解如何清理已 provisioned resources，请参阅 Clean up。

结论

在本示例中，你使用 GRPO (Group Relative Policy Optimization) 在 GSM8K 上训练了一个 Qwen2.5-0.5B model。GSM8K 是一个包含 8,500 道小学数学应用题的数据集，需要多步算术推理和自然语言理解。每个问题都包含类似 “Janet’s ducks lay 16 eggs per day…” 的题目，并附有逐步解答，最终以数值答案结尾，因此非常适合 verifiable reward training。

该实现展示了 Reinforcement Learning with Verifiable Rewards (RLVR) 在数学推理任务中的有效性。经过 GRPO 训练的 Qwen2.5-0.5B model 相比 base model 提升了 3.7x，在 GSM8K 上达到 41% accuracy，而 baseline 为 11%。评估结果验证了 RLVR 对于具有客观可验证结果的领域是一种有前景的方法，可作为 preference-based training methods 的替代方案。阈值行为表明，GRPO 学会了利用多个示例中的组比较，这与其 group-based optimization 方法一致。这项工作为将 verifiable reward systems 应用于其他需要逻辑严谨性和数学准确性的领域奠定了基础。

关于 Amazon SageMaker AI fully managed training 的更多信息，请参阅 SageMaker AI documentation 的 training 部分。本文的支持代码可在 GitHub 中找到。

关于作者

Surya Kari 是 AWS 的 Senior Generative AI Data Scientist，专注于开发利用 state-of-the-art foundation models 的解决方案。他在处理包括 DeepSeek-R1、Llama family 和 Qwen 在内的先进 language models 方面拥有丰富经验，重点关注这些模型面向特定科学应用的 fine-tuning 和 optimization。他的专业能力还包括使用 AWS SageMaker 实现高效 training pipelines 和 deployment strategies，使 foundation models 能够从开发扩展到生产。他与客户合作设计和实现 generative AI solutions，帮助他们完成 model selection、fine-tuning approaches 和 deployment strategies，以针对具体用例获得最优性能。

Giuseppe Zappia 是 AWS 的 Principal AI/ML Specialist Solutions Architect，专注于帮助大型企业在 AWS 上设计和部署 ML solutions。他拥有 20 多年 full stack software engineer 经验，并在过去 6 年于 AWS 专注于 machine learning 领域。

Amin Dashti 是 AWS 的 Senior Data Scientist 和 researcher，将深厚的理论洞察与实践 machine learning 专长结合起来。他拥有理论物理背景和七年以上经验，曾在多个领域设计并部署 scalable models，包括金融系统中的 predictive analytics 和 statistical inference，以及 computer vision (CV) 和 natural language processing (NLP) 的前沿应用。

译自 aws-ml · 录于二〇二六年五月八日