精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

機器學習 | 從0開發大模型之復現DeepSeek的aha moment

人工智能
最近申請了48G的顯存,結合一些開源的方案復現aha monent,并給出完整的代碼和工具鏈。

前面一篇文章介紹了《從0開發大模型之DeepSeek的GRPO》,并且實現了一個簡單版本的 GRPO 代碼,不過從工程領域來看,并沒有復現DeepSeek-R1,于是最近申請了48G的顯存,結合一些開源的方案復現aha monent,并給出完整的代碼和工具鏈。

 1、什么是 aha monent 

DeepSeek-R1 論文中提到,模型讓作者「見證了強化學習的力量和美感」,在DeepSeek-R1-Zero的中間版本,「頓悟時刻」來了:模型學會了以人類的語氣進行反思。

圖片

aha monent

 2、使用什么的基座模型和訓練數據 

  • 由于顯卡只有48G,可以用基座模型Qwen2.5,模型大小:0.5B,1.5B,3B
  • 訓練數據有很多:(可以直接在huggingface上找到)

   a.AI-MO/NuminaMath-TIR:包括72K行的數學問題,解決方案和答案,是從 NuminaMath-CoT 數據集提煉出來的

   b. FreedomIntelligence/medical-o1-verifiable-problem:包括40K行的醫學數據,不過沒有相關的推理過程

   c.https://raw.githubusercontent.com/hkust-nlp/simpleRL-reason/refs/heads/main/train/data/math_level3to5_data_processed_with_qwen_prompt.json:在simpleRL-reason的開源項目中的訓練數據集

 3、如何訓練 

 3.1、設計獎勵函數 

從上一篇《從0開發大模型之DeepSeek的GRPO》中已經了解GRPO的原理,其中一部分是包括獎勵函數的設計,其中如何設計這里就省略,本文暫時參考其他復現R1的項目設使用了5個函數:

  • accuracy_reward:驗證答案的準確性,對就返回1,不對就返回0
  • format_reward:驗證格式的準確性,如果符合^<think>.*?</think><answer>.*?</answer>$的返回則返回1,否則就返回0
  • reasoning_steps_reward:有推理步驟的,類似(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,),最大值返回3,否則返回0
  • cosine_reward:基于答案的長度做余弦,分為正確答案最大長度,正確答案最小長度,錯誤答案最大長度,錯誤答案最小長度
  • repetition_penalty_reward:計算 N-gram 重復獎勵
  • length_reward:參考kimi1.5的論文(https://arxiv.org/abs/2501.12599)

a.正確答案長度獎勵: reward = 0.5 - (len - min_len)/(max_len - min_len)

b.錯誤答案長度獎勵: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))

 3.2、使用vLLM 

為了提升性能和節省顯存,這里使用了vLLMvLLM是一個開源的大模型推理加速框架,通過PagedAttention高效地管理attention中緩存的張量,實現比HuggingFace Transformers高14-24倍的吞吐量,從本文實驗過程中發現,之前需要60G顯存的,基本40G就能跑起來。

由于vLLM的加載模型和Huggingface的可以直接兼容,所以可以參考如下代碼跑起來:

from vllm import LLM, SamplingParams
if __name__ == '__main__':
    model_path = "{模型名稱}"
    model = LLM(model=model_path, 
        tensor_parallel_size=1, 
        trust_remote_code=True, 
        max_model_len=10000, 
        enforce_eager=True, 
        gpu_memory_utilizatinotallow=0.5, 
        block_size=32)
    sampling_params = SamplingParams(temperature=0, max_tokens=1, prompt_logprobs=20)

    prompt = "vLLM是如何實現的?"
    response = model.generate(prompt, sampling_params, use_tqdm=False)[0]
    print(response, '\n\n', response.outputs)

 3.3、使用Acceleratedeepspeed加速訓練 

AcceleratePyTorch官方提供的分布式訓練工具,而deepspeed是由Microsoft提供的分布式訓練工具,最主要的區別在于支持的模型規模不同,deepspeed支持更大規模的模型,deepspeed還提供了更多的優化策略和工具,例如ZeROOffload等,Accelerate更加穩定和易于使用,適合中小規模的訓練任務,不過huggingface已經集成了deepspeed,如果對于訓練改幾行代碼即可,如下:

#!pip install accelerate
#!pip install deepspeed
import torch
import torch.nn.functional as F
from datasets import load_dataset
# 引入基礎庫accelerate
from accelerate import Accelerator

# 創建accelerator
accelerator = Accelerator()
# 修改設備信息
device = accelerator.device
model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())

dataset = load_dataset({需要加載的數據})
data = torch.utils.data.DataLoader(dataset, shuffle=True)

# 使用accelerator訓練
model, optimizer, data = accelerator.prepare(model, optimizer, data)
model.train()
for epoch in range(10):
    for source, targets in data:
        source = source.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()

        output = model(source)
        loss = F.cross_entropy(output, targets)

        # 使用accelerator做backward
        accelerator.backward(loss)

        optimizer.step()

相關的配置可以參考zero3.yaml文件或者運行accelerate config

 4、完整的代碼 

4.1、命令

需要安裝 python>=3.10 和必要的庫如下:

pip install transformers
pip install trl
pip install --upgrade trl
pip install latex2sympy2_extended math_verify
pip install flash_attn
pip install vllm
pip install deepspeed
pip install accelerate

運行的命令:

accelerate launch --config_file zero3.yaml 0-grpotrainer_r1.py

其中zero3.yaml配置:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero_stage: 3 
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1 
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

4.2、代碼

完整的訓練代碼較大,請到本文的最后查看。

 5、觀察aha moment 

圖片

從上圖可以看出,模型從直接思考沒有解出問題,但是后面反復添加一些思考步驟就正確了。

 6、注意事項 

(1)安裝過程中錯誤:ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.解決方案:
pip install -U flash-attn

(2)安裝過程中錯誤:ImportError: vLLM is not available and use_vllm is set to True. Please install vLLM with pip install vllm to use it.解決方案:
pip install -U vllm

(3)訓練完的模型如何轉換為運行的模型?解決方案:

from deepspeed.utils.zero_to_fp32 import convert_zero_checkpoint_to_fp32_state_dict 

convert_zero_checkpoint_to_fp32_state_dict(
    checkpoint_dir="./output/GRPO-R1-1.5B",
    output_dir="./output/GRPO-R1-1.5B",
    tag="global_step9055", # 模型保存的step文件
)

(4)如果進行模型測試?解決方案:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftConfig

# 加載Qwen模型
# model_name = "Qwen/Qwen2.5-1.5B"
# 加載本地模型
model_name = "./output/GRPO-R1-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir="./model")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
device = torch.device("cuda"if torch.cuda.is_available() else"cpu")
print("device: ", device)
model.to(device)

chat_history_ids = None
whileTrue:
    user_input = input("用戶: ")
    if user_input.lower() == "exit":
        break

    new_user_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt').to(device)

    if chat_history_ids isnotNone:
        input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)
    else:
        input_ids = new_user_input_ids

    chat_history_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
    bot_response = tokenizer.decode(chat_history_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True)

    print("機器人: ", bot_response)

 7、代碼 

from typing import Optional, Dict
import re, logging, os, sys, torch, math
import transformers
from transformers import (
    AutoModelForCausalLM,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
import datasets
from datasets import load_dataset
from trl import ModelConfig, ScriptArguments, GRPOConfig, GRPOTrainer, get_peft_config
from dataclasses import dataclass, field
from latex2sympy2_extended import NormalizationConfig
from math_verify import LatexExtractionConfig, parse, verify

logger = logging.getLogger(__name__)

def verify_answer(contents, solution):
    rewards = []
    for content, sol in zip(contents, solution):
        gold_parsed = parse(
            sol,
            extraction_mode="first_match",
            extraction_cnotallow=[LatexExtractionConfig()],
        )
        print('-'*100)
        print(f'\ncontent:{content}\nsol:{sol}')
        if len(gold_parsed) != 0:
            answer_parsed = parse(
                content,
                extraction_cnotallow=[
                    LatexExtractionConfig(
                        normalization_cnotallow=NormalizationConfig(
                            nits=False,
                            malformed_operators=False,
                            basic_latex=True,
                            equatinotallow=True,
                            boxed="all",
                            units=True,
                        ),
                        # Ensures that boxed is tried first
                        boxed_match_priority=0,
                        try_extract_without_anchor=False,
                    )
                ],
                extraction_mode="first_match",
            )
            # Reward 1 if the content is the same as the ground truth, 0 otherwise
            reward = float(verify(answer_parsed, gold_parsed))
            print('-'*100)
            print(f'\nanswer_parsed:{answer_parsed}\ngold_parsed:{gold_parsed}\nreward:{reward}')
        else:
            reward = 1.0
            print(f'Failed to parse gold solution: {sol}')
        rewards.append(reward)

    return rewards

def accuracy_reward(completions, solution, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    contents = [completion[0]["content"] for completion in completions]
    rewards = verify_answer(contents, solution)
    print(f'\naccuracy rewards:{rewards}')
    return rewards

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think><answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards = [1.0if match else0.0for match in matches]
    print('-'*100)
    print('\nformat rewards:', rewards)
    return rewards

def reasoning_steps_reward(completions, **kwargs):
    """Reward function that checks for clear step-by-step reasoning.
    Regex pattern:
        Step \d+: - matches "Step 1:", "Step 2:", etc.
        ^\d+\. - matches numbered lists like "1.", "2.", etc. at start of line
        \n- - matches bullet points with hyphens
        \n\* - matches bullet points with asterisks
        First,|Second,|Next,|Finally, - matches transition words
    """
    pattern = r"(Step \d+:|^\d+\.|\n-|\n\*|First,|Second,|Next,|Finally,)"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [len(re.findall(pattern, content)) for content in completion_contents]
    # Magic nubmer 3 to encourage 3 steps and more, otherwise partial reward
    return [min(1.0, count / 3) for count in matches]

def len_reward(completions: list[Dict[str, str]], solution: list[str], **kwargs) -> float:
    """Compute length-based rewards to discourage overthinking and promote token efficiency.

    Taken from from the Kimi 1.5 tech report: https://arxiv.org/abs/2501.12599

    Args:
        completions: List of model completions
        solutions: List of ground truth solutions

    Returns:
        List of rewards where:
        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)
        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))
    """
    contents = [completion[0]["content"] for completion in completions]

    # First check correctness of answers
    correctness = verify_answer(contents, solution)

    # Calculate lengths
    lengths = [len(content) for content in contents]
    min_len = min(lengths)
    max_len = max(lengths)

    # If all responses have the same length, return zero rewards
    if max_len == min_len:
        return [0.0] * len(completions)

    rewards = []
    for length, is_correct in zip(lengths, correctness):
        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)
        reward = lambda_val if is_correct > 0.0else min(0, lambda_val) 
        rewards.append(float(reward))

    return rewards

def get_cosine_scaled_reward(
    min_value_wrong: float = -1.0,
    max_value_wrong: float = -0.5,
    min_value_correct: float = 0.5,
    max_value_correct: float = 1.0,
    max_len: int = 1000,
):
    def cosine_scaled_reward(completions, solution, **kwargs):
        """Reward function that scales based on completion length using a cosine schedule.

        Shorter correct solutions are rewarded more than longer ones.
        Longer incorrect solutions are penalized less than shorter ones.

        Args:
            completions: List of model completions
            solution: List of ground truth solutions

        This function is parameterized by the following arguments:
            min_value_wrong: Minimum reward for wrong answers
            max_value_wrong: Maximum reward for wrong answers
            min_value_correct: Minimum reward for correct answers
            max_value_correct: Maximum reward for correct answers
            max_len: Maximum length for scaling
        """
        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        correctness = verify_answer(contents, solution)
        lengths = [len(content) for content in contents]
        for gen_len, is_correct in zip(lengths, correctness):
            # Apply cosine scaling based on length
            progress = gen_len / max_len
            cosine = math.cos(progress * math.pi)

            if is_correct > 0:
                min_value = min_value_correct
                max_value = max_value_correct
            else:
                # Swap min/max for incorrect answers
                min_value = max_value_wrong
                max_value = min_value_wrong

            reward = min_value + 0.5 * (max_value - min_value) * (1.0 + cosine)
            rewards.append(float(reward))

        return rewards

    return cosine_scaled_reward


def get_repetition_penalty_reward(ngram_size: int, max_penalty: float):
    """
    Computes N-gram repetition penalty as described in Appendix C.2 of https://arxiv.org/abs/2502.03373.
    Reference implementation from: https://github.com/eddycmu/demystify-long-cot/blob/release/openrlhf/openrlhf/reward/repetition.py

    Args:
    ngram_size: size of the n-grams
    max_penalty: Maximum (negative) penalty for wrong answers
    """
    if max_penalty > 0:
        raise ValueError(f"max_penalty {max_penalty} should not be positive")

    def zipngram(text: str, ngram_size: int):
        words = text.lower().split()
        return zip(*[words[i:] for i in range(ngram_size)])

    def repetition_penalty_reward(completions, **kwargs) -> float:
        """
        reward function the penalizes repetitions
        ref implementation: https://github.com/eddycmu/demystify-long-cot/blob/release/openrlhf/openrlhf/reward/repetition.py

        Args:
            completions: List of model completions
        """

        contents = [completion[0]["content"] for completion in completions]
        rewards = []
        for completion in contents:
            if completion == "":
                rewards.append(0.0)
                continue
            if len(completion.split()) < ngram_size:
                rewards.append(0.0)
                continue

            ngrams = set()
            total = 0
            for ng in zipngram(completion, ngram_size):
                ngrams.add(ng)
                total += 1

            scaling = 1 - len(ngrams) / total
            reward = scaling * max_penalty
            rewards.append(reward)
        return rewards

    return repetition_penalty_reward

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

@dataclass
class R1GRPOScriptArguments(ScriptArguments):
    reward_funcs: list[str] = field(
        default_factory = lambda: ["accuracy", "format"],
        metadata = {
            "help": f"List of reward functions. Available options: 'accuracy', 'format', 'reasoning_steps', 'len', 'get_cosine_scaled', 'get_repetition_penalty'"
        },
    )
    cosine_min_value_wrong: float = field(
        default=0.0,
        metadata={"help": "Minimum reward for wrong answers"},
    )
    cosine_max_value_wrong: float = field(
        default=-0.5,
        metadata={"help": "Maximum reward for wrong answers"},
    )
    cosine_min_value_correct: float = field(
        default=0.5,
        metadata={"help": "Minimum reward for correct answers"},
    )
    cosine_max_value_correct: float = field(
        default=1.0,
        metadata={"help": "Maximum reward for correct answers"},
    )
    cosine_max_len: int = field(
        default=1000,
        metadata={"help": "Maximum length for scaling"},
    )
    repetition_n_grams: int = field(
        default=3,
        metadata={"help": "Number of n-grams for repetition penalty reward"},
    )
    repetition_max_penalty: float = field(
        default=-1.0,
        metadata={"help": "Maximum (negative) penalty for for repetition penalty reward"},
    )

@dataclass
class R1GRPOConfig(GRPOConfig):
    """
    args for callbacks, benchmarks etc
    """
    benchmarks: list[str] = field(
        default_factory=lambda: [], metadata={"help": "The benchmarks to run after training."}
    )
    callbacks: list[str] = field(
        default_factory=lambda: [], metadata={"help": "The callbacks to run during training."}
    )
    system_prompt: Optional[str] = field(
        default=None, metadata={"help": "The optional system prompt to use for benchmarking."}
    )


def main(script_args, training_args, model_args):
    # Set seed for reproducibility
    set_seed(training_args.seed)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
    logger.info(f"Data parameters {training_args}")

    # Check for last checkpoint
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        logger.info(f"Last checkpoint detected, resuming training at {last_checkpoint=}.")
    if last_checkpoint isnotNoneand training_args.resume_from_checkpoint isNone:
        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")

    # Load the dataset
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

    # Get reward functions
    REWARD_FUNCS_REGISTRY = {
        "accuracy": accuracy_reward,
        "format": format_reward,
        "reasoning_steps": reasoning_steps_reward,
        "cosine": get_cosine_scaled_reward(
            min_value_wrnotallow=script_args.cosine_min_value_wrong,
            max_value_wrnotallow=script_args.cosine_max_value_wrong,
            min_value_correct=script_args.cosine_min_value_correct,
            max_value_correct=script_args.cosine_max_value_correct,
            max_len=script_args.cosine_max_len,
        ),
        "repetition_penalty": get_repetition_penalty_reward(
            ngram_size=script_args.repetition_n_grams,
            max_penalty=script_args.repetition_max_penalty,
        ),
        "length": len_reward,
    }
    reward_funcs = [REWARD_FUNCS_REGISTRY[func] for func in script_args.reward_funcs]

    # Format into conversation
    def make_conversation(example):
        return {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": example["problem"]},
            ],
        }

    dataset = dataset.map(make_conversation)
    for split in dataset:
        if"messages"in dataset[split].column_names:
            dataset[split] = dataset[split].remove_columns("messages")

    logger.info("*** Initializing model kwargs ***")
    torch_dtype = (
        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
    )

    training_args.gradient_checkpointing = True
    model_kwargs = dict(
        revision = model_args.model_revision,
        trust_remote_code = model_args.trust_remote_code,
        attn_implementation = model_args.attn_implementation,
        torch_dtype = torch_dtype,
        use_cache = Falseif training_args.gradient_checkpointing elseTrue,
    )

    model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, 
                                                 load_in_4bit=False, **model_kwargs)

    print(model_args.model_name_or_path)
    #############################
    # Initialize the R1GRPO trainer
    #############################
    trainer = GRPOTrainer(
        model = model,
        reward_funcs = reward_funcs,
        args = training_args,
        train_dataset = dataset[script_args.dataset_train_split],
        eval_dataset = dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no"elseNone,
        peft_config = get_peft_config(model_args),
    )

    ###############
    # Training loop
    ###############
    logger.info("*** Train ***")
    checkpoint = None
    if training_args.resume_from_checkpoint isnotNone:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint isnotNone:
        checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    metrics = train_result.metrics
    metrics["train_samples"] = len(dataset[script_args.dataset_train_split])
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    ##################################
    # Save model and create model card
    ##################################
    logger.info("*** Save model ***")
    trainer.save_model(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")

    # Save everything else on main process
    kwargs = {
        "dataset_name": script_args.dataset_name,
        "tags": ["GRPOTrainer-R1"],
    }
    if trainer.accelerator.is_main_process:
        trainer.create_model_card(**kwargs)
        # Restore k,v cache for fast inference
        trainer.model.config.use_cache = True
        trainer.model.config.save_pretrained(training_args.output_dir)

script_config = {
    "dataset_name": "AI-MO/NuminaMath-TIR",
    "dataset_config": "default",
    "reward_funcs": [
        "accuracy",
        "format",
        "reasoning_steps",
    ]
}

training_config = {
    "output_dir": "output/GRPO-R1-1.5B", # 模型輸出目錄
    "overwrite_output_dir": True, # 是否覆蓋輸出目錄
    "do_eval": True, # 是否進行評估
    "eval_strategy": "steps", # 評估策略,按步數進行評估
    "eval_steps": 100, # 每100步進行一次評估
    "per_device_train_batch_size": 4, # 每個設備上的訓練批次大小
    "per_device_eval_batch_size": 4, # 每個設備上的評估批次大小
    "gradient_accumulation_steps": 8, # 梯度累積步數
    "learning_rate": 1.0e-06, # 學習率
    "num_train_epochs": 1.0, # 訓練的總輪數
    "max_steps": -1, # 最大訓練步數,-1表示不限制
    "lr_scheduler_type": "cosine", # 學習率調度器類型,使用余弦退火
    "warmup_ratio": 0.1, # 預熱比例
    "log_level": "info", # 日志記錄級別
    "logging_strategy": "steps", # 日志記錄策略,按步數記錄
    "logging_steps": 100, # 每100步記錄一次日志
    "save_strategy": "no", # 保存策略,不保存
    "seed": 42, # 隨機種子
    "bf16": True, # 是否使用bfloat16精度
    "gradient_checkpointing": True, # 是否使用梯度檢查點
    "gradient_checkpointing_kwargs": {
        "use_reentrant": False# 梯度檢查點的額外參數,是否使用reentrant模式
    },
    "max_prompt_length": 128, # 最大提示長度
    "num_generations": 4, # 生成的數量
    "max_completion_length": 256, # 最大完成長度
    "use_vllm": True, # 是否使用vLLM
    "vllm_device": "auto", # vLLM設備,自動選擇
    "vllm_gpu_memory_utilization": 0.8, # vLLM GPU內存利用率
    "resume_from_checkpoint": "output/GRPO-R1-1.5B", # 恢復檢查點,如果沒有latest文件,需要添加latest文件類似`global_step9055`
}

model_config = {
    "model_name_or_path": "Qwen/Qwen2.5-1.5B-Instruct",
    "model_revision": "main",
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
}

if __name__ == "__main__":
    script_args = R1GRPOScriptArguments(**script_config)
    training_args = R1GRPOConfig(**training_config)
    model_args = ModelConfig(**model_config)
    main(script_args, training_args, model_args)

 參考 

(1)https://github.com/agentica-project/deepscaler

(2)https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset

(3)https://zhuanlan.zhihu.com/p/21393382793

(4)https://github.com/hkust-nlp/simpleRL-reason

(5)https://mp.weixin.qq.com/s/RbQnInTa00ZISvJL7vORzA

(6)https://zhuanlan.zhihu.com/p/629644249

責任編輯:龐桂玉 來源: 周末程序猿
相關推薦

2025-04-03 15:40:41

機器學習大模型DeepSeek

2024-11-04 00:24:56

2024-12-09 00:00:10

2024-12-26 00:46:25

機器學習LoRA訓練

2024-11-26 09:33:44

2025-02-18 10:54:04

2025-01-10 08:38:10

2025-07-04 08:47:00

大模型AI信息

2025-02-20 09:27:46

2025-04-27 09:00:00

模型視頻生成

2017-08-25 14:05:01

機器學習算法模型

2020-10-13 07:00:00

機器學習人工智能

2022-05-18 16:24:36

PythonPyCaret機器學習

2025-05-08 08:10:25

大模型DeepSeekAPI

2017-07-11 10:19:24

淺層模型機器學習優化算法

2025-02-10 09:42:14

2025-03-06 07:28:31

DeepSeek大模型人工智能

2020-11-03 10:09:46

機器學習論文代碼

2017-04-25 16:45:11

2025-02-13 08:30:11

點贊
收藏

51CTO技術棧公眾號

小早川怜子一区二区的演员表| 一本色道无码道dvd在线观看| 亚洲大尺度网站| 亚洲精品日韩久久| 亚洲欧美日韩中文在线制服| 特黄视频免费观看| 国产精品13p| 国产欧美一区二区三区鸳鸯浴| 91九色精品视频| 欧美a∨亚洲欧美亚洲| 99精品国产一区二区三区| 亚洲精品xxxx| 手机在线视频一区| 日韩激情电影免费看| 亚洲欧美一区二区三区国产精品| 精品一区二区久久久久久久网站| 91肉色超薄丝袜脚交一区二区| 亚洲精品系列| 美女少妇精品视频| 少妇久久久久久久久久| 香蕉大人久久国产成人av| 欧美亚洲国产一区二区三区va| 国产 欧美 日韩 一区| 国产黄色在线播放| 99久久久国产精品免费蜜臀| 亚洲已满18点击进入在线看片| av大全在线观看| 一区在线视频观看| 久久影视免费观看| 婷婷综合在线视频| 窝窝社区一区二区| 精品国产一区二区在线观看| 一级黄色特级片| 唐人社导航福利精品| 亚洲va国产va欧美va观看| 中文字幕中文字幕一区三区| eeuss影院www在线播放| 99精品视频一区二区三区| 亚洲一区二区三区777| 中文字幕视频一区二区| 色喇叭免费久久综合网| 亚洲午夜久久久| av电影一区二区三区| av片在线看| 精品国产亚洲一区二区在线观看| 国内成人自拍| 亚洲欧美在线专区| 日韩成人小视频| 国产精品网红直播| 亚洲精品中文字幕乱码三区91| 伊人激情综合| 国自在线精品视频| 国产午夜手机精彩视频| 国产精品国产一区| 亚洲乱码国产乱码精品精天堂| 最新中文字幕日本| jizz性欧美23| 日韩亚洲欧美综合| 亚洲一区二区图片| 精品国产一区二区三区2021| 911精品国产一区二区在线| 中文字幕22页| 91久久精品国产91性色69| av成人天堂| 欧美黑人又粗大| 午夜久久久久久久| 盗摄牛牛av影视一区二区| 欧美一级免费观看| 一本色道久久hezyo无码| 欧美h版在线观看| 日韩精品在线网站| 中文字幕av一区二区三区人妻少妇| 二区三区精品| 亚洲精品在线电影| 一二三区视频在线观看| 日韩黄色网络| 亚洲色图欧美制服丝袜另类第一页| 在线观看日本中文字幕| 欧美r级电影| 久久久国产精品一区| 欧美成人精品欧美一级| 极品尤物久久久av免费看| 2019中文字幕在线| 台湾佬中文在线| 麻豆精品一区二区三区| 91在线免费看片| 亚洲欧美另类综合| 久久久精品国产免大香伊 | 久久丁香四色| 亚洲国产99精品国自产| 精品人妻一区二区三区蜜桃视频| 久久影院100000精品| 久久夜色撩人精品| 国产精品美女久久久久av爽| 美女免费视频一区二区| 高清视频一区二区三区| 九色在线观看视频| 亚洲美女视频一区| 99爱视频在线| 曰本一区二区| 欧美成人三级在线| 精品欧美一区二区久久久| 欧美黄色免费| www.日韩系列| 亚洲日本韩国在线| 日韩在线一区二区三区| 久久精品一偷一偷国产| 久久久久亚洲av成人毛片韩| 狠狠色丁香久久婷婷综合_中| 免费成人美女在线观看| 久久亚洲精品视频| 国产精品久久久久久久久久久久午夜片 | 裤袜国产欧美精品一区| 日韩一区二区在线观看| 欧美熟妇激情一区二区三区| 欧美日韩午夜| 国产美女久久精品| 亚洲av成人精品日韩在线播放| 中文字幕av一区二区三区| 黄色影院一级片| h视频久久久| 欧美成人午夜激情在线| 中文字幕 人妻熟女| 99久久精品国产毛片| 国产欧美自拍视频| 国产精品成人国产| 亚洲男人天堂古典| 国产又黄又猛又粗又爽| 成人午夜激情在线| 欧美一级特黄aaaaaa在线看片| 久久av日韩| 亚洲最新av网址| 国产精品suv一区| 972aa.com艺术欧美| 性一交一乱一伧国产女士spa| 国产精品一区二区精品| 精品激情国产视频| 在线免费av片| 中文字幕日韩一区| 中文字幕免费高清在线| 久久人人99| 国产精品视频成人| youjizz在线播放| 色播五月激情综合网| 波多野结衣av在线观看| 在线不卡视频| 99精品国产高清在线观看| 日韩porn| 国产精品久久久久久免费| 国产极品久久久久久久久波多结野 | 美女精品视频一区| 国产美女无遮挡永久免费| 国产精品久久久久久久浪潮网站| 天堂中文视频在线| 一区二区三区日本久久久| 国产91精品青草社区| 日本又骚又刺激的视频在线观看| 欧美日韩国产精品专区| 中文字幕一区二区三区人妻电影| 久久在线精品| 亚洲天堂电影网| 伊人久久一区| 久久久噜噜噜久噜久久| 天堂av中文在线资源库| 色综合av在线| 五月激情四射婷婷| 国产乱对白刺激视频不卡| 蜜臀av性久久久久蜜臀av| 午夜视频一区二区在线观看| 91精品国产色综合| 福利成人在线观看| 欧美高清激情brazzers| 久久黄色免费视频| 久久先锋影音av| 一区二区成人网| 欧美 日韩 国产 一区| 好吊妞www.84com只有这里才有精品 | 性猛交╳xxx乱大交| 日韩视频一区| 日本在线观看不卡| 日韩中文字幕无砖| 欧美一区二粉嫩精品国产一线天| 亚洲视频tv| 欧美精品一区男女天堂| 国产suv精品一区二区33| 亚洲欧洲韩国日本视频| 先锋资源av在线| 麻豆精品一区二区综合av| 9色porny| 久久精品国产大片免费观看| 春色成人在线视频| 色综合一本到久久亚洲91| 久久久极品av| 午夜av免费在线观看| 欧美日韩大陆在线| 日本一区二区欧美| 成人福利视频网站| 97超碰成人在线| 国产精品亚洲综合色区韩国| www.亚洲一区二区| 国产成人精品一区二区免费看京| 亚洲自拍在线观看| 国产成人精品一区二三区在线观看| 欧美激情欧美激情在线五月| jyzzz在线观看视频| 亚洲黄页视频免费观看| 一本一道精品欧美中文字幕| 第一福利永久视频精品 | 欧美一区二区| 日韩影视精品| 琪琪久久久久日韩精品 | 国产一区不卡| 国产偷久久久精品专区| 国产一精品一av一免费爽爽| 国产精品视频26uuu| 成人动漫一区| 97免费中文视频在线观看| h网站久久久| 最新的欧美黄色| 久热av在线| 亚洲成avwww人| www日本高清视频| 91麻豆精品国产91久久久久| 成年人视频免费| 色综合天天综合网天天看片| 日本午夜小视频| 亚洲一二三四区| 青青草原在线免费观看| 中文字幕在线不卡一区二区三区| 手机看片日韩av| 国产色综合一区| 亚洲AV无码片久久精品| 91日韩在线专区| 精品黑人一区二区三区观看时间| 成人国产精品免费网站| 国产性猛交96| 成人动漫在线一区| 国产精品久久久久久在线观看| 国产成人亚洲综合a∨猫咪| 午夜视频在线观| 国内外成人在线视频| 99九九99九九九99九他书对| 捆绑调教一区二区三区| 婷婷激情5月天| 国内不卡的二区三区中文字幕| 爱爱爱爱免费视频| 精品一二三四区| 国产在线观看中文字幕| 国产成人综合精品三级| yjizz视频| 久久综合狠狠综合久久综合88| 亚洲理论片在线观看| 久久久99精品免费观看不卡| 免费黄色片网站| 日韩美女视频一区二区| 日本天堂中文字幕| 亚洲国产精品一区二区久久恐怖片| 国产午夜精品无码| 日韩欧美中文免费| 波多野结衣日韩| 8v天堂国产在线一区二区| 精品国产伦一区二区三| 亚洲国产精品中文| 美女欧美视频在线观看免费 | 日韩免费av电影| 欧美电影一区| 99er在线视频| 香蕉国产精品偷在线观看不卡| 亚洲男人天堂色| 麻豆国产精品777777在线| 无码人妻一区二区三区在线视频| 成人永久看片免费视频天堂| 30一40一50老女人毛片| 中文字幕va一区二区三区| 国产高潮流白浆| 偷窥少妇高潮呻吟av久久免费 | 欧美日韩综合色| 国产av无码专区亚洲av麻豆| 亚洲国产毛片完整版| chinese偷拍一区二区三区| 色综合天天狠天天透天天伊人| 狠狠躁少妇一区二区三区| 国产精品扒开腿做| 日韩精品一区二区三区中文| 麻豆精品传媒视频| 婷婷综合亚洲| 国产在线青青草| 久久精品国产999大香线蕉| 久久久久国产免费| 久久精品视频在线免费观看 | 欧美日韩免费在线观看| 中文在线资源天堂| 精品国产伦一区二区三区观看方式 | 久久久精品区| 欧美日韩高清免费| 欧美91大片| 日日噜噜噜噜久久久精品毛片| 国产成人av福利| 国精产品视频一二二区| 婷婷中文字幕一区三区| 91亚洲欧美激情| 亚洲欧美成人网| 黄色影院在线看| 成人激情视频免费在线| 视频一区欧美| 免费在线观看视频a| 韩国欧美国产1区| 国产av自拍一区| 亚洲成人资源在线| 国产喷水吹潮视频www| 国产亚洲美女久久| 国产一二三在线| 成人av网站观看| 亚洲成人tv| 国产免费999| 久久综合色婷婷| av资源吧首页| 日韩一区二区三区观看| 生活片a∨在线观看| 日韩美女视频免费看| 嫩草国产精品入口| www.avtt| 国产成人在线免费| 久久高清内射无套| 欧美日韩高清一区| 91电影在线播放| 国产精品va在线| 九一国产精品| 成人av一级片| 2024国产精品| 国产成人免费观看视频| 亚洲成在人线av| 免费在线观看av电影| **亚洲第一综合导航网站| 天天综合网网欲色| 激情久久综合网| 亚洲人成伊人成综合网小说| 97超碰人人草| 久久国产精品电影| 国产精品美女久久久久人| 亚洲综合五月天| 捆绑调教美女网站视频一区| 女教师淫辱の教室蜜臀av软件| 日本高清视频一区二区| 好吊色一区二区| 中文字幕久热精品视频在线| 欧洲一区二区三区| 国产另类自拍| 91精品国产乱码久久久久久| 免费一区二区三区在线观看| 精品一区二区国语对白| 野外性满足hd| 色婷婷av一区二区三区大白胸| 国产在线中文字幕| 国产欧美va欧美va香蕉在| 国产成人三级| 欧美美女一级片| 一区二区三区不卡在线观看| 国模无码一区二区三区| 97视频在线免费观看| 天堂资源在线亚洲| 亚洲精品高清无码视频| 中文字幕在线不卡一区二区三区| 国产免费一区二区三区免费视频| 色综合老司机第九色激情| 精品精品国产三级a∨在线| 91传媒久久久| 国产精品第五页| 国产av无码专区亚洲a∨毛片| 久久久久久亚洲精品不卡| 欧洲亚洲视频| 一级做a免费视频| 亚洲成人自拍一区| www.中文字幕久久久| 2019国产精品视频| 久久久久99| 中文字幕手机在线观看| 日韩成人在线播放| 人人玩人人添人人澡欧美| 欧美日韩不卡在线视频| 国产午夜精品理论片a级大结局| 亚洲图片在线播放| 韩国三级日本三级少妇99| 欧美三级三级| 任你躁av一区二区三区| 欧美在线小视频| 超碰97国产精品人人cao| 水蜜桃亚洲一二三四在线| 国产成a人无v码亚洲福利| 亚洲国产精品无码久久久| 美女久久久久久久久久久| 国产欧美日韩| 精品影片一区二区入口| 欧美日韩国产区一| 少妇视频在线观看| 公共露出暴露狂另类av| 久久蜜桃av一区二区天堂| 性一交一乱一透一a级| 国产精品美女免费| 夜夜夜久久久|