精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

機器學習 | 從0開發(fā)大模型之模型預訓練

人工智能 機器學習
在訓練過程中,通常會使用 scaler.scale(loss).backward() 來計算縮放后的損失的梯度,然后使用 scaler.step(optimizer) 來更新模型參數(shù),最后使用 scaler.update() 來更新縮放因子,這樣可以確保訓練過程的穩(wěn)定性和效率。

1、參數(shù)初始化

初始化參數(shù)模板:

from transformers import PretrainedConfig

class MyPretrainConfig(PretrainedConfig):
    model_type = "myllm"

    def __init__(
            self,
            dim: int = 512,
            n_layers: int = 8,
            n_heads: int = 16,
            n_kv_heads: int = 8,
            vocab_size: int = 6400,
            hidden_dim: int = None,
            multiple_of: int = 64,
            norm_eps: float = 1e-5,
            max_seq_len: int = 512,
            dropout: float = 0.0,
            flash_attn: bool = True,
            use_moe: bool = False,
            num_experts_per_tok=2,
            n_routed_experts=4,
            n_shared_experts: bool = True,
            scoring_func='softmax',
            aux_loss_alpha=0.01,
            seq_aux=True,
            norm_topk_prob=True,
            **kwargs,
    ):
        self.dim = dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.multiple_of = multiple_of
        self.norm_eps = norm_eps
        self.max_seq_len = max_seq_len
        self.dropout = dropout
        self.flash_attn = flash_attn
        self.num_experts_per_tok = num_experts_per_tok  # 每個token選擇的專家數(shù)量
        self.n_routed_experts = n_routed_experts        # 總的專家數(shù)量
        self.n_shared_experts = n_shared_experts        # 共享專家
        self.scoring_func = scoring_func                # 評分函數(shù),默認為'softmax'
        self.aux_loss_alpha = aux_loss_alpha            # 輔助損失的alpha參數(shù)
        self.seq_aux = seq_aux                          # 是否在序列級別上計算輔助損失
        self.norm_topk_prob = norm_topk_prob            # 是否標準化top-k概率
        super().__init__(**kwargs)

這里依賴 transformers 庫的 PretrainedConfig,其中 MyPretrainConfig 參數(shù)如下:

  • dim: int = 512:模型的維度,默認為 512
  • n_layers: int = 8:模型的層數(shù),默認為 8
  • n_heads: int = 16:注意力頭的數(shù)量,默認為 16
  • n_kv_heads: int = 8:鍵值對的頭數(shù),默認為 8
  • vocab_size: int = 6400:詞匯表的大小,默認為 6400
  • hidden_dim: int = None:隱藏層的維度,默認為 None,可以根據(jù)需要設置
  • multiple_of: int = 64:模型維度必須是這個值的倍數(shù),默認為 64
  • norm_eps: float = 1e-5:歸一化的 epsilon 值,默認為 1e-5
  • max_seq_len: int = 512:最大序列長度,默認為 512
  • dropout: float = 0.0:dropout 概率,默認為 0.0
  • flash_attn: bool = True:是否使用快速注意力機制,默認為 True
  • num_experts_per_tok=2:每個 token 選擇的專家數(shù)量,默認為 2
  • n_routed_experts=4:總的專家數(shù)量,默認為 4
  • n_shared_experts: bool = True:是否使用共享專家,默認為 True
  • scoring_func='softmax':評分函數(shù),默認為 'softmax'
  • aux_loss_alpha=0.01:輔助損失的 alpha 參數(shù),默認為 0.01
  • seq_aux=True:是否在序列級別上計算輔助損失,默認為 True
  • norm_topk_prob=True:是否標準化 top-k 概率,默認為 True
  • **kwargs:接收其他關(guān)鍵字參數(shù),傳遞給父類的構(gòu)造函數(shù)

PretrainedConfig 提供預訓練的參數(shù)模板,由于每個模型都是不一樣的,所以一般做成配置文件攜帶模型一起發(fā)布。

2、加載預處理的數(shù)據(jù)

加載上一篇文章已經(jīng)處理好的預處理數(shù)據(jù),代碼如下:

data_path_list = [f'./pretrain_data.bin']
train_ds = PretrainDataset(data_path_list, max_length=max_seq_len, memmap=True)
train_sampler = None
num_workers = 16  # 可以根據(jù)系統(tǒng)的 CPU 核心數(shù)來調(diào)整
train_loader = DataLoader(
    train_ds,
    batch_size=batch_size,
    pin_memory=True,
    drop_last=False,
    shuffle=False,
    num_workers=num_workers,
    sampler=train_sampler
)

其中 PretrainDataset 是加載代碼,主要目的是將數(shù)據(jù)轉(zhuǎn)換到內(nèi)存中,方便 DataLoader 獲取:

class PretrainDataset(Dataset):
    def __init__(self, data_path_lst, max_length=512, memmap=False):
        super().__init__()
        if memmap:
            with open(data_path_lst[0], 'r') as f:
                nbytes = f.seek(0, 2)
                flen = f.tell() // np.dtype('uint16').itemsize
            self.data = np.memmap(data_path_lst[0], dtype=np.dtype('uint16'), shape=(flen // max_length, max_length))
        else:
            data_lst = []
            for data_path in data_path_lst:
                with open(data_path, 'rb') as f:
                    data = np.fromfile(f, dtype=np.uint16)
                    data_lst.append(data)
            data = np.concatenate(data_lst)
            data = data[:max_length * int(len(data) / max_length)]
            self.data = data.reshape(-1, max_length)
        print("memmap:{} train data.shape:{}".format(memmap, self.data.shape))
        print("downloading finished.....")

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index: int):
        sample = self.data[index]
        X = np.array(sample[:-1]).astype(np.int64)
        Y = np.array(sample[1:]).astype(np.int64)

        return torch.from_numpy(X), torch.from_numpy(Y)

其中 Datasetfrom torch.utils.data import Dataset 通用代碼。

3、初始化模型

初始化模型,借鑒 llama2.c 的代碼,路徑:https://github.com/karpathy/llama2.c/blob/master/model.py,使用 Transformerdecoder 階段,即 Decoder-Only,主要是如下邏輯:

  • 初始化:創(chuàng)建tok_embeddings,dropout,layers和CausalLMOutputWithPast等
  • forward:獲取迭代輸出的結(jié)果

具體代碼如下:

class Transformer(PreTrainedModel):
    last_loss: Optional[torch.Tensor]

    def __init__(self, params: MyPretrainConfig):
        super().__init__(params)
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
        self.dropout = nn.Dropout(params.dropout)
        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))
        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = nn.Linear(params.dim, params.vocab_size, bias=False)

        # share the unembedding parameters with the embedding parameters
        self.tok_embeddings.weight = self.output.weight # https://paperswithcode.com/method/weight-tying

        # some useful precompute for the RoPE relative positional embeddings
        freqs_cos, freqs_sin = precompute_freqs_cis(self.params.dim // self.params.n_heads, self.params.max_seq_len)
        self.register_buffer("freqs_cos", freqs_cos, persistent=False)
        self.register_buffer("freqs_sin", freqs_sin, persistent=False)

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('w3.weight') or pn.endswith('wo.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * params.n_layers))

        # Initialize attribute for the loss of the last forward call. This will be set if the forward is called with a targets tensor.
        self.last_loss = None
        self.OUT = CausalLMOutputWithPast()

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, tokens: torch.Tensor, targets: Optional[torch.Tensor] = None) -> torch.Tensor:
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        h = self.dropout(h)
        freqs_cos = self.freqs_cos[:seqlen]
        freqs_sin = self.freqs_sin[:seqlen]

        for layer in self.layers:
            h = layer(h, freqs_cos, freqs_sin)
        h = self.norm(h)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.output(h)
            self.last_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the output on the very last position
            logits = self.output(h[:, [-1], :]) # note: using list [-1] to preserve the time dim
            self.last_loss = None

        self.OUT.__setitem__('logits', logits)
        self.OUT.__setitem__('last_loss', self.last_loss)
        return self.OUT
...

然后通過上述模型初始化,并打印模型:

def init_model():
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

    model = Transformer(lm_config).to(device)
    print(f'LLM總參數(shù)量:{count_parameters(model) / 1e6:.3f} 百萬')
    return model

model = init_model()
print(model)

獲取輸出結(jié)果如下:

Transformer(
  (tok_embeddings): Embedding(6400, 512)
  (dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-7): 8 x TransformerBlock(
      (attention): Attention(
        (wq): Linear(in_features=512, out_features=512, bias=False)
        (wk): Linear(in_features=512, out_features=256, bias=False)
        (wv): Linear(in_features=512, out_features=256, bias=False)
        (wo): Linear(in_features=512, out_features=512, bias=False)
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=512, out_features=1408, bias=False)
        (w2): Linear(in_features=1408, out_features=512, bias=False)
        (w3): Linear(in_features=512, out_features=1408, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (attention_norm): RMSNorm()
      (ffn_norm): RMSNorm()
    )
  )
  (norm): RMSNorm()
  (output): Linear(in_features=512, out_features=6400, bias=False)
)

模型初始化這里就不詳細說了,這個系列出一篇文章具體分析 llama2.c 源碼,講述是如何實現(xiàn)模型創(chuàng)建的。

4、選擇optimizer

執(zhí)行模型初始化后則選擇優(yōu)化器,這里代碼如下:

scaler = torch.cuda.amp.GradScaler(enabled=(dtype == dtype))
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

4.1 GradScaler

GradScaler 在 PyTorch 中的作用是用于自動混合精度(Automatic Mixed Precision, AMP)訓練時的梯度縮放,具體來說,它的主要功能包括:

  • 防止梯度下溢:在使用混合精度訓練時,模型的權(quán)重和激活值可能會使用較低的精度(如半精度浮點數(shù),F(xiàn)P16)。這可能導致在反向傳播過程中計算出的梯度值過小,從而出現(xiàn)梯度下溢(即梯度變?yōu)榱悖?code style="background-color: rgb(231, 243, 237); padding: 1px 3px; border-radius: 4px; overflow-wrap: break-word; text-indent: 0px; display: inline-block;">GradScaler 會自動調(diào)整梯度的縮放因子,以確保梯度在更新時不會下溢;
  • 提高訓練速度:使用混合精度可以減少內(nèi)存使用和計算時間,從而加速訓練過程,GradScaler 通過動態(tài)調(diào)整縮放因子,幫助在保持數(shù)值穩(wěn)定性的同時,充分利用混合精度的優(yōu)勢;
  • 簡化代碼:使用 GradScaler 可以簡化混合精度訓練的實現(xiàn),開發(fā)者不需要手動管理縮放因子和反縮放操作;

在訓練過程中,通常會使用 scaler.scale(loss).backward() 來計算縮放后的損失的梯度,然后使用 scaler.step(optimizer) 來更新模型參數(shù),最后使用 scaler.update() 來更新縮放因子,這樣可以確保訓練過程的穩(wěn)定性和效率。

4.2 optimizer

optimizer 在深度學習中是一個非常重要的組件,其主要作用是更新模型的參數(shù),以最小化損失函數(shù),具體來說,optimizer 的作用包括:

  • 參數(shù)更新:優(yōu)化器根據(jù)計算得到的梯度信息來更新模型的參數(shù)(權(quán)重和偏置),通過調(diào)整這些參數(shù),優(yōu)化器試圖使模型在訓練數(shù)據(jù)上的表現(xiàn)更好;
  • 控制學習率:優(yōu)化器通常會使用學習率(learning rate)來控制每次參數(shù)更新的幅度。學習率是一個超參數(shù),決定了模型在每次迭代中向最優(yōu)解移動的步長;
  • 實現(xiàn)不同的優(yōu)化算法:PyTorch 提供了多種優(yōu)化算法(如 SGD、Adam、RMSprop 等),每種算法都有其獨特的更新規(guī)則和策略。選擇合適的優(yōu)化器可以影響模型的收斂速度和最終性能;
  • 處理動量和自適應學習率:一些優(yōu)化器(如 Adam 和 RMSprop)使用動量和自適應學習率的策略來加速收斂和提高穩(wěn)定性。這些策略可以幫助優(yōu)化器在訓練過程中更有效地探索參數(shù)空間;
  • 支持正則化:某些優(yōu)化器可以集成正則化技術(shù)(如 L2 正則化),以防止模型過擬合;

在下面的迭代訓練中,主要作用是根據(jù)損失值調(diào)整優(yōu)化器參數(shù):

# 反向傳播
scaler.scale(loss).backward()

# 梯度剪裁和更新參數(shù)
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

# 清零梯度
optimizer.zero_grad(set_to_none=True)

5、迭代訓練

上述預處理數(shù)據(jù)加載完,模型執(zhí)行了初始化,然后優(yōu)化器也初始化后,就可以進行迭代訓練了,不過迭代訓練最重要的是設置學習率,根據(jù)loss動態(tài)調(diào)整參數(shù),代碼如下:

for epoch in range(epochs):
    start_time = time.time()

    for step, (X, Y) in enumerate(train_loader):
        X = X.to(device)
        Y = Y.to(device)

        # 設置學習率
        lr = get_lr(epoch * iter_per_epoch + step, epochs * iter_per_epoch)
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # 前向傳播和損失計算
        with ctx:
            out = model(X, Y)
            loss = out.last_loss

        # 反向傳播
        scaler.scale(loss).backward()

        # 梯度剪裁和更新參數(shù)
        if (step + 1) % accumulation_steps == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()

        # 清零梯度
        optimizer.zero_grad(set_to_none=True)

        if step % 100 == 0:
            spend_time = time.time() - start_time
            print(
                'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.7f} epoch_Time:{}min:'.format(
                    epoch,
                    epochs,
                    step,
                    iter_per_epoch,
                    loss.item(),
                    optimizer.param_groups[-1]['lr'],
                    spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
            model.eval()
            ckp = f'{save_dir}/pretrain_{lm_config.dim}.pth'
            state_dict = model.state_dict()
            torch.save(state_dict, ckp)
            model.train()
  • out = model(X, Y) 前向傳播,計算輸出
  • scaler.scale(loss).backward() 反向傳播,計算梯度,執(zhí)行 accumulation_steps 后更新梯度
  • model.eval()model.train() 分別是模型評估和訓練,并保存當前模型到指定的文件夾

本人在T4的GPU上,跑了30+小時完成迭代訓練,如果使用CPU時間會X4,我在附錄中放了完整的代碼,有興趣的可以跑一下。

附錄

完成代碼:

import os
import time
import math
import warnings
import inspect
import numpy as np
import torch
from torch import optim
from torch.utils.data import DataLoader
from contextlib import nullcontext
from model.model import Transformer
from torch.utils.data import Dataset
from transformers import PretrainedConfig
from typing import Any, Optional, Tuple
import torch.nn.functional as F
from torch import nn
from transformers import PreTrainedModel
from transformers.modeling_outputs import CausalLMOutputWithPast
os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')
basepath = "../datasets"

class MyPretrainConfig(PretrainedConfig):
    model_type = "myllm"

    def __init__(
            self,
            dim: int = 512,
            n_layers: int = 8,
            n_heads: int = 16,
            n_kv_heads: int = 8,
            vocab_size: int = 6400,
            hidden_dim: int = None,
            multiple_of: int = 64,
            norm_eps: float = 1e-5,
            max_seq_len: int = 512,
            dropout: float = 0.0,
            flash_attn: bool = True,
            num_experts_per_tok=2,
            n_routed_experts=4,
            n_shared_experts: bool = True,
            scoring_func='softmax',
            aux_loss_alpha=0.01,
            seq_aux=True,
            norm_topk_prob=True,
            **kwargs,
    ):
        self.dim = dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.multiple_of = multiple_of
        self.norm_eps = norm_eps
        self.max_seq_len = max_seq_len
        self.dropout = dropout
        self.flash_attn = flash_attn
        self.num_experts_per_tok = num_experts_per_tok  # 每個token選擇的專家數(shù)量
        self.n_routed_experts = n_routed_experts        # 總的專家數(shù)量
        self.n_shared_experts = n_shared_experts        # 共享專家
        self.scoring_func = scoring_func                # 評分函數(shù),默認為'softmax'
        self.aux_loss_alpha = aux_loss_alpha            # 輔助損失的alpha參數(shù)
        self.seq_aux = seq_aux                          # 是否在序列級別上計算輔助損失
        self.norm_topk_prob = norm_topk_prob            # 是否標準化top-k概率
        super().__init__(**kwargs)

class PretrainDataset(Dataset):
    def __init__(self, data_path_lst, max_length=512, memmap=False):
        super().__init__()
        if memmap:
            with open(data_path_lst[0], 'r') as f:
                nbytes = f.seek(0, 2)
                flen = f.tell() // np.dtype('uint16').itemsize
            self.data = np.memmap(data_path_lst[0], dtype=np.dtype('uint16'), shape=(flen // max_length, max_length))
        else:
            data_lst = []
            for data_path in data_path_lst:
                with open(data_path, 'rb') as f:
                    data = np.fromfile(f, dtype=np.uint16)
                    data_lst.append(data)
            data = np.concatenate(data_lst)
            data = data[:max_length * int(len(data) / max_length)]
            self.data = data.reshape(-1, max_length)
        print("memmap:{} train data.shape:{}".format(memmap, self.data.shape))
        print("downloading finished.....")

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index: int):
        sample = self.data[index]
        X = np.array(sample[:-1]).astype(np.int64)
        Y = np.array(sample[1:]).astype(np.int64)

        return torch.from_numpy(X), torch.from_numpy(Y)
    
class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cos = torch.cos(freqs)  # real part
    freqs_sin = torch.sin(freqs)  # imaginary part
    return freqs_cos, freqs_sin

def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(shape)

def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cos: torch.Tensor,
    freqs_sin: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:

    # reshape xq and xk to match the complex representation
    xq_r, xq_i = xq.float().reshape(xq.shape[:-1] + (-1, 2)).unbind(-1)
    xk_r, xk_i = xk.float().reshape(xk.shape[:-1] + (-1, 2)).unbind(-1)

    # reshape freqs_cos and freqs_sin for broadcasting
    freqs_cos = reshape_for_broadcast(freqs_cos, xq_r)
    freqs_sin = reshape_for_broadcast(freqs_sin, xq_r)

    # apply rotation using real numbers
    xq_out_r = xq_r * freqs_cos - xq_i * freqs_sin
    xq_out_i = xq_r * freqs_sin + xq_i * freqs_cos
    xk_out_r = xk_r * freqs_cos - xk_i * freqs_sin
    xk_out_i = xk_r * freqs_sin + xk_i * freqs_cos

    # flatten last two dimensions
    xq_out = torch.stack([xq_out_r, xq_out_i], dim=-1).flatten(3)
    xk_out = torch.stack([xk_out_r, xk_out_i], dim=-1).flatten(3)

    return xq_out.type_as(xq), xk_out.type_as(xk)

def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )

class Attention(nn.Module):
    def __init__(self, args: MyPretrainConfig):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        assert args.n_heads % self.n_kv_heads == 0
        model_parallel_size = 1
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads
        self.wq = nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
        self.attn_dropout = nn.Dropout(args.dropout)
        self.resid_dropout = nn.Dropout(args.dropout)
        self.dropout = args.dropout

        # use flash attention or a manual implementation?
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            mask = torch.full((1, 1, args.max_seq_len, args.max_seq_len), float("-inf"))
            mask = torch.triu(mask, diagonal=1)
            self.register_buffer("mask", mask)

    def forward(
        self,
        x: torch.Tensor,
        freqs_cos: torch.Tensor,
        freqs_sin: torch.Tensor,
    ):
        bsz, seqlen, _ = x.shape

        # QKV
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        # RoPE relative positional embeddings
        xq, xk = apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)

        # grouped multiquery attention: expand out keys and values
        xk = repeat_kv(xk, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
        xv = repeat_kv(xv, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)

        # make heads into a batch dimension
        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        xk = xk.transpose(1, 2)
        xv = xv.transpose(1, 2)

        # flash implementation
        if self.flash:
            output = torch.nn.functional.scaled_dot_product_attention(xq, xk, xv, attn_mask=None, dropout_p=self.dropout if self.training else 0.0, is_causal=True)
        else:
            # manual implementation
            scores = torch.matmul(xq, xk.transpose(2, 3)) / math.sqrt(self.head_dim)
            assert hasattr(self, 'mask')
            scores = scores + self.mask[:, :, :seqlen, :seqlen]   # (bs, n_local_heads, seqlen, cache_len + seqlen)
            scores = F.softmax(scores.float(), dim=-1).type_as(xq)
            scores = self.attn_dropout(scores)
            output = torch.matmul(scores, xv)  # (bs, n_local_heads, seqlen, head_dim)

        # restore time as batch dimension and concat heads
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)

        # final projection into the residual stream
        output = self.wo(output)
        output = self.resid_dropout(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropout: float):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = 4 * dim
            hidden_dim = int(2 * hidden_dim / 3)
            hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: MyPretrainConfig):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=args.hidden_dim,
            multiple_of=args.multiple_of,
            dropout=args.dropout,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(self, x, freqs_cos, freqs_sin):
        h = x + self.attention.forward(self.attention_norm(x), freqs_cos, freqs_sin)
        out = h + self.feed_forward.forward(self.ffn_norm(h))
        return out

class Transformer(PreTrainedModel):
    last_loss: Optional[torch.Tensor]

    def __init__(self, params: MyPretrainConfig):
        super().__init__(params)
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
        self.dropout = nn.Dropout(params.dropout)
        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))
        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = nn.Linear(params.dim, params.vocab_size, bias=False)

        # share the unembedding parameters with the embedding parameters
        self.tok_embeddings.weight = self.output.weight # https://paperswithcode.com/method/weight-tying

        # some useful precompute for the RoPE relative positional embeddings
        freqs_cos, freqs_sin = precompute_freqs_cis(self.params.dim // self.params.n_heads, self.params.max_seq_len)
        self.register_buffer("freqs_cos", freqs_cos, persistent=False)
        self.register_buffer("freqs_sin", freqs_sin, persistent=False)

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('w3.weight') or pn.endswith('wo.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * params.n_layers))

        # Initialize attribute for the loss of the last forward call. This will be set if the forward is called with a targets tensor.
        self.last_loss = None
        self.OUT = CausalLMOutputWithPast()

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, tokens: torch.Tensor, targets: Optional[torch.Tensor] = None) -> torch.Tensor:
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        h = self.dropout(h)
        freqs_cos = self.freqs_cos[:seqlen]
        freqs_sin = self.freqs_sin[:seqlen]

        for layer in self.layers:
            h = layer(h, freqs_cos, freqs_sin)
        h = self.norm(h)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.output(h)
            self.last_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the output on the very last position
            logits = self.output(h[:, [-1], :]) # note: using list [-1] to preserve the time dim
            self.last_loss = None

        self.OUT.__setitem__('logits', logits)
        self.OUT.__setitem__('last_loss', self.last_loss)
        return self.OUT

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer

    def estimate_mfu(self, fwdbwd_per_iter, dt):
        """ estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS """
        # first estimate the number of flops we do per iteration.
        # see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311
        N = sum(p.numel() for p in self.parameters())
        cfg = self.params
        L, H, Q, T = cfg.n_layers, cfg.n_heads, cfg.dim//cfg.n_heads, cfg.max_seq_len
        flops_per_token = 6*N + 12*L*H*Q*T
        flops_per_fwdbwd = flops_per_token * T
        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
        # express our flops throughput as ratio of A100 bfloat16 peak flops
        flops_achieved = flops_per_iter * (1.0/dt) # per second
        flops_promised = 312e12 # A100 GPU bfloat16 peak flops is 312 TFLOPS
        mfu = flops_achieved / flops_promised
        return mfu

    @torch.inference_mode()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        Also note this is a super inefficient version of sampling with no key/value cache.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.params.max_seq_len else idx[:, -self.params.max_seq_len:]
            # forward the model to get the logits for the index in the sequence
            logits = self(idx_cond)
            logits = logits[:, -1, :] # crop to just the final time step
            if temperature == 0.0:
                # "sample" the single most likely index
                _, idx_next = torch.topk(logits, k=1, dim=-1)
            else:
                # pluck the logits at the final step and scale by desired temperature
                logits = logits / temperature
                # optionally crop the logits to only the top k options
                if top_k is not None:
                    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                    logits[logits < v[:, [-1]]] = -float('Inf')
                # apply softmax to convert logits to (normalized) probabilities
                probs = F.softmax(logits, dim=-1)
                idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

def get_lr(it, all):
    warmup_iters = 0
    lr_decay_iters = all
    min_lr = learning_rate / 10

    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    if it > lr_decay_iters:
        return min_lr
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (learning_rate - min_lr)

def init_model():
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

    model = Transformer(lm_config).to(device)
    print(f'LLM總參數(shù)量:{count_parameters(model) / 1e6:.3f} 百萬')
    return model


if __name__ == "__main__":
    # -----------------------------------------------------------------------------
    lm_config = MyPretrainConfig()
    max_seq_len = lm_config.max_seq_len
    out_dir = 'out'
    epochs = 20             # 訓練輪數(shù)
    batch_size = 8          # batch_size
    learning_rate = 1e-4    # 學習率
    device = 'cuda:0'       # or cpu
    dtype = 'bfloat16'
    save_dir = os.path.join(out_dir)
    os.makedirs(save_dir, exist_ok=True)
    os.makedirs(out_dir, exist_ok=True)
    tokens_per_iter = batch_size * max_seq_len
    torch.manual_seed(1337)
    device_type = device if "cuda" in device else "cpu"
    print(f"device_type: {device_type}")
    ctx = (
        nullcontext()
        if device_type == "cpu"
        else torch.cuda.amp.autocast()
    )
    # -----------------------------------------------------------------------------

    # -----init dataloader------
    data_path_list = [f'{basepath}/pretrain_data.bin']
    train_ds = PretrainDataset(data_path_list, max_length=max_seq_len, memmap=True)
    train_sampler = None
    num_workers = 16  # 可以根據(jù)系統(tǒng)的 CPU 核心數(shù)來調(diào)整
    train_loader = DataLoader(
        train_ds,
        batch_size=batch_size,
        pin_memory=True,
        drop_last=False,
        shuffle=False,
        num_workers=num_workers,
        sampler=train_sampler
    )

    # init model
    model = init_model()
    print(model)
    scaler = torch.cuda.amp.GradScaler(enabled=(dtype == dtype))
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # training loop
    accumulation_steps = 8
    iter_per_epoch = len(train_loader)
    for epoch in range(epochs):
        start_time = time.time()

        for step, (X, Y) in enumerate(train_loader):
            X = X.to(device)
            Y = Y.to(device)

            # 設置學習率
            lr = get_lr(epoch * iter_per_epoch + step, epochs * iter_per_epoch)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr

            # 前向傳播和損失計算
            with ctx:
                out = model(X, Y)
                loss = out.last_loss

            # 反向傳播
            scaler.scale(loss).backward()

            # 梯度剪裁和更新參數(shù)
            if (step + 1) % accumulation_steps == 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(optimizer)
                scaler.update()

            # 清零梯度
            optimizer.zero_grad(set_to_none=True)

            if step % 100 == 0:
                spend_time = time.time() - start_time
                print(
                    'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.7f} epoch_Time:{}min:'.format(
                        epoch,
                        epochs,
                        step,
                        iter_per_epoch,
                        loss.item(),
                        optimizer.param_groups[-1]['lr'],
                        spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
                model.eval()
                ckp = f'{save_dir}/pretrain_{lm_config.dim}.pth'
                state_dict = model.state_dict()
                torch.save(state_dict, ckp)
                model.train()

參考

(1)https://github.com/jingyaogong/minimind?tab=readme-ov-file#%E6%95%B0%E6%8D%AE%E9%9B%86%E4%B8%8B%E8%BD%BD%E5%9C%B0%E5%9D%80
(2)https://github.com/karpathy/llama2.c/blob/master/train.py

責任編輯:武曉燕 來源: 周末程序猿
相關(guān)推薦

2024-11-26 09:33:44

2024-12-26 00:46:25

機器學習LoRA訓練

2025-04-03 15:40:41

機器學習大模型DeepSeek

2024-12-09 00:00:10

2025-01-10 08:38:10

2025-04-03 15:46:53

2020-08-10 15:05:02

機器學習人工智能計算機

2025-10-10 07:48:12

大模型預訓練初始化

2025-08-24 09:24:07

2019-05-07 11:18:51

機器學習人工智能計算機

2022-03-28 09:00:00

SQL數(shù)據(jù)庫機器學習

2017-03-24 15:58:46

互聯(lián)網(wǎng)

2017-12-26 13:53:31

深度學習遷移學習

2023-02-28 13:09:53

訓練模型

2025-06-19 10:09:55

2025-08-13 01:00:00

2023-06-24 19:59:40

2025-10-11 09:23:28

RLPT強化學習預訓練數(shù)據(jù)

2024-01-03 18:53:13

語言模型LLM
點贊
收藏

51CTO技術(shù)棧公眾號

日本免费在线观看| 成人在线免费看视频| 日本电影久久久| 国产精品国产自产拍高清av| 91亚洲精华国产精华| 美女毛片在线观看| 亚洲区小说区| 欧美日韩国产首页在线观看| 青少年xxxxx性开放hg| 亚洲av无码一区二区三区dv| 国产欧美日韩一级| 一区二区三区黄色| 91精产国品一二三| 国产成人精品123区免费视频| 国产精品日韩成人| 国产精品永久入口久久久| 欧美精品韩国精品| 天堂美国久久| 亚洲精品国产福利| 色婷婷成人在线| wwwww亚洲| 中文字幕av资源一区| 国产传媒一区二区三区| 亚洲精品国产精品国自产网站按摩| 久久久久久久久久久9不雅视频| 亚洲国产精品字幕| 日韩av片免费观看| 成人软件在线观看| 亚洲一区二区黄色| 亚洲一区尤物| 色婷婷av一区二区三区之红樱桃| 蜜桃视频一区二区三区在线观看| 欧美精品成人在线| 欧美伦理一区二区| 国产又粗又猛又爽又黄视频| 国产一区二区三区电影在线观看| 6080国产精品一区二区| 福利在线一区二区三区| 都市激情国产精品| 亚洲卡通欧美制服中文| 亚洲第一综合| 五月激情婷婷综合| 国产精品一区二区久久不卡| 国产精品十八以下禁看| 国产精品第5页| 激情综合视频| 亚洲国产精品电影| 久久精品一卡二卡| 91精品麻豆| 欧美日韩国产小视频| wwwwww.色| 欧美××××黑人××性爽 | 一本大道av伊人久久综合| 日本熟妇人妻xxxx| 羞羞的网站在线观看| 亚洲天堂成人在线观看| 亚洲欧美影院| 1pondo在线播放免费| 蜜乳av一区二区| 欧美中文在线观看| 国产无人区码熟妇毛片多| 国产精品啊v在线| 欧美成人精品一区二区| 欧美性x x x| 成人免费直播在线| 亚洲va国产va欧美va观看| 老司机午夜免费福利视频| 日韩中文字幕免费在线观看| 国产乱人伦偷精品视频不卡| 亚洲aⅴ日韩av电影在线观看 | 黄色动漫在线观看| 国产精品国产三级国产普通话三级| 日韩精品一区二区三区外面 | 日韩在线激情视频| 波兰性xxxxx极品hd| 91精品国产福利在线观看麻豆| www.亚洲天堂| 91在线播放观看| 激情视频一区二区三区| 国内精品久久久久影院 日本资源| 国产无套粉嫩白浆内谢| 国产日韩欧美三级| 国产精品777| 亚洲最大成人在线视频| 国产一区二区三区在线看麻豆| 2020国产精品久久精品不卡| 人妻一区二区三区免费| 91丨porny丨中文| 久久99精品久久久久久久久久| 视频在线观看你懂的| 国产日产欧美一区| 中文字幕中文字幕在线中心一区| 最新国产露脸在线观看| 亚洲国产精品一区二区www| 5月婷婷6月丁香| 国产麻豆一区| 日韩一级片网站| 国产精品无码电影| 清纯唯美综合亚洲| 欧美日韩第一页| 国产午夜性春猛交ⅹxxx| 日本网站在线观看一区二区三区| 91网在线免费观看| 天天综合永久入口| 国产精品福利影院| 人妻夜夜添夜夜无码av| 三上悠亚激情av一区二区三区 | 黄色一级片在线看| 蜜桃视频成人m3u8| 精品日韩一区二区三区免费视频| 中文字幕在线看高清电影| 久久久久久久久久久9不雅视频| 午夜精品免费视频| 97在线公开视频| 91亚洲永久精品| 黄频视频在线观看| 亚洲三级欧美| 亚洲情趣在线观看| 国产精品丝袜久久久久久消防器材| h1515四虎成人| 亚洲成av人片在线观看香蕉| 欧美一区二区三区粗大| 一本色道久久综合一区| 91麻豆国产语对白在线观看| 欧美777四色影视在线| 99精品一区二区| 亚洲一区高清| a日韩av网址| 精品国产三级a在线观看| 超碰人人人人人人人| 亚洲欧美日韩综合国产aⅴ| 成人在线免费观看一区| 老司机在线视频二区| 日本韩国欧美在线| 中文字幕在线播放一区| 欧美精品播放| 亚洲a在线观看| 永久av在线| 欧美亚洲国产bt| 色婷婷在线影院| 国产日韩欧美| 久久久久高清| 色吧亚洲日本| 亚洲精品美女视频| 日本中文字幕在线免费观看| 丁香婷婷综合色啪| 国产精品igao激情视频| 欧美高清hd| 欧美成人免费va影院高清| 国产乱子伦精品无码码专区| 中文字幕永久在线不卡| xxww在线观看| 99久久激情| 成人激情视频在线观看| 香蕉视频在线看| 欧美日韩一区二区三区在线看| 我不卡一区二区| 久久亚洲欧洲| 青娱乐一区二区| 日韩欧美一区二区三区在线观看| 亚洲激情自拍图| 精品国产视频一区二区三区| 久久精品国产99| 影音先锋在线亚洲| 欧美性生活一级| 日韩国产激情在线| 性色av免费观看| 久久午夜色播影院免费高清| 欧美日韩在线中文| 亚洲动漫精品| 国产精品久久久久久亚洲影视| 日韩专区一区二区| 精品久久久在线观看| 国产一线在线观看| 欧美精品播放| 精品在线视频一区二区| 黄视频网站在线观看| 亚洲国产免费av| 日本少妇做爰全过程毛片| 97se亚洲国产综合自在线不卡 | 一区二区高清在线| 日本男女交配视频| 中文字幕乱码中文乱码51精品| 欧美成人aa大片| 欧美videossex极品| 91毛片在线观看| www.超碰com| 色喇叭免费久久综合网| 95av在线视频| 中国色在线日|韩| 国产一区二区三区18| 亚洲一区二区影视| 亚洲精品国产a久久久久久| www.日本少妇| 色综合综合色| 国产日韩在线视频| 色综合久久网女同蕾丝边| 色综合久久久久网| 成人黄色免费网址| 国产精品资源在线看| 国产免费黄色一级片| 亚洲桃色综合影院| 国产精品视频在线播放| 国产福利在线播放麻豆| 亚洲国产精品一区二区久| 特级西西444www大胆免费看| 综合久久久久综合| 午夜福利三级理论电影| 久久精品国产999大香线蕉| 日本阿v视频在线观看| 精品72久久久久中文字幕| 91欧美激情另类亚洲| 高清在线视频不卡| 美日韩精品免费观看视频| 五月婷婷久久久| 欧美精品vⅰdeose4hd| 日本在线视频免费| 亚洲欧美日韩国产综合| av黄色一级片| 美女免费视频一区| 国产中文字幕乱人伦在线观看| 国产精品亚洲二区| 国产精品美女xx| 成人福利一区二区| 午夜精品久久久久久久久久久久| 成人高清网站| 欧美大片在线观看| 国产精品一区二区免费视频| 欧美日韩美女在线| 人妻久久一区二区| 欧美极品少妇xxxxⅹ高跟鞋 | 91成人免费观看网站| 免费网站成人| 亚洲精品色婷婷福利天堂| 国产男男gay体育生网站| 欧美午夜在线一二页| 在线观看中文字幕视频| 一区二区三区欧美视频| 国产激情av在线| 成人激情av网| 精品人妻二区中文字幕| 日本va欧美va欧美va精品| 欧美黑人经典片免费观看| 香蕉久久网站| 欧美一区二区三区四区在线观看地址| h视频久久久| 亚洲bt天天射| 亚洲欧美一级| 91欧美精品成人综合在线观看| 成人在线视频播放| 热久久免费国产视频| bl在线肉h视频大尺度| 久久免费精品日本久久中文字幕| 国产91在线视频蝌蚪| 日韩中文在线不卡| 成人精品一区二区| 在线观看91久久久久久| 男女污视频在线观看| 亚洲精品久久久久国产| 亚洲欧美激情国产综合久久久| 在线播放中文字幕一区| 中文字幕在线2018| 日本二三区不卡| 影音先锋亚洲天堂| 色老汉av一区二区三区| 中文字幕激情小说| 一道本成人在线| 二区视频在线观看| 欧美性生活一区| 中文字幕永久在线| 欧美亚洲丝袜传媒另类| 日韩不卡高清视频| 欧美精品丝袜中出| 国产婷婷在线视频| 日韩一区二区影院| 精品久久久久中文慕人妻| 亚洲精品99久久久久| 性xxxfllreexxx少妇| 亚洲乱亚洲乱妇无码| 欧美套图亚洲一区| 亚洲欧美日韩国产成人| 国内精品一区视频| 一个色综合导航| 中文日本在线观看| 超碰91人人草人人干| 暖暖在线中文免费日本| 久久久久久久爱| 国产污视频在线播放| 777精品视频| 91看片一区| 成人日韩在线电影| 成人av影音| 亚洲国产欧美一区二区三区不卡| 99国内精品久久久久久久| 日本三级中文字幕在线观看| 激情成人综合| 免费涩涩18网站入口| 国产一区二区影院| 在线免费播放av| 欧美国产禁国产网站cc| 国产精品23p| 色综合久久久久综合体| 国产精品久久久久久久久毛片| 91精品国模一区二区三区| 先锋av资源站| 中文字幕av一区中文字幕天堂 | 欧美特黄一级片| 久久只有精品| 免费成人av在线播放| 伊人久久青草| 欧美一区精品| 日韩中文字幕二区| 国产在线播放一区| 国产精品三级在线观看无码| 国产目拍亚洲精品99久久精品| 夫妻性生活毛片| 婷婷综合另类小说色区| 一级黄色免费看| 亚洲视频网站在线观看| 91麻豆一二三四在线| 人人爽久久涩噜噜噜网站| 91视频亚洲| 日韩精品一区二区三区丰满| 海角社区69精品视频| 美女黄色片视频| 成人午夜视频在线| 亚洲成人生活片| 欧美午夜影院一区| 日韩专区第一页| 久久久久久国产三级电影| 亚洲一区有码| 奇米888一区二区三区| 影音先锋日韩资源| 亚洲第一天堂久久| 久久精品亚洲精品国产欧美 | 精品国产一区二区三区在线观看| 老司机深夜福利在线观看| 国产精品久久色| 欧美热在线视频精品999| 久久久久久久中文| 国产福利不卡视频| 国产女片a归国片aa| 欧美日韩一区高清| 国产九九在线| 日韩精品一区二区三区视频| 电影在线高清| 欧亚精品中文字幕| 国产精品对白| 一区二区三区av在线| 麻豆国产精品官网| 欧美丰满美乳xxⅹ高潮www| 疯狂欧美牲乱大交777| 理论片中文字幕| 免费97视频在线精品国自产拍| 播放一区二区| 日本三级中国三级99人妇网站| 久久亚洲电影| 久久久久久亚洲中文字幕无码| 精品久久香蕉国产线看观看亚洲| 国产wwwwwww| 久久久久久久色| 99久久人爽人人添人人澡| 9999在线观看| 大白屁股一区二区视频| 免费又黄又爽又色的视频| 欧美一二三区在线观看| 触手亚洲一区二区三区| 国产欧美va欧美va香蕉在线| 欧美日韩激情在线一区二区三区| 激情五月开心婷婷| 久久这里都是精品| 一级片视频在线观看| 亚洲欧美另类自拍| 在线免费日韩片| 亚洲精品9999| 激情综合色综合久久| 日本高清一二三区| 亚洲国产精品大全| 在线看片福利| 手机在线观看国产精品| 国内精品视频一区二区三区八戒| 小泽玛利亚一区二区免费| 欧美一区二区视频在线观看 | 在线观看国产欧美| 国产精品蜜月aⅴ在线| 在线视频精品一区| 久久www免费人成看片高清| 九九视频免费观看| 亚洲第一精品福利| 三上悠亚国产精品一区二区三区| 99热一区二区三区| 成人国产精品视频| 在线视频一区二区三区四区| 久久亚洲精品国产亚洲老地址| 无人区乱码一区二区三区| www.射射射| 91丨porny丨在线| 在线免费观看中文字幕| 欧美成人在线影院| 国产一区二区三区不卡视频网站|