多模態大模型Qwen2的深入了解原創

一起AI技術

發布于 2024-11-15 15:09

瀏覽

2收藏

前言

本章我們將深入了解Qwen2-VL并使用多模態對于視頻的處理能力。

資料

論文標題：《Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution》

論文地址：https://arxiv.org/pdf/2409.12191

論文閱讀理解

論文核心要點

據Qwen2-VL的論文中介紹，該模型為了進一步增強模型對視頻中視覺信息的有效感知和理解能力，引入了三個關鍵的創新升級：

原始動態分辨率：該功能允許模型處理任意分辨率的圖像，而不需要調整模型結構。
多模態旋轉位置嵌入：該功能通過時間、高度、寬度三個維度來對進行embedding，從而建模了多模態輸入的位置信息。
統一圖像和視頻的理解：通過混合訓練方法的方式，結合圖像和視頻數據，確保在圖像理解和視頻理解方面具有專業水平。

升級點1：原始動態分辨率

模型結構

多模態大模型Qwen2的深入了解-AI.x社區

論文原文

Naive Dynamic Resolution A key architectural improvement in Qwen2-VL is the introduction of naive dynamic resolution support (Dehghani et al., 2024). Unlike Qwen-VL, Qwen2-VL can now process images of any resolution, dynamically converting them into a variable number of visual tokens.1 To support this feature, we modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE (Suet al., 2024; Su, 2021) to capture the two-dimensional positional information of images. At the inference stage, images of varying resolutions are packed into a single sequence, with the packed length controlled to limit GPU memory usage. Furthermore, to reduce the visual tokens of each image, a simple MLP layer is employed after the ViT to compress adjacent 2 × 2 tokens into a single token, with the special <|vision_start|> and <|vision_end|> tokens placed at the beginning and end of the compressed visual tokens. As a result, an image with a resolution of 224 × 224, encoded with a ViT using patch_size=14, will be compressed to 66 tokens before entering LLM.

論文翻譯

原始動態分辨率(Naive Dynamic Resolution)：??Qwen2-VL??? 架構改進的關鍵之一。與它的前身不同，Qwen2-VL現在可以處理任何分辨率的圖像，并且能夠將它們動態轉換為可變數量的視覺令牌。為了支持這一功能，我們修改了 ??ViT??，刪除了原始絕對位置嵌入，并引入2D-RoPE來捕獲圖像的二維位置信息。在推理階段，各種分辨率的圖像被包裝成單個序列，包裝長度受控以限制GPU內存使用量。此外，為了減少每個圖像的視覺令牌數，在ViT之后采用一個簡單的??MLP??層，將相鄰的2×2令牌壓縮到一個令牌中，其中特殊的 <|vision_start|> 和 <|vision_end|> 令牌放置在壓縮的視覺令牌的開始和結束處。因此，使用 ??patch_size = 14??? 編碼的分辨率 ??224×224??? 的圖像將在進入LLM之前被壓縮為 ??66?? 個令牌。

論文理解

圖像分塊（Patch）：在視覺 Transformer（ViT）中，圖像會被劃分為多個小塊（patches）。??patch_size = 14?? 意味著每個小塊的尺寸為??14x14?? 像素。

圖像分辨率：假如輸入的圖像分辨率為??224×224?? 像素。
小塊數量：

水平方向：??224 / 14?? = 16

垂直方向：??224 / 14?? = 16 因此，總的小塊數量為 16 × 16 = 256 個小塊。

壓縮視覺令牌：為了減少輸入到模型中的視覺令牌數量，??Qwen2-VL?? 使用了一個簡單的??MLP?? 層，將相鄰的??2x2?? 個小塊壓縮為一個視覺令牌。由于每個??2x2?? 的小塊包含??4?? 個小塊，因此??256?? 個小塊被壓縮為??256 / 4?? = 64 個視覺令牌。
特殊令牌：在壓縮后的視覺令牌序列中，添加了兩個特殊的令牌：??<|vision_start|>?? 和??<|vision_end|>??，用于標識視覺信息的開始和結束。因此，最終的視覺令牌數量為??64 + 2?? = 66 個。

升級點2：多模態旋轉位置嵌入

模型結構

多模態大模型Qwen2的深入了解-AI.x社區

論文原文

Multimodal Rotary Position Embedding (M-RoPE) Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding ??(M-RoPE)???. Unlike the traditional ??1D-RoPE??? in LLMs, which is limited to encoding one-dimensional positional information, M-RoPE effectively models the positional information of multimodal inputs. This is achieved by deconstructing the original rotary embedding into three components: ??temporal???, ??height???, and ??width??. For text inputs, these components utilize identical position IDs, making M-RoPE functionally equivalent to 1D-RoPE (Su, 2024). When processing images, the temporal IDs of each visual token remain constant, while distinct IDs are assigned to the height and width components based on the token’s position in the image. For videos, which are treated as sequences of frames, the temporal ID increments for each frame, while the height and width components follow the same ID assignment pattern as images. In scenarios where the model’s input encompasses multiple modalities, position numbering for each modality is initialized by incrementing the maximum position ID of the preceding modality by one. An illustration of M-RoPE is shown in Figure 3. M-RoPE not only enhances the modeling of positional information but also reduces the value of position IDs for images and videos, enabling the model to extrapolate to longer sequences during inference.

論文翻譯

多模態旋轉位置嵌入（M-RoPE）：另一個關鍵的架構增強是多模態旋轉位置嵌入 (M-RoPE) 的創新。與大型語言模型中的傳統 1D-RoPE 不同，它僅限于編碼一維位置信息，M-RoPE 有效地建模了多模態輸入的位置信息。這通過將原始旋轉嵌入分解為三個組件：??時間???、??高度??? 和 ??寬度?? 來實現。 對于文本輸入，這些組件使用相同的位移。多模態旋轉位置嵌入ID，使M-RoPE功能上等同于1D-RoPE。 在處理圖像時，每個視覺令牌的時間ID保持不變，而高度和寬度組件根據令牌在圖像中的位置分配不同的ID。 對于視頻，這些被當作幀序列來處理的視頻，每幀的時間ID遞增，而高度和寬度組件遵循與圖像相同的ID分配模式。在模型輸入包含多個模態的情況下，每個模態的位置編號通過將前一模態的最大位置ID增加一個進行初始化。圖3顯示了M-RoPE的示例。M-RoPE不僅增強了對位置信息的建模能力，而且降低了圖像和視頻中位置ID的價值，使得模型能夠在推理期間擴展到更長的序列。

論文理解

Postion Embedding：位置嵌入是用來告訴模型輸入數據中每個元素的位置。比如，在處理文本時，模型需要知道“我愛你”中的“我”是第一個詞，“愛”是第二個詞。
M-RoPE：Qwen2-VL 引入的 M-RoPE 則是一個更復雜的系統，它不僅能處理文本，還能處理圖像和視頻。M-RoPE 將位置嵌入分為三個部分：

時間：適用于視頻或序列數據，表示幀的順序。

高度和寬度：適用于圖像，表示圖像中每個視覺令牌的位置（行和列）。

不同數據類型的處理：

對于文本輸入:

相同位移：文本中的每個詞使用相同的時間位移。例如，句子中的詞按順序編號。

對于圖像輸入
?固定的時間ID：圖像中的每個視覺令牌（小塊）保持相同的時間ID，但高度和寬度的ID會根據它們在圖像中的位置不同而變化。例如，左上角的小塊可能是（1,1），而右下角的小塊可能是（16,16）。
?對于視頻輸入
?遞增的時間ID：視頻中的每一幀都有不同的時間ID，表示它們在序列中的順序。同時，每幀的高度和寬度組件仍然根據圖像的位置分配ID。

模態之間的ID初始化: 當模型處理多個模態時，比如同時處理文本和圖像，??M-RoPE?? 會為每個模態分配不同的起始位置ID。例如，處理圖像時，圖像的最大ID會在處理文本時被增加，以避免沖突。

升級點3：統一圖像和視頻的理解

論文原文

Unified Image and Video Understanding Qwen2-VL employs a mixed training regimen incorporating both image and video data, ensuring proficiency in image understanding and video comprehension. To preserve video information as completely as possible, we sampled each video at two frames per second. Additionally, we integrated ??3D convolutions?? (Carreira and Zisserman, 2017) with a depth of two to process video inputs, allowing the model to handle 3D tubes instead of 2D patches, thus enabling it to process more video frames without increasing the sequence length (Arnab et al., 2021). For consistency, each image is treated as two identical frames. To balance the computational demands of long video processing with overall training efficiency, we dynamically adjust the resolution of each video frame, limiting the total number of tokens per video to 16384. This training approach strikes a balance between the model’s ability to comprehend long videos and training efficiency.

論文翻譯

統一圖像和視頻理解：采用混合訓練方法，結合圖像和視頻數據，確保在圖像理解和視頻理解方面具有專業水平。為了盡可能完整地保留視頻信息，我們每秒對每個視頻進行兩次采樣。此外，我們還集成深度為兩層的??三維卷積??來處理視頻輸入，允許模型處理三維管狀結構而不是二維塊，從而使其能夠處理更多視頻幀而無需增加序列長度。為了保持一致，每張圖片都被視為兩張相同的幀。為了平衡長視頻處理所需的計算需求與整體訓練效率，我們動態調整每個視頻幀的分辨率，限制每個視頻中的總令牌數量不超過 16384。這種訓練方法在模型理解和訓練效率之間取得了平衡。

模型部署(使用flash_attention)

在上一章【課程總結】day31：多模態大模型初步了解，我們部署了Qwen2-VL模型。由于多模態大模型比較占用GPU顯存，我們使用??flash_attention??來加速推理，以減少顯存占用。

準備環境

第一步：啟動ModelScope平臺的PAI-DSW的GPU環境

# 檢查CUDA的版本
nvcc --version

# 檢查pytorch版本
import torch
print(torch.__version__)
print(torch.cuda.is_available())

運行結果：

多模態大模型Qwen2的深入了解-AI.x社區

系統版本為 CUDA 12.1 和 PyTorch 2.3.1

拉取代碼

第二步：下載通義千問2-VL-2B-Instruct模型

# 確保 git lfs 已安裝
git lfs install

# 下載模型
git clone https://www.modelscope.cn/Qwen/Qwen2-VL-2B-Instruct.git

安裝flash_attention

第三步：安裝flash_attention

pip install flash-attn

運行結果：

多模態大模型Qwen2的深入了解-AI.x社區

引入相關庫

from transformers import Qwen2VLForConditionalGeneration
from transformers import AutoTokenizer
from transformers import AutoProcessor
import torch
from qwen_vl_utils import process_vision_info

加載模型

# 設置模型路徑
model_dir = "Qwen2-VL-2B-Instruct"

# 使用flash-attension加載模型
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

運行結果：

多模態大模型Qwen2的深入了解-AI.x社區

模型形狀

在加載模型后，如果輸出 ??model??，可以看到Qwen2的模型結構為：

Qwen2VLForConditionalGeneration(
(visual):Qwen2VisionTransformerPretrainedModel(
(patch_embed):PatchEmbed(
(proj):Conv3d(3,1280, kernel_size=(2,14,14), stride=(2,14,14), bias=False)
)
(rotary_pos_emb):VisionRotaryEmbedding()
(blocks):ModuleList(
(0-31):32 x Qwen2VLVisionBlock(
(norm1):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn):VisionFlashAttention2(
(qkv):Linear(in_features=1280, out_features=3840, bias=True)
(proj):Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp):VisionMlp(
(fc1):Linear(in_features=1280, out_features=5120, bias=True)
(act):QuickGELUActivation()
(fc2):Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger):PatchMerger(
(ln_q):LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp):Sequential(
(0):Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2):Linear(in_features=5120, out_features=1536, bias=True)
)
)
)
(model):Qwen2VLModel(
(embed_tokens):Embedding(151936,1536)
(layers):ModuleList(
(0-27):28 x Qwen2VLDecoderLayer(
(self_attn):Qwen2VLFlashAttention2(
(q_proj):Linear(in_features=1536, out_features=1536, bias=True)
(k_proj):Linear(in_features=1536, out_features=256, bias=True)
(v_proj):Linear(in_features=1536, out_features=256, bias=True)
(o_proj):Linear(in_features=1536, out_features=1536, bias=False)
(rotary_emb):Qwen2RotaryEmbedding()
)
(mlp):Qwen2MLP(
(gate_proj):Linear(in_features=1536, out_features=8960, bias=False)
(up_proj):Linear(in_features=1536, out_features=8960, bias=False)
(down_proj):Linear(in_features=8960, out_features=1536, bias=False)
(act_fn):SiLU()
)
(input_layernorm):Qwen2RMSNorm((1536,), eps=1e-06)
(post_attention_layernorm):Qwen2RMSNorm((1536,), eps=1e-06)
)
)
(norm):Qwen2RMSNorm((1536,), eps=1e-06)
)
(lm_head):Linear(in_features=1536, out_features=151936, bias=False)
)

說明：

Qwen2-VL 模型主要由兩個部分組成：視覺編碼器和語言模型。
視覺編碼器(Qwen2VisionTransformerPretrainedModel)：

Patch Embedding：使用 ??Conv3d?? 進行圖像的embedding，切分為多個小塊并提取特征。其中卷積核大小為 (2, 14, 14)，步幅也為 (2, 14, 14)。

Rotary Positional Embedding：如論文所述，進行旋轉位置嵌入以增強視覺模型的感知能力。

Transformer Blocks：包含 32 個 ??Qwen2VLVisionBlock???，每個塊都有兩個 ??Layer Normalization??? 層和一個 ??注意力機制???，注意力機制采用 ??Linear??? 層進行 ??QKV（查詢、鍵、值）??映射。

Patch Merger：對提取的特征進行合并，使用 ??LayerNorm?? 和 ??MLP(多層感知機)?? 處理。

語言模型(Qwen2VLModel)：
?Token Embedding：使用 Embedding 層將輸入的文本 token 轉換為稠密向量，維度為 1536。
?Decoder Layers：包含 28 個 Qwen2VLDecoderLayer，每層具有自注意力機制和 MLP；自注意力機制（Qwen2VLFlashAttention2）通過 Q、K、V 的線性映射進行注意力計算，采用旋轉嵌入增強序列信息。
?Norm Layer:使用 Qwen2RMSNorm 進行歸一化，幫助模型在訓練過程中保持穩定性。
?輸出層 (lm_head)：
? 最后通過一個線性層將模型的輸出映射回詞匯表大小（151936），用于生成文本。

加載processor

processor = AutoProcessor.from_pretrained(model_dir)

processor配置

打印processor可以得到如下信息：

Qwen2VLProcessor:
- image_processor:Qwen2VLImageProcessor{
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean":[
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type":"Qwen2VLImageProcessor",
"image_std":[
0.26862954,
0.26130258,
0.27577711
],
"max_pixels":12845056,
"merge_size":2,
"min_pixels":3136,
"patch_size":14,
"processor_class":"Qwen2VLProcessor",
"resample":3,
"rescale_factor":0.00392156862745098,
"size":{
"max_pixels":12845056,
"min_pixels":3136
},
"temporal_patch_size":2
}

- tokenizer:Qwen2TokenizerFast(name_or_path='Qwen2-VL-2B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token':'<|im_end|>','pad_token':'<|endoftext|>','additional_special_tokens':['<|im_start|>','<|im_end|>','<|object_ref_start|>','<|object_ref_end|>','<|box_start|>','<|box_end|>','<|quad_start|>','<|quad_end|>','<|vision_start|>','<|vision_end|>','<|vision_pad|>','<|image_pad|>','<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
151643:AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644:AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645:AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151646:AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151647:AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151648:AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151649:AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151650:AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151651:AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151652:AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151653:AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151654:AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151655:AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151656:AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

{
"chat_template":"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
"processor_class":"Qwen2VLProcessor"
}

說明：

圖像處理器 (Qwen2VLImageProcessor)

轉換 RGB -??do_convert_rgb??: 設置為 true，表示將輸入圖像轉換為 RGB 格式，確保顏色通道的一致性。
歸一化 -??do_normalize??: 設置為 true，表示對圖像進行標準化處理，以便使圖像特征的均值和方差符合模型的預期。
重縮放 -??do_rescale??: 設置為 true，表示將圖像像素值縮放到 [0, 1] 的范圍。
調整大小 -??do_resize??: 設置為 true，表示將圖像調整為模型所需的輸入尺寸。
均值和標準差:??image_mean??: [0.48145466, 0.4578275, 0.40821073]，用于圖像歸一化的均值。??image_std??: [0.26862954, 0.26130258, 0.27577711]，用于圖像歸一化的標準差。
像素限制:??max_pixels??: 12845056，表示處理的圖像最大像素數。??min_pixels??: 3136，表示處理的圖像最小像素數。
補丁大小 -??patch_size??: 14，表示將圖像劃分為補丁的大小。

分詞器 (Qwen2TokenizerFast)

詞匯表大小 -??vocab_size??: 151643，表示分詞器支持的詞匯數量。
最大長度 -??model_max_length??: 32768，表示模型能夠處理的最大文本長度。
快速模式 -??is_fast??: 設置為 True，表示使用快速分詞器，以提高處理效率。
填充和截斷:

??padding_side??: 'left'，表示在文本左側填充。

??truncation_side??: 'right'，表示在文本右側截斷。

特殊標記 -??special_tokens??: 包含多個特殊標記，例如：
? <|vision_start|> 和 <|vision_end|>，用于標識圖像的開始和結束。
?<|vision_pad|>、<|image_pad|> 和 <|video_pad|> 表示圖像補丁的填充。

構建對話模板

messages = [
{
"role":"user",
"content":[
{
"type":"image",
"image":"https://17aitech.com/wp-content/uploads/2024/10/missile.jpeg",
},
{"type":"text","text":"描述一下這張圖片，可以的話給出具體參數型號."},
],
}
]

備注：

圖片路徑為https://17aitech.com/wp-content/uploads/2024/10/missile.jpeg
qwen_vl_utils會自動從以上地址下載圖片
圖片內容如下：

多模態大模型Qwen2的深入了解-AI.x社區

導彈

數據預處理

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

說明：

查看text內容，其構成的對話模板內容為：??'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>描述一下這張圖片，可以的話給出具體參數型號.<|im_end|>\n<|im_start|>assistant\n'??
其中??<|image_pad|>?? 為圖片的填充符，用于對齊。

模型推理

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

運行結果：

多模態大模型Qwen2的深入了解-AI.x社區

識別Gif動圖

messages = [
{
"role":"user",
"content":[
{
"type":"image",
"image":"https://17aitech.com/wp-content/uploads/2024/09/%E6%A3%80%E7%B4%A2%E5%88%B0%E7%AD%94%E6%A1%88.gif",
},
{"type":"text","text":"描述一下這張圖片."},
],
}
]

原始動圖：

多模態大模型Qwen2的深入了解-AI.x社區

識別結果：

多模態大模型Qwen2的深入了解-AI.x社區

識別視頻

首先，我們下載一段.mp4視頻到本地，下載的視頻地址為好看視頻

備注：我以前曾經做過一個項目，通過視頻的幀數來度量軟件的啟動速度，我們看看大模型是否可以很容易地給出結果。

其次，我們將視頻上傳到服務器上。

多模態大模型Qwen2的深入了解-AI.x社區

然后，修改消息內容如下：

messages =[
{
"role":"user",
"content":[
{
"type":"video",
"video":"file://start_speed.mp4",
"max_pixels":360*420,
"fps":1.0,
},
{"type":"text","text":"請描述這段視頻，同時計算兩個手機各自從啟動到顯示各自的幀數并輸出結果."},
],
}
]

其他部分代碼保持不變后運行，運行結果如下：

多模態大模型Qwen2的深入了解-AI.x社區

可以看到，Qwen2-VL可以識別出視頻中的內容，雖然沒有給出各自的幀數，但是可以識別出兩個手機的品牌并且給出哪個更快。

內容小結

Qwen2-VL為了增強模型能力，主要進行了3個改進：

1.原始動態分辨率：該功能允許模型處理任意分辨率的圖像，而不需要調整模型結構。

2.多模態旋轉位置嵌入：該功能通過時間、高度、寬度三個維度來對進行embedding，從而建模了多模態輸入的位置信息。

3.統一圖像和視頻的理解：通過混合訓練方法的方式，結合圖像和視頻數據，確保在圖像理解和視頻理解方面具有專業水平。

Qwen2-VL的模型結構主要由視覺編碼器和語言模型兩部分組成。
Qwen2-VL可以使用flashAttention進行加速，使用時需要檢查CUDA、torch版本等。
Qwen2-VL除了可以識別圖片之外，也可以識別Gif動圖和視頻，其能力非常強大。

參考資料

知乎：【精讀】Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

本文轉載自公眾號一起AI技術作者：Dongming

原文鏈接：??https://mp.weixin.qq.com/s/Lo8aPBkIenwgcRy8WAlLlQ??

?著作權歸作者所有，如需轉載，請注明出處，否則將追究法律責任

標簽

多模態

大模型

Qwen2

贊

回復

舉報

回復

51CTO

51CTO博客

51CTO學堂

多模態大模型Qwen2的深入了解原創

前言

資料

論文閱讀理解

論文核心要點

升級點1：原始動態分辨率

模型結構

論文原文

論文翻譯

論文理解

升級點2：多模態旋轉位置嵌入

模型結構

論文原文

論文翻譯

論文理解

升級點3：統一圖像和視頻的理解

論文原文

論文翻譯

模型部署(使用flash_attention)

準備環境

拉取代碼

安裝flash_attention

引入相關庫

加載模型

模型形狀

加載processor

processor配置

構建對話模板

數據預處理

模型推理

識別Gif動圖

識別視頻

內容小結

參考資料

目錄

51CTO

51CTO博客

51CTO學堂

多模態大模型Qwen2的深入了解 原創

前言

資料

論文閱讀理解

論文核心要點

升級點1：原始動態分辨率

模型結構

論文原文

論文翻譯

論文理解

升級點2：多模態旋轉位置嵌入

模型結構

論文原文

論文翻譯

論文理解

升級點3：統一圖像和視頻的理解

論文原文

論文翻譯

模型部署(使用flash_attention)

準備環境

拉取代碼

安裝flash_attention

引入相關庫

加載模型

模型形狀

加載processor

processor配置

構建對話模板

數據預處理

模型推理

識別Gif動圖

識別視頻

內容小結

參考資料

目錄

多模態大模型Qwen2的深入了解原創