秋刀鱼在做梦

这个屌丝很懒，什么也没留下！

热门标签

手搓大模型——qwen2原理解析_qwen2大模型

作者：秋刀鱼在做梦 | 2024-07-01 13:29:09

踩

qwen2大模型

如果按照处理顺序，对于输入的问题Text，要先进入tokenizer

Tokenizer

分词器，把句子分为各个词，并将词分为 [主干] [特征] 两部分

词的token化类似于把 friendly 拆开为两个部分，这样子词语本身就可以拆开成本体和可多对应的部分，那么只需要对于本体进行识别即可，减少了所需要的词表的大小

之后在处理的时候，各个token会对应一个词向量，有一定的维度（200，300，768，1536）其中由一定的小数组成

之后生成的就是就是一个input_ids，由此输入Qwen2的主干部分

Qwen2主干


class Qwen2Model(Qwen2PreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
    Args:
        config: Qwen2Config
    """
 
    def __init__(self, config: Qwen2Config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size # 获取词表长度
 
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) # torch.nn.embedding操作
        self.layers = nn.ModuleList(
            [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self._attn_implementation = config._attn_implementation
        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
 
        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

- 设置了模型的两个属性:`padding_idx`（用于指定填充标记的索引），`vocab_size`（词汇表的大小）
- 初始化了模型的嵌入层、解码器层、归一化层
- 嵌入层（`nn.Embedding`）：模型使用嵌入层将输入的标记映射成密集的向量表示。
- 解码器层（`nn.ModuleList()`）：模型包含多个解码器层，这些层都是由 `Qwen2DecoderLayer`` 定义
- 归一化层 `Qwen2RMSNorm`：归一化层使用的是 Root Mean Square Layer Normalization
- 设置了是否使用 `gradient_checkpoint` 主要是用来节省显存
- 调用 `post_init()` 完成一些初始化和准备检查的代码

Embedding

采用的是torch.nn.embedding的方式，限于知识不足，这一部分之后会慢慢跟上

Hidden_states

前面的input_ids经embedding处理后会有一个向量输入到decoder_layers进行处理

Decoder_layers

这部分是一个多层的结构


 hidden_states = inputs_embeds
 
        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = None
 
        for decoder_layer in self.layers:
            if output_hidden_states:
                all_hidden_states += (hidden_states,)
 
            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    attention_mask,
                    position_ids,
                    past_key_values,
                    output_attentions,
                    use_cache,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                )
 
            hidden_states = layer_outputs[0]
 
            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]
 
            if output_attentions:
                all_self_attns += (layer_outputs[1],)
 
        hidden_states = self.norm(hidden_states)
 
        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)
 
        next_cache = None
        if use_cache:
            next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
 
        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )

对大整体的逻辑应该是把hidden_states传入每一层layers，然后上层处理的hd标准化放入下一层layers，然后加到一块，再处理，再标准化，再送入下层，依次继续，最后输出加上最后一层hd

对于decoder


class Qwen2DecoderLayer(nn.Module):
    def __init__(self, config: Qwen2Config, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size
 
        if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
            logger.warning_once(
                f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
                "unexpected results may be encountered."
            )
        self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
 
        self.mlp = Qwen2MLP(config)
        self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
 
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        **kwargs,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        if "padding_mask" in kwargs:
            warnings.warn(
                "Passing `padding_mask` is deprecated and will be removed in v4.37. "
                "Please make sure use `attention_mask` instead.`"
            )
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
                `(batch, sequence_length)` where padding elements are indicated by 0.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
        """
 
        residual = hidden_states
 
        hidden_states = self.input_layernorm(hidden_states)
 
        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states
 
        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states
 
        outputs = (hidden_states,)
 
        if output_attentions:
            outputs += (self_attn_weights,)
 
        if use_cache:
            outputs += (present_key_value,)
 
        return outputs

传入残差

residual = hidden_states

正则化

self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

hidden_states = self.input_layernorm(hidden_states)

传入注意力模块

self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

# Self Attention

hidden_states, self_attn_weights, present_key_value = self.self_attn(

hidden_states=hidden_states,

attention_mask=attention_mask,

position_ids=position_ids,

past_key_value=past_key_value,

output_attentions=output_attentions,

use_cache=use_cache,

)

会产出一个新的hidden_states,要与之前的残差相加

hidden_states = residual + hidden_states

再提出一个残差

residual = hidden_states

再次进入RMSNorm的环节

hidden_states = self.post_attention_layernorm(hidden_states)

做一次MLP

hidden_states = self.mlp(hidden_states)

结果与残差相加

hidden_states = hidden_states + residual

然后输出

因此，总结上面最关键的几个环节，不难看出

Decoder = MLP + attn + norm

此后就是输出环节

那么回头看下attn吧，这才是重头戏

attn

先将输入的hidden_states传入三个部分

query_states = self.q_proj(hidden_states)

key_states = self.k_proj(hidden_states)

value_states = self.v_proj(hidden_states)

然后转置成为后续进行矩阵乘法的形式

query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)

key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

将旋转位置嵌入应用于查询和键张量。使用了旋转位置嵌入的余弦和正弦部分，将它们与查询和键张量相乘，并将结果相加，从而实现旋转位置嵌入的效果

cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)

query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

然后对key与value做kv_repeat（上图是否有误存疑）

在小模型里这一步并非必须

key_states = repeat_kv(key_states, self.num_key_value_groups)

value_states = repeat_kv(value_states, self.num_key_value_groups)

做Dot attn

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

同遮蔽注意力相叠加，实现读取顺序

attn_weights = attn_weights + attention_mask

softmax + dropout + values_states相乘

attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)

attn_output = torch.matmul(attn_weights, value_states)

转置，修改形状等reshape操作并最后进行一次o_proj

attn_output = attn_output.transpose(1, 2).contiguous()

attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

attn_output = self.o_proj(attn_output)

最后输出

return attn_output, attn_weights, past_key_value

总结

限于能力、篇幅暂时没把数学部分写上来，后续会补上

但是对于qwen的原理梳理成功的将先前的有关于gpt类似的transformers模型的知识串联起来，颇有意思

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/776551