当前位置:   article > 正文

手搓大模型——qwen2原理解析_qwen2大模型

qwen2大模型

如果按照处理顺序,对于输入的问题Text,要先进入tokenizer

Tokenizer

分词器,把句子分为各个词,并将词分为        [主干]        [特征]        两部分

词的token化类似于把 friendly 拆开为两个部分,这样子词语本身就可以拆开成本体和可多对应的部分,那么只需要对于本体进行识别即可,减少了所需要的词表的大小

之后在处理的时候,各个token会对应一个词向量,有一定的维度(200,300,768,1536)其中由一定的小数组成

之后生成的就是就是一个input_ids,由此输入Qwen2的主干部分

 Qwen2主干

  1. class Qwen2Model(Qwen2PreTrainedModel):
  2. """
  3. Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
  4. Args:
  5. config: Qwen2Config
  6. """
  7. def __init__(self, config: Qwen2Config):
  8. super().__init__(config)
  9. self.padding_idx = config.pad_token_id
  10. self.vocab_size = config.vocab_size # 获取词表长度
  11. self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) # torch.nn.embedding操作
  12. self.layers = nn.ModuleList(
  13. [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  14. )
  15. self._attn_implementation = config._attn_implementation
  16. self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
  17. self.gradient_checkpointing = False
  18. # Initialize weights and apply final processing
  19. self.post_init()

 - 设置了模型的两个属性:`padding_idx`(用于指定填充标记的索引),`vocab_size`(词汇表的大小)
- 初始化了模型的嵌入层、解码器层、归一化层
- 嵌入层(`nn.Embedding`):模型使用嵌入层将输入的标记映射成密集的向量表示。
- 解码器层(`nn.ModuleList()`):模型包含多个解码器层,这些层都是由 `Qwen2DecoderLayer``  定义
- 归一化层 `Qwen2RMSNorm`:归一化层使用的是 Root Mean Square Layer Normalization
- 设置了是否使用 `gradient_checkpoint` 主要是用来节省显存
- 调用 `post_init()` 完成一些初始化和准备检查的代码

Embedding

采用的是torch.nn.embedding的方式,限于知识不足,这一部分之后会慢慢跟上

Hidden_states

前面的input_ids经embedding处理后会有一个向量输入到decoder_layers进行处理

Decoder_layers

这部分是一个多层的结构

  1. hidden_states = inputs_embeds
  2. # decoder layers
  3. all_hidden_states = () if output_hidden_states else None
  4. all_self_attns = () if output_attentions else None
  5. next_decoder_cache = None
  6. for decoder_layer in self.layers:
  7. if output_hidden_states:
  8. all_hidden_states += (hidden_states,)
  9. if self.gradient_checkpointing and self.training:
  10. layer_outputs = self._gradient_checkpointing_func(
  11. decoder_layer.__call__,
  12. hidden_states,
  13. attention_mask,
  14. position_ids,
  15. past_key_values,
  16. output_attentions,
  17. use_cache,
  18. )
  19. else:
  20. layer_outputs = decoder_layer(
  21. hidden_states,
  22. attention_mask=attention_mask,
  23. position_ids=position_ids,
  24. past_key_value=past_key_values,
  25. output_attentions=output_attentions,
  26. use_cache=use_cache,
  27. )
  28. hidden_states = layer_outputs[0]
  29. if use_cache:
  30. next_decoder_cache = layer_outputs[2 if output_attentions else 1]
  31. if output_attentions:
  32. all_self_attns += (layer_outputs[1],)
  33. hidden_states = self.norm(hidden_states)
  34. # add hidden states from the last decoder layer
  35. if output_hidden_states:
  36. all_hidden_states += (hidden_states,)
  37. next_cache = None
  38. if use_cache:
  39. next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
  40. if not return_dict:
  41. return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
  42. return BaseModelOutputWithPast(
  43. last_hidden_state=hidden_states,
  44. past_key_values=next_cache,
  45. hidden_states=all_hidden_states,
  46. attentions=all_self_attns,
  47. )

对大整体的逻辑应该是把hidden_states传入每一层layers,然后上层处理的hd标准化放入下一层layers,然后加到一块,再处理,再标准化,再送入下层,依次继续,最后输出加上最后一层hd

对于decoder

  1. class Qwen2DecoderLayer(nn.Module):
  2. def __init__(self, config: Qwen2Config, layer_idx: int):
  3. super().__init__()
  4. self.hidden_size = config.hidden_size
  5. if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
  6. logger.warning_once(
  7. f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
  8. "unexpected results may be encountered."
  9. )
  10. self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
  11. self.mlp = Qwen2MLP(config)
  12. self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
  13. self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
  14. def forward(
  15. self,
  16. hidden_states: torch.Tensor,
  17. attention_mask: Optional[torch.Tensor] = None,
  18. position_ids: Optional[torch.LongTensor] = None,
  19. past_key_value: Optional[Tuple[torch.Tensor]] = None,
  20. output_attentions: Optional[bool] = False,
  21. use_cache: Optional[bool] = False,
  22. **kwargs,
  23. ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
  24. if "padding_mask" in kwargs:
  25. warnings.warn(
  26. "Passing `padding_mask` is deprecated and will be removed in v4.37. "
  27. "Please make sure use `attention_mask` instead.`"
  28. )
  29. """
  30. Args:
  31. hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
  32. attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
  33. `(batch, sequence_length)` where padding elements are indicated by 0.
  34. output_attentions (`bool`, *optional*):
  35. Whether or not to return the attentions tensors of all attention layers. See `attentions` under
  36. returned tensors for more detail.
  37. use_cache (`bool`, *optional*):
  38. If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
  39. (see `past_key_values`).
  40. past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
  41. """
  42. residual = hidden_states
  43. hidden_states = self.input_layernorm(hidden_states)
  44. # Self Attention
  45. hidden_states, self_attn_weights, present_key_value = self.self_attn(
  46. hidden_states=hidden_states,
  47. attention_mask=attention_mask,
  48. position_ids=position_ids,
  49. past_key_value=past_key_value,
  50. output_attentions=output_attentions,
  51. use_cache=use_cache,
  52. )
  53. hidden_states = residual + hidden_states
  54. # Fully Connected
  55. residual = hidden_states
  56. hidden_states = self.post_attention_layernorm(hidden_states)
  57. hidden_states = self.mlp(hidden_states)
  58. hidden_states = residual + hidden_states
  59. outputs = (hidden_states,)
  60. if output_attentions:
  61. outputs += (self_attn_weights,)
  62. if use_cache:
  63. outputs += (present_key_value,)
  64. return outputs

  传入残差

residual = hidden_states

  正则化

self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

hidden_states = self.input_layernorm(hidden_states)

 传入注意力模块

self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)

 # Self Attention

        hidden_states, self_attn_weights, present_key_value = self.self_attn(

            hidden_states=hidden_states,

            attention_mask=attention_mask,

            position_ids=position_ids,

            past_key_value=past_key_value,

            output_attentions=output_attentions,

            use_cache=use_cache,

        )

会产出一个新的hidden_states,要与之前的残差相加

hidden_states = residual + hidden_states

 再提出一个残差

residual = hidden_states

 再次进入RMSNorm的环节

hidden_states = self.post_attention_layernorm(hidden_states)

做一次MLP

  hidden_states = self.mlp(hidden_states)

 结果与残差相加

hidden_states = hidden_states + residual

 然后输出

因此,总结上面最关键的几个环节,不难看出

Decoder = MLP + attn + norm

此后就是输出环节

那么回头看下attn吧,这才是重头戏

attn

先将输入的hidden_states传入三个部分

        query_states = self.q_proj(hidden_states)

        key_states = self.k_proj(hidden_states)

        value_states = self.v_proj(hidden_states)

然后转置成为后续进行矩阵乘法的形式

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)

        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

将旋转位置嵌入应用于查询和键张量。使用了旋转位置嵌入的余弦和正弦部分,将它们与查询和键张量相乘,并将结果相加,从而实现旋转位置嵌入的效果

        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)

        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

 然后对key与value做kv_repeat(上图是否有误存疑)

在小模型里这一步并非必须

        key_states = repeat_kv(key_states, self.num_key_value_groups)

        value_states = repeat_kv(value_states, self.num_key_value_groups)

 做Dot attn

        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

 同遮蔽注意力相叠加,实现读取顺序

attn_weights = attn_weights + attention_mask

softmax + dropout + values_states相乘

        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

        attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)

        attn_output = torch.matmul(attn_weights, value_states)

 转置,修改形状等reshape操作并最后进行一次o_proj

        attn_output = attn_output.transpose(1, 2).contiguous()

        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        attn_output = self.o_proj(attn_output)

最后输出

return attn_output, attn_weights, past_key_value 

 总结

限于能力、篇幅暂时没把数学部分写上来,后续会补上

但是对于qwen的原理梳理成功的将先前的有关于gpt类似的transformers模型的知识串联起来,颇有意思

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/秋刀鱼在做梦/article/detail/776551
推荐阅读
  

闽ICP备14008679号