赞
踩
如果按照处理顺序,对于输入的问题Text,要先进入tokenizer
分词器,把句子分为各个词,并将词分为 [主干] [特征] 两部分
词的token化类似于把 friendly 拆开为两个部分,这样子词语本身就可以拆开成本体和可多对应的部分,那么只需要对于本体进行识别即可,减少了所需要的词表的大小
之后在处理的时候,各个token会对应一个词向量,有一定的维度(200,300,768,1536)其中由一定的小数组成
之后生成的就是就是一个input_ids,由此输入Qwen2的主干部分
- class Qwen2Model(Qwen2PreTrainedModel):
- """
- Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
- Args:
- config: Qwen2Config
- """
-
- def __init__(self, config: Qwen2Config):
- super().__init__(config)
- self.padding_idx = config.pad_token_id
- self.vocab_size = config.vocab_size # 获取词表长度
-
- self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) # torch.nn.embedding操作
- self.layers = nn.ModuleList(
- [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
- )
- self._attn_implementation = config._attn_implementation
- self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-
- self.gradient_checkpointing = False
- # Initialize weights and apply final processing
- self.post_init()

- 设置了模型的两个属性:`padding_idx`(用于指定填充标记的索引),`vocab_size`(词汇表的大小)
- 初始化了模型的嵌入层、解码器层、归一化层
- 嵌入层(`nn.Embedding`):模型使用嵌入层将输入的标记映射成密集的向量表示。
- 解码器层(`nn.ModuleList()`):模型包含多个解码器层,这些层都是由 `Qwen2DecoderLayer`` 定义
- 归一化层 `Qwen2RMSNorm`:归一化层使用的是 Root Mean Square Layer Normalization
- 设置了是否使用 `gradient_checkpoint` 主要是用来节省显存
- 调用 `post_init()` 完成一些初始化和准备检查的代码
采用的是torch.nn.embedding的方式,限于知识不足,这一部分之后会慢慢跟上
前面的input_ids经embedding处理后会有一个向量输入到decoder_layers进行处理
这部分是一个多层的结构
- hidden_states = inputs_embeds
-
- # decoder layers
- all_hidden_states = () if output_hidden_states else None
- all_self_attns = () if output_attentions else None
- next_decoder_cache = None
-
- for decoder_layer in self.layers:
- if output_hidden_states:
- all_hidden_states += (hidden_states,)
-
- if self.gradient_checkpointing and self.training:
- layer_outputs = self._gradient_checkpointing_func(
- decoder_layer.__call__,
- hidden_states,
- attention_mask,
- position_ids,
- past_key_values,
- output_attentions,
- use_cache,
- )
- else:
- layer_outputs = decoder_layer(
- hidden_states,
- attention_mask=attention_mask,
- position_ids=position_ids,
- past_key_value=past_key_values,
- output_attentions=output_attentions,
- use_cache=use_cache,
- )
-
- hidden_states = layer_outputs[0]
-
- if use_cache:
- next_decoder_cache = layer_outputs[2 if output_attentions else 1]
-
- if output_attentions:
- all_self_attns += (layer_outputs[1],)
-
- hidden_states = self.norm(hidden_states)
-
- # add hidden states from the last decoder layer
- if output_hidden_states:
- all_hidden_states += (hidden_states,)
-
- next_cache = None
- if use_cache:
- next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
-
- if not return_dict:
- return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
- return BaseModelOutputWithPast(
- last_hidden_state=hidden_states,
- past_key_values=next_cache,
- hidden_states=all_hidden_states,
- attentions=all_self_attns,
- )

对大整体的逻辑应该是把hidden_states传入每一层layers,然后上层处理的hd标准化放入下一层layers,然后加到一块,再处理,再标准化,再送入下层,依次继续,最后输出加上最后一层hd
对于decoder
- class Qwen2DecoderLayer(nn.Module):
- def __init__(self, config: Qwen2Config, layer_idx: int):
- super().__init__()
- self.hidden_size = config.hidden_size
-
- if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
- logger.warning_once(
- f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
- "unexpected results may be encountered."
- )
- self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
-
- self.mlp = Qwen2MLP(config)
- self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
- self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-
- def forward(
- self,
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_value: Optional[Tuple[torch.Tensor]] = None,
- output_attentions: Optional[bool] = False,
- use_cache: Optional[bool] = False,
- **kwargs,
- ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
- if "padding_mask" in kwargs:
- warnings.warn(
- "Passing `padding_mask` is deprecated and will be removed in v4.37. "
- "Please make sure use `attention_mask` instead.`"
- )
- """
- Args:
- hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
- attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
- `(batch, sequence_length)` where padding elements are indicated by 0.
- output_attentions (`bool`, *optional*):
- Whether or not to return the attentions tensors of all attention layers. See `attentions` under
- returned tensors for more detail.
- use_cache (`bool`, *optional*):
- If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
- (see `past_key_values`).
- past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
- """
-
- residual = hidden_states
-
- hidden_states = self.input_layernorm(hidden_states)
-
- # Self Attention
- hidden_states, self_attn_weights, present_key_value = self.self_attn(
- hidden_states=hidden_states,
- attention_mask=attention_mask,
- position_ids=position_ids,
- past_key_value=past_key_value,
- output_attentions=output_attentions,
- use_cache=use_cache,
- )
- hidden_states = residual + hidden_states
-
- # Fully Connected
- residual = hidden_states
- hidden_states = self.post_attention_layernorm(hidden_states)
- hidden_states = self.mlp(hidden_states)
- hidden_states = residual + hidden_states
-
- outputs = (hidden_states,)
-
- if output_attentions:
- outputs += (self_attn_weights,)
-
- if use_cache:
- outputs += (present_key_value,)
-
- return outputs

传入残差
residual = hidden_states
正则化
self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
hidden_states = self.input_layernorm(hidden_states)
传入注意力模块
self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
会产出一个新的hidden_states,要与之前的残差相加
hidden_states = residual + hidden_states
再提出一个残差
residual = hidden_states
再次进入RMSNorm的环节
hidden_states = self.post_attention_layernorm(hidden_states)
做一次MLP
hidden_states = self.mlp(hidden_states)
结果与残差相加
hidden_states = hidden_states + residual
然后输出
因此,总结上面最关键的几个环节,不难看出
Decoder = MLP + attn + norm
此后就是输出环节
那么回头看下attn吧,这才是重头戏
先将输入的hidden_states传入三个部分
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
然后转置成为后续进行矩阵乘法的形式
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
将旋转位置嵌入应用于查询和键张量。使用了旋转位置嵌入的余弦和正弦部分,将它们与查询和键张量相乘,并将结果相加,从而实现旋转位置嵌入的效果
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
然后对key与value做kv_repeat(上图是否有误存疑)
在小模型里这一步并非必须
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
做Dot attn
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
同遮蔽注意力相叠加,实现读取顺序
attn_weights = attn_weights + attention_mask
softmax + dropout + values_states相乘
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
转置,修改形状等reshape操作并最后进行一次o_proj
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
最后输出
return attn_output, attn_weights, past_key_value
限于能力、篇幅暂时没把数学部分写上来,后续会补上
但是对于qwen的原理梳理成功的将先前的有关于gpt类似的transformers模型的知识串联起来,颇有意思
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。