菜鸟追梦旅行

这个屌丝很懒，什么也没留下！

热门标签

BERT原理加代码解读_bert_mf

作者：菜鸟追梦旅行 | 2024-04-03 15:30:06

踩

bert_mf

BERT原理

BERT是google在2018年提出的一种全新的预训练语言模型， BERT的预训练是同时考虑左边和右边上下文的双向表示。将预训练好的BERT表示应用到各个任务时只需要微调最后一层就能达到最佳效果。

主要贡献可以归纳以下三点：

1、我们证明了预训练双向语言模型对于语言表示的重要性，不向之前的尝试都是单向的语言模型，BERT使用遮盖（masked）语言模型解决双向表示问题。

2、我们证明了预训练的表示可以取代大量的任务特定的特征工程结构。BERT是第一个基于微调的表示模型，同时在大量句子级别或者词汇级别的任务上达到最好效果。

3、 BERT在11个NLP任务上达到最佳效果，我们同时做了消融实验，证明我们模型的双向特性是最重要的。

1.模型结构

	BERT的模型结构是多层的双向Transformer编码器，基于[Attention is all you need](https://link.zhihu.com/?target=https%3A//arxiv.org/abs/1706.03762)。在本次研究中，我们用$L$表示Transformer的节点数，隐藏成维度为$H$，self-attention heads的数量为$A$，所有的场景下我们都把feed-forward/filter的大小设置为$4H$，比如$H=768$的时候为3072，$H=1024$的时候为4096。两种模型的参数如下：
1

${BERT_{BASE}:L=12,H=768,A=12}$ ，总参数大小为110M

${BERT_{LARGE}:L=24,H=1024,A=16}$ ，总参数大小为340M

在这里插入图片描述

${BERT_{BASE}}$ 是为了和OpenAI GPT做对比，所以参数和OpenAI GPT一样。双向的Transfomer通常是作为编码器，而从左到右的Transformer通常是作为编码器，因为需要生成文本。BERT，OpenAI GPT，ELMo的比较入上图所示。

2. 输入表示

我们的输入可以是单句也可以是一对句子组成的一个词序列，对于给定的词汇，其表示组成为：对应词，分割和位置嵌入。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mp9Z1vvw-1626103392208)(.\img\2.png)]$

细节如下：

1、我们用 WordPiece 嵌入，字典大小为30000，词之间用 ## 分隔。

2、我们用最长512的序列学习位置嵌入。

3、每个序列的开始是一个用于分类任务的嵌入，用[CLS]表示，对应了序列最终的隐藏状态。而对于非分类任务，这个向量可以忽视。

4、句子对，打包成了一个序列，用[SEP] 分割开。其次，分别加上了一个句子嵌入 ${E_A}$ 表示句子A， ${E_B}$ 表示句子嵌入B。（即上图中的 Segment Embeddings）

5、对于单句输入我们只用 ${E_A}$ 。

3. 预训练任务

和ELMo和OpenAI GPT不同的是，我们不用传统的左到右或者右到左的语言模型预训练BERT，相反我们使用两种新的无监督预测任务，如下所述。

3.1 任务1：Masked LM

双向的深度模型可以让每个字都能在多层的上下文中间接的“看到自己”。为了训练一个深度双向表示，我们进一步尝试随机遮掩一定比例的输入特征，然后预测这些被遮掩的特征，这个过程叫“masked LM”（MLM）最终的输出向量对应被遮掩的词然后输出到softmax层，其对应维度为词典大小。每句话我们随机遮掩了15%的词，与auto-encoders不同，我们只预测遮掩的词，而不是重建整个输入。

尽管这让我们能获取一个双向的预训练模型，但这同样有两个缺点：1、我们的预训练和微调之间有些不匹配，因为[MASK]遮掩词汇在微调的时候永远不会被看到，为了缓解这种情况，我们并不总将“masked”掉的词替换为[MASK]符号，相反我们时处理训练数据时随机挑15%。例如 my dog is hairy 可能选择了 hairy。与其总是将选中的词替换为[MASK]，数据生成器会这么做，随后的过程如下：

1、80%的情况：替换选中的词为[MASK]，如 my dog is hairy -> my dog is [MASK]

2、10%的情况：将选中的词替换为一个随机词，如 my dog is hairy -> my dog is apple

3、10%的情况：保持不变。如 my dog is hairy -> my dog is hairy

transformer的编码器并不知道哪个词会被要求用来预测，或者说哪个词被替换为了随机词，它必须保持对每个词的上下文表示的分布。另外，由于随机替换对所有词汇的替换概率为1.5%（15%中的10%），这并不会影响模型的理解能力。

另一个缺点是每个batch只有15%的词汇被预测。这意味着需要更多的训练步数来提高覆盖率。在5.3节中我们演示了MLM覆盖速度比左到右的模型慢，但是实际带来的提升远大于付出的额外训练开销。

3.2 任务2：下一句话预测

许多重要的下游任务如问答自然语言推断都基于对两句话关系的理解，但这并不能直接由语言模型学到。为了能训练一个可以理解句子关系的模型，我们训练了一个二分类的下一句话预测任务，数据很容易获取。值得注意的是，当选择句子A和B作为预训练样本时，50%的概率下的B是真的A的下一句，而50%概率下B是一个随机的句子（负样本），如下所示：

对于NotNext的句子我们是随机挑选的，最终可以达到97%-98%的准确率。尽管看起来很简单，我们在5.1节中会演示这个任务对于问答任务和自然语言推理任务都非常重要。

3.4 预训练过程

预训练过程基本依照现有的语言模型预训练文献，对于预训练预料，我们使用BooksCorpus（800M 单词）。和英文维基（2500M 单词）。对于维基百科我们只提取文本段落，忽略列表，表格，标题。使用文档级别的预料，而不是随机打乱的句子文本。这样来提取常连续的序列。

为了生成各个输入序列，我们对预料库进行两句两句的采样，我们称为“句子”，但实际上比一般的句子要长。但两句的总长度不会超过512个词。

我们训练的batch大小为256个序列（即256序列*512个词=128,000个词/batch），然后训练1,000,000步，相当于40个epoch左右，训练语料为33亿个词。使用Adam，学习率为1e-4， ${{\beta_1}=0.9}$ ， ${{\beta_2}=0.999}$ ，L2 权重衰减为0.01，学习率使用10,000步热身，然后线性递减。每层的dropout为0.1，我们使用gelu为激活函数，和OpenAI GPT一样，训练的loss为masked语言模型相似度的均值和，以及下一句话相似度的均值。

设备上我们使用4块云TPUs（实际包含），16块云TPUs。每个预训练花4天来完成。

3.5 微调策略

对于句子级别的分类任务，BERT的微调很直接，为了获取输入的固定维度的池化后的表示，我们将最后一个隐藏状态作为输入的第一个词输入，即输入第一个词都是[CLS]，其对应的嵌入就是最后一个隐藏状态，它是唯一用于分类任务微调的新参数，我们将其定义为
在这里插入图片描述

微调的时候，绝大多数超参都是和预训练一样的，优化器的超参视任务而定，但我们发现如下范围的参数效果非常好：

Batch size: 16，32
Learning rate(Adam): 5e-5，3e-5，2e-5
Number of epochs: 3，4

我们同样发现大规模的数据集相比小数据集不会对超参那么敏感，微调通常都非常快，所以也可以做一个详尽的测试来找到效果最好的参数。

4. 实验

这个部分是针对一些下游任务做的微调实验，原文中给出了四个任务，这里就不细展开了，这里给出四种任务的结构示意图。重点看句子表示以及输入输出方式的不同。
在这里插入图片描述

5. 消融实验

这部分针对预训练任务、模型尺寸、特征等三个方面做了消融实验。

5.1 预训练任务

在这里插入图片描述

No NSP(no next sentence prediction)：没有下一句预测任务的预训练模型。
LTR(left-to-right) & No NSP：没有使用masked LM，而是左到右语言模型，所以预测每个输入词，没有使用遮掩。

结果如下：

5.2 模型大小的影响

模型大小对微调准确率的影响具体如下表所示：
在这里插入图片描述

5.3 训练步长的影响

训练步长对应的准确率如下图所示：

5.4 基于特征的方式

之前所有的结果都是基于微调的方式，如将一个简单的分类器加到预训练模型上，然后所有的参数都需要微调。但是可以基于特征来迁移，这样做有一些好处：1、不是所有NLP任务都可以轻松用Transformer的编码器结构表示，因此添加特定的模型结构；2、这样做计算开销很小。

6. 结论

我们的主要贡献是将这种方式整合到双向结构，可以让同样的预训练模型成功处理大量自然语言任务。同时也为后续研究打开了新的思路。

7. Code

github：https://github.com/google-research/bert

选用的是谷歌原版的tf代码，同时在代码中一些关键部分做了一些标注。

7.1 main函数

main函数的部分就是通过自定义模型，用于estimator训练模型，具体可以看代码注释。

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)

  if not FLAGS.do_train and not FLAGS.do_eval:
    raise ValueError("At least one of `do_train` or `do_eval` must be True.")

  bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

  tf.gfile.MakeDirs(FLAGS.output_dir)

  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern))

  tf.logging.info("*** Input Files ***")
  for input_file in input_files:
    tf.logging.info("  %s" % input_file)

  tpu_cluster_resolver = None
  if FLAGS.use_tpu and FLAGS.tpu_name:
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
        FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)

  is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  run_config = tf.contrib.tpu.RunConfig(                # 训练参数
      cluster=tpu_cluster_resolver,
      master=FLAGS.master,
      model_dir=FLAGS.output_dir,
      save_checkpoints_steps=FLAGS.save_checkpoints_steps,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=FLAGS.iterations_per_loop,
          num_shards=FLAGS.num_tpu_cores,
          per_host_input_for_training=is_per_host))

  model_fn = model_fn_builder(               # 自定义模型，用于estimator训练
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      num_train_steps=FLAGS.num_train_steps,
      num_warmup_steps=FLAGS.num_warmup_steps,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_tpu)

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(            # 创建TPU Estimator
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=FLAGS.train_batch_size,
      eval_batch_size=FLAGS.eval_batch_size)

  if FLAGS.do_train:        # 训练过程
    tf.logging.info("***** Running training *****")
    tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    train_input_fn = input_fn_builder(          # 创建输入训练集
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=True)
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)

  if FLAGS.do_eval:                   # 验证过程
    tf.logging.info("***** Running evaluation *****")
    tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)

    eval_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=False)

    result = estimator.evaluate(
        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)

    output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
    with tf.gfile.GFile(output_eval_file, "w") as writer:
      tf.logging.info("***** Eval results *****")
      for key in sorted(result.keys()):
        tf.logging.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

7.2 BertConfig设置

构建bert的配置参数。

class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
               
    self.vocab_size = vocab_size    # 词典中词数
    self.hidden_size = hidden_size   # 隐藏单元数
    self.num_hidden_layers = num_hidden_layers  # 隐藏层数
    self.num_attention_heads = num_attention_heads # 每个隐藏层中的attention head数
    self.hidden_act = hidden_act   # 激活函数（gelu）
    self.intermediate_size = intermediate_size  # 升维维度
    self.hidden_dropout_prob = hidden_dropout_prob  # 隐藏层dropout概率
    self.attention_probs_dropout_prob = attention_probs_dropout_prob  # 乘法attention时，softmax后dropout概率
    self.max_position_embeddings = max_position_embeddings  # 一个大于seq_length的参数，用于生成position_embedding
    self.type_vocab_size = type_vocab_size  # segment_ids类别 [0，1]
    self.initializer_range = initializer_range  # 初始化范围

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

7.3 BertModel

该类下就是Bert的核心代码，下面挑选几个主要的组成部分进行展示。

7.3.1 构造函数

class BertModel(object):
  """BERT model ("Bidirectional Encoder Representations from Transformers").
  """

  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=False,
               scope=None):
    """Constructor for BertModel.

    Args:
      config: `BertConfig` instance.
      is_training: bool. true for training model, false for eval model. Controls
        whether dropout will be applied.
      input_ids: int32 Tensor of shape [batch_size, seq_length].
      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
        embeddings or tf.embedding_lookup() for the word embeddings.
      scope: (optional) variable scope. Defaults to "bert".

    Raises:
      ValueError: The config is invalid or one of the input tensor shapes
        is invalid.
    """
    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (self.embedding_output, self.embedding_table) = embedding_lookup(   # word embedding
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output = embedding_postprocessor(        # token_embedding和position_embedding
            input_tensor=self.embedding_output,                 # [batch_size, seq_length, embedding_size]
            use_token_type=True,
            token_type_ids=token_type_ids,                      # [batch_size, seq_length]
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)


      #### 模型的应用####
      # get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息；适用于句子分类问题
      # get_sequece_output表示BERT最终的输出结果，shape为[batch_size, seq_length, hidden_size].可以直观理解为对每条语料的最终表示，适用于seqtoseq问题
      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110

7.3.2 embedding_lookup函数

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])    # 最低维度扩维 [batch_size, seq_length, 1]

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])    # [batchsize*seq_length]
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)     #[batch_size*seq_length, vocab_size]
    output = tf.matmul(one_hot_input_ids, embedding_table)               #[batch_size*seq_length, embedding_size]
  else:   # 按索引值取值
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  # output：[batchsize, seq_length, num_inputs]
  # 转成：[batchsize, seq_length, num_inputs*embedding_size]
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

7.3.3 三种embedding相加处理函数

这个函数作用是基于前面word embedding的向量，同时处理并获得segment Embeddings部分（处理句子对）和Position Embeddings部分，并把三种向量进行相加获得最终的特征向量表示。

def embedding_postprocessor(input_tensor,                       # [batch_size, seq_length, embedding_size]
                            use_token_type=False,
                            token_type_ids=None,                # [batch_size, seq_length]
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:              # segment Embeddings部分（处理句子对）
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])                     # [batchsize*seq_length]
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) # [batch_size*seq_length, 2] token_type只有0，1
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)           # [batch_size*seq_length, embedding_size]
    token_type_embeddings = tf.reshape(token_type_embeddings,                  
                                       [batch_size, seq_length, width])        # [batchsize, seq_length, width=embedding_size]    
    output += token_type_embeddings

  if use_position_embeddings:                                                  # Position Embeddings部分
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)      # 确保seq_length < max_position_embeddings
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],        # [seq_length, embedding_size]
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])                   # [1, seq_length, embeddingsize]
      position_embeddings = tf.reshape(position_embeddings,                  # [1, seq_length, embeddingsize]
                                       position_broadcast_shape)             # [batch_size, seq_length, embedding_size] 与#[1,seq_length,embedding_size]相加
      output += position_embeddings                                          # 每个batch的同一位置position_embedding是一样的，所以相当于batchsize个position_embeddings与output相加

  output = layer_norm_and_dropout(output, dropout_prob)
  return output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

7.3.4 attention layer

其实就是谷歌multihead-attention的一个实现。

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.
  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
    float Tensor of shape [batch_size, from_seq_length,
      num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
      true, this will be of shape [batch_size * from_seq_length,
      num_attention_heads * size_per_head]).

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)    # [batch_size*seq_length, hidden_size]
  to_tensor_2d = reshape_to_matrix(to_tensor)        # [batch_size*seq_length, hidden_size]


  # 首先将key和value输入进全连接层，但是激活函数为None,原因不详
  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))    # [batch_size*seq_length,hidden_size] hidden_size即num_attention_heads*size_per_head

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

# reshape成四位，用于注意力矩阵运算
  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,        # 将num_attention_heads调到第二维。这里表示每个batch有N个head，每个head有F个token，每个token用H表示。不同head学习不同子空间的特征
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.  乘法注意力
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # 这部分将每条训练语料的结尾padding的部分都变为一个极小值，其他有实数据的部分都是0
    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    # 相加后，有实数据的部分加的，padding部分都是一个极小值
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  # 注意力矩阵乘以value
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
  # 返回2D结果
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

7.3.5 transformer model

transformer的实现，具体关键地方见注释。

def transformer_model(input_tensor,
                      attention_mask=None,      # [batch_size, from_seq_length, to_seq_length]
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".

  This is almost an exact implementation of the original Transformer encoder.

  See the original paper:
  https://arxiv.org/abs/1706.03762

  Also see:
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.

  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.

  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor)     #这里官方说为了避免来回升降维，所以直接先变形为2D.最后再恢复成3D [batch_size*seq_length,hidden_size]

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(         # 进行self_attention 即multi-head attention
              from_tensor=layer_input,              # [batch_size*seq_length, hidden_size]
              to_tensor=layer_input,                # [batch_size*seq_length, hidden_size]
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(       # 对attention的输出做一个全连接层
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)       # 残差和layer_norm

      # Feed Forward过程，先对输出升维，再进行降维
      # The activation is only applied to the "intermediate" hidden layer.
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(                      # 升维
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(                           # 降维
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)      # 加入残差
        prev_output = layer_output                                      # 本层输出作为下一层输入
        all_layer_outputs.append(layer_output)                          # 所有层的输出结果列表

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop】