About Bert's finishing

BERT model input and output


3 Embedding: token embedding word vector, position embedding location vector, segment embedding text vector

  • About Token Embedding

Token Embedding To convert each word into a fixed dimension. In BERT, each word will be converted to a 768-dimensional vector representation. Entering Text Before sending token Embeddings layers, you must first perform tokenization processing.

Tokenization method is WordPiece tokenization. Two special token will be inserted into the beginning of the result of Tokenization ([CLS]) and the end ([SEP])


  • About Position Embedding
Position Embedding and Transformer in Bert Different, TransorMER Chinese Directly utilize the formula, calculates the value of the corresponding dimension. It is to learn in Bert. For example, the size of D_Model is 512, then each sentence generates a one-dimensional array of [0, 1, 2, … 511], then repeats BATCH, so actual input is [Batch, D_Model], will It is delivered to one_hot to coding, the specific encoding process is the same as token Embedding, and then the last output is [BATCH, SEQ_LEN, D_MODEL]. Like the dimension of Token Embedding
  • About Segment Embedding

(Reference description: https://blog.csdn.net/u010099080/Article/details/102587954)

BERT source code.py is the process of preprocessing the word, there are two psychorators:BasicTokenizerandWordpieceTokenizerFullTokenizerYes, these two combinations: firstBasicTokenizerGet a list of comparable token, then make each Token once againWordpieceTokenizerGet the final word.

BasicTokenizeris a preliminary psycholer. For a string of the word, the process is roughly convertedunicode–>Remove all kinds of strange characters ->Handling Chinese(Judge whether it is Chinese, Chinese according to character separation) ->Space word ->Remove extra superfluence character and punctuation–>Spacefinder

WordpieceTokenizerIt is a single part on the basis of BT results to obtain a child (Subword, starting with ##), and the vocabulary is introduced at this time. This class has only two ways: a initialization method__init__(Self, vocab, unk_token = “[unk]”, max_input_chars_per_word = 200), a word methodtokenize(Self, Text).tokenize(self, text): This method is the main scratch method. The approximately word idea is to split a word into multiple quarters according to the order from left to right, each as a long. According toSource codeThe saying, this method is called Greedy Longeest-Match-First Algorithm, greedy longest preferential matching algorithm.




Sample aspect

def convert_single_example( max_seq_length, tokenizer,text_a, text_b=None):
    tokens_a = tokenizer.tokenize(text_a)
    tokens_b = None
    if text_b:
        tokens_b = tokenizer.tokenize(text_b)#Here mainly
    if tokens_b:
        #If there is a second sentence, then the total length of the two sentences is less than max_seq_length - 3
        #Because you want to make up [CLS], [SEP], [SEP], [SEP]
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        #If there is only one sentence, use only [CLS] before and after [sep], so the sentence is less than max_seq_length - 3
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]

    #Convert to Bert's input, pay attention to the following type_ids corresponding to segment_ids in the source code
    #(a) Two sentences:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
    #(b) Single sentence:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0     0   0   0  0     0 0
    #Here "Type_IDS" is mainly used to distinguish the first second sentence.
    #The first sentence is 0, and the second sentence is 1. In the pre-training, it will be added to the vector of the word, but this is not a must
    #Because [SEP] has distinguished the first sentence and the second sentence. But Type_ids will make learning simple

    tokens = []
    segment_ids = []
    for token in tokens_a:
    if tokens_b:
        for token in tokens_b:
    input_ids = tokenizer.convert_tokens_to_ids(tokens)#Convert Chinese into IDS
    #Create Mask
    input_mask = [1] * len(input_ids)
    #Buy 0 for input
    while len(input_ids) < max_seq_length:
    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    return input_ids,input_mask,segment_ids #Corresponding to the input_ids, input_mask, segment_ids parameters when creating a Bert model
  • For text classification tasks, the Bert model is inserted into a [CLS] symbol before text, and the output vector corresponding to the symbol as the semantics of the entire text.


  • ​​Statement pair classification tasks: The actual application scenario of this task includes: Q & A (judging if a problem matches if the answer is matched), the statement match (whether or not to express the same thing in two sentences) For this task, the BERT model is divided by the two sentences of the input by the BERT model, in addition to the [CLS] symbol, and the corresponding output as the semantic semantic representation.Additional two different text vector is attached to two sentences, respectively.

  • Sequence labeling task: The actual application scenario of this task includes: Chinese word & new word discovery (labeling each word is the first word, middle word or last word), answer extraction (the start of the answer), etc. For this task, the BERT model uses the output vector of each word in the text (classified), as shown in the following figure (B, I, E represent the first word, the intermediate word, and the last one) Character).

Model output

Get the output of the BERT model, usemodel.get_sequence_output()andmodel.get_pooled_output()Two methods.

output_layer = model.get_sequence_output()#This acquisition of each token Output output [Batch_size, seq_length, embedding_size]
"""Gets final hidden layer of encoder.

float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.

 output_layer = model.get_pooled_output() #This Output of this snail


mask problem

In the master Masked LM training task, [Mask] token will be used to replace 15% of the words in the text, and then predict in the last layer. However, there will be no [MASK] token in the downstream task, causing the pre-training and Fine-Tune that appear inseparable, in order to reduce the impact of the inconsistency to the model, in this 15% corpus:

1. 80% of Tokens will be replaced with [Mask] token

2. 10% of Tokens will refer to replacement with random token

3. 10% of Tokens will remain unchanged but need to be predicted

Replacement in the first point: It is the main part of the Masked LM, which can fuse the true bidirectional semantic information without revealing Label;

Random replacement of the second point: Because the last layer is randomly replaced with the TOKEN bit to predict its true words, the model does not know that this token bit is randomly replaced, forcing the model to try to be in each word. I have learned a characterization of a global context, so it is also possible to get BERT to get better contextual words (this is the most important feature of solving the most meaning);

The third point remains unchanged: that is, 10% is leaked (the proportion of all words is 15% * 10% = 1.5%), which can give the model a certain BIAS, which is equivalent to being Additional rewards, the characterization of the model can be characterized by the word to the true characterization of the word (at this time, the input layer is the true EmbedDing of the words to be predicted, and the eMBedding obtained in the word position in the output layer is after the layer Self-Attension Getting, this part of the Embedding still retains some input Embedding information, and this part is the additional rewards brought by entering a certain proportion, eventually makes the model’s output vector toward the real Embedding of the input layer. Offset). If you use MASK, the model only needs to guarantee the classification of the output layer, and the vector characterization of the output layer is not concerned, so it may result in the final vector output effect.

Randomly selected 15% of the word 10% probability to replace the correct word, equivalent to the text error correction task, giving a certain text error correction ability to the BERT model; randomly selected 15% In the word, 10% of the probability remained unchanged, alleviating the problem of entering mismatch with the pre-training time when FineTune is mitigated (when the sentence is Mask during the pre-training, and the finetune is a complete sentence, that is, the input does not match the problem )



Model structure

About Attention mechanism

atrtion mechanismTarget wordandContextSemanticsVectorAs an input, first obtained by linear transformationTarget wordQueryVectorContextKey vectorAs well as the target wordContextOriginal Valueand then calculateQueryVector with eachKeyTimphics of vectorWeightWeighted integration targetValueVector and each up and down textValueVector, as an output of Attention, ie:Enhancement of the target word.

Self-Attention:For entering text, we need to enhance the semantic vector, so we willEach wordAs aQueryWeighted the semantic information of all words in the converged text, get the enhanced semantic vector of each word, as shown below. under these circumstances,Query,KeyandValueThe vector representation comes from the same input text, so the Attention mechanism is also called Self-Attension


 Multi-head Self-Attention:In order to enhance the diversity of Attention, the Author of the article further uses different Self-Attention modules to obtain each word in different semantic semantics in different semantic spaces, and linearly combined multiple enhanced semantics vectors of each word. Thus, a finalized semantic vector of the final vector length

Use multiple attention to extract information from different angles, improve the comprehensiveness of information extraction


[MULTI-HEAD SELF-Attention input and output is identical in the form, input to the original vector representation of each word in the text, the output is the enhancement vector represented after the word semantic information is blended]

Transformer EncoderThree key operations have been added over Multi-Head Self-Attension:

  • Residual connection(ResidualConnection): adds the input of the module directly to the output as the last output. One basic consideration behind this operation is that modification input is more easier to convert the entire output (“adding flowers on the brown” than “the charcoal in the snow” is much easier!). In this way, the network can be easier to train. Solve the problem of gradient disappearances generated when network layers
  • Layer Normalization: Tenian 1 variance of 0 mean 1 variance.

    The difference between retray normalization and batch normalization

    The calculated dimensions are different. The BN is based on the same CHANNEL of different batches. LN is calculated based on different characters of the same BATCH.

  • Linear conversion: Enhanced semantic vector for each word, two linear transformations to enhance the expression of the entire model. Here, the transformed vector is the same as the original vector.



[Transformer Encoder inputs and outputs are still exactly the same, so the Transformer Encoder can also represent enhanced semantics to enhance the semantic vector of each word in the input text into the same length.

BERT uses Transformer-Encoder to encode input, the Self-Attention mechanism in Encoder uses the token with its context when encoding a token, where ‘uses the context’ is a two-way manifestation