BERT model input and output
3 Embedding: token embedding word vector, position embedding location vector, segment embedding text vector
- About Token Embedding
Token Embedding To convert each word into a fixed dimension. In BERT, each word will be converted to a 768-dimensional vector representation. Entering Text Before sending token Embeddings layers, you must first perform tokenization processing.
Tokenization method is WordPiece tokenization. Two special token will be inserted into the beginning of the result of Tokenization ([CLS]) and the end ([SEP])
- About Position Embedding
- About Segment Embedding
(Reference description: https://blog.csdn.net/u010099080/Article/details/102587954)
BERT source code.py is the process of preprocessing the word, there are two psychorators:BasicTokenizerandWordpieceTokenizerFullTokenizerYes, these two combinations: firstBasicTokenizerGet a list of comparable token, then make each Token once againWordpieceTokenizerGet the final word.
BasicTokenizeris a preliminary psycholer. For a string of the word, the process is roughly convertedunicode–>Remove all kinds of strange characters ->Handling Chinese(Judge whether it is Chinese, Chinese according to character separation) ->Space word ->Remove extra superfluence character and punctuation–>Spacefinder
WordpieceTokenizerIt is a single part on the basis of BT results to obtain a child (Subword, starting with ##), and the vocabulary is introduced at this time. This class has only two ways: a initialization method__init__(Self, vocab, unk_token = “[unk]”, max_input_chars_per_word = 200), a word methodtokenize(Self, Text).
tokenize(self, text): This method is the main scratch method. The approximately word idea is to split a word into multiple quarters according to the order from left to right, each as a long. According toSource codeThe saying, this method is called Greedy Longeest-Match-First Algorithm, greedy longest preferential matching algorithm.
def convert_single_example( max_seq_length, tokenizer,text_a, text_b=None): tokens_a = tokenizer.tokenize(text_a) tokens_b = None if text_b: tokens_b = tokenizer.tokenize(text_b)#Here mainly if tokens_b: #If there is a second sentence, then the total length of the two sentences is less than max_seq_length - 3 #Because you want to make up [CLS], [SEP], [SEP], [SEP] _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: #If there is only one sentence, use only [CLS] before and after [sep], so the sentence is less than max_seq_length - 3 if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] #Convert to Bert's input, pay attention to the following type_ids corresponding to segment_ids in the source code #(a) Two sentences: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 #(b) Single sentence: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # #Here "Type_IDS" is mainly used to distinguish the first second sentence. #The first sentence is 0, and the second sentence is 1. In the pre-training, it will be added to the vector of the word, but this is not a must #Because [SEP] has distinguished the first sentence and the second sentence. But Type_ids will make learning simple tokens =  segment_ids =  tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens)#Convert Chinese into IDS #Create Mask input_mask =  * len(input_ids) #Buy 0 for input while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length return input_ids,input_mask,segment_ids #Corresponding to the input_ids, input_mask, segment_ids parameters when creating a Bert model
- For text classification tasks, the Bert model is inserted into a [CLS] symbol before text, and the output vector corresponding to the symbol as the semantics of the entire text.
- Statement pair classification tasks: The actual application scenario of this task includes: Q & A (judging if a problem matches if the answer is matched), the statement match (whether or not to express the same thing in two sentences) For this task, the BERT model is divided by the two sentences of the input by the BERT model, in addition to the [CLS] symbol, and the corresponding output as the semantic semantic representation.Additional two different text vector is attached to two sentences, respectively.
- Sequence labeling task: The actual application scenario of this task includes: Chinese word & new word discovery (labeling each word is the first word, middle word or last word), answer extraction (the start of the answer), etc. For this task, the BERT model uses the output vector of each word in the text (classified), as shown in the following figure (B, I, E represent the first word, the intermediate word, and the last one) Character).
Get the output of the BERT model, use
output_layer = model.get_sequence_output()#This acquisition of each token Output output [Batch_size, seq_length, embedding_size]
"""Gets final hidden layer of encoder.
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
output_layer = model.get_pooled_output() #This Output of this snail
In the master Masked LM training task, [Mask] token will be used to replace 15% of the words in the text, and then predict in the last layer. However, there will be no [MASK] token in the downstream task, causing the pre-training and Fine-Tune that appear inseparable, in order to reduce the impact of the inconsistency to the model, in this 15% corpus:
1. 80% of Tokens will be replaced with [Mask] token
2. 10% of Tokens will refer to replacement with random token
3. 10% of Tokens will remain unchanged but need to be predicted
Replacement in the first point: It is the main part of the Masked LM, which can fuse the true bidirectional semantic information without revealing Label;
Random replacement of the second point: Because the last layer is randomly replaced with the TOKEN bit to predict its true words, the model does not know that this token bit is randomly replaced, forcing the model to try to be in each word. I have learned a characterization of a global context, so it is also possible to get BERT to get better contextual words (this is the most important feature of solving the most meaning);
The third point remains unchanged: that is, 10% is leaked (the proportion of all words is 15% * 10% = 1.5%), which can give the model a certain BIAS, which is equivalent to being Additional rewards, the characterization of the model can be characterized by the word to the true characterization of the word (at this time, the input layer is the true EmbedDing of the words to be predicted, and the eMBedding obtained in the word position in the output layer is after the layer Self-Attension Getting, this part of the Embedding still retains some input Embedding information, and this part is the additional rewards brought by entering a certain proportion, eventually makes the model’s output vector toward the real Embedding of the input layer. Offset). If you use MASK, the model only needs to guarantee the classification of the output layer, and the vector characterization of the output layer is not concerned, so it may result in the final vector output effect.
About Attention mechanism
atrtion mechanismTarget wordandContextSemanticsVectorAs an input, first obtained by linear transformationTarget wordQueryVectorContextKey vectorAs well as the target wordContextOriginal Valueand then calculateQueryVector with eachKeyTimphics of vectorWeightWeighted integration targetValueVector and each up and down textValueVector, as an output of Attention, ie:Enhancement of the target word.
Self-Attention:For entering text, we need to enhance the semantic vector, so we willEach wordAs aQueryWeighted the semantic information of all words in the converged text, get the enhanced semantic vector of each word, as shown below. under these circumstances,Query,KeyandValueThe vector representation comes from the same input text, so the Attention mechanism is also called Self-Attension
Multi-head Self-Attention:In order to enhance the diversity of Attention, the Author of the article further uses different Self-Attention modules to obtain each word in different semantic semantics in different semantic spaces, and linearly combined multiple enhanced semantics vectors of each word. Thus, a finalized semantic vector of the final vector length
Use multiple attention to extract information from different angles, improve the comprehensiveness of information extraction
[MULTI-HEAD SELF-Attention input and output is identical in the form, input to the original vector representation of each word in the text, the output is the enhancement vector represented after the word semantic information is blended]
Transformer EncoderThree key operations have been added over Multi-Head Self-Attension:
- Residual connection(ResidualConnection): adds the input of the module directly to the output as the last output. One basic consideration behind this operation is that modification input is more easier to convert the entire output (“adding flowers on the brown” than “the charcoal in the snow” is much easier!). In this way, the network can be easier to train. Solve the problem of gradient disappearances generated when network layers
- Layer Normalization: Tenian 1 variance of 0 mean 1 variance.
The difference between retray normalization and batch normalization
The calculated dimensions are different. The BN is based on the same CHANNEL of different batches. LN is calculated based on different characters of the same BATCH.
- Linear conversion: Enhanced semantic vector for each word, two linear transformations to enhance the expression of the entire model. Here, the transformed vector is the same as the original vector.
[Transformer Encoder inputs and outputs are still exactly the same, so the Transformer Encoder can also represent enhanced semantics to enhance the semantic vector of each word in the input text into the same length.