Instructions for Pytorch BertModel

2021-10-15 11:01:39
OfStack

Basic introduction

Environment: Python 3.5 +, Pytorch 0.4. 1/1. 0.0

Installation:


pip install pytorch-pretrained-bert

Required parameters:

--data_dir: "str": The data root directory contains three data files, train. xxx/dev. xxx/test. xxx.

--vocab_dir: "str": Thesaurus file address.

--bert_model: "str": The pre-trained model of bert is stored. It needs to be an gz file, such as ".. x/xx/bert-base-chinese. tar. gz", which contains an bert_config. json and pytorch_model. bin file.

--task_name: "str": The parameter used to select the corresponding data set, such as "cola", corresponds to the data set.

--output_dir: "str": Model prediction results and model parameter storage directory.

Simple example:

Import the required package


import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

Create word segmentation


tokenizer = BertTokenizer.from_pretrained(--vocab_dir)

Required parameter: --vocab_dir, see here for data style

Owning functions:

tokenize: Enter sentences and cut words according to --vocab_dir and greed principle. Return word list

convert_token_to_ids: Converts the cut list into a thesaurus-corresponding id list.

convert_ids_to_tokens: Converts an id list to a word list.


text = '[CLS]  Wu Song fights tigers  [SEP]  Where are you  [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0,0,0,0, 1,1, 1, 1, 1, 1, 1, 1]
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Here, there seems to be a problem with the word cutting of marker symbols ([cls]/[sep]), and the Chinese bert is coded based on word level, so all the Chinese characters are cut out:


['[', 'cl', '##s', ']', ' Wu ', ' Pine ', ' Hit ', ' Old ', ' Tiger ', '[', 'sep', ']', ' You ', ' In ', ' Where ', '[', 'sep', ']']

Create the bert model and load the pre-training model:


model = BertModel.from_pretrained(--bert_model)

Insert GPU:


tokens_tensor = tokens_tensor.cuda()
segments_tensors = segments_tensors.cuda()
model.cuda()

Forward propagation:


encoded_layers, pooled_output= model(tokens_tensor, segments_tensors)

Parameters:

input_ids: (batch_size, sqe_len) Represents the Tensor of the input instance

token_type_ids=None: (batch_size, sqe_len) An instance can contain two sentences, which is equivalent to a sentence tag.

attention_mask=None: (batch_size*): The length passed in for mask of attention.

output_all_encoded_layers=True: Controls whether the results of all encoder layers are output.

Return value:

encoded_layer: Tensor of length num_hidden_layers (batch_size, sequence_length, hidden_size). List

pooled_output: (batch_size, hidden_size), the first word [CLS] of the last layer encoder passes through the Linear layer and Tensor after the activation function Tanh (). It represents sentence information

Supplement: pytorch uses Bert

It is mainly divided into the following steps:

Download the model and put it in the directory

Use BertModel and BertTokenizer in transformers to load the model and word segmentation

Use the encode and decode functions of tokenizer to encode and decode, respectively, noting the parameters add_special_tokens and skip_special_tokens

The input of forward is an tensor of [batch_size, seq_length], and it is important to note the attention_mask parameter.

The output is an tuple, the first value of tuple is hidden_state of the last transformer layer of bert, and size is [batch_size, seq_length, hidden_size], that is, the last output of bert, which is used for downstream tasks.


# -*- encoding: utf-8 -*-
import warnings
warnings.filterwarnings('ignore')
from transformers import BertModel, BertTokenizer, BertConfig
import os
from os.path import dirname, abspath
root_dir = dirname(dirname(dirname(abspath(__file__))))
import torch
#  Download the pre-training model from official website and put it in the directory 
pretrained_path = os.path.join(root_dir, 'pretrained/bert_zh')
#  Load from a file bert Model 
model = BertModel.from_pretrained(pretrained_path)
#  From bert Load dictionaries in the directory 
tokenizer = BertTokenizer.from_pretrained(pretrained_path)
print(f'vocab size :{tokenizer.vocab_size}')
#  Put '[PAD]' Code 
print(tokenizer.encode('[PAD]'))
print(tokenizer.encode('[SEP]'))
#  Encode Chinese sentences and add them by default special tokens That is to say, at the beginning of the sentence, [CLS]  At the end of the sentence, add [SEP]
ids = tokenizer.encode(" I'm Chinese ", add_special_tokens=True)
#  According to the results, 101 Yes [CLS] Adj. id , and 2769 Yes " I " Adj. id
# [101, 2769, 3221, 704, 1744, 782, 102]
print(ids)
#  Put ids Decoded to Chinese, no special characters are skipped by default 
print(tokenizer.decode([101, 2769, 3221, 704, 1744, 782, 102], skip_special_tokens=False))
# print(model)
inputs = torch.tensor(ids).unsqueeze(0)
# forward , result Yes 1 A tuple , No. 1 A tensor Is the last hidden-state
result = model(torch.tensor(inputs))
# [1, 5, 768]
print(result[0].size())
# [1, 768]
print(result[1].size())
for name, parameter in model.named_parameters():
  #  Print every 1 Layer, and each 1 Parameters of layer 
  print(name)
  #  Every 1 The parameters of the layer are all by default requires_grad=True The parameters are learnable 
  print(parameter.requires_grad)
  #  If you only want to train the first 11 Layer transformer Parameters of: 
  if '11' in name:
    parameter.requires_grad = True
  else:
    parameter.requires_grad = False
print([p.requires_grad for name, p in model.named_parameters()])

To add atten_mask:

Where 101 is [CLS], 102 is [SEP], and 0 is [PAD]


>>> a
tensor([[101,  3,  4, 23, 11,  1, 102,  0,  0,  0]])
>>> notpad = a!=0
>>> notpad
tensor([[ True, True, True, True, True, True, True, False, False, False]])
>>> notcls = a!=101
>>> notcls
tensor([[False, True, True, True, True, True, True, True, True, True]])
>>> notsep = a!=102
>>> notsep
tensor([[ True, True, True, True, True, True, False, True, True, True]])
>>> mask = notpad & notcls & notsep
>>> mask
tensor([[False, True, True, True, True, True, False, False, False, False]])
>>>