How to Create a Transformer Architecture Model for Natural Language Processing – Visual Studio Magazine

0

The data science lab

How to create a transformer architecture model for natural language processing

The goal is to create a model that accepts a sequence of words such as “The man walked through the door {empty}” and then predicts the words most likely to fill in the blank.

This article explains how to create a transformer architecture model for natural language processing. Specifically, the goal is to create a model that accepts a sequence of words such as “The man walked through the {empty} door” and then predicts the words most likely to fill in the blank.

Transformer architecture (TA) models such as BERT (bidirectional encoder representations from transformers) and GPT (generative pre-trained transformer) have revolutionized natural language processing (NLP). But technical support systems are extremely complex, and setting them up from scratch can take hundreds or thousands of man hours. The Hugging Face (HF) library is open source code that contains pre-trained TA models and a set of APIs for working with the models. The HF library makes implementing NLP systems using TA models much less difficult.

A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1. The demo program is an example of “filling in the blank”. The source sentence is “The man walked through the {empty} door” and the goal is to determine the reasonable words for the {empty}.

Figure 1: Fill the Gap Using the Hugging Face Code Library
[Click on image for larger view.] Figure 1: Fill the gap using the hugging face code library

The demonstration program begins by loading a pre-trained DistilBERT language model into memory. DistilBERT is a condensed version of the massive BERT language model. The source sentence is passed to a Tokenizer object which divides the sentence into words / tokens and assigns an integer ID to each token. For example, one of the tokens is “man” and its ID is 1299, and the token that represents the empty word is [MASK] and its ID is 103.

The token identifiers are passed to the DistilBERT model and the model calculates the probabilities of 28,996 possible words / tokens to fill the blank. The top five candidates to fill out for “Man walked through {empty} door” are: “front”, “bathroom”, “kitchen”, “back” and “garage”.

This article assumes you have intermediate or above knowledge of a C family programming language, preferably Python, and a basic knowledge of PyTorch, but does not assume that you are familiar with the Hugging Face code library. The full source code for the demo program is presented in this article, and the code is also available in the accompanying download file.

To run the demo program, you must have Python, PyTorch, and HF installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.8.0 for the processor installed via pip transformers and HF version 4.11.3. The installation is not trivial. You can find detailed step-by-step installation instructions for PyTorch in my blog post. The installation of the HF transformer library is relatively straightforward. You can run the “pip install transformers” shell command.

The demo to complete

The full code for the demo program, with some minor edits to save space, is shown in
List 1. I draw using two spaces rather than the standard four spaces. The backslash character is used for line continuation to break up long statements.

The demo program imports three libraries:

import numpy as np
import torch as T
from transformers import AutoModelForMaskedLM, AutoTokenizer

The Hugging Face Transformer Library (Transformers) can work with PyTorch (Torch) or TensorFlow Deep Neural Libraries. The demo uses PyTorch. Technically, the NumPy library is not required to use HF transformers, but in practice most programs will use NumPy.

List 1: The Fill-in-the-Blank demo program

# fill_blank_hf.py
# Anaconda 2020.02 (Python 3.7.6)
# PyTorch 1.8.0  HF 4.11.3  Windows 10

import numpy as np
import torch as T
from transformers import AutoModelForMaskedLM, AutoTokenizer

def main():
  print("nBegin fill--blank using Transformer Architecture ")

  print("nLoading DistilBERT language model into memory ")
  model = 
    AutoModelForMaskedLM.from_pretrained('distilbert-base-cased')
  toker = 
    AutoTokenizer.from_pretrained('distilbert-base-cased')

  sentence = "The man ran through the {BLANK} door."

  print("nThe source fill-in-the-blank sentence is: ")
  print(sentence)

  sentence = f"The man ran through the {toker.mask_token} door."

  print("nConverting sentence to token IDs ")
  inpts = toker(sentence, return_tensors="pt")
  # inpts = toker(sentence, return_tensors=None)
  print(inpts)

  # {'input_ids': tensor([[ 101, 1109, 1299, 1868, 1194, 1103,
  #                         103, 1442,  119,  102]]),
  #  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

  print("nIDs and their tokens: ")
  n = len(inpts['input_ids'][0])  # 10
  for i in range(n):
    id = inpts['input_ids'][0][i]
    w = toker.decode(id)
    print("%6d %s " % (id, w))

  print("nComputing all 28,996 output possibilities ")
  blank_id = toker.mask_token_id  # ID of {blank} = 103
  blank_id_idx = T.where(inpts['input_ids'] == blank_id)[1]

  # with T.no_grad():
  #   all_logits = model(**inpts).logits  # shortcut form
  ids = inpts['input_ids']
  mask = inpts['attention_mask']
  with T.no_grad():
    output = model(ids, mask)
  # print(output)

  all_logits = output.logits  # [1, 10, 28996] 
  pred_logits = all_logits[0, blank_id_idx, :]  # [1, 28996]

  print("nExtracting IDs of top five predicted words: ")
  top_ids = T.topk(pred_logits, 5, dim=1).indices[0].tolist()
  print(top_ids)

  print("nThe top five predicteds as words: ")
  for id in top_ids:
    print(toker.decode([id]))

  print("nConverting logit outputs to pseudo-probabilities ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  pred_probs = T.softmax(pred_logits, dim=1).numpy()
  pred_probs = np.sort(pred_probs[0])[::-1]  # high p to low p
  top_probs = pred_probs[0:5]
  print("nThe top five corresponding pseudo-probabilities: ")
  print(top_probs)
  # [0.1689  0.0630  0.0447  0.0432  0.0323]

  print("nEnd fill-in-the-blank demo ")
  return  # end-of-function main()

if __name__ == "__main__":
  main()

The demo has a single main () function and no help function. The demo starts with:

def main():
  print("Begin fill--blank using Transformer Architecture ")
  print("Loading DistilBERT language model into memory ")
  model = 
    AutoModelForMaskedLM.from_pretrained('distilbert-base-cased')
  toker = 
    AutoTokenizer.from_pretrained('distilbert-base-cased')
. . .

The HF library has many different transformer architecture language models. The demo loads the still-based model (65 million weight) into memory. Examples of other models include bert-large-cased (335 million weights trained using Wikipedia articles and text from books) and gpt2-medium (345 million weights). and download the template. On subsequent runs of the program, the code will use the cached version of the model. On Windows systems, cached HF models are stored by default in C: Users (user) . Cache huggingface transformers.

In general, each HF model has its own associated tokenizer to divide the text of the source sequence into tokens. This is different from earlier language systems which often use a generic tokenizer such as spaCy. Therefore, the demo loads the distilbert-based tokenizer.

Tokenization
Dividing an NLP source phrase / sequence into words / tokens is much trickier than you might expect if you are new to NLP. The demo sets up a text source sequence and tokenizes it as follows:

  sentence = "The man ran through the {BLANK} door."
  print("The source fill-in-the-blank sentence is: ")
  print(sentence)

  sentence = f"The man ran through the {toker.mask_token} door."

  print("Converting sentence to token IDs ")
  inpts = toker(sentence, return_tensors="pt")
  print(inpts)

The “f” in front of the source string, combined with the variable {toker.mask_token}, is a relatively new (Python 3.6) “f-string” syntax for formatting strings. The source string is passed to the Tokenizer toker object with a return_tensors = “pt” argument. The “pt” means to return tokenized information as PyTorch tensors rather than the default NumPy arrays. The idea here is that the tokenized information will be passed to the DistilBERT model, which requires PyTorch tensors. So if you omitted the return_tensors = “pt” argument, you will need to convert the return results to PyTorch tensors later.

The return result of calling the toker method is:

{'input_ids': tensor([[ 101, 1109, 1299, 1868,
                       1194, 1103,  103, 1442,
                        119,  102]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1,
                            1, 1, 1, 1, 1]])}

The input_ids field contains the integer IDs of each token. The attention_mask field tells the system which tokens to use (1) or ignore (0). In this example, the demo uses all tokens.

The demo shows how the source text sequence was tokenized with these instructions:

  print("IDs and their tokens: ")
  n = len(inpts['input_ids'][0])  # 10
  for i in range(n):
    id = inpts['input_ids'][0][i]
    w = toker.decode(id)
    print("%6d %s " % (id, w))

The for loop iterates through each of the 10 token IDs and displays the associated word / token using the decode () method. The results are:

   101 [CLS]
  1109 The
  1299 man
  1868 ran
  1194 through
  1103 the
   103 [MASK]
  1442 door
   119 .
   102 [SEP]

The [CLS] token stands for “classifier” and is used internally. The [SEP] token means separator.

This example is slightly misleading because each word in the source sentence produces a single token. But tokenization doesn’t always work that way. For example, if the source sequence is “The man floogled”, the fake word “floogled” would be symbolized as:

  1109 The
  1299 man
 22593 fl
  5658 ##oo
  8384 ##gled

The point is that in informal usage it is common to use terms such as source “sentence” and tokenized “words”, but it is more accurate to use sequence (instead of sentence) and tokens ( instead of words).

Give tokens to the model
The demo prepares to deliver the tokenized source phrase to the model with these instructions:

  print("Computing all 28,996 output possibilities ")
  blank_id = toker.mask_token_id  # ID of {blank} = 103
  blank_id_idx = T.where(inpts['input_ids'] == blank_id)[1]  # [6]

Different tokenizers use different IDs for the {blank} token, so the demo gets the ID programmatically rather than hard-coding with the blank_id = 103 statement. The PyTorch where () function finds the index of a target value in an array. In this case, the location of the {empty} token is at the index [6]. This index will be needed to extract the results.

Tokenized identifiers and attention mask values ​​are passed to the DistilBERT model as follows:

  ids = inpts['input_ids']
  mask = inpts['attention_mask']
  with T.no_grad():
    output = model(ids, mask)

The no_grad () block is used so that the output results are not connected to the underlying PyTorch compute network which is the DistilBERT model. Instead of passing tokenized IDs and attention mask tensors to the DistilBERT model, it is possible to pass them together and then retrieve the output logits using this shortcut syntax:

  with T.no_grad():
    all_logits = model(**inpts).logits  # shortcut form

The raw results of the output logits are retrieved, then the logits of interest are extracted with these two instructions:

  all_logits = output.logits  # [1, 10, 28996] 
  pred_logits = all_logits[0, blank_id_idx, :]  # [1, 28996]

The output object returned by the model is:

MaskedLMOutput(loss=None,
               logits=tensor([[[  -6.4189, . .  -5.4534],
                               [  -7.3583, . .  -6.3187],
                               [ -12.7053, . . -11.8682]]]),
               hidden_states=None,
               attentions=None)

For a filling problem, the only relevant field is the logits information. The shape of the 3D logits tensor for the given input sequence is [1, 10, 28996]. The 28996 in the third dimension represents every possible word / token to fill in the blank. The 10 in the second dimension represents each of the 10 Entry Token IDs. The only one of these tensors which is necessary is the one which predicts the [MASK] index to [6] stored in blank_id_idx so that these values ​​are extracted and stored in the file pred_logits. Dealing with delicate indexing in multidimensional tensors is not conceptually difficult, but it does take a bit of time.

Results interpretation
The pred_logits tensor contains 28,996 values. Each logit value represents the probability that a word fills the blank in the sentence “The man walked through the door {empty}” and the index values ​​0, 1, 2,. . represent token identifiers. Higher logit values ​​are more likely. You can simply search for the largest logit value and then get the index of its location for the most likely word / token. A better approach is to find the five most probable words / tokens. The demo uses the handy torch.topk () function to do this:

  print("Extracting IDs of top five predicted words: ")
  top_ids = T.topk(pred_logits, 5, dim=1).indices[0].tolist()
  print(top_ids)

  print("The top five predicteds as words: ")
  for id in top_ids:
    print(toker.decode([id]))

The 28,996 logit values ​​are difficult to interpret, so the demo program converts the logits to pseudo-probabilities using the softmax () function. These values ​​add up to 1.0, so they roughly represent probabilities.


Source link

Share.

About Author

Comments are closed.