Embeddings can be used in machine learning to represent data and take advantage of reducing the dimensionality of the dataset and learning some latent factors between data points. Commonly this is used with words to say, reduce a 400,000 word vector to a 50 dimensional vector, but could equally be used to map post codes or other token encoded data. Another use case might be in recommender systems GloVe (Global Vectors for Word Representation) was developed at Stanford and more information can be found here. There are a few learnt datasets including Wikipedia, web crawl and a Twitter set, each increasing the number of words in its vocabulary with varying embedding dimensions. We will be using the smallest Wikipedia dataset and for this sample will pick the 50 dimensional embedding.
Obtaining the embeddings
Lets start by noting all the dependencies we’ll use below:
import os
import urllib
import zipfile
import nltk
import numpy as np
import tensorflow as tf
And define a few paths to make things easier and ensure our python script can obtain and extract the data whether we have it locally or retrieving it from the web. Here we also define EMBEDDING_DIMENSION
as the dimension of the vector for word representation. It will be the length of the vector representing the words. After parsing the weight file, we will later define VOCAB_LENGTH
which will be the total number of word tokens we will use. Later we will also define UNKOWN_WORD
to represent a token used for any words we encounter that aren’t in the dataset.
EMBEDDING_DIMENSION=50 # Available dimensions for 6B data is 50, 100, 200, 300
data_directory = '/data/glove'
if not os.path.isdir(data_directory):
os.path.makedirs(data_directory)
glove_weights_file_path = os.path.join(data_directory, f'glove.6B.{EMBEDDING_DIMENSION}d.txt')
if not os.path.isfile(glove_weights_file_path):
# Glove embedding weights can be downloaded from https://nlp.stanford.edu/projects/glove/
glove_fallback_url = 'http://nlp.stanford.edu/data/glove.6B.zip'
local_zip_file_path = os.path.join(data_directory, os.path.basename(glove_fallback_url))
if not os.path.isfile(local_zip_file_path):
print(f'Retreiving glove weights from {fallback_url}')
urllib.request.urlretrieve(glove_fallback_url, local_zip_file_path)
with zipfile.ZipFile(local_zip_file_path, 'r') as z:
print(f'Extracting glove weights from {local_zip_file_path}')
z.extractall(path=data_directory)
Loading embeddings
The downloaded file has a word per line in descending frequency of usage in the dataset. A line is space separated with the word first and the decimal numbers as the vector representation of that word. Here we will keep three data structures for various uses:
-
word2idx
: a dictionary for mapping words to their index token - used for converting a sequence of words to sequence of integers for embedding lookup -
idx2word
: a list of words in order - used for decoding an integer sequence to words -
weights
: a matrice of size VOCAB_LENGTH x EMBEDDING_DIMESNION containing the vectors for each wordPAD_TOKEN = 0
word2idx = { ‘PAD’: PAD_TOKEN } # dict so we can lookup indices for tokenising our text later from string to sequence of integers weights = []
with (glove_data_directory / 'glove.6B.50d.txt').open('r') as file:
for index, line in enumerate(file):
values = line.split() # Word and weights separated by space
word = values[0] # Word is first symbol on each line
word_weights = np.asarray(values[1:], dtype=np.float32) # Remainder of line is weights for word
word2idx[word] = index + 1 # PAD is our zeroth index so shift by one
weights.append(word_weights)
if index + 1 == 40_000:
# Limit vocabulary to top 40k terms
break
EMBEDDING_DIMENSION = len(weights[0])
# Insert the PAD weights at index 0 now we know the embedding dimension
weights.insert(0, np.random.randn(EMBEDDING_DIMENSION))
# Append unknown and pad to end of vocab and initialize as random
UNKNOWN_TOKEN=len(weights)
word2idx['UNK'] = UNKNOWN_TOKEN
weights.append(np.random.randn(EMBEDDING_DIMENSION))
# Construct our final vocab
weights = np.asarray(weights, dtype=np.float32)
VOCAB_SIZE=weights.shape[0]
Embeddings in TensorFlow
The missing part in this is converting a string to a sequence of integers: this is easily achieved using nltk.word_tokenize
and will be unique to your problem, but an example is
features = {}
features['word_indices'] = nltk.word_tokenize('hello world') # ['hello', 'world']
features['word_indices'] = [word2idx.get(word, UNKNOWN_TOKEN) for word in features['word_indices']]
Finally from with TensorFlow we define a variable to hold our embedding weights using tf.get_variable
. This will either create or load the variable into the graph. The most common initializer
for variables would be to create random weights, but we wish to load our Glove weights and so will use a tf.constant_initializer
to initialize it with the weights we loaded previously and indicate that these shouldn’t be updated by setting trainable=False
. The actual embedding of our sequence of word indices to embedded vectors is then done by tf.nn.embedding_lookup
. This is basically a vector retrieval using the word indices for indixing the zeroth axis and returning the vector on the embedded vector.
glove_weights_initializer = tf.constant_initializer(weights)
embedding_weights = tf.get_variable(
name='embedding_weights',
shape=(VOCAB_LENGTH, EMBEDDING_DIMENSION),
initializer=glove_weights_initializer,
trainable=False)
embedding = tf.nn.embedding_lookup(embedding_weights, features['word_indices'])
Summary
Using pre-trained embeddings can enhance an NLP by giving meaning to vector input of word tokens by utilising existing outcomes trained on large texts such as Wikipedia. Today we covered:
- Retrieving the glove embeddings
- Parsing the file into useful data structures
- Initializing a TensorFlow embedding layer with these weights