This post deals with the name entity recognition task using RNN model.

The RNN model for NER

Let \(\mathbf{x}_t\) be a one-hot vector for word at time \(t\), define \(\mathbf{E}\in \mathbb{R}^{V\times D}, \mathbf{W}_h \in \mathbb{R}^{H\times H}, \mathbf{W}_e\in \mathbb{R}^{D\times H}, \mathbf{U} \in \mathbb{R}^{H\times (C=5)},\mathbf{b}_1\in \mathbb{R}^{H}, \mathbf{b}_2 \in \mathbb{R}^{C}\), the RNN model to make prediction at time step \(t\) can be expressed as \[\mathbf{e}^t = \mathbf{x}^t\mathbf{E}\\ \mathbf{h}^t = \sigma(\mathbf{h}^{t-1}\mathbf{W}_h + \mathbf{e}^t\mathbf{W}_e+\mathbf{b}_1) \\ \hat{\mathbf{y}}^t = softmax(\mathbf{h}^t\mathbf{U} + \mathbf{b}_2)\\ J = \sum_tCE(\mathbf{y}^t-\hat{\mathbf{y}}^t) =-\sum_t\sum_iy_i^tlog(\hat{y}_i^t).\]

The GRU model for NER

Code:

Using load_and_preprocess_data(), the raw train data and raw test data

[(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
  ['ORG', 'O', 'MISC', 'O', 'O', 'O', 'MISC', 'O', 'O']),
 (['Peter', 'Blackburn'], ['PER', 'PER'])]

become list of tuples ([[ids, case types]], [labels]) as

[([[1, 12],
   [2, 11],
   [3, 13],
   [4, 11],
   [5, 11],
   [6, 11],
   [7, 13],
   [8, 11],
   [9, 14]],
  [1, 4, 3, 4, 4, 4, 3, 4, 4]),
 ([[10, 13], [11, 13]], [0, 0])]

load_embeddings() matches the word vectors with the corresponding word in each sentence by searching in helper.tok2id dictionary. The embedding’s shape is \(19\times 50\) for the 2-sentence example and embed size is 50.

The two steps above are the same as in the window-based model. The followings are the different steps in creating RNN model from creating window-based model.

Defining RNN cell

Essentially, this step is to create the hidden cell by

\[\mathbf{h}^t = \sigma(\mathbf{h}^{t-1}\mathbf{W}_h + \mathbf{e}^t\mathbf{W}_e+\mathbf{b}_1).\]

define the variables \(\mathbf{W}_h, \mathbf{W}_e, \mathbf{b}_1\) with shape \(\mathbf{W}_h\in \mathbb{R}^{H\times H}, \mathbf{W}_e \in \mathbb{R}^{(n window features\times D)\times H}, \mathbf{b}_1 \in \mathbb{R}^{1\times H}\).
create \(\mathbf{h}^t\) following the formula
return output and new_state. In RNN, output is the same as the new state \(\mathbf{h}^t\).

class RNNCell(tf.nn.rnn_cell.RNNCell):
    """Wrapper around our RNN cell implementation that allows us to play
    nicely with TensorFlow.
    """
    def __init__(self, input_size, state_size):
        self.input_size = input_size
        self._state_size = state_size

    @property
    def state_size(self):
        return self._state_size

    @property
    def output_size(self):
        return self._state_size

    def __call__(self, inputs, state, scope=None):
        scope = scope or type(self).__name__

        with tf.variable_scope(scope):
            b = tf.get_variable(name='b', shape = [self.state_size], initializer=tf.contrib.layers.xavier_initializer(seed=1))
            W_x = tf.get_variable(name='W_x', shape = [self.input_size, self.state_size], initializer=tf.contrib.layers.xavier_initializer(seed=2))
            W_h = tf.get_variable(name='W_h', shape = [self.state_size, self.state_size], initializer=tf.contrib.layers.xavier_initializer(seed=3))
            z1 = tf.matmul(inputs,W_x) + tf.matmul(state, W_h) + b
            new_state = tf.nn.sigmoid(z1)
            
        output = new_state
        return output, new_state

Padding and masking

Implementing an RNN requires us to unroll the computation over the whole sentence. Unfortunately, each sentence can be of arbitrary length and this would cause the RNN to be unrolled a different number of times for different sentences, making it impossible to BATCH PROCESS the data (assignment#3).

The most common way to address the unequal length of the sentences is pad. For example, suppose the longest sentence in the inputs is \(M\) tokens, then for our example with 2 sentences, the lengths are 9 and 2, respectively, which are \(<M\), so we pad the two sentences with zeros until they reaches length \(M\). Meanwhile, a mask vector is also created for each sentence. The mask vector has True wherever there was a token in the original sequence and False for padded positions:

# before
[([[89, 2647],
   [1070, 2646],
   [115, 2648],
   [288, 2646],
   [7, 2646],
   [1071, 2646],
   [78, 2648],
   [392, 2646],
   [1, 2649]],
  [1, 4, 3, 4, 4, 4, 3, 4, 4]),
 ([[1072, 2648], [175, 2648]], [0, 0])]

# after
[([[89, 2647],
   [1070, 2646],
   [115, 2648],
   [288, 2646],
   [7, 2646],
   [1071, 2646],
   [78, 2648],
   [392, 2646],
   [1, 2649],
   [0, 0]],
  [1, 4, 3, 4, 4, 4, 3, 4, 4, 4],
  [True, True, True, True, True, True, True, True, True, False]),
 ([[1072, 2648],
   [175, 2648],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0]],
  [0, 0, 4, 4, 4, 4, 4, 4, 4, 4],
  [True, True, False, False, False, False, False, False, False, False])]

The returned ret is a list of tuples containing three lists: padded sentences (each token is represented by 2 features), labels and mask vector.

RNN model

In RNN model, each sentence is first windowed using window_size = 1, then padded to have equal length (max_length) using featuruzed_windows(), pad_sequences() and window_iterator(). These three functions are wrapped in a function preprocess_sequence_data()：

#before

[([[89, 2647],
   [1070, 2646],
   [115, 2648],
   [288, 2646],
   [7, 2646],
   [1071, 2646],
   [78, 2648],
   [392, 2646],
   [1, 2649]],
  [1, 4, 3, 4, 4, 4, 3, 4, 4]),
 ([[1072, 2648], [175, 2648]], [0, 0])]

#after
def preprocess_sequence_data(examples):
    def featurize_windows(data, start, end, window_size = 1):
        """Uses the input sequences in @data to construct new windowed data points.
        """
        ret = []
        for sentence, labels in data:
            from util import window_iterator
            sentence_ = []
            for window in window_iterator(sentence, window_size, beg=start, end=end):
                sentence_.append(sum(window, [])) # ex: window = [[1072, 2648], [175, 2648], [2651, 2646]], sum two list makes it a long list from list of lists
            ret.append((sentence_, labels))
        return ret

    examples = featurize_windows(examples, [2650, 2646], [2651, 2646])
    return pad_sequences(examples, 10) #suppose the max length in this example is 10

preprocess_sequence_data(train[0:2])

Out[31]: 
[([[2650, 2646, 89, 2647, 1070, 2646],
   [89, 2647, 1070, 2646, 115, 2648],
   [1070, 2646, 115, 2648, 288, 2646],
   [115, 2648, 288, 2646, 7, 2646],
   [288, 2646, 7, 2646, 1071, 2646],
   [7, 2646, 1071, 2646, 78, 2648],
   [1071, 2646, 78, 2648, 392, 2646],
   [78, 2648, 392, 2646, 1, 2649],
   [392, 2646, 1, 2649, 2651, 2646],
   [0, 0, 0, 0, 0, 0]],
  [1, 4, 3, 4, 4, 4, 3, 4, 4, 4],
  [True, True, True, True, True, True, True, True, True, False]),
 ([[2650, 2646, 1072, 2648, 175, 2648],
   [1072, 2648, 175, 2648, 2651, 2646],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 0]],
  [0, 0, 4, 4, 4, 4, 4, 4, 4, 4],
  [True, True, False, False, False, False, False, False, False, False])]

NOTICE THE DIFFERENT FORMAT AFTER preprocess_sequence_data FOR WINDOW-BASED MODEL AND RNN MODEL. WINDOW-BASED MODEL IS 9000+ WINDOWED INPUTS WITH CORRESPONDING LABEL FOR THE CENTERED WORD; RNN MODEL IS STILL 700+ SENTENCES (IN EACH SENTENCE, IT CONTAINS WINDOWED INPUTS), AND THE LABLES ARE IN A LIST OUTSIDE THE SENTENCE, THE LABLES ARE FOR THE SENTENCE.

The output[31] is assigned to train_examples. Do the same preprocessing for development data and the resulted data is assigned to dev_set

#train and dev is in the form of the 'before' preprocessing
train_examples = self.preprocess_sequence_data(train) 
dev_set = self.preprocess_sequence_data(dev)

If model.cell = rnn, then after preparing the data into wanted form, initialize the RNN model

model = RNNModel(helper, config, embeddings, args.cell)

model.helper = helper
model.report = report
model.pretrained_embeddings = pretrained_embeddings
model.max_length = 120
model.cell = arg_cell

Then fit the RNN model to the train and dev datasets, which are the output after load_and_preprocess_data. Inside the model.fit, train and dev are first preprocess_sequence_data(), the resulted outputs are train_examples and dev_set, respectively.

We will train and evaluate the RNN model for 10 epochs. In each epoch, first initialize a file summary writer for the TensorBoard:

writer = tf.summary.FileWriter('./tb/epoch {}'.format(epoch), sess.graph)

Then dividing the train_examples into minibatches as in window-based model. Train the RNN model on each batch by providing the feed dictionary to placeholders and using the loss function defined in add_loss_op() and add_training_op(). In add_loss_op(), the RNN operation is implemented in making the predictions at each time step.

   def add_embedding(self):

        embedded = tf.Variable(self.pretrained_embeddings)
        embeddings = tf.nn.embedding_lookup(embedded, self.input_placeholder)
        embeddings = tf.reshape(embeddings, [-1, self.max_length, 6 * 50])                                                     

        return embeddings

    def add_prediction_op(self):

        x = self.add_embedding() #(None, max_length, n_window_features*embed_size)
        dropout_rate = self.dropout_placeholder

        preds = [] # Predicted output at each timestep should go here!
        
        #Initialize cell
        if self.cell == "rnn":
            cell = RNNCell(6 * 50, 300)
        elif self.cell == "gru":
            cell = GRUCell(6 * 50, 300)
        else:
            raise ValueError("Unsuppported cell type: " + self.cell)

        # Initialize state as vector of zeros.
        with tf.variable_scope("Layer1"):       
            b2 = tf.get_variable(name='b2', shape = [5],        
                                 initializer=tf.constant_initializer(0))
            U = tf.get_variable(name='U', shape = [300, 5],
                                initializer=tf.contrib.layers.xavier_initializer(seed=4)) 

        input_shape = tf.shape(x) #[batch size, max_length, n_features*embed_size]
        h = tf.zeros([input_shape[0], 300]) #t_0 state

        with tf.variable_scope("RNN"):
            for time_step in range(self.max_length):
                if time_step > 0:
                    tf.get_variable_scope().reuse_variables()

                o, h =  cell(x[:,time_step, :], h, scope="RNN") #__call__ in RNNcell
                h_drop = tf.nn.dropout(h, dropout_rate)
                pred = tf.matmul(h_drop, U) + b2 #pred: batch_size*n_class, 
                preds.append(pred) #preds: [[nclass, nclass, ...,nclass],[],...,[]]maxlength个【batch_size个 nclass】
                
        # Make sure to reshape @preds here.
        #print(tf.shape(preds), tf.shape(preds[0]))
        preds = tf.stack(preds, axis = 1) #batchsize个（maxlength个 nclass）
        #print(tf.shape(preds))

        assert preds.get_shape().as_list() == [None, 120, 5], "predictions are not of the right shape. Expected {}, got {}".format([None, 120, 5], preds.get_shape().as_list())
        return preds

Here, cell = RNNCell(6 * 50, 300) initializes the RNN cell (introduced above in class RNNCell(tf.nn.rnn_cell.RNNCell):) with \(nwindowfeatures \times embed size = (2\times3) \times 50, hidden size = 300\), initialize the initial hidden state for time step 0, which is of size \(batch size \times hidden size = batch size \times 300\). For each time step for time_step in range(self.max_length):, update the output o and h by cell(x[:,time_step, :], h, scope="RNN") (here, for each word, the input is the window features, with size \(6 \times 50\)), making prediction for that time step pred = tf.matmul(h_drop, U) + b2; move the next time step using the previously updated h and update the h again, making predictions, etc.

With the preds in add_prediction_op(), we can define add_loss_op() and add_training_op. Recall when building the model, we have

self.add_placeholders()
self.pred = self.add_prediction_op()
self.loss = self.add_loss_op(self.pred)
self.train_op = self.add_training_op(self.loss)
tf.summary.scalar('loss1', self.loss) # add a scalar tensorboard to keep track of the loss scalar
self.merged = tf.summary.merge_all() # merge the loss scalar across 10 epochs into one plot

Then

def add_loss_op(self, preds):
      pred_mask = tf.boolean_mask(preds, self.mask_placeholder)
      label_mask = tf.boolean_mask(self.labels_placeholder, self.mask_placeholder)
      loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=pred_mask, labels=label_mask)
      loss = tf.reduce_mean(loss)
      return loss

def add_training_op(self, loss): 
      adam_optim = tf.train.AdamOptimizer(0.001)
      train_op = adam_optim.minimize(loss)

      return train_op

and implement those functions in train_on_batch(), in addition to the loss value returned, the merged summary of the loss value for each batch in each epoch is also returned:

def train_on_batch(self, sess, inputs_batch, labels_batch, mask_batch):

        feed = self.create_feed_dict(inputs_batch, labels_batch=labels_batch, mask_batch=mask_batch,
                                     dropout=0.5)
        summary, _, loss = sess.run([self.merged, self.train_op, self.loss], feed_dict=feed)
        return summary, loss

for i, batch in enumerate(minibatches(train_examples, 140)):
            summary, loss = self.train_on_batch(sess, *batch)
            writer.add_summary(summary, i)
            writer.flush()

After training on all batches in the epoch, evaluate the trained model on dev_set (in the code, inputs is same as dev_set). Then divide the dev_set into minibatches, for each minibatch, predict_on_batch()

def output(self, sess, dev, inputs=dev_set):
  for i, batch in enumerate(minibatches(inputs, 140, shuffle=False)):
            # Ignore predict
      batch = batch[:1] + batch[2:]
      preds_ = self.predict_on_batch(sess, *batch)
      preds += list(preds_)
      prog.update(i + 1, [])
  return self.consolidate_predictions(inputs_raw, inputs, preds)

where self.predict_on_batch is

def predict_on_batch(self, sess, inputs_batch, mask_batch):
        feed = self.create_feed_dict(inputs_batch=inputs_batch, mask_batch=mask_batch)
        predictions = sess.run(tf.argmax(self.pred, axis=2), feed_dict=feed)
        return predictions

by using the trained \(\mathbf{W}_h, \mathbf{W}_e, \mathbf{U}, \mathbf{b}_1, \mathbf{b}_2\).

DIFFERENT BATCH SIZE HAS DIFFERENT VALIDATION SCORES. IN THIS EXAMPLE, THE BATCH SIZE IS 140. At epoch 10, the results are

I0713 10:28:00.729423 139762688649024 <ipython-input-39-b9d6a5ec9b7a>:213] Epoch 10 out of 10



6/6 [==============================] - 15s - train loss: 0.3258    

I0713 10:28:19.765505 139762688649024 <ipython-input-39-b9d6a5ec9b7a>:180] Evaluating on development data



6/6 [==============================] - 48s    

I0713 10:29:08.066092 139762688649024 <ipython-input-39-b9d6a5ec9b7a>:182] Token-level confusion matrix:
go\gu   PER     ORG     LOC     MISC    O      
PER     689.00  33.00   14.00   7.00    81.00  
ORG     151.00  69.00   26.00   18.00   127.00 
LOC     40.00   26.00   413.00  16.00   82.00  
MISC    44.00   27.00   44.00   71.00   72.00  
O       71.00   27.00   39.00   8.00    6971.00

I0713 10:29:08.068744 139762688649024 <ipython-input-39-b9d6a5ec9b7a>:183] Token-level scores:
label   acc     prec    rec     f1   
PER     0.95    0.69    0.84    0.76 
ORG     0.95    0.38    0.18    0.24 
LOC     0.97    0.77    0.72    0.74 
MISC    0.97    0.59    0.28    0.38 
O       0.94    0.95    0.98    0.96 
micro   0.96    0.90    0.90    0.90 
macro   0.96    0.68    0.60    0.62 
not-O   0.96    0.68    0.61    0.64 

I0713 10:29:08.070044 139762688649024 <ipython-input-39-b9d6a5ec9b7a>:184] Entity level P/R/F1: 0.54/0.54/0.54

The loss plot in the TensorBoard is

Luyao Peng

At early epochs, the loss decreases apparantly; at later epochs, the loss values remain unchanged.