CS224n/assignment2/: Dependency Parsing Using Deep Learning NN Model

This project extends Neural Transition-based dependeny Parsing (Stanford U cs224n A#2 Q2). The goal is to build a three layer neural network using TensorFlow to parse the dependency structure of sentences. The orignal code contributed by hankcs is built in Python2. I revised it so it runs in Python3. The training data for the neural dependency parsing model is Penn Treebank, and each sentence has ‘word’, ‘POS’, ‘head’ and ‘label’, where the ‘head’ is the correct dependency relation.

First, import necessary modules and set up the configurations for the parser:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


import time
import os
import logging
from collections import Counter
import numpy as np

P_PREFIX = '<p>:'
L_PREFIX = '<l>:'
UNK = '<UNK>'
NULL = '<NULL>'
ROOT = '<ROOT>'

class Config(object): 
    language = 'english'
    with_punct = True
    unlabeled = True 
    lowercase = True
    use_pos = True # use part of speech as input features
    use_dep = True #?
    use_dep = use_dep and (not unlabeled)
    data_path = './data'
    train_file = 'train.conll'
    dev_file = 'dev.conll'
    test_file = 'test.conll'
    embedding_file = './data/en-cw.txt'   
    n_features = 36 #?
    n_classes = 3 # want to predict shift, left-arc or right-arc
    dropout = 0.5
    embed_size = 50
    hidden_size = 200
    batch_size = 2048
    n_epochs = 10
    lr = 0.001

Define some utility functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# read the training data in:
# The first 5 rows in the training data:
'''
1	In	_	ADP	IN	_	5	case	_	_
2	an	_	DET	DT	_	5	det	_	_
3	Oct.	_	PROPN	NNP	_	5	compound	_	_
4	19	_	NUM	CD	_	5	nummod	_	_
5	review	_	NOUN	NN	_	45	nmod	_	_
'''

def read_conll(in_file, lowercase=False, max_example=None):
    examples = []
    with open(in_file) as f:
        word, pos, head, label = [], [], [], []
        for line in f.readlines():
            sp = line.strip().split('\t') #tokenize each line of the file
            if len(sp) == 10:
                if '-' not in sp[0]:  #that means the sentence is not complete, continue accumulate words in the sentence
                    word.append(sp[1].lower() if lowercase else sp[1])
                    pos.append(sp[4])
                    head.append(int(sp[6]))
                    label.append(sp[7])
            elif len(word) > 0: #accumulate to a dictionary
                examples.append({'word': word, 'pos': pos, 'head': head, 'label': label})
                word, pos, head, label = [], [], [], []
                if (max_example is not None) and (len(examples) == max_example):
                    break
        if len(word) > 0: #if the sentence has >0 words, accumulate the sentence dictionary to a list called examples.
            examples.append({'word': word, 'pos': pos, 'head': head, 'label': label})
    return examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


def build_dict(keys, n_max=None, offset=0):
    count = Counter()
    for key in keys:
        count[key] += 1
    ls = count.most_common() if n_max is None \
        else count.most_common(n_max)

    return {w[0]: index + offset for (index, w) in enumerate(ls)}

def punct(language, pos):
    if language == 'english':
        return pos in ["''", ",", ".", ":", "``", "-LRB-", "-RRB-"]
    elif language == 'chinese':
        return pos == 'PU'
    elif language == 'french':
        return pos == 'PUNC'
    elif language == 'german':
        return pos in ["$.", "$,", "$["]
    elif language == 'spanish':
        # http://nlp.stanford.edu/software/spanish-faq.shtml
        return pos in ["f0", "faa", "fat", "fc", "fd", "fe", "fg", "fh",
                       "fia", "fit", "fp", "fpa", "fpt", "fs", "ft",
                       "fx", "fz"]
    elif language == 'universal':
        return pos == 'PUNCT'
    else:
        raise ValueError('language: %s is not supported.' % language)

Now, define Preprocessor object (in the assignment it is called Parser. I was confused about the parse method in that object, so I take the parse method out as an independent method, so the Parser object was renamed as Preprocessor, which will be used as an argument in the parse() method) to extract features of the context words of each word in the training data (word, pos, label features), get the correct transitions, get the legal transitions , then create instances containing [a long list of tuples of (features, legal labels, gold_t (correct transitions))]

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245


class Preprocessor(object):
    # initialize the dictionary {token: id}, and {id: token},  
    def __init__(self, dataset):
        root_labels = list([l for ex in dataset
                           for (h, l) in zip(ex['head'], ex['label']) if h == 0]) #h==0 is root
        counter = Counter(root_labels)
        if len(counter) > 1:
            logging.info('Warning: more than one root label')
            logging.info(counter)
        self.root_label = counter.most_common()[0][0]
        deprel = [self.root_label] + list(set([w for ex in dataset
                                               for w in ex['label']
                                               if w != self.root_label]))
        '''
        the deprel (dependent relations) are:
        
        '''
        
        tok2id = {L_PREFIX + l: i for (i, l) in enumerate(deprel)}
        tok2id[L_PREFIX + NULL] = self.L_NULL = len(tok2id)

        config = Config()
        self.unlabeled = config.unlabeled
        self.with_punct = config.with_punct
        self.use_pos = config.use_pos
        self.use_dep = config.use_dep
        self.language = config.language

        if self.unlabeled: #just S, LA, RA
            trans = ['L', 'R', 'S']
            self.n_deprel = 1
        else: #S_NN
            trans = ['L-' + l for l in deprel] + ['R-' + l for l in deprel] + ['S']
            self.n_deprel = len(deprel)

        self.n_trans = len(trans)
        self.tran2id = {t: i for (i, t) in enumerate(trans)}
        self.id2tran = {i: t for (i, t) in enumerate(trans)}
        
        '''
        example of id2tran is:
        {0: 'L', 1: 'R', 2: 'S'}
        '''
        # logging.info('Build dictionary for part-of-speech tags.')
        tok2id.update(build_dict([P_PREFIX + w for ex in dataset for w in ex['pos']],
                                  offset=len(tok2id)))
        tok2id[P_PREFIX + UNK] = self.P_UNK = len(tok2id)
        tok2id[P_PREFIX + NULL] = self.P_NULL = len(tok2id)
        tok2id[P_PREFIX + ROOT] = self.P_ROOT = len(tok2id)

        # logging.info('Build dictionary for words.')
        tok2id.update(build_dict([w for ex in dataset for w in ex['word']],
                                  offset=len(tok2id)))
        tok2id[UNK] = self.UNK = len(tok2id)
        tok2id[NULL] = self.NULL = len(tok2id)
        tok2id[ROOT] = self.ROOT = len(tok2id)
        
        '''
        example of id2tok is:
        {0: '<l>:root', 1: '<l>:nsubj', 2: '<l>:advmod', 3: '<l>:dobj', 4: '<l>:acl', 5: '<l>:nummod', 6: '<l>:nmod', 7: '<l>:compound', 8: '<l>:xcomp', 
        9: '<l>:auxpass', 10: '<l>:det', 11: '<l>:punct', 12: '<l>:mark', 13: '<l>:case', 14: '<l>:conj', 15: '<l>:appos', 16: '<l>:nmod:tmod', 
        17: '<l>:dep', 18: '<l>:amod', 19: '<l>:nmod:poss', 20: '<l>:nsubjpass', 21: '<l>:cc', 22: '<l>:ccomp', 23: '<l>:<NULL>', 24: '<p>:NNP', 
        25: '<p>:NN', 26: '<p>:IN', 27: '<p>:DT', 28: '<p>:,', 29: '<p>:NNS', 30: '<p>:JJ', 31: '<p>:CD', 32: '<p>:CC', 33: '<p>:VBD', 34: '<p>:.', 
        35: '<p>:VBN', 36: '<p>:VBZ', 37: '<p>:``', 38: "<p>:''", 39: '<p>:VB', 40: '<p>:TO', 41: '<p>:PRP', 42: '<p>:POS', 43: '<p>:-LRB-', 
        44: '<p>:-RRB-', 45: '<p>:RB', 46: '<p>:NNPS', 47: '<p>:PRP$', 48: '<p>:<UNK>', 49: '<p>:<NULL>', 50: '<p>:<ROOT>', 51: ',', 52: 'in', 53: 'the', 
        54: '.', 55: 'cars', 56: 'and', 57: 'of', 58: '``', 59: "''", 60: 'at', 61: 'to', 62: 'haag', 63: 'said', 64: 'u.s.', 65: 'luxury', 66: 'auto', 
        67: 'maker', 68: 'an', 69: 'oct.', 70: '19', 71: 'review', 72: 'misanthrope', 73: 'chicago', 74: "'s", 75: 'goodman', 76: 'theatre', 77: '-lrb-', 
        78: 'revitalized', 79: 'classics', 80: 'take', 81: 'stage', 82: 'windy', 83: 'city', 84: 'leisure', 85: '&', 86: 'arts', 87: '-rrb-', 88: 'role',
        89: 'celimene', 90: 'played', 91: 'by', 92: 'kim', 93: 'cattrall', 94: 'was', 95: 'mistakenly', 96: 'attributed', 97: 'christina', 98: 'ms.', 
        99: 'plays', 100: 'elianti', 101: 'rolls-royce', 102: 'motor', 103: 'inc.', 104: 'it', 105: 'expects', 106: 'its', 107: 'sales', 108: 'remain', 
        109: 'steady', 110: 'about', 111: '1,200', 112: '1990', 113: 'last', 114: 'year', 115: 'sold', 116: '1,214', 117: 'howard', 118: 'mosher', 
        119: 'president', 120: 'chief', 121: 'executive', 122: 'officer', 123: 'he', 124: 'anticipates', 125: 'growth', 126: 'for', 127: 'britain', 
        128: 'europe', 129: 'far', 130: 'eastern', 131: 'markets', 132: '<UNK>', 133: '<NULL>', 134: '<ROOT>'}
        '''
        self.tok2id = tok2id
        self.id2tok = {v: k for (k, v) in tok2id.items()}

        self.n_features = 18 + (18 if config.use_pos else 0) + (12 if config.use_dep else 0) #? why 18?
        self.n_tokens = len(tok2id) 
    
    '''
    arranging each set as this form: [{'word': [corresponding id of each token], 'pos': [corresponding id in pos from tok2id], 'head': [id...], 'label': [id...]},...,      #{last sentence in the set}]
        The resulting vec_examples are (this is vectorized training data in id form from tok2id):
        [{'word': [134, 52, 68, 69, 70, 71, 57, 58, 53, 72, 59, 60, 73, 74, 75, 76, 77, 58, 78, 79, 80, 53, 81, 52, 82, 83, 51, 59, 84, 85, 86, 87, 51, 53, 88, 57, 89, 51, 90, 91, 92, 93, 51, 94, 95, 96, 61, 97, 62, 54], 
        'pos': [50, 26, 27, 24, 31, 25, 26, 37, 27, 25, 38, 26, 24, 42, 24, 24, 43, 37, 35, 29, 39, 27, 25, 26, 24, 24, 28, 38, 25, 32, 29, 44, 28, 27, 25, 26, 24, 28, 35, 26, 24, 24, 28, 33, 45, 35, 40, 24, 24, 34], 
        'head': [-1, 5, 5, 5, 5, 45, 9, 9, 9, 5, 9, 15, 15, 12, 15, 9, 20, 20, 19, 20, 5, 22, 20, 25, 25, 20, 20, 20, 20, 28, 28, 20, 45, 34, 45, 36, 34, 34, 34, 41, 41, 38, 34, 45, 45, 0, 48, 48, 45, 45], 
        'label': [-1, 13, 10, 7, 5, 6, 13, 11, 10, 6, 11, 13, 19, 13, 7, 6, 11, 11, 18, 1, 17, 10, 3, 13, 7, 6, 11, 11, 17, 21, 14, 11, 11, 10, 20, 13, 6, 11, 4, 13, 7, 6, 11, 9, 2, 0, 13, 7, 6, 11]}, 
        {'word': [134, 98, 62, 99, 100, 54], 'pos': [50, 24, 24, 36, 24, 34], 'head': [-1, 2, 3, 0, 3, 3], 'label': [-1, 7, 1, 0, 3, 11]},{...}...{...}
    '''
    def vectorize(self, examples): 
        vec_examples = []
        for ex in examples:
            word = [self.ROOT] + [self.tok2id[w] if w in self.tok2id
                                  else self.UNK for w in ex['word']] # a list of word id for each sentence in dataset
            pos = [self.P_ROOT] + [self.tok2id[P_PREFIX + w] if P_PREFIX + w in self.tok2id # a list of POS id
                                   else self.P_UNK for w in ex['pos']]
            head = [-1] + ex['head']
            label = [-1] + [self.tok2id[L_PREFIX + w] if L_PREFIX + w in self.tok2id
                            else -1 for w in ex['label']]
            vec_examples.append({'word': word, 'pos': pos,
                                 'head': head, 'label': label})
        return vec_examples # each sentence is a dictionary with list of 'word','POS', 'head' and 'label
    
    '''
    This function is important, it extracts context words first, then extract the corresping word from ex['word'], pos from ex['pos']
    and label features from ex['label'], finally concatenate them into a long vector for each sentence, an example of one set of features for one sentence is:
    [133, 133, 134, 52, 68, 69, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 133, 49, 49, 50, 26, 27, 24, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49] #36 features of context words with corresponding id from tok2id,  that is a long list containing [features, p_features, l_features] 
    '''
    def extract_features(self, stack, buf, arcs, ex):
        if stack[0] == "ROOT":
            stack[0] = 0

        def get_lc(k): # lc: left context, since arc = (stack[-1], stack[-2], gold_t) or (stack[-2], stack[-1], gold_t), if arc[1]'s id < arc[0]'s id, then extract lc
            return sorted([arc[1] for arc in arcs if arc[0] == k and arc[1] < k])

        def get_rc(k): #if arc[1]'s id > arc[0]'s id, rc
            return sorted([arc[1] for arc in arcs if arc[0] == k and arc[1] > k],
                          reverse=True)

        p_features = []
        l_features = []
        features = [self.NULL] * (3 - len(stack)) + [ex['word'][x] for x in stack[-3:]] #if len(stack)>=3, no NULL, features contain the last 3 words in the stack
        features += [ex['word'][x] for x in buf[:3]] + [self.NULL] * (3 - len(buf)) # extract the first 3 words in the buffer if len(buf) >=3
        if self.use_pos: # extracting the corresponding pos features, 3 from stack, 3 from buffer if stack and buffer have length>=3
            p_features = [self.P_NULL] * (3 - len(stack)) + [ex['pos'][x] for x in stack[-3:]]
            p_features += [ex['pos'][x] for x in buf[:3]] + [self.P_NULL] * (3 - len(buf))

        for i in range(2): #i = 0, 1
            if i < len(stack):
                k = stack[-i-1] #the kth word's context: i=0, the -1 word's context, i=1, the -2 word's context
                lc = get_lc(k) # the left context word of kth word in stack = [(s3,) s2, s1]
                rc = get_rc(k) # the right context word
                llc = get_lc(lc[0]) if len(lc) > 0 else [] #llc
                rrc = get_rc(rc[0]) if len(rc) > 0 else [] #rrc

                features.append(ex['word'][lc[0]] if len(lc) > 0 else self.NULL)
                features.append(ex['word'][rc[0]] if len(rc) > 0 else self.NULL)
                features.append(ex['word'][lc[1]] if len(lc) > 1 else self.NULL)
                features.append(ex['word'][rc[1]] if len(rc) > 1 else self.NULL)
                features.append(ex['word'][llc[0]] if len(llc) > 0 else self.NULL)
                features.append(ex['word'][rrc[0]] if len(rrc) > 0 else self.NULL)

                if self.use_pos:
                    p_features.append(ex['pos'][lc[0]] if len(lc) > 0 else self.P_NULL)
                    p_features.append(ex['pos'][rc[0]] if len(rc) > 0 else self.P_NULL)
                    p_features.append(ex['pos'][lc[1]] if len(lc) > 1 else self.P_NULL)
                    p_features.append(ex['pos'][rc[1]] if len(rc) > 1 else self.P_NULL)
                    p_features.append(ex['pos'][llc[0]] if len(llc) > 0 else self.P_NULL)
                    p_features.append(ex['pos'][rrc[0]] if len(rrc) > 0 else self.P_NULL)

                if self.use_dep:
                    l_features.append(ex['label'][lc[0]] if len(lc) > 0 else self.L_NULL)
                    l_features.append(ex['label'][rc[0]] if len(rc) > 0 else self.L_NULL)
                    l_features.append(ex['label'][lc[1]] if len(lc) > 1 else self.L_NULL)
                    l_features.append(ex['label'][rc[1]] if len(rc) > 1 else self.L_NULL)
                    l_features.append(ex['label'][llc[0]] if len(llc) > 0 else self.L_NULL)
                    l_features.append(ex['label'][rrc[0]] if len(rrc) > 0 else self.L_NULL)
            else:
                features += [self.NULL] * 6
                if self.use_pos:
                    p_features += [self.P_NULL] * 6
                if self.use_dep:
                    l_features += [self.L_NULL] * 6

        features += p_features + l_features
        assert len(features) == self.n_features
        return features 

    '''
    This is to get the correct transitions for training purposes. 
    1. if there is only ROOT in stack, the only correct transition is shift
    2. if stack = [i1, i0] or more, i2 is not ROOT, i1's head == i0, then left arc is the correct transition
    3. if stack = [i1, i0] or more, i0's head == i1, i1 can be ROOT or not ROOT, right arc is the correct transition
    4. ow shift
    '''
    def get_oracle(self, stack, buf, ex):
        if len(stack) < 2:
            return self.n_trans - 1 #如果stack只有root, 就执行shift, shift对应id是48 or 2: n_tran = 3-1 if unlabel or 49-1  if label1
        i0 = stack[-1]
        i1 = stack[-2]
        h0 = ex['head'][i0]
        h1 = ex['head'][i1]
        l0 = ex['label'][i0]
        l1 = ex['label'][i1]

        if self.unlabeled:
            if (i1 > 0) and (h1 == i0): #if i1 is not ROOT, stack[i1, i0], head is i0, left_arc i1
                return 0
            elif (i1 >= 0) and (h0 == i1) and (not any([x for x in buf if ex['head'][x] == i0])): 
                return 1 # if head is i1, no i0 head in buffer , right_arc
            else:
                return None if len(buf) == 0 else 2 #if len(buf) =0, no transition, ow, shift
        else:
            if (i1 > 0) and (h1 == i0):
                return l1 if (l1 >= 0) and (l1 < self.n_deprel) else None
            elif (i1 >= 0) and (h0 == i1) and \
                 (not any([x for x in buf if ex['head'][x] == i0])):
                return l0 + self.n_deprel if (l0 >= 0) and (l0 < self.n_deprel) else None
            else:
                return None if len(buf) == 0 else self.n_trans - 1
    
    '''
    for each sentence in training data, create stack with only ROOT in id form [0], buf with all tokens of the sentence, and empty arcs, get oracle for the 
    stack[ROOT] and full buffer (oracle should be shift for this case), get legal labels (only shift is legal for stack[ROOT] and full buffer), so in this
    round the legal_labels will return [0,0,1], then update stack, buffer and arcs, repeate this procedure for the updated stack, buffer and arcs for 
    2*#ofwords times. for each update, instances are accumulated as walking through the whole sentence as [], each sentence is an instances, all_instances  =
    [[instances1], [instances2], ...[]]
    '''
    def create_instances(self, examples):
        all_instances = []
        succ = 0
        for n_ex, ex in enumerate(examples):
            n_words = len(ex['word']) - 1

            # arcs = {(h, t, label)}
            stack = [0]
            buf = [i + 1 for i in range(n_words)]
            arcs = []
            instances = []
            for i in range(n_words * 2): # rolling over each word to get the features (legal label and gold_t) for each transition in (llc, lc, w, rc, rrc) for each sentence, the transitions have 2*n_word times 
                gold_t = self.get_oracle(stack, buf, ex)
                if gold_t is None:
                    break
                legal_labels = self.legal_labels(stack, buf)
                assert legal_labels[gold_t] == 1 # correct transition must belong to legal lables
                instances.append((self.extract_features(stack, buf, arcs, ex),
                                  legal_labels, gold_t)) 
                if gold_t == self.n_trans - 1: # gold_t is shift= 2
                    stack.append(buf[0])
                    buf = buf[1:]
                elif gold_t < self.n_deprel: #n_deprel = 1, when gold_t ==0,
                    arcs.append((stack[-1], stack[-2], gold_t)) # i.e. L_arc, next gold_t must be 2, i.e. shift, stack is all the words except the -2 word, then next loop will be adding the first word in buffer to stack (execute the first if).
                    stack = stack[:-2] + [stack[-1]]
                else:
                    arcs.append((stack[-2], stack[-1], gold_t - self.n_deprel))
                    stack = stack[:-1] #when gold_t ==1, R_arc, next gold_t must be 0, i.e. shift, stack will be all the words except the last word, then next loop will be adding the first word in buffer to stack (execute the first if).
            succ += 1
            all_instances += instances # one instance is one sentence with 2*n_word tuples, each tuple contains context features, legal_labels, gold_t
        return all_instances

    def legal_labels(self, stack, buf):
        labels = ([1] if len(stack) > 2 else [0]) * self.n_deprel #stack<=2 must not be L-arc, stack>2, L-arc is legal
        labels += ([1] if len(stack) >= 2 else [0]) * self.n_deprel
        labels += [1] if len(buf) > 0 else [0] #buf > 0,  shift is legal 
        return labels # labels' length == n_tran, 3 or 49

After defining the Preprocessor, it is ready to load and preprocess data. It read in the train set, dev set and test set first, and slice sentences for debugging purpose, initialize Preprocessor, initalize embedding matrix for all tokens in train set and then map each row in embedding matrix to corresponding token (this is the pretrained embedding matrix), vectorize train, dev and test sets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


def load_and_preprocess_data(reduced=True):
    config = Config()

    print("Loading data..."),
    start = time.time()
    train_set = read_conll(os.path.join(config.data_path, config.train_file),
                           lowercase=config.lowercase)
    dev_set = read_conll(os.path.join(config.data_path, config.dev_file),
                         lowercase=config.lowercase)
    test_set = read_conll(os.path.join(config.data_path, config.test_file),
                          lowercase=config.lowercase)
    if reduced:
        train_set = train_set[:1000]
        dev_set = dev_set[:500]
        test_set = test_set[:500]
    print("took {:.2f} seconds".format(time.time() - start))

    print("Building parser..."),
    start = time.time()
    preprocessor = Preprocessor(train_set)
    print("took {:.2f} seconds".format(time.time() - start))

    print("Loading pretrained embeddings..."),
    start = time.time()
    word_vectors = {}
    for line in open(config.embedding_file).readlines():
        sp = line.strip().split()
        word_vectors[sp[0]] = [float(x) for x in sp[1:]] #sp[0] is all vocabulary, sp[1:] is the corresponding word vector 
    embeddings_matrix = np.asarray(np.random.normal(0, 0.9, (preprocessor.n_tokens, 50)), dtype='float32') #embeding: is n_tokens * 50

    for token in preprocessor.tok2id: #all token/ POS/ label in train set
        i = preprocessor.tok2id[token]
        if token in word_vectors:
            embeddings_matrix[i] = word_vectors[token]
        elif token.lower() in word_vectors:
            embeddings_matrix[i] = word_vectors[token.lower()] #fill embedding matrix with corresponding word_vectors as input x
    print("took {:.2f} seconds".format(time.time() - start))
    
    print("Vectorizing data..."),
    start = time.time()
    train_set = preprocessor.vectorize(train_set)
    dev_set = preprocessor.vectorize(dev_set) # each sentence conresponds to a dictionary of {'word':id, 'head': id, 'pos': id, 'label': id}
    test_set = preprocessor.vectorize(test_set)
    print("took {:.2f} seconds".format(time.time() - start))

    print("Preprocessing training data..."),
    start = time.time()
    train_examples = preprocessor.create_instances(train_set) # each sentence contain 2*n_word tuple of context features in word, POS, legal lable and gold_t
    print("took {:.2f} seconds".format(time.time() - start))

    return preprocessor, embeddings_matrix, train_examples, dev_set, test_set,

This is creating partial parses object. Treating each sentence as a partial parse with attribute ‘stack’, ‘buffer’, ‘dependencies’, and with method ‘parse_step’ to update words in stack and buffers according to predicted transitions, the parse method returns dependencies for each transition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


class PartialParse(object):
    def __init__(self, sentence):
        """Initializes this partial parse.
        Your code should initialize the following fields:
            self.stack: The current stack represented as a list with the top of the stack as the
                        last element of the list.
            self.buffer: The current buffer represented as a list with the first item on the
                         buffer as the first item of the list
            self.dependencies: The list of dependencies produced so far. Represented as a list of
                    tuples where each tuple is of the form (head, dependent).
                    Order for this list doesn't matter.
        The root token should be represented with the string "ROOT"
        Args:
            sentence: The sentence to be parsed as a list of words.
                      Your code should not modify the sentence.
        """
        # The sentence being parsed is kept for bookkeeping purposes. Do not use it in your code.
        self.sentence = sentence

        ### YOUR CODE HERE
        self.stack = ['ROOT']
        self.buffer = sentence[:]
        self.dependencies = []
        ### END YOUR CODE

    def parse_step(self, transition):
        """Performs a single parse step by applying the given transition to this partial parse
        Args:
            transition: A string that equals "S", "LA", or "RA" representing the shift, left-arc,
                        and right-arc transitions.
        """
        ### YOUR CODE HERE
        if transition == "S":
            self.stack.append(self.buffer[0])
            self.buffer.pop(0)
        elif transition == "LA":
            self.dependencies.append((self.stack[-1], self.stack[-2]))
            self.stack.pop(-2)
        else:
            self.dependencies.append((self.stack[-2], self.stack[-1]))
            self.stack.pop(-1)
            ### END YOUR CODE

    def parse(self, transitions):
        """Applies the provided transitions to this PartialParse
        Args:
            transitions: The list of transitions in the order they should be applied
        Returns:
            dependencies: The list of dependencies produced when parsing the sentence. Represented
                          as a list of tuples where each tuple is of the form (head, dependent)
        """
        for transition in transitions:
            self.parse_step(transition)
        return self.dependencies

Next, define the model used to predict the transitions for each set of features in each sentences. This is used when evaluating the dev_set and test_set. It is important to use sess as the argument in the predict method so that the most updated variables (the weight matrices and the bias terms in this cases) in TensorFlow in the training stage can be applied to the evaluation set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


from utils.model import Model #want to use its predict_on_batch method

class ModelWrapper(object):
    def __init__(self, sess, preprocessor, dataset, sentence_id_to_idx):
        self.sess = sess
        self.preprocessor = preprocessor
        self.dataset = dataset
        self.sentence_id_to_idx = sentence_id_to_idx

    def predict(self, model, partial_parses): #model and partial_parses are imported and defined above.
        mb_x = [self.preprocessor.extract_features(p.stack, p.buffer, p.dependencies,
                                                   self.dataset[self.sentence_id_to_idx[id(p.sentence)]]) #get context features, pos, label in ID FORM for each sentence in partial_parses, since dev_set's word are expressed in id form, rather than character like in train_set, mb_x can be type of 'int32'
                for p in partial_parses] #partial parses are initialized in parse() method below 
        mb_x = np.array(mb_x).astype('int32') #inputs_batch, which is from dev_set
        mb_l = [self.preprocessor.legal_labels(p.stack, p.buffer) for p in partial_parses]
        
        pred = model.predict_on_batch(self.sess, mb_x)
        pred = np.argmax(pred + 10000 * np.array(mb_l).astype('float32'), 1)
        pred = ["S" if p == 2 else ("LA" if p == 0 else "RA") for p in pred]
        return pred

def parse(sess, model, preprocessor, dataset, eval_batch_size=5000): #here dataset is dev_set
    sentences = []
    sentence_id_to_idx = {}
    for i, example in enumerate(dataset):
        n_words = len(example['word']) - 1
        sentence = [j + 1 for j in range(n_words)]
        #sentence = [j for j in example['word']]
        sentences.append(sentence)
        sentence_id_to_idx[id(sentence)] = i
    
    MW = ModelWrapper(sess, preprocessor, dataset, sentence_id_to_idx)
    #dependencies = minibatch_parse(sentences, model, eval_batch_size)
    partial_parses = [PartialParse(s) for s in sentences]

    unfinished_parse = partial_parses

    while len(unfinished_parse) > 0:
        minibatch = unfinished_parse[0:eval_batch_size]
        # perform transition and single step parser on the minibatch until it is empty
        while len(minibatch) > 0:
            transitions =MW.predict(model, minibatch) #the model remembers the optimized variables for the sess, so it can predict on the minibatch from dev_set, get the predicted transitions, feature extractions are done within predict() method in ModelWrapper
            for index, action in enumerate(transitions):
                minibatch[index].parse_step(action) #update the stack and buff for each features (list of context word, pos and labels) in each sentence
            minibatch = [parse for parse in minibatch if len(parse.stack) > 1 or len(parse.buffer) > 0]

        # move to the next batch
        unfinished_parse = unfinished_parse[eval_batch_size:]

    dependencies = []
    for n in range(len(sentences)):
        dependencies.append(partial_parses[n].dependencies)
    
    # get the Unlabeled Accuracy Score (maybe)
    UAS = all_tokens = 0.0 
    for i, ex in enumerate(dataset):
        head = [-1] * len(ex['word'])
        for h, t, in dependencies[i]:
            head[t] = h
        for pred_h, gold_h, gold_l, pos in zip(head[1:], ex['head'][1:], ex['label'][1:], ex['pos'][1:]):
            assert preprocessor.id2tok[pos].startswith(P_PREFIX)
            pos_str = preprocessor.id2tok[pos][len(P_PREFIX):]
            if (preprocessor.with_punct) or (not punct(preprocessor.language, pos_str)):
                UAS += 1 if pred_h == gold_h else 0
                all_tokens += 1
    UAS /= all_tokens
    return UAS, dependencies 

Now, the deep neural network can be built in TensorFlow framework

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138


import os
import time
import tensorflow as tf
import pickle

class Config(object):
    """Holds model hyperparams and data information.
    The config class is used to store various hyperparameters and dataset
    information parameters. Model objects are passed a Config() object at
    instantiation.
    """
    n_features = 36
    n_classes = 3
    dropout = 0.5
    embed_size = 50
    hidden_size = 200
    batch_size = 2048
    n_epochs = 10
    lr = 0.001
    language = 'english'
    with_punct = True
    unlabeled = True
    lowercase = True
    use_pos = True
    use_dep = True
    use_dep = use_dep and (not unlabeled)
    data_path = './data'
    train_file = 'train.conll'
    dev_file = 'dev.conll'
    test_file = 'test.conll'
    embedding_file = './data/en-cw.txt'
    
class ParserModel(Model):
    """
    Implements a feedforward neural network with an embedding layer and single hidden layer.
    This network will predict which transition should be applied to a given partial parse
    configuration.
    """

    def add_placeholders(self):
 
        ### YOUR CODE HERE
        self.input_placeholder = tf.placeholder(tf.int32, [None, self.config.n_features])
        self.labels_placeholder = tf.placeholder(tf.float32, [None, self.config.n_classes])
        self.dropout_placeholder = tf.placeholder(tf.float32)
        self.beta_regul = tf.placeholder(tf.float32)
        ### END YOUR CODE

    def create_feed_dict(self, inputs_batch, labels_batch=None, dropout=1, beta_regul=10e-7):
  
        ### YOUR CODE HERE
        feed_dict = {self.input_placeholder: inputs_batch, \
                     self.dropout_placeholder: dropout, \
                     self.beta_regul: beta_regul
                     }
        if labels_batch is not None:
            feed_dict[self.labels_placeholder] = labels_batch
        ### END YOUR CODE
        return feed_dict

    def add_embedding(self):

        ### YOUR CODE HERE
        embedded = tf.Variable(self.pretrained_embeddings)
        embeddings = tf.nn.embedding_lookup(embedded,self.input_placeholder)
        embeddings = tf.reshape(embeddings, [-1, self.config.n_features * self.config.embed_size])
        ### END YOUR CODE
        return embeddings

    def add_prediction_op(self):

        x = self.add_embedding()
        ### YOUR CODE HERE
        xavier = xavier_weight_init()
        with tf.variable_scope("transformation"):
            b1 = tf.Variable(tf.random_uniform([self.config.hidden_size,]))
            b2 = tf.Variable(tf.random_uniform([self.config.n_classes]))

            self.W = W = xavier([self.config.n_features * self.config.embed_size, self.config.hidden_size])
            U = xavier([self.config.hidden_size, self.config.n_classes])

            z1 = tf.matmul(x,W) + b1
            h = tf.nn.relu(z1)
            h_drop = tf.nn.dropout(h,self.dropout_placeholder)
            pred = tf.matmul(h_drop, U) + b2
        ### END YOUR CODE
        return pred

    def add_loss_op(self, pred):
 
        ### YOUR CODE HERE
        loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=pred, labels=self.labels_placeholder)
        loss += self.beta_regul * tf.nn.l2_loss(self.W)
        loss = tf.reduce_mean(loss)
        ### END YOUR CODE
        return loss

    def add_training_op(self, loss):

        ### YOUR CODE HERE
        adam_optim = tf.train.AdamOptimizer(self.config.lr)
        train_op = adam_optim.minimize(loss)
        ### END YOUR CODE
        return train_op

    def train_on_batch(self, sess, inputs_batch, labels_batch):
        feed = self.create_feed_dict(inputs_batch, labels_batch=labels_batch,
                                     dropout=self.config.dropout)
        _, loss = sess.run([self.train_op, self.loss], feed_dict=feed)
        return loss

    def run_epoch(self, sess, model, preprocessor,train_examples, dev_set):
        prog = tf.keras.utils.Progbar(target=1 + len(train_examples) / self.config.batch_size)
        for i, (train_x, train_y) in enumerate(minibatches(train_examples, self.config.batch_size)):
            loss = self.train_on_batch(sess, train_x, train_y)
            prog.update(i + 1, [("train loss", loss)])

        print("Evaluating on dev set"),
        dev_UAS, _ = parse(sess, model, preprocessor, dev_set)
        print("- dev UAS: {:.2f}".format(dev_UAS * 100.0))
        return dev_UAS

    def fit(self, sess, model, saver, preprocessor,train_examples, dev_set):
        best_dev_UAS = 0
        for epoch in range(self.config.n_epochs):
            print("Epoch {:} out of {:}".format(epoch + 1, self.config.n_epochs))
            dev_UAS = self.run_epoch(sess, model, preprocessor,train_examples, dev_set)
            if dev_UAS > best_dev_UAS:
                best_dev_UAS = dev_UAS
                if saver:
                    print("New best dev UAS! Saving model in ./data/weights/parser.weights")
                    saver.save(sess, './data/weights/parser.weights')
            print

    def __init__(self, config, pretrained_embeddings):
        self.pretrained_embeddings = pretrained_embeddings
        self.config = config
        self.build()

Before combining everything together, define the follow methods. minibatches() are used to extract the input batch and lable batch for the feeder in tensorflow; xavier_weight_init() used to randomly generate W and U matrices in the prediction functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


from utils.general_utils import get_minibatches

def minibatches(data, batch_size):
    x = np.array([d[0] for d in data])
    y = np.array([d[2] for d in data])
    one_hot = np.zeros((y.size, 3))
    one_hot[np.arange(y.size), y] = 1
    return get_minibatches([x, one_hot], batch_size)

def xavier_weight_init():
    """Returns function that creates random tensor.
    The specified function will take in a shape (tuple or 1-d array) and
    returns a random tensor of the specified shape drawn from the
    Xavier initialization distribution.
    Hint: You might find tf.random_uniform useful.
    """
    def _xavier_initializer(shape, **kwargs):
        """Defines an initializer for the Xavier distribution.
        Specifically, the output should be sampled uniformly from [-epsilon, epsilon] where
            epsilon = sqrt(6) / <sum of the sizes of shape's dimensions>
        e.g., if shape = (2, 3), epsilon = sqrt(6 / (2 + 3))
        This function will be used as a variable initializer.
        Args:
            shape: Tuple or 1-d array that species the dimensions of the requested tensor.
        Returns:
            out: tf.Tensor of specified shape sampled from the Xavier distribution.
        """
        ### YOUR CODE HERE
        epsilon = np.sqrt(6 / np.sum(shape))
        out = tf.Variable(tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon))
        ### END YOUR CODE
        return out
    # Returns defined initializer function.
    return _xavier_initializer

Combining everything to the main() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


def main(debug=True):
    print(80 * "=")
    print("INITIALIZING")
    print(80 * "=")
    config = Config()
    preprocessor, embeddings, train_examples, dev_set, test_set = load_and_preprocess_data(debug)
    if not os.path.exists('./data/weights/'):
        os.makedirs('./data/weights/')

    with tf.Graph().as_default():
        print("Building model..."),
        start = time.time()
        model = ParserModel(config, embeddings)
        #parser.model = model
        print("took {:.2f} seconds\n".format(time.time() - start))

        init = tf.global_variables_initializer()
        # If you are using an old version of TensorFlow, you may have to use
        # this initializer instead.
        # init = tf.initialize_all_variables()
        saver = None if debug else tf.train.Saver()

        with tf.Session() as sess:
            #parser.session = sess
            sess.run(init)

            print(80 * "=")
            print("TRAINING")
            print(80 * "=")
            model.fit(sess, model, saver, preprocessor,train_examples, dev_set) #train_examples provide the input_batch and label batch by minibatches() method defined above 

            if not debug:
                print(80 * "=")
                print("TESTING")
                print(80 * "=")
                print("Restoring the best model weights found on the dev set")
                saver.restore(sess, './data/weights/parser.weights')
                print("Final evaluation on test set",)
                UAS, dependencies = parse(sess, model, preprocessor, test_set)
                print("- test UAS: {:.2f}".format(UAS * 100.0))
                print("Writing predictions")
                with open('q2_test.predicted.pkl', 'wb') as f:
                    pickle.dump(dependencies, f, -1)
                print("Done!")

Finally, run the main function with debug mode:

1

main(True)

The results under debug mode. After 10 epoch of training, UAS is 68.56.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


================================================================================
INITIALIZING
================================================================================
Loading data...
took 11.97 seconds
Building parser...
took 0.09 seconds
Loading pretrained embeddings...
took 6.36 seconds
Vectorizing data...
took 0.17 seconds
Preprocessing training data...
took 3.79 seconds
Building model...
took 0.52 seconds

================================================================================
TRAINING
================================================================================
Epoch 1 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 1.0069Evaluating on dev set
- dev UAS: 45.86
Epoch 2 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.4400Evaluating on dev set
- dev UAS: 53.38
Epoch 3 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.3473Evaluating on dev set
- dev UAS: 58.92
Epoch 4 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.3010Evaluating on dev set
- dev UAS: 61.86
Epoch 5 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.2675Evaluating on dev set
- dev UAS: 62.74
Epoch 6 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.2419Evaluating on dev set
- dev UAS: 64.17
Epoch 7 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.2214Evaluating on dev set
- dev UAS: 65.05
Epoch 8 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.2016Evaluating on dev set
- dev UAS: 67.05
Epoch 9 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.1881Evaluating on dev set
- dev UAS: 67.45
Epoch 10 out of 10
24/24 [============================>.] - ETA: 0s - train loss: 0.1733Evaluating on dev set
- dev UAS: 68.56

Training on the full data, the final UAS is 88.91.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66


================================================================================
INITIALIZING
================================================================================
Loading data...
took 5.62 seconds
Building parser...
took 4.05 seconds
Loading pretrained embeddings...
took 6.82 seconds
Vectorizing data...
took 4.56 seconds
Preprocessing training data...
took 144.24 seconds
Building model...
took 0.65 seconds

================================================================================
TRAINING
================================================================================
Epoch 1 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.2253Evaluating on dev set
- dev UAS: 82.99
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 2 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.1231Evaluating on dev set
- dev UAS: 85.37
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 3 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.1069Evaluating on dev set
- dev UAS: 86.63
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 4 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0969Evaluating on dev set
- dev UAS: 87.34
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 5 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0902Evaluating on dev set
- dev UAS: 87.09
Epoch 6 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0839Evaluating on dev set
- dev UAS: 87.84
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 7 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0792Evaluating on dev set
- dev UAS: 88.21
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 8 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0747Evaluating on dev set
- dev UAS: 88.08
Epoch 9 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0711Evaluating on dev set
- dev UAS: 88.26
New best dev UAS! Saving model in ./data/weights/parser.weights
Epoch 10 out of 10
928/928 [============================>.] - ETA: 0s - train loss: 0.0675Evaluating on dev set
- dev UAS: 88.52
New best dev UAS! Saving model in ./data/weights/parser.weights
================================================================================
TESTING
================================================================================
Restoring the best model weights found on the dev set
INFO:tensorflow:Restoring parameters from ./data/weights/parser.weights
Final evaluation on test set
- test UAS: 88.91
Writing predictions
Done!

The final dependencies for each transition is in id form and will look like this:

1

[(7, 6), (7, 5), (7, 4), (7, 3), (7, 2), (7, 1), (7, 8), (0, 7)], [],...]