CNTK 303: Deep Structured Semantic Modeling with LSTM Networks

DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. DSSM, developed by the MSR Deep Learning Technology Center(DLTC), is a deep neural network (DNN) modeling technique for representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space and modeling semantic similarity between two text strings (e.g., Sent2Vec). DSSM has wide applications including information retrieval and web search ranking (Huang et al. 2013; Shen et al. 2014a,2014b; Palangi et al. 2016), ad selection/relevance, contextual entity search and interestingness tasks (Gao et al. 2014a, question answering (Yih et al., 2014), image captioning (Fang et al., 2014), and machine translation (Gao et al., 2014b) etc.

DSSM can be used to develop latent semantic models that project entities of different types (e.g., queries and documents) into a common low-dimensional semantic space for a variety of machine learning tasks such as ranking and classification. For example, in web search ranking, the relevance of a document given a query can be readily computed as the distance between them in that space. With the latest GPUs from Nvidia, we can train our models on billions of words. Readers that are interested in deep learning for text processing may refer to the tutorial by He et al., 2014. We released the predictors and trained model files of the DSSM (also a.k.a. Sent2Vec).


To develop mechanism such that given a pair of documents say a query and a set of web page documents, the model would map the inputs to a pair of feature vectors in a continuous, low dimensional space where one could compare the semantic similarity between the text strings using the cosine similarity between their vectors in that space.

In the figure above one can see how given a query (\(Q\)) and set of documents (\(D_1, D_2, \ldots, D_n\)), one can generate latent representation a.k.a. semantic features, which can then be used to generate pairwise distance metric. The metric evaluated can be used for ranking.

In the picture above, one can see that the query and the document are each mapped to a term vector. While a bag of word based modeling is a first step one takes while building NLP models, they are limited in their ability to capture relative positions amongst words. Convolution based, or recurrence based models perform better due to their inherent ability to leverage the positions of words. In this tutorial, we will use a simple illustrative model using LSTM to encode the term vector following the work done by Palangi et. al..

In this tutorial, we show you how to build such a network. We use a small sample from the Question-Answering corpus. Additionally we will use a recurrent network to develop the semantic model as it allows to inherently incorporate the positional information with the word tokens.

Note: The data set is very small and the emphasis of this tutorial is in showing how to create an end-to-end modeling workflow for the DSSM network and not so much on the specific numerical performance we are able to get on this small data set.

In [1]:
# Import the relevant libraries
import math
import numpy as np
import os
from __future__ import print_function # Use a function definition from future version (say 3.x from 2.7 interpreter)

import cntk as C
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env() # (only needed for our build system)
C.cntk_py.set_fixed_random_seed(1) # fix a random seed for CNTK components

Data Preparation


We use a sampling of the Question Answering data set for illustrating how to model DSSM networks. The data set consists of pair of sentences with Questions and Answers. In this tutorial, we have preprocessed the data into two parts: - Vocabulary files: 1 file each for question and answers. There are 1204 and 1019 words in the question and answers vocabulary, respectively. - QA files: 1 file each for training and validation data (hold-out) where each of the files are converted in the CTF format. The training and validation files have 3500 and 409 sentence pairs respectively.

Note: a small portion of the original data was provided by the author of the paper for creating an exemplar network for illustration purposes.

In [2]:
location = os.path.normpath('data/DSSM')
data = {
  'train': { 'file': 'train.pair.tok.ctf' },
  'val':{ 'file': 'valid.pair.tok.ctf' },
  'query': { 'file': 'vocab_Q.wl' },
  'answer': { 'file': 'vocab_A.wl' }

import requests

def download(url, filename):
    """ utility function to download a file """
    response = requests.get(url, stream=True)
    with open(filename, "wb") as handle:
        for data in response.iter_content():

if not os.path.exists(location):

for item in data.values():
    path = os.path.normpath(os.path.join(location, item['file']))

    if os.path.exists(path):
        print("Reusing locally cached:", path)

        print("Starting download:", item['file'])
        url = ""%(item['file'])
        download(url, path)
        print("Download completed")
    item['file'] = path
Starting download: vocab_A.wl
Download completed
Starting download: train.pair.tok.ctf
Download completed
Starting download: valid.pair.tok.ctf
Download completed
Starting download: vocab_Q.wl
Download completed


We will be using the CTF deserializer to read the input data. However, one can write their own readers or use numpy arrays to provide data into CNTK modeling workflow. You may want to open the CTF files with a text editor to parse the input. Note, the CTF deserializer has the capability to scale across production scale data sizes spanning mulitple disks. The reader also abstracts the randomization of the large scale with a simple flag, an added convenience and time savings for the programmer.

In [3]:
# Define the vocabulary size (QRY-stands for question and ANS stands for answer)
QRY_SIZE = 1204
ANS_SIZE = 1019

def create_reader(path, is_training):
         query ='S0', shape=QRY_SIZE,  is_sparse=True),
         answer  ='S1', shape=ANS_SIZE, is_sparse=True)
     )), randomize=is_training, max_sweeps = if is_training else 1)
In [4]:
train_file = data['train']['file']

if os.path.exists(train_file):
    train_source = create_reader(train_file, is_training=True)
    raise ValueError("Cannot locate file {0} in current directory {1}".format(train_file, os.getcwd()))

validation_file = data['val']['file']
if os.path.exists(validation_file):
    val_source = create_reader(validation_file, is_training=False)
    raise ValueError("Cannot locate file {0} in current directory {1}".format(validation_file, os.getcwd()))

Model creation

The proposed LSTM-RNN model sequentially takes each word in a sentence, extracts its information, and embeds it into a semantic vector. Due to its ability to capture long term memory, the LSTM-RNN accumulates increasingly richer information as it goes through the sentence, and when it reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence. The last block is then projected to a query_vector space, also referred to semantic feature in the figure above.

                                                    "query vector"
                                                      | Dense |
                                                     | Dropout |
                                                      | Dense |
                                                      | last |
          +------+   +------+   +------+   +------+   +------+
     0 -->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |-->| LSTM |
          +------+   +------+   +------+   +------+   +------+
              ^          ^          ^          ^          ^
              |          |          |          |          |
          +-------+  +-------+  +-------+  +-------+  +-------+
          | Embed |  | Embed |  | Embed |  | Embed |  | Embed |
          +-------+  +-------+  +-------+  +-------+  +-------+
              ^          ^          ^          ^          ^
              |          |          |          |          |
query  ------>+--------->+--------->+--------->+--------->+

Similarly we can project the answer sentence to answer_vector. However, before we create our model. Let us define the input variables for our model. Note, there is a query and paired with it there is an answer. Given both of these are a sequence of words we define

In [5]:
# Create the containers for input feature (x) and the label (y)
qry = C.sequence.input_variable(QRY_SIZE)
ans = C.sequence.input_variable(ANS_SIZE)

Notice: Do you smell any problem with the aforementioned statements. If you want to see what would happen if you were to go with the declarations above, please comment out the 4 statements below and run the model. You will find that your model throws an exception. The details of the exception is explained here.

Each sequence in CNTK, is associated with a dynamic axis representing the number of words in the sequence. Intuitively, when you have sequences of different sizes and vocabularies, each of them need to have their own dynamic axis. This is facilitated by declaring the input data containers with a named axis. Strictly speaking you could name just one, the other one would be a default dynamic axis. However, for clarity we name the two axis separately.

In [6]:
# Create the containers for input feature (x) and the label (y)
axis_qry = C.Axis.new_unique_dynamic_axis('axis_qry')
qry = C.sequence.input_variable(QRY_SIZE, sequence_axis=axis_qry)

axis_ans = C.Axis.new_unique_dynamic_axis('axis_ans')
ans = C.sequence.input_variable(ANS_SIZE, sequence_axis=axis_ans)

Before we can create the model we need to specify a few parameters associated with the network architecture.

In [7]:
EMB_DIM   = 25 # Embedding dimension
HIDDEN_DIM = 50 # LSTM dimension
DSSM_DIM = 25 # Dense layer dimension
In [8]:
def create_model(qry, ans):
    with C.layers.default_options(initial_state=0.1):
        qry_vector = C.layers.Sequential([
            C.layers.Embedding(EMB_DIM, name='embed'),
            C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
            C.layers.Dense(DSSM_DIM, activation=C.relu, name='q_proj'),
            C.layers.Dropout(DROPOUT_RATIO, name='dropout qdo1'),
            C.layers.Dense(DSSM_DIM, activation=C.tanh, name='q_enc')

        ans_vector = C.layers.Sequential([
            C.layers.Embedding(EMB_DIM, name='embed'),
            C.layers.Recurrence(C.layers.LSTM(HIDDEN_DIM), go_backwards=False),
            C.layers.Dense(DSSM_DIM, activation=C.relu, name='a_proj'),
            C.layers.Dropout(DROPOUT_RATIO, name='dropout ado1'),
            C.layers.Dense(DSSM_DIM, activation=C.tanh, name='a_enc')

    return {
        'query_vector': qry_vector(qry),
        'answer_vector': ans_vector(ans)

# Create the model and store reference in `network` dictionary
network = create_model(qry, ans)

network['query'], network['axis_qry'] = qry, axis_qry
network['answer'], network['axis_ans'] = ans, axis_ans


Now that we have created a network, the next step is to find a suitable loss function where if a question is paired with the correct answer, the loss would be 0 else it would be 1. In other words, this loss should maximize the similarity (dot  product) between the answer vector which appears close to the answer vector and minimize the similarity of between the answer and question vector that do not answer each other.

The use cases of DSSM often appear in information retrieval where for a given query or question there are few answers amongst an ocean of poor or non-answers. The input data as in this case is a pair of query and answer (document or advertisement) that attracted a click. A classical way to train would be a a binary classifier to predict click / no-click (or equivalently a 2-class classifier - one class each for click or no click). One could generate pairs of query and incorrect answers (as no-click data). However, one way to simulate no-click data is to use query and answers for other queries within a minibatch. This is the concept behind cosine_distance_with_negative_samples function. Note: This function returns 1 for correct the question and answer pair and 0 for incorrect, which is referred to as similarity. Hence, we use 1- cosine_distance_with_negative_samples as our loss function.

In [9]:
def create_loss(vector_a, vector_b):
    qry_ans_similarity = C.cosine_distance_with_negative_samples(vector_a, \
                                                                 vector_b, \
                                                                 shift=1, \
    return 1 - qry_ans_similarity
In [10]:
# Model parameters
EPOCH_SIZE = 10000
In [11]:
# Create trainer
def create_trainer(reader, network):

    # Setup the progress updater
    progress_writer = C.logging.ProgressPrinter(tag='Training', num_epochs=MAX_EPOCHS)

    # Set learning parameters
    lr_per_sample     = [0.0015625]*20 + \
                        [0.00046875]*20 + \
                        [0.00015625]*20 + \
                        [0.000046875]*10 + \
    lr_schedule       = C.learning_parameter_schedule_per_sample(lr_per_sample, \
    mms               = [0]*20 + [0.9200444146293233]*20 + [0.9591894571091382]
    mm_schedule       = C.learners.momentum_schedule(mms, \
                                                     epoch_size=EPOCH_SIZE, \
    l2_reg_weight     = 0.0002

    model = C.combine(network['query_vector'], network['answer_vector'])

    #Notify the network that the two dynamic axes are indeed same
    query_reconciled = C.reconcile_dynamic_axes(network['query_vector'], network['answer_vector'])

    network['loss'] = create_loss(query_reconciled, network['answer_vector'])
    network['error'] = None

    print('Using momentum sgd with no l2')
    dssm_learner = C.learners.momentum_sgd(model.parameters, lr_schedule, mm_schedule)

    network['learner'] = dssm_learner

    print('Using local learner')
    # Create trainer
    return C.Trainer(model, (network['loss'], network['error']), network['learner'], progress_writer)
In [12]:
# Instantiate the trainer
trainer = create_trainer(train_source, network)
Using momentum sgd with no l2
Using local learner
In [13]:
# Train
def do_train(network, trainer, train_source):
    # define mapping from intput streams to network inputs
    input_map = {
        network['query']: train_source.streams.query,
        network['answer']: train_source.streams.answer

    t = 0
    for epoch in range(MAX_EPOCHS):         # loop over epochs
        epoch_end = (epoch+1) * EPOCH_SIZE
        while t < epoch_end:                # loop over minibatches on the epoch
            data = train_source.next_minibatch(MINIBATCH_SIZE, input_map= input_map)  # fetch minibatch
            trainer.train_minibatch(data)               # update model with it
            t += MINIBATCH_SIZE

In [14]:
do_train(network, trainer, train_source)
Learning rate per 1 samples: 0.0015625
Momentum per 1 samples: 0.0
Finished Epoch[1 of 5]: [Training] loss = 0.343046 * 1522, metric = 0.00% * 1522 5.720s (266.1 samples/s);
Finished Epoch[2 of 5]: [Training] loss = 0.102804 * 1530, metric = 0.00% * 1530 3.464s (441.7 samples/s);
Finished Epoch[3 of 5]: [Training] loss = 0.066461 * 1525, metric = 0.00% * 1525 3.402s (448.3 samples/s);
Finished Epoch[4 of 5]: [Training] loss = 0.048511 * 1534, metric = 0.00% * 1534 3.390s (452.5 samples/s);
Finished Epoch[5 of 5]: [Training] loss = 0.035384 * 1510, metric = 0.00% * 1510 3.383s (446.3 samples/s);


Once the model is trained we want to select a model that has similar error with the validation (hold-out set) as the error with the training set.

Suggested Activity: Vary the number of epochs and check the training and the validation error.

The chosen model would then be used for prediction.

In [15]:
# Validate
def do_validate(network, val_source):
    # process minibatches and perform evaluation
    progress_printer = C.logging.ProgressPrinter(tag='Evaluation', num_epochs=0)

    val_map = {
        network['query']: val_source.streams.query,
        network['answer']: val_source.streams.answer

    evaluator = C.eval.Evaluator(network['loss'], progress_printer)

    while True:
        minibatch_size = 100
        data = val_source.next_minibatch(minibatch_size, input_map=val_map)
        if not data:                                 # until we hit the end


In [16]:
do_validate(network, val_source)
Finished Evaluation [1]: Minibatch[1-35]: metric = 0.02% * 410;


We will now create a vector representation of the query and the answer. Then compute the cosine similarity between the two vectors. When the answer is close to the question one would get a high similarity, while an incorrect / partially relevant question / answer pair would result in a smaller similarity. These scores are often used for ranking web documents in response to a query.

In [17]:
# load dictionaries
query_wl = [line.rstrip('\n') for line in open(data['query']['file'])]
answers_wl = [line.rstrip('\n') for line in open(data['answer']['file'])]
query_dict = {query_wl[i]:i for i in range(len(query_wl))}
answers_dict = {answers_wl[i]:i for i in range(len(answers_wl))}

# let's run a sequence through
qry = 'BOS what contribution did  e1  made to science in 1665 EOS'
ans = 'BOS book author book_editions_published EOS'
ans_poor = 'BOS language human_language main_country EOS'

qry_idx = [query_dict[w+' '] for w in qry.split()] # convert to query word indices
print('Query Indices:', qry_idx)

ans_idx = [answers_dict[w+' '] for w in ans.split()] # convert to answer word indices
print('Answer Indices:', ans_idx)

ans_poor_idx = [answers_dict[w+' '] for w in ans_poor.split()] # convert to fake answer word indices
print('Poor Answer Indices:', ans_poor_idx)
Query Indices: [1202, 1154, 267, 321, 357, 648, 1070, 905, 549, 6, 1203]
Answer Indices: [1017, 135, 91, 137, 1018]
Poor Answer Indices: [1017, 501, 452, 533, 1018]

Convert the query, answer and the fake answer to one-hot representation. This is a necessary step since the input to our trained network takes one-hot encoded input.

In [18]:
# Create the one hot representations
qry_onehot = np.zeros([len(qry_idx),len(query_dict)], np.float32)
for t in range(len(qry_idx)):
    qry_onehot[t,qry_idx[t]] = 1

ans_onehot = np.zeros([len(ans_idx),len(answers_dict)], np.float32)
for t in range(len(ans_idx)):
    ans_onehot[t,ans_idx[t]] = 1

ans_poor_onehot = np.zeros([len(ans_poor_idx),len(answers_dict)], np.float32)
for t in range(len(ans_poor_idx)):
    ans_poor_onehot[t, ans_poor_idx[t]] = 1

For each of the query and the answer one-hot encoded input, create the embeddings. Note: we use the answer embedding for both the correct answer and the poor answer. We compute the cosine similarity between the query and answer pair. The relative value of the cosine similarity with a higher value indicating a better answer.

In [19]:
qry_embedding = network['query_vector'].eval([qry_onehot])
ans_embedding = network['answer_vector'].eval([ans_onehot])
ans_poor_embedding = network['answer_vector'].eval([ans_poor_onehot])

from scipy.spatial.distance import cosine

print('Query to Answer similarity:', 1-cosine(qry_embedding, ans_embedding))
print('Query to poor-answer similarity:', 1-cosine(qry_embedding, ans_poor_embedding))
Query to Answer similarity: 0.99995367043
Query to poor-answer similarity: 0.999941420215