
Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# GenSen with Pytorch
In this tutorial, you will train a GenSen model for the sentence similarity task. We use the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset in this example. For a more detailed walkthrough about data processing jump to [SNLI Data Prep](../01-prep-data/snli.ipynb). A quickstart version of this notebook can be found [here](../00-quick-start/)

## Notes:
The model training part of this notebook can only run on a GPU machine. The running time shown in the notebook is on a Standard_NC6 Azure VM with 1 NVIDIA Tesla K80 GPU and 12 GB GPU memory. See the [README](README.md) for more details of the running time.

## Overview

### What is GenSen?

GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. GenSen model combines the benefits of diverse sentence-representation learning objectives into a single multi-task framework. "This is the first large-scale reusable sentence representation model obtained by combining a set of training objectives with the level of diversity explored here, i.e. multi-lingual NMT, natural language inference, constituency parsing and skip-thought vectors." [\[1\]](#References) These representations are useful for transfer and low-resource learning. GenSen is trained on several data sources with multiple training objectives on over 100 milion sentences.

The GenSen model is most similar to that of Luong et al. (2015) [\[4\]](#References), who train a many-to-many **sequence-to-sequence** model on a diverse set of weakly related tasks that includes machine translation, constituency parsing, image captioning, sequence autoencoding, and intra-sentence skip-thoughts. However, there are two key differences. "First, like McCann et al. (2017) [\[5\]](#References), their use of an attention mechanism prevents learning a Ô¨Åxed-length vector representation for a sentence. Second, their work aims for improvements on the same tasks on which the model is trained, as opposed to learning re-usable sentence representations that transfer elsewhere." [\[1\]](#References)

### Why GenSen?

GenSen model performs the state-of-the-art results on multiple datasets, such as MRPC, SICK-R, SICK-E and STS, for sentence similarity. The reported results are as follows compared with other models [\[3\]](#References):

| Model | MRPC | SICK-R | SICK-E | STS |
| --- | --- | --- | --- | --- |
| GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 |
| [InferSent](https://arxiv.org/abs/1705.02364) (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 |
| [TF-KLD](https://www.aclweb.org/anthology/D13-1090) (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - |

## Outline
This notebook is organized as follows:

1. Data preparation and inspection.
2. Model training and prediction.

For a more detailed deep dive of the Gensen model checkout the [Gensen Deep Dive Notebook](gensen_aml_deep_dive.ipynb)

## 0. Global Settings

In [11]:
import sys
sys.path.append("../..")

import os
import papermill as pm
import scrapbook as sb

from utils_nlp.dataset.preprocess import to_lowercase, to_nltk_tokens
from utils_nlp.dataset import snli, preprocess
from utils_nlp.models.pretrained_embeddings.glove import download_and_extract
from utils_nlp.dataset import Split
from examples.sentence_similarity.gensen_wrapper import GenSenClassifier

print("System version: {}".format(sys.version))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]


In [12]:
max_epoch = None
config_filepath = 'gensen_config.json'
base_data_path = '../../data'
nrows = None

## 1. Data Preparation and inspection

The [SNLI](https://nlp.stanford.edu/projects/snli/) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). 

### 1.1 Load the dataset

We provide a function load_pandas_df which does the following

* Downloads the SNLI zipfile at the specified directory location
* Extracts the file based on the specified split
* Loads the split as a pandas dataframe The zipfile contains the following files:
    * snli_1.0_dev.txt
    * snli_1.0_train.txt
    * snli_1.0_test.tx
    * snli_1.0_dev.jsonl
    * snli_1.0_train.jsonl
    * snli_1.0_test.jsonl
    
The loader defaults to reading from the .txt file; however, you can change this to .jsonl by setting the optional file_type parameter when calling the function.

In [3]:
train = snli.load_pandas_df(base_data_path, file_split=Split.TRAIN, nrows=nrows)
dev = snli.load_pandas_df(base_data_path, file_split=Split.DEV, nrows=nrows)
test = snli.load_pandas_df(base_data_path, file_split=Split.TEST, nrows=nrows)

train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


### 1.2 Tokenize

We have loaded the dataset into pandas.DataFrame, we now convert sentences to tokens. We also clean the data before tokenizing. This includes dropping unneccessary columns and renaming the relevant columns as score, sentence_1, and sentence_2.

In [4]:
def clean_and_tokenize(df):
    df = snli.clean_cols(df)
    df = snli.clean_rows(df)
    df = preprocess.to_lowercase(df)
    df = preprocess.to_nltk_tokens(df)
    return df

train = clean_and_tokenize(train)
dev = clean_and_tokenize(dev)
test = clean_and_tokenize(test)

Once we have the clean pandas dataframes, we do lowercase standardization and tokenization. We use the [NLTK] (https://www.nltk.org/) library for tokenization.

In [5]:
dev.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,two women are embracing while holding to go pa...,the sisters are hugging goodbye while holding ...,"[two, women, are, embracing, while, holding, t...","[the, sisters, are, hugging, goodbye, while, h..."
1,entailment,two women are embracing while holding to go pa...,two woman are holding packages.,"[two, women, are, embracing, while, holding, t...","[two, woman, are, holding, packages, .]"
2,contradiction,two women are embracing while holding to go pa...,the men are fighting outside a deli.,"[two, women, are, embracing, while, holding, t...","[the, men, are, fighting, outside, a, deli, .]"
3,entailment,"two young children in blue jerseys, one with t...",two kids in numbered jerseys wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, in, numbered, jerseys, wash, their..."
4,neutral,"two young children in blue jerseys, one with t...",two kids at a ballgame wash their hands.,"[two, young, children, in, blue, jerseys, ,, o...","[two, kids, at, a, ballgame, wash, their, hand..."


##  2. Model application, performance and analysis of the results
The model has been implemented as a GenSen class with the specifics hidden inside the fit() method, so that no explicit call is needed. The algorithm operates in three different steps:

** Model initialization ** : This is where we tell our class how to train the model. The main parameters to specify are the number of
1. config file which contains information about the number of training epochs, the minibatch size etc.
2. cache_dir which is the folder where all the data will be saved.
3. learning rate for the model
4. path to the pretrained embedding vectors.

** Model fit ** : This is where we train the model on the data. The method takes two arguments: the training, dev and test set pandas dataframes. Note that the model is trained only on the training set, the test set is used to display the test set accuracy of the trained model, that in turn is an estimation of the generazation capabilities of the algorithm. It is generally useful to look at these quantities to have a first idea of the optimization behaviour.

** Model prediction ** : This is where we generate the similarity for a pair of sentences. Once the model has been trained and we are satisfied with its overall accuracy we use the saved model to show the similarity between two provided sentences. 

### 2.0 Download pretrained vectors
In this example we use gloVe for pretrained embedding vectors.

In [14]:
pretrained_embedding_path = download_and_extract(base_data_path)

Vector file already exists. No changes made.


### 2.1 Initialize Model

In [15]:
clf = GenSenClassifier(config_file = config_filepath, 
                       pretrained_embedding_path = pretrained_embedding_path,
                       learning_rate = 0.0001, 
                       cache_dir=base_data_path,
                      max_epoch=max_epoch)

### 2.2 Train Model

In [8]:
%%time
clf.fit(train, dev, test)

  "num_layers={}".format(dropout, num_layers))
  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)
  Variable(torch.LongTensor(sorted_src_lens), volatile=True)
  torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)
  f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)
  f.softmax(class_logits).data.cpu().numpy().argmax(axis=-1)


CPU times: user 1h 19min 28s, sys: 22min 1s, total: 1h 41min 30s
Wall time: 1h 41min 22s


### 2.3 Predict

In the predict method we perform Pearson's Correlation computation [\[2\]](#References) on the outputs of the model. The predictions of the model can be further improved by hyperparameter tuning which we walk through in the other example [here](gensen_aml_deep_dive.ipynb). 

In [16]:
sentences = [
        'The sky is blue and beautiful',
        'Love this blue and beautiful sky!'
    ]

results = clf.predict(sentences)
print("******** Similarity Score for sentences **************")
print(results)

# Record results with scrapbook for tests
sb.glue("results", results.to_dict())

******** Similarity Score for sentences **************
          0         1
0  1.000000  0.966793
1  0.966793  1.000000


## References

1. Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J, [*Learning general purpose distributed sentence representations via large scale multi-task learning*](https://arxiv.org/abs/1804.00079), ICLR, 2018.
2. Pearson's Correlation Coefficient. url: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
3. Semantic textual similarity. url: http://nlpprogress.com/english/semantic_textual_similarity.html
4. Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. [*Multi-task sequence to sequence learning*](https://arxiv.org/abs/1511.06114), 2015.
5. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. [*Learned in translation: Contextualized word vectors](https://arxiv.org/abs/1708.00107), 2017. 