Copyright (c) Microsoft Corporation.  
Licensed under the MIT License.

# Abstractive Summarization using UniLM on CNN/DailyMails

## Before you start
Set `QUICK_RUN = True` to run the notebook on a small subset of data and a smaller number of steps. If `QUICK_RUN = False`, the notebook takes about 9 hours to run on a VM with 4 16GB NVIDIA V100 GPUs. 

In [3]:
QUICK_RUN = True

## Summary
This notebook demostrates how to fine-tune the [Unified Language Model](https://arxiv.org/abs/1905.03197) (UniLM) for abstractive summarization task. Utility functions and classes in the microsoft/nlp-recipes repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation.

### Abstractive Summarization
Abstractive summarization is the task of taking an input text and summarizing its content in a shorter output text. In contrast to extractive summarization, abstractive summarization doesn't take sentences directly from the input text, instead, rephrases the input text.

### UniLM
UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is first pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookBorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.   
The figure below shows the UniLM architecture. During pre-training, the model parameters are shared across the LM objectives (i.e., bidirectional LM, unidirectional LM, and sequence-to-sequence LM). For different NLP tasks, UniLM uses different self-attention masks to control the access to context for each word token.  
The seq-to-seq LM in the third row in the figure is used in summarization task. In seq-to-seq LM, word tokens in the input sequence can access all the other tokens in the input sequence, but can not access the word tokens in the output sequence. Word tokens in the output sequence can access all the tokens in the input sequence and the tokens in the output sequence generated before the current position. 
<img src="https://nlpbp.blob.core.windows.net/images/unilm_architecture.PNG" width="600" height="600">


In [4]:
%load_ext autoreload
%autoreload 2
import os
import shutil
from tempfile import TemporaryDirectory
import pprint
import scrapbook as sb
import sys
import time
import torch

nlp_path = os.path.abspath("../../")
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.dataset.cnndm import CNNDMSummarizationDatasetOrg
from utils_nlp.models.transformers.abstractive_summarization_seq2seq import S2SAbsSumProcessor, S2SAbstractiveSummarizer
from utils_nlp.eval import compute_rouge_python

from utils_nlp.models.transformers.datasets import SummarizationDataset
from utils_nlp.dataset.cnndm import detokenize

start_time = time.time()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# model parameters
MODEL_NAME = "unilm-base-cased"
MAX_SEQ_LENGTH = 768
MAX_SOURCE_SEQ_LENGTH = 640
MAX_TARGET_SEQ_LENGTH = 128

# use 0 for CPU
NUM_GPUS =  torch.cuda.device_count()

# fine-tuning parameters
TRAIN_PER_GPU_BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 2
LEARNING_RATE = 3e-5

TOP_N = -1
WARMUP_STEPS = 500
MAX_STEPS = 5000
BEAM_SIZE = 5
if QUICK_RUN:
    TOP_N = 100
    WARMUP_STEPS = 5
    MAX_STEPS = 50
    BEAM_SIZE = 3
    if NUM_GPUS == 0:
        TOP_N = 5
        MAX_STEPS = 10

# inference parameters
TEST_PER_GPU_BATCH_SIZE = 12
FORBID_IGNORE_WORD = "."

# mixed precision setting. To enable mixed precision training, follow instructions in SETUP.md. 
# You will be able to increase the batch sizes with mixed precision training.
FP16 = False

DATA_DIR = TemporaryDirectory().name
CACHE_DIR = TemporaryDirectory().name
MODEL_DIR = "."
RESULT_DIR = "."
OUTPUT_FILE = os.path.join(RESULT_DIR, 'nlp_cnndm_finetuning_results.txt')

## Load the CNN/DailyMail dataset
The [CNN/DailyMail dataset](https://cs.nyu.edu/~kcho/DMQA/) was original introduced for Q&A research. There are multiple versions of the dataset processed for summarization task available on the web. The `CNNDMSummarizationDatasetOrg` function downloads a version from the [UniLM repo](https://github.com/microsoft/unilm) with minimal processing. The function returns the training and testing dataset as `SummarizationDataset` which can be further processed for model training and testing.

In [6]:
train_ds, test_ds = CNNDMSummarizationDatasetOrg(local_path=DATA_DIR, top_n=TOP_N)
print(len(train_ds))
print(len(test_ds))

Downloading 1jiDbDbAsqy_5BM79SmX6aSu5DQVCAZq1 into /tmp/tmp9eqxhpzx/cnndm_data.zip... Done.
100
100


## Preprocessing
The `S2SAbsSumProcessor` has multiple methods for converting input data in `SummarizationDataset`, `IterableSummarizationDataset` or json files into the format required for model training and testing. The preprocessing steps include
- Tokenize input text
- Convert tokens into token ids

In [7]:
processor = S2SAbsSumProcessor(model_name=MODEL_NAME, cache_dir=CACHE_DIR)

HBox(children=(IntProgress(value=0, description='Downloading', max=213419, style=ProgressStyle(description_wid…




In [8]:
cached_features_file_train = os.path.join(RESULT_DIR, "cached_features_for_training.pt")
cached_features_file_test = os.path.join(RESULT_DIR, "cached_features_for_testing.pt")
train_dataset = processor.s2s_dataset_from_sum_ds(train_ds, cached_features_file=cached_features_file_train, train_mode=True)
test_dataset = processor.s2s_dataset_from_sum_ds(test_ds, cached_features_file=cached_features_file_test, train_mode=False)

use cached feature file ./cached_features_for_training.pt
use cached feature file ./cached_features_for_testing.pt


## Fine tune model

The `S2SAbstractiveSummarizer` loads a pre-trained UniLM model specified by `model_name`.  
Call `S2SAbstractiveSummarizer.list_supported_models()` to see all the supported models.  
If you want to use a model on the local disk, specify `load_model_from_dir` and `model_file_name`. This is particularly useful if you want to load a previously fine-tuned model and use it for inference directly without fine-tuning. 

In [9]:
abs_summarizer = S2SAbstractiveSummarizer(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    max_source_seq_length=MAX_SOURCE_SEQ_LENGTH,
    max_target_seq_length=MAX_TARGET_SEQ_LENGTH,
    cache_dir=CACHE_DIR
)

HBox(children=(IntProgress(value=0, description='Downloading', max=609, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=217918554, style=ProgressStyle(description_…




  size=(config.max_position_embeddings, state_dict[_k].shape[1])), dtype=torch.float)


In [10]:
abs_summarizer.fit(
    train_dataset=train_dataset,
    num_gpus=NUM_GPUS,
    per_gpu_batch_size=TRAIN_PER_GPU_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_STEPS,
    fp16=FP16,
    save_model_to_dir=MODEL_DIR
)

Iteration:  20%|██        | 20/100 [00:09<00:39,  2.05it/s]

timestamp: 24/07/2020 05:53:03, average loss: 3.905551, time duration: 9.688284,
                            number of examples in current reporting: 20, step 10
                            out of total 50


Iteration:  40%|████      | 40/100 [00:19<00:29,  2.04it/s]

timestamp: 24/07/2020 05:53:12, average loss: 3.218857, time duration: 9.597530,
                            number of examples in current reporting: 20, step 20
                            out of total 50


Iteration:  60%|██████    | 60/100 [00:28<00:19,  2.04it/s]

timestamp: 24/07/2020 05:53:22, average loss: 3.173424, time duration: 9.650656,
                            number of examples in current reporting: 20, step 30
                            out of total 50


Iteration:  80%|████████  | 80/100 [00:38<00:09,  2.02it/s]

timestamp: 24/07/2020 05:53:32, average loss: 2.944041, time duration: 9.670888,
                            number of examples in current reporting: 20, step 40
                            out of total 50


Iteration: 100%|██████████| 100/100 [00:48<00:00,  2.04it/s]

timestamp: 24/07/2020 05:53:41, average loss: 2.746940, time duration: 9.667172,
                            number of examples in current reporting: 20, step 50
                            out of total 50





50

## Generate summaries on testing dataset

In [11]:
predictions = abs_summarizer.predict(
    test_dataset=test_dataset,
    num_gpus=NUM_GPUS,
    per_gpu_batch_size=TEST_PER_GPU_BATCH_SIZE,
    beam_size=BEAM_SIZE,
    forbid_ignore_word=FORBID_IGNORE_WORD,
    fp16=FP16
)

Evaluating: 100%|██████████| 9/9 [04:25<00:00, 20.56s/it]


In [13]:
for r in predictions[:2]:
    print(r)

Germanwings Flight 9525 crashed into the crash site .
" The ICC is a step toward greater justice and peace . "


In [14]:
test_ds.get_source()[0]

'Marseille, France (CNN) The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor Brice Robin told CNN that " so far no videos were used in the crash investigation . " He added, " A person who has such a video needs to immediately give it to the investigators . " Robin\'s comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French Alps . All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a phone at the wreckage site. The two publications described the supposed video, but did not post it on their websites . The publications said that they watched the video, which was found by a source close to the investigation. " One can hear cries of\' My God\' in several lan

In [15]:
test_ds.get_target()[0]

'Marseille prosecutor says " so far no videos were used in the crash investigation " despite media reports. Journalists at Bild and Paris Match are " very confident " the video clip is real, an editor says. Andreas Lubitz had informed his Lufthansa training school of an episode of severe depression, airline says.'

In [16]:
predictions[0]

'Germanwings Flight 9525 crashed into the crash site .'

In [17]:
with open(OUTPUT_FILE, 'w', encoding="utf-8") as f:
    for line in predictions:
        f.write(line + '\n')

## Prediction on a single input sample

In [18]:
source = """
But under the new rule, set to be announced in the next 48 hours, Border Patrol agents would immediately return anyone to Mexico — without any detainment and without any due process — who attempts to cross the southwestern border between the legal ports of entry. The person would not be held for any length of time in an American facility.

Although they advised that details could change before the announcement, administration officials said the measure was needed to avert what they fear could be a systemwide outbreak of the coronavirus inside detention facilities along the border. Such an outbreak could spread quickly through the immigrant population and could infect large numbers of Border Patrol agents, leaving the southwestern border defenses weakened, the officials argued.
The Trump administration plans to immediately turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally, saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents, four administration officials said.
The administration officials said the ports of entry would remain open to American citizens, green-card holders and foreigners with proper documentation. Some foreigners would be blocked, including Europeans currently subject to earlier travel restrictions imposed by the administration. The points of entry will also be open to commercial traffic."""

In [19]:
singel_test_ds = SummarizationDataset(
    None, source=[source], source_preprocessing=[detokenize],
)
single_test_dataset = processor.s2s_dataset_from_sum_ds(singel_test_ds, train_mode=False)

100%|██████████| 1/1 [00:00<00:00, 221.29it/s]


In [20]:
single_prediction = abs_summarizer.predict(
    test_dataset=single_test_dataset,
    num_gpus=NUM_GPUS,
    per_gpu_batch_size=1,
    beam_size=BEAM_SIZE,
    forbid_ignore_word=FORBID_IGNORE_WORD,
    fp16=FP16
)

Evaluating: 100%|██████████| 1/1 [00:02<00:00,  2.46s/it]


In [21]:
single_prediction[0]

'The Trump administration plans to turn back all asylum seekers and other foreigners attempting to enter the United States from Mexico illegally , saying the nation cannot risk allowing the coronavirus to spread through detention facilities and Border Patrol agents .'

## Evaluation
We provide utility functions for evaluating summarization models and details can be found in the [summarization evaluation notebook](./summarization_evaluation.ipynb).  
For the settings in this notebook with QUICK_RUN=False, you should get ROUGE scores close to the following numbers:  
{'rouge-1': {'f': 0.37109626943068647,
  'p': 0.4692792272280924,
  'r': 0.33322322114381886},  
 'rouge-2': {'f': 0.1690495786379728,
  'p': 0.21782900161918375,
  'r': 0.15079122430118444},  
 'rouge-l': {'f': 0.2671310062443078,
  'p': 0.3414039392451434,
  'r': 0.2392756715930202}}

In [22]:
rouge_scores = compute_rouge_python(cand=predictions, ref=test_ds.get_target())
pprint.pprint(rouge_scores)

Number of candidates: 100
Number of references: 100
{'rouge-1': {'f': 0.17773123687925843,
             'p': 0.2617638566079649,
             'r': 0.17949882083348662},
 'rouge-2': {'f': 0.05765103445062507,
             'p': 0.07978699966526893,
             'r': 0.06088008973211268},
 'rouge-l': {'f': 0.14359941956184336,
             'p': 0.2206589161348127,
             'r': 0.14228368692690307}}


## Distributed training with DistributedDataParallel (DDP)
This notebook uses DataParallel for multi-GPU training by default. In general, DistributedDataParallel(DDP) is recommended because of its better performance. See details [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).  
Since DDP requires multiprocess and can not be run from the notebook, we provide a python script [abstractive_summarization_unilm_cnndm.py](./abstractive_summarization_unilm_cnndm.py) to demonstrate how to use DDP.  

First, we save the training and testing dataset to jsonlines files to be used by the python script. This avoids multiple processes repeating the initial data pre-processing. 

In [23]:
train_ds.save_to_jsonl(os.path.join(RESULT_DIR, "train_ds.jsonl"))
test_ds.save_to_jsonl(os.path.join(RESULT_DIR, "test_ds.jsonl"))

Next, we can execute the Python script using torch.distributed.launch, set `--nproc_per_node` to the number of GPUs on your machine.

```
python -m torch.distributed.launch --nproc_per_node=4 --nnode=1 abstractive_summarization_unilm_cnndm.py
```

**Note that the python script set `fp16=False` by default. If you have enabled mixed precision training following the instructions in [SETUP.md]("../../SETUP.md"), you can call the script with an additional argument "--fp16 true". You will be able to increase the batch sizes with mixed precision training.**

In [24]:
if os.path.exists(DATA_DIR):
    shutil.rmtree(DATA_DIR, ignore_errors=True)
if os.path.exists(CACHE_DIR):
    shutil.rmtree(CACHE_DIR, ignore_errors=True)

In [25]:
print("Total notebook running time {}".format(time.time() - start_time))

Total notebook running time 914.4297912120819


In [None]:
# for testing
sb.glue("rouge_1_f_score", rouge_scores["rouge-1"]["f"])
sb.glue("rouge_2_f_score", rouge_scores["rouge-2"]["f"])
sb.glue("rouge_l_f_score", rouge_scores["rouge-l"]["f"])