Pegasus

PEGASUS:
Pre-training with Extracted Gap-sentences
for Abstractive Summarization
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu(2019.12., ICML 2020)
NLP: 김한길, 문경언, 주정헌
2020.07.19.
Photo by Sudan Ouyang on Unsplash

Two main approaches to summarization task
2
https://www.slideshare.net/aclanthology/abigail-see-2017-get-
to-the-point-summarization-with-pointergenerator-networks
● More difficult
● More flexible
● Recent approach
● Easier
● Too restrictive <-> unlikely to produce absurd summary
● Most past work is extractive

PEGASUS
• PEGASUS
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models
• A large Transformer-based encoder-decoder model
with pre-training self-supervised objective (called gap-sentence generation, GSG)
to improve fine-tuning performance on abstractive summarization,
• Achieving state-of-the-art results on 12 diverse summarization datasets.
3
What is the good pre-training objectives
tailored for abstractive text summarization?
Key question

Gap Sentences Generation (GSG)
4
Pre-training Objectives
tailored for abstractive text summarization

• Problem: Insufficient (labeled) data for a new domain
• time / money ↓, performance ↑
Transfer learning
5
(end-to-end)
e.g. ①Pre-training, ②Fine-tuning

Transfer learning: BERT
6→ Can get general linguistic knowledge

Pre-training objective: MLM
7https://amitness.com/2020/05/self-supervised-learning-nlp/

Pre-training objective for summarization
8
What is the good pre-training objectives
tailored for abstractive text summarization?
Key question
A pre-training objective that
more closely resembles the downstream task
Hypothesis
(i.e. Generality(Reusability) ↓, Performance ↑)

Pre-training objective for summarization
9
But it is not easy…
Pretraining
Fine-tuning
A large unlabeled data set

Pre-training objective for summarization:
Gap Sentences Generation
GSG: Self-Supervised Objective for Summarization
• ① Masking sentences from a document
and generating these gap-sentences from the rest of the document
10
https://towardsdatascience.com/pegasus-google-state-of-
the-art-abstractive-summarization-model-627b1bbbc5ce
② By choose putatively important sentences (psedo-summary)
How? → ROUGE1-F1

Which sentences are important?  ROUGE1-F1
• ROUGE(Recall-Oriented Understudy for Gisting Evaluation) & F-score
• ROUGE
• Metrics for evaluating automatic summarization or machine translation
• By comparing an automatically produced summary against a set of reference summaries
• Diverse variations: ROUGE-N(N-gram), ROUGE-L(longest matching sequence,not require consecutive matches)
• e.g.
Recall
- ROUGE-N: System-1 = System-2 (“police”, “the gunman”)
- ROUGE-L:
- Output 1 = 3 / 4 (“police the gunman”)
- Output 2 = 2 / 4 (“the gunman”)
11
http://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/What-is-ROUGE.pdf
https://huffon.github.io/2019/12/07/rouge/
Reference: police killed the gunman
- Output 1: police kill the gunman
- Output 2: the gunman kill police

Both GSG and MLM are applied simultaneously to this example as pre-training objectives
12
MASK1: GSG
MASK2: MLM

Experiments
Pre-training each corpus
• C4: text from 350M Web-pages (750GB)
• HugeNews: 1.5B articles (3.8TB) collected from news and news-like websites from 2013-2019
Downstream Tasks/Datasets
• TensorFlow Summarization Datasets 1 for reproduciblity
Experiments
1. Pre-training ablation experiments to choices of pre-training corpus, objective, and vocabulary size
Using PEGASUSBASE(223M) instead of PEGASUSLARGE(568M)
2. Larger Model Results
3. Fine-tuning with low-resource
4. Qualitative Observations
14

Pre-training ablation experiments:
6.1.1. Corpus
• Pre-training on HugeNews (1.5B news-like documents)
→ more effective on the two news downstream datasets
• Pre-training on C4 (350M Web-pages)
→ more effective on the non-news informal datasets (WikiHow and Reddit TIFU)
✓ Pretraining models transfer more effectively to downstream tasks when their domains are aligned better.

6.1.2. Pre-training Objectives
How to select the “important sentences” as gap-sentences? 6 strategies
• Random: Uniformly select m sentences at random.
• Lead: Select the first m sentences.
• Principal: Select top-m scored sentences according to importance
ROUGE1-F1 (Lin, 2004) between the sentence and the rest of the document
• (Ind) scored independently / (Seq) selecting sequentially by greedily maximizing the
ROUGE1-F1
• (Uniq) n-grams as a set / (Orig) double-counting identical n-grams
16

- Comparison six variants: Lead, Random, Ind-Orig, Ind-Uniq, Seq-Orig, Seq-Uniq
6.1.2. Pre-training Objectives
- MLM solely < Lead < Random < ~~ < Ind-Orig
- MLM & Ind-Orig VS Ind-Orig
MLM improved fine-tuning performance at early pre-training
checkpoints (100k - 200k steps),
but inhibited further gains with more pre-training steps (500k)
✓ Not to include MLM in PEGASUSLARGE
- GSG(30% masks sentences ratios)

6.1.2.Effect of Vocabulary
- Two tokenizer
- Byte-pair-encoding(BPE)
- SentencePiece Unigram algorithm (Unigram 32k to 256k ranges )
- Best options : Unigram 96k in large model

6.2 Larger Model Results
PEGASUSBASE (223M) TO PEGASUSLARGE(568M)
• Number of layers for Transformer blocks L = 12 → 16
• Hidden size H = 768 → 1024
• Feed-forward layer size F = 3072 → 4096
• Number of self-attention heads A = 12 → 16
Optimization : pre-training&fine-tuning ”Adafactor” with square root learning rate decay, dropout 0.1
GSG
• Left 20% of selected sentences unchanged in the input to encourage the model to do copy mechnism
• Increased the GSR to 45% to achieve a similar number of “gaps” as the optimal 30% found above

6.2 Larger Model Results
The improvement from a Transformer model without pretraining (TransformerBASE) to PEGASUSLARGE
was more significant on smaller datasets
✓ Small text summarization datasets benefit the most from pre-training
(ROUGE1-F1 / ROUGE2-F1 / ROUGEL-F1 scores )

6.3 Zero and Low-Resource Summarization
- In 8 out of 12 datasets, with just 100 examples PEGASUSLARGE >= TransformerBASE
The dashed lines are TransformerBASE models,
equivalent in capacity as PEGASUSBASE, trained using the full supervised datasets, with no pre-training

6.4 Qualitative Observations and Human Evaluation
① Both PEGASUSLARGE outputs were
at least as good as the reference
summaries in all cases.
② At low-levels of supervision
PEGASUSLARGE(HugeNews) was
not measurably worse than
human summaries
on Xsum and CNN/DailyMail.
③ In the Reddit TIFU case, however,
perhaps due to its diverse writing
styles, required full supervision.
Workers were asked to rate the summaries on a 1-5 scale
Do paired t-test to assess whether scores were significantly different from human
①
② ③

Conclusion
• Suggest new pretraining objective, GSG(gap-sentences generation)
as a tailored for abstractive text summarization
• Identified gap-sentence selection strategy : principle sentence selection(Ind-Orig)
• Demonstrated the effects of the pre-training corpora, gap-sentences ratios, vocabulary sizes
• Achieve state-of-the-art results on all 12 diverse downstream datasets
• Showed that the model was able to adapt to unseen summarization datasets
very quickly, achieving strong results in as little as 1000 examples

Pegasus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pegasus

Similar to Pegasus (20)

More from Hangil Kim

More from Hangil Kim (8)

Recently uploaded

Recently uploaded (20)

Pegasus