1. PEGASUS:
Pre-training with Extracted Gap-sentences
for Abstractive Summarization
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu(2019.12., ICML 2020)
NLP: 김한길, 문경언, 주정헌
2020.07.19.
Photo by Sudan Ouyang on Unsplash
2. Two main approaches to summarization task
2
https://www.slideshare.net/aclanthology/abigail-see-2017-get-
to-the-point-summarization-with-pointergenerator-networks
● More difficult
● More flexible
● Recent approach
● Easier
● Too restrictive <-> unlikely to produce absurd summary
● Most past work is extractive
3. PEGASUS
• PEGASUS
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models
• A large Transformer-based encoder-decoder model
with pre-training self-supervised objective (called gap-sentence generation, GSG)
to improve fine-tuning performance on abstractive summarization,
• Achieving state-of-the-art results on 12 diverse summarization datasets.
3
What is the good pre-training objectives
tailored for abstractive text summarization?
Key question
4. Gap Sentences Generation (GSG)
4
Pre-training Objectives
tailored for abstractive text summarization
5. • Problem: Insufficient (labeled) data for a new domain
• time / money ↓, performance ↑
Transfer learning
5
(end-to-end)
e.g. ①Pre-training, ②Fine-tuning
8. Pre-training objective for summarization
8
What is the good pre-training objectives
tailored for abstractive text summarization?
Key question
A pre-training objective that
more closely resembles the downstream task
Hypothesis
(i.e. Generality(Reusability) ↓, Performance ↑)
9. Pre-training objective for summarization
9
But it is not easy…
Pretraining
Fine-tuning
A large unlabeled data set
10. Pre-training objective for summarization:
Gap Sentences Generation
GSG: Self-Supervised Objective for Summarization
• ① Masking sentences from a document
and generating these gap-sentences from the rest of the document
10
https://towardsdatascience.com/pegasus-google-state-of-
the-art-abstractive-summarization-model-627b1bbbc5ce
② By choose putatively important sentences (psedo-summary)
How? → ROUGE1-F1
11. Which sentences are important? ROUGE1-F1
• ROUGE(Recall-Oriented Understudy for Gisting Evaluation) & F-score
• ROUGE
• Metrics for evaluating automatic summarization or machine translation
• By comparing an automatically produced summary against a set of reference summaries
• Diverse variations: ROUGE-N(N-gram), ROUGE-L(longest matching sequence,not require consecutive matches)
• e.g.
Recall
- ROUGE-N: System-1 = System-2 (“police”, “the gunman”)
- ROUGE-L:
- Output 1 = 3 / 4 (“police the gunman”)
- Output 2 = 2 / 4 (“the gunman”)
11
http://www.ccs.neu.edu/home/vip/teach/DMcourse/5_topicmodel_summ/notes_slides/What-is-ROUGE.pdf
https://huffon.github.io/2019/12/07/rouge/
Reference: police killed the gunman
- Output 1: police kill the gunman
- Output 2: the gunman kill police
Pre-training objective for summarization:
Gap Sentences Generation
12. Both GSG and MLM are applied simultaneously to this example as pre-training objectives
12
Pre-training objective for summarization:
Gap Sentences Generation
MASK1: GSG
MASK2: MLM
14. Experiments
Pre-training each corpus
• C4: text from 350M Web-pages (750GB)
• HugeNews: 1.5B articles (3.8TB) collected from news and news-like websites from 2013-2019
Downstream Tasks/Datasets
• TensorFlow Summarization Datasets 1 for reproduciblity
Experiments
1. Pre-training ablation experiments to choices of pre-training corpus, objective, and vocabulary size
Using PEGASUSBASE(223M) instead of PEGASUSLARGE(568M)
2. Larger Model Results
3. Fine-tuning with low-resource
4. Qualitative Observations
14
15. Pre-training ablation experiments:
6.1.1. Corpus
• Pre-training on HugeNews (1.5B news-like documents)
→ more effective on the two news downstream datasets
• Pre-training on C4 (350M Web-pages)
→ more effective on the non-news informal datasets (WikiHow and Reddit TIFU)
✓ Pretraining models transfer more effectively to downstream tasks when their domains are aligned better.
16. Pre-training ablation experiments:
6.1.2. Pre-training Objectives
How to select the “important sentences” as gap-sentences? 6 strategies
• Random: Uniformly select m sentences at random.
• Lead: Select the first m sentences.
• Principal: Select top-m scored sentences according to importance
ROUGE1-F1 (Lin, 2004) between the sentence and the rest of the document
• (Ind) scored independently / (Seq) selecting sequentially by greedily maximizing the
ROUGE1-F1
• (Uniq) n-grams as a set / (Orig) double-counting identical n-grams
16
17. - Comparison six variants: Lead, Random, Ind-Orig, Ind-Uniq, Seq-Orig, Seq-Uniq
Pre-training ablation experiments:
6.1.2. Pre-training Objectives
- MLM solely < Lead < Random < ~~ < Ind-Orig
- MLM & Ind-Orig VS Ind-Orig
MLM improved fine-tuning performance at early pre-training
checkpoints (100k - 200k steps),
but inhibited further gains with more pre-training steps (500k)
✓ Not to include MLM in PEGASUSLARGE
- GSG(30% masks sentences ratios)
18. Pre-training ablation experiments:
6.1.2.Effect of Vocabulary
- Two tokenizer
- Byte-pair-encoding(BPE)
- SentencePiece Unigram algorithm (Unigram 32k to 256k ranges )
- Best options : Unigram 96k in large model
19. 6.2 Larger Model Results
PEGASUSBASE (223M) TO PEGASUSLARGE(568M)
• Number of layers for Transformer blocks L = 12 → 16
• Hidden size H = 768 → 1024
• Feed-forward layer size F = 3072 → 4096
• Number of self-attention heads A = 12 → 16
Optimization : pre-training&fine-tuning ”Adafactor” with square root learning rate decay, dropout 0.1
GSG
• Left 20% of selected sentences unchanged in the input to encourage the model to do copy mechnism
• Increased the GSR to 45% to achieve a similar number of “gaps” as the optimal 30% found above
20. 6.2 Larger Model Results
The improvement from a Transformer model without pretraining (TransformerBASE) to PEGASUSLARGE
was more significant on smaller datasets
✓ Small text summarization datasets benefit the most from pre-training
(ROUGE1-F1 / ROUGE2-F1 / ROUGEL-F1 scores )
21. 6.3 Zero and Low-Resource Summarization
- In 8 out of 12 datasets, with just 100 examples PEGASUSLARGE >= TransformerBASE
The dashed lines are TransformerBASE models,
equivalent in capacity as PEGASUSBASE, trained using the full supervised datasets, with no pre-training
22. 6.4 Qualitative Observations and Human Evaluation
① Both PEGASUSLARGE outputs were
at least as good as the reference
summaries in all cases.
② At low-levels of supervision
PEGASUSLARGE(HugeNews) was
not measurably worse than
human summaries
on Xsum and CNN/DailyMail.
③ In the Reddit TIFU case, however,
perhaps due to its diverse writing
styles, required full supervision.
Workers were asked to rate the summaries on a 1-5 scale
Do paired t-test to assess whether scores were significantly different from human
①
② ③
23. Conclusion
• Suggest new pretraining objective, GSG(gap-sentences generation)
as a tailored for abstractive text summarization
• Identified gap-sentence selection strategy : principle sentence selection(Ind-Orig)
• Demonstrated the effects of the pre-training corpora, gap-sentences ratios, vocabulary sizes
• Achieve state-of-the-art results on all 12 diverse downstream datasets
• Showed that the model was able to adapt to unseen summarization datasets
very quickly, achieving strong results in as little as 1000 examples