SlideShare a Scribd company logo
1 of 36
Download to read offline
#TechSEOBoost | @CatalystSEM
THANK YOU TO OUR SPONSORS
Generating Qualitative Content with GPT-2
in All Languages
Vincent Terrasi, OnCrawl
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
In All Languages
Generating Qualitative
Content
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
SEO Use-cases
• Image captioning with Pythia
• Visual question & Answering
• Abstractive Summarization with BERTsum
• Full Article generation with GPT-2
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Text Spinners are bad
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2016?
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Google, What is bad generated content in 2019?
• Text that makes no sense to the reader but which may contain search keywords.
• Text translated by an automated tool without human review or curation before
publishing
• Text generated through automated processes, such as Markov chains
• Text generated using automated synonymizing or obfuscation techniques
• Text generated from scraping Atom/RSS feeds or search results
• Stitching or combining content from different web pages without adding sufficient value
https://support.google.com/webmasters/answer/2721306?hl=en
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Surprise!
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
2019, the best year for
using AI for text
generation
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2BERT
ELMO ULM-FIT
J Howard
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Transformer and Attention Model
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Patterns for Attention Model
Pattern 1: Attention to next word
Pattern 2: Attention to previous word
Pattern 3: Attention to identical/related words
Pattern 4: Attention to identical/related words in other sentence
Pattern 5: Attention to other words predictive (next word) of word
Pattern 6: Attention to delimiter tokens
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
State of the Art
⚫ All models exist for English
⚫ Documentation is good
⚫ So we just need to translate
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
There are a lot of biases:
◦ Small Talk
◦ Idioms
◦ Local Named Entities
◦ Rarest Verbs
◦ Uncommon Tenses
◦ Gender rules
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
How to scale?
Create your own model
in your language
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Objectives
Use only qualitative methods to improve
the quality of content created by humans
Extract the knowledge learnt by the Deep
Learning.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Why others attempts have
failed?
Quantitative:
You need a lot of data: more than 100 000
texts with a minimum of 500 words
Qualitative:
You need qualitative texts
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
GPT-2
Recipe
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 1: Training the model
This method without pretraining requires significant computing power.
You need GPUs! 3 days to get my first result with one GPU.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 1/2
GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of
data compression.
Why?
- Predicting the next character is too imprecise
- Predicting the next word is too precive and take a lot of computing power.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 2: Generating the compressed training dataset - 2/2
Use SentencePiece to generate my BPE files.
Why?
- Unsupervised text tokenizer and detokenizer
- Purely end-to-end system that does not depend on language-specific
pre/postprocessing.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 3: Fine-tuning the model
Vocabulary size: depends on the language
- n_vocab:50257
Embedding size: default value recommended by Open AI team
- n_embd:768
Size of attention: no greater accuracy if you increase this value
- n_head:12
Number of layers: no greater accuracy if you increase this value
- n_layer:12
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Step 4: Generating article text
Once the model has been trained, the gpt-2-gen command is used to generate a text.
The first parameter is the path to the model.
The second is the beginning of the sentence.
Then there are two optional parameters:
o --tokens-to-generate: number of tokens to generate, default 42
o --top-k: number of candidate tokens each time, by default 8.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Results & Quality
Evaluated subjectively by a native reader.
API pylanguagetool was used to quantifiably
confirm the quality of results and did not find
any errors in the generated text.
https://github.com/Findus23/pyLanguagetool
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
You can find my Google Colab Notebook
here for the French
https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV
Warning: it is just an example using limited
data.
NOW it is your turn.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Further ?
Parameters Objectives Use Cases
top-k < 10
token < 10
High Performance
Very high qualitative content related
to your original training content
Anchors for Internal Linking
Variant of Title
Variant of Meta
top-k > 50
token > 400
Low Performance
Low qualitative content because the
model is weak, but the model
successfully extracts all concepts
that GPT-2 learnt about your dataset.
Guides to help you write, compared
to a query, with the stated purpose of
saving you time.
Vincent Terrasi | @vincentterrasi | #TechSEOBoost
Thank You
vincent@oncrawl.com
Catalyst | @CatalystSEM | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

More Related Content

What's hot

Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
ijtsrd
 

What's hot (20)

Using AI to understand search intent
Using AI to understand search intentUsing AI to understand search intent
Using AI to understand search intent
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
TechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOsTechSEO Boost - Apps script for SEOs
TechSEO Boost - Apps script for SEOs
 
Fake News detection.pptx
Fake News detection.pptxFake News detection.pptx
Fake News detection.pptx
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Agrupa y vencerás - SEO técnico
Agrupa y vencerás - SEO técnicoAgrupa y vencerás - SEO técnico
Agrupa y vencerás - SEO técnico
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
The Value of Featured Snippets (BrightonSEO 2023).pdf
The Value of Featured Snippets (BrightonSEO 2023).pdfThe Value of Featured Snippets (BrightonSEO 2023).pdf
The Value of Featured Snippets (BrightonSEO 2023).pdf
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
How to categorise 100K search queries in 15 minutes - MeasureFest
How to categorise 100K search queries in 15 minutes - MeasureFestHow to categorise 100K search queries in 15 minutes - MeasureFest
How to categorise 100K search queries in 15 minutes - MeasureFest
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

Similar to Generating Qualitative Content with GPT-2 in All Languages

Similar to Generating Qualitative Content with GPT-2 in All Languages (20)

Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
 
Analyzing Real Time News
Analyzing Real Time NewsAnalyzing Real Time News
Analyzing Real Time News
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
BTech Final Project (1).pptx
BTech Final Project (1).pptxBTech Final Project (1).pptx
BTech Final Project (1).pptx
 
Machine Learning for Designers
Machine Learning for DesignersMachine Learning for Designers
Machine Learning for Designers
 
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSISMOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
 
Improve existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit testsImprove existing code with confidence, supported by unit tests
Improve existing code with confidence, supported by unit tests
 
Deep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science ExperienceDeep Learning using Tensorflow and Data Science Experience
Deep Learning using Tensorflow and Data Science Experience
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
 
Five steps to search and store tweets by keywords
Five steps to search and store tweets by keywordsFive steps to search and store tweets by keywords
Five steps to search and store tweets by keywords
 
MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...MmIT webinar 2018 - Essential tools and technologies for the library and info...
MmIT webinar 2018 - Essential tools and technologies for the library and info...
 
Intent Classifier with Facebook fastText
Intent Classifier with Facebook fastTextIntent Classifier with Facebook fastText
Intent Classifier with Facebook fastText
 
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
Machine Learning and Python For Marketing Automation | MKGO October 2019 | Ru...
 
Thesis Presentation V4
Thesis Presentation V4Thesis Presentation V4
Thesis Presentation V4
 
How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?How can AI be a creative partner for PR & marketing?
How can AI be a creative partner for PR & marketing?
 
Sentiment analysis on demonetisation
Sentiment analysis on demonetisationSentiment analysis on demonetisation
Sentiment analysis on demonetisation
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
 
Let's Make Pentesting Fun Again! Report writing in 5 minutes.
Let's Make Pentesting Fun Again! Report writing in 5 minutes.Let's Make Pentesting Fun Again! Report writing in 5 minutes.
Let's Make Pentesting Fun Again! Report writing in 5 minutes.
 

More from Catalyst

New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Catalyst
 

More from Catalyst (20)

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
 
Keynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender SystemsKeynote: Bias in Search and Recommender Systems
Keynote: Bias in Search and Recommender Systems
 
NLP Powered Outreach Link Building
NLP Powered Outreach Link BuildingNLP Powered Outreach Link Building
NLP Powered Outreach Link Building
 
NLP for SEO
NLP for SEONLP for SEO
NLP for SEO
 
What I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like GoogleWhat I Learned Building a Toy Example to Crawl & Render like Google
What I Learned Building a Toy Example to Crawl & Render like Google
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
 

Recently uploaded

Mastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to SuccessMastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to Success
Abdulsamad Lukman
 

Recently uploaded (20)

personal branding kit for music business
personal branding kit for music businesspersonal branding kit for music business
personal branding kit for music business
 
The+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdfThe+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdf
 
Rise and fall of Kulula.com, an airline won consumers by different marketing ...
Rise and fall of Kulula.com, an airline won consumers by different marketing ...Rise and fall of Kulula.com, an airline won consumers by different marketing ...
Rise and fall of Kulula.com, an airline won consumers by different marketing ...
 
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATIONHOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
HOW TO HANDLE SALES OBJECTIONS | SELLING AND NEGOTIATION
 
Aiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMAiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMM
 
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdfTAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
 
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdfMicro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
 
Tata Punch brochure with complete detail of all the variants
Tata Punch brochure with complete detail of all the variantsTata Punch brochure with complete detail of all the variants
Tata Punch brochure with complete detail of all the variants
 
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
10 Email Marketing Best Practices to Increase Engagements, CTR, And ROI
 
Hannah Brady - Powering Up Your Brand with Content @ Force24 All leads
Hannah Brady - Powering Up Your Brand with Content @ Force24 All leadsHannah Brady - Powering Up Your Brand with Content @ Force24 All leads
Hannah Brady - Powering Up Your Brand with Content @ Force24 All leads
 
SALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptxSALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptx
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarh
 
Mastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to SuccessMastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to Success
 
Cartona.pptx. Marketing how to present your project very well , discussed a...
Cartona.pptx.   Marketing how to present your project very well , discussed a...Cartona.pptx.   Marketing how to present your project very well , discussed a...
Cartona.pptx. Marketing how to present your project very well , discussed a...
 
2024 Social Trends Report V4 from Later.com
2024 Social Trends Report V4 from Later.com2024 Social Trends Report V4 from Later.com
2024 Social Trends Report V4 from Later.com
 
Unlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich ManuscriptUnlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich Manuscript
 
Distribution Ad Platform_ The Role of Distribution Ad Network.pdf
Distribution Ad Platform_ The Role of  Distribution Ad Network.pdfDistribution Ad Platform_ The Role of  Distribution Ad Network.pdf
Distribution Ad Platform_ The Role of Distribution Ad Network.pdf
 
Optimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered PromptsOptimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered Prompts
 

Generating Qualitative Content with GPT-2 in All Languages

  • 1. #TechSEOBoost | @CatalystSEM THANK YOU TO OUR SPONSORS Generating Qualitative Content with GPT-2 in All Languages Vincent Terrasi, OnCrawl
  • 2. Vincent Terrasi | @vincentterrasi | #TechSEOBoost In All Languages Generating Qualitative Content
  • 3. Vincent Terrasi | @vincentterrasi | #TechSEOBoost SEO Use-cases • Image captioning with Pythia • Visual question & Answering • Abstractive Summarization with BERTsum • Full Article generation with GPT-2
  • 4. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Text Spinners are bad
  • 5. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2016? • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://web.archive.org/web/20160222004700/https://support.google.com/webmasters/answer/2721306?hl=en
  • 6. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Google, What is bad generated content in 2019? • Text that makes no sense to the reader but which may contain search keywords. • Text translated by an automated tool without human review or curation before publishing • Text generated through automated processes, such as Markov chains • Text generated using automated synonymizing or obfuscation techniques • Text generated from scraping Atom/RSS feeds or search results • Stitching or combining content from different web pages without adding sufficient value https://support.google.com/webmasters/answer/2721306?hl=en
  • 7. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Surprise!
  • 8. Vincent Terrasi | @vincentterrasi | #TechSEOBoost 2019, the best year for using AI for text generation
  • 9. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 10. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2BERT ELMO ULM-FIT J Howard
  • 11. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Transformer and Attention Model
  • 12. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word
  • 13. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word
  • 14. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words
  • 15. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence
  • 16. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word
  • 17. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Patterns for Attention Model Pattern 1: Attention to next word Pattern 2: Attention to previous word Pattern 3: Attention to identical/related words Pattern 4: Attention to identical/related words in other sentence Pattern 5: Attention to other words predictive (next word) of word Pattern 6: Attention to delimiter tokens
  • 18. Vincent Terrasi | @vincentterrasi | #TechSEOBoost State of the Art ⚫ All models exist for English ⚫ Documentation is good ⚫ So we just need to translate
  • 19. Vincent Terrasi | @vincentterrasi | #TechSEOBoost There are a lot of biases: ◦ Small Talk ◦ Idioms ◦ Local Named Entities ◦ Rarest Verbs ◦ Uncommon Tenses ◦ Gender rules
  • 20. Vincent Terrasi | @vincentterrasi | #TechSEOBoost How to scale? Create your own model in your language
  • 21. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Objectives Use only qualitative methods to improve the quality of content created by humans Extract the knowledge learnt by the Deep Learning.
  • 22. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Why others attempts have failed? Quantitative: You need a lot of data: more than 100 000 texts with a minimum of 500 words Qualitative: You need qualitative texts
  • 23. Vincent Terrasi | @vincentterrasi | #TechSEOBoost GPT-2 Recipe
  • 24. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 1: Training the model This method without pretraining requires significant computing power. You need GPUs! 3 days to get my first result with one GPU.
  • 25. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 1/2 GPT-2 needs to learn with the Byte Pair Encoding (BPE) format which is a simple form of data compression. Why? - Predicting the next character is too imprecise - Predicting the next word is too precive and take a lot of computing power.
  • 26. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 2: Generating the compressed training dataset - 2/2 Use SentencePiece to generate my BPE files. Why? - Unsupervised text tokenizer and detokenizer - Purely end-to-end system that does not depend on language-specific pre/postprocessing.
  • 27. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257
  • 28. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768
  • 29. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12
  • 30. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 3: Fine-tuning the model Vocabulary size: depends on the language - n_vocab:50257 Embedding size: default value recommended by Open AI team - n_embd:768 Size of attention: no greater accuracy if you increase this value - n_head:12 Number of layers: no greater accuracy if you increase this value - n_layer:12
  • 31. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Step 4: Generating article text Once the model has been trained, the gpt-2-gen command is used to generate a text. The first parameter is the path to the model. The second is the beginning of the sentence. Then there are two optional parameters: o --tokens-to-generate: number of tokens to generate, default 42 o --top-k: number of candidate tokens each time, by default 8.
  • 32. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Results & Quality Evaluated subjectively by a native reader. API pylanguagetool was used to quantifiably confirm the quality of results and did not find any errors in the generated text. https://github.com/Findus23/pyLanguagetool
  • 33. Vincent Terrasi | @vincentterrasi | #TechSEOBoost You can find my Google Colab Notebook here for the French https://colab.research.google.com/drive/13Lbk1TYmTjoQFO6qbw_f1TJgoD5ulJwV Warning: it is just an example using limited data. NOW it is your turn.
  • 34. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Further ? Parameters Objectives Use Cases top-k < 10 token < 10 High Performance Very high qualitative content related to your original training content Anchors for Internal Linking Variant of Title Variant of Meta top-k > 50 token > 400 Low Performance Low qualitative content because the model is weak, but the model successfully extracts all concepts that GPT-2 learnt about your dataset. Guides to help you write, compared to a query, with the stated purpose of saving you time.
  • 35. Vincent Terrasi | @vincentterrasi | #TechSEOBoost Thank You vincent@oncrawl.com
  • 36. Catalyst | @CatalystSEM | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/