SlideShare a Scribd company logo
1 of 24
Download to read offline
ā€¦with ā€Ø
Natural Language
Processing and
Text Classiļ¬cation
Data Natives 2015
19.11.2015 - Peter Grosskopf
Hey, Iā€™m Peter.
Developer (mostly Ruby), Founder (of Zweitag)
Chief Development Officer @ HitFox Group
Department ā€žTech & Developmentā€œ (TechDev)
Company Builder with 500+
employees
in AdTech, FinTech and Big Data
Company Builder =
šŸ’”Ideas + šŸ‘„People
How do we select the best people out of more than 1000
applications every month in a consistent way?
?
? ?
Machine Learning ?
Yeah!
I found a
solution
Not really šŸ’©
Our Goal
Add a sort-by-
relevance to lower
the screening costs
and invite people
faster
Letā€™s Go!
Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret ā€Ø
the results
1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for workflow states
Link data
2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency
Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(I play a fun game)
-> v1(i, play, a, fun, game)
d2(I am a nice little text)
-> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), ā€¦)
-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information šŸ˜¢
better:
n-grams
n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)
-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice,
city)
-> (new, new_york, york, york_is, is, is_a, a,
a_nice, nice, nice_city, city)
vectorize the resumes
build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer
Deļ¬ne runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have to be
screened to find all the good CVs
3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
machine learning model ā€Ø
ā€Ø
(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-
classifier, logistic-regression and
random-forest)
4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python machine-learning
library scikit-learn
AUC: 73.0615 %
Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Learning model
visualize results
clean data
define train-test-
split
run it!
Area under curve
(AUC)
Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words still the
best
Explana<on: CV documents do not
contain too many semantics
Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model
Happy End.
Thanks :-)

More Related Content

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
Ā 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia
Ā 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
Ā 
AM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdfAM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdfmewajok782
Ā 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
Ā 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseDevnology
Ā 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
Ā 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
Ā 
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)Paul Chao
Ā 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against YouC4Media
Ā 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
Ā 
Recipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tastyRecipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tastyPyData
Ā 
CommitBERT.pdf
CommitBERT.pdfCommitBERT.pdf
CommitBERT.pdfssuserdd444a
Ā 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Ā 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays
Ā 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
Ā 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
Ā 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
Ā 

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015 (20)

Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Ā 
Text classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_juneText classification with fast text elena_meetup_milano_27_june
Text classification with fast text elena_meetup_milano_27_june
Ā 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Ā 
AM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdfAM4TM_WS22_Practice_01_NLP_Basics.pdf
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Ā 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Ā 
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in EclipseSpoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Ā 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
Ā 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Ā 
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)
AIčˆ‡å¤§ę•øꓚę•øꓚ處ē† SparkåÆ¦ęˆ°(20171216)
Ā 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Ā 
F sharp - an overview
F sharp - an overviewF sharp - an overview
F sharp - an overview
Ā 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against You
Ā 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Ā 
Recipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tastyRecipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tasty
Ā 
CommitBERT.pdf
CommitBERT.pdfCommitBERT.pdf
CommitBERT.pdf
Ā 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Ā 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
Ā 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
Ā 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
Ā 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Ā 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
Ā 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
Ā 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Ā 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
Ā 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Ā 
Advanced Computer Architecture ā€“ An Introduction
Advanced Computer Architecture ā€“ An IntroductionAdvanced Computer Architecture ā€“ An Introduction
Advanced Computer Architecture ā€“ An IntroductionDilum Bandara
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervƩ Boutemy
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Ā 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Ā 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Ā 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
Ā 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Ā 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
Ā 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
Ā 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
Ā 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Ā 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Ā 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Ā 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Ā 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Ā 
Advanced Computer Architecture ā€“ An Introduction
Advanced Computer Architecture ā€“ An IntroductionAdvanced Computer Architecture ā€“ An Introduction
Advanced Computer Architecture ā€“ An Introduction
Ā 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Ā 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Ā 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Ā 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Ā 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Ā 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Ā 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Ā 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Ā 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Ā 

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

  • 1. ā€¦with ā€Ø Natural Language Processing and Text Classiļ¬cation Data Natives 2015 19.11.2015 - Peter Grosskopf
  • 2. Hey, Iā€™m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group Department ā€žTech & Developmentā€œ (TechDev)
  • 3. Company Builder with 500+ employees in AdTech, FinTech and Big Data
  • 5. How do we select the best people out of more than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?
  • 7. Our Goal Add a sort-by- relevance to lower the screening costs and invite people faster
  • 9. Action Steps 1. Prepare the textual data 2. Build a model to classify the data 3. Run it! 4. Display and interpret ā€Ø the results
  • 10. 1. Prepare Load data Kick out outlier Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states Link data
  • 11. 2. Build a model tf-idf / bag of words !: term-frequency idf: inverse document frequency
  • 12. Transform / Quantization from a textual shape to a numerical vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
  • 13. term-frequency (tf) Count occurrences in document I am a nice little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
  • 14. inverse document frequency (idf) Count how often a term occurs in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), ā€¦) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
  • 15. bag of words Simple approach to calculate the frequency of relevant terms Ignores contextual information šŸ˜¢ better: n-grams
  • 16. n-grams Generate new tokens by concatenating neighboured tokens example (1 and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)
  • 17. vectorize the resumes build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer
  • 18. Deļ¬ne runtime Train-test-split by date (80/20) Approach: Pick randomly CVs out of the test group Count how many CVs have to be screened to find all the good CVs
  • 19. 3. run it! After the resumes are transformed to vector form, the classification gets done with a classical statistical machine learning model ā€Ø ā€Ø (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)
  • 20. 4. Results Generated with a combination of stochastic-gradient-descent- classifier and logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %
  • 21. Wrap Up 1. Prepare 2. Build Model 3. Run 4. Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)
  • 22. Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks, Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics
  • 23. Outlook Build a better database Experiment with new approaches and tune models Build a continuous learning model