Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

•

2 likes•577 views

Peter Grosskopf

Talk from Data Natives Conference 2015 about a experimental project for Natural Language Processing.

Technology

…with  
Natural Language
Processing and
Text Classiﬁcation
Data Natives 2015
19.11.2015 - Peter Grosskopf

Hey, I’m Peter.
Developer (mostly Ruby), Founder (of Zweitag)
Chief Development Officer @ HitFox Group
Department „Tech & Development“ (TechDev)

Company Builder with 500+
employees
in AdTech, FinTech and Big Data

How do we select the best people out of more than 1000
applications every month in a consistent way?
?
? ?
Machine Learning ?

Our Goal
Add a sort-by-
relevance to lower
the screening costs
and invite people
faster

Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret  
the results

1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for workflow states
Link data

2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency

Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(I play a fun game)
-> v1(i, play, a, fun, game)
d2(I am a nice little text)
-> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)
-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information 😢
better:
n-grams

n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)
-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice,
city)
-> (new, new_york, york, york_is, is, is_a, a,
a_nice, nice, nice_city, city)

vectorize the resumes
build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer

Deﬁne runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have to be
screened to find all the good CVs

3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
machine learning model  
 
(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-
classifier, logistic-regression and
random-forest)

4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python machine-learning
library scikit-learn
AUC: 73.0615 %

Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Learning model
visualize results
clean data
define train-test-
split
run it!
Area under curve
(AUC)

Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words still the
best
Explana<on: CV documents do not
contain too many semantics

Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Feature Engineering for NLPBill Liu

Text classification with fast text elena_meetup_milano_27_juneDeep Learning Italia

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData

AM4TM_WS22_Practice_01_NLP_Basics.pdfmewajok782

Text mining and social network analysis of twitter data part 1Johan Blomme

Spoofax: ontwikkeling van domeinspecifieke talen in EclipseDevnology

Hands on Mahout!OSCON Byrum

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic

AI與大數據數據處理 Spark實戰(20171216)Paul Chao

Intro.pptWrushabhShirsat3

F sharp - an overviewChristoph Santschi

Types Working for You, Not Against YouC4Media

Introduction to R for data scienceLong Nguyen

Recipe2Vec: Or how does my robot know what’s tastyPyData

CommitBERT.pdfssuserdd444a

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

"Optimization of a .NET application- is it simple ! / ?", Yevhen TatarynovFwdays

RDataMining slides-r-programmingYanchang Zhao

Daniel Krasner - High Performance Text Processing with Rosetta PyData

Natural Language ProcessingCloudxLab

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015 (20)

Feature Engineering for NLP

Text classification with fast text elena_meetup_milano_27_june

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...

AM4TM_WS22_Practice_01_NLP_Basics.pdf

Text mining and social network analysis of twitter data part 1

Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse

Hands on Mahout!

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...

AI與大數據數據處理 Spark實戰(20171216)

Intro.ppt

F sharp - an overview

Types Working for You, Not Against You

Introduction to R for data science

Recipe2Vec: Or how does my robot know what’s tasty

CommitBERT.pdf

Best Practices for Building and Deploying Data Pipelines in Apache Spark

"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov

RDataMining slides-r-programming

Daniel Krasner - High Performance Text Processing with Rosetta

Natural Language Processing

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Advanced Computer Architecture – An IntroductionDilum Bandara

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Artificial intelligence in cctv survelliance.pptx

Connect Wave/ connectwave Pitch Deck Presentation

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Take control of your SAP testing with UiPath Test Suite

Gen AI in Business - Global Trends Report 2024.pdf

Advanced Computer Architecture – An Introduction

Are Multi-Cloud and Serverless Good or Bad?

DevoxxFR 2024 Reproducible Builds with Apache Maven

Advanced Test Driven-Development @ php[tek] 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Scanning the Internet for External Cloud Exposures via SSL Certs

Developer Data Modeling Mistakes: From Postgres to NoSQL

Unleash Your Potential - Namagunga Girls Coding Club

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

SIP trunking in Janus @ Kamailio World 2024

TeamStation AI System Report LATAM IT Salaries 2024

What's New in Teams Calling, Meetings and Devices March 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

1. …with   Natural Language Processing and Text Classiﬁcation Data Natives 2015 19.11.2015 - Peter Grosskopf

2. Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group Department „Tech & Development“ (TechDev)

3. Company Builder with 500+ employees in AdTech, FinTech and Big Data

4. Company Builder = 💡Ideas + 👥People

5. How do we select the best people out of more than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?

6. Yeah! I found a solution Not really 💩

7. Our Goal Add a sort-by- relevance to lower the screening costs and invite people faster

8. Let’s Go!

9. Action Steps 1. Prepare the textual data 2. Build a model to classify the data 3. Run it! 4. Display and interpret   the results

10. 1. Prepare Load data Kick out outlier Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states Link data

11. 2. Build a model tf-idf / bag of words !: term-frequency idf: inverse document frequency

12. Transform / Quantization from a textual shape to a numerical vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

13. term-frequency (tf) Count occurrences in document I am a nice little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

14. inverse document frequency (idf) Count how often a term occurs in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

15. bag of words Simple approach to calculate the frequency of relevant terms Ignores contextual information 😢 better: n-grams

16. n-grams Generate new tokens by concatenating neighboured tokens example (1 and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)

17. vectorize the resumes build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer

18. Deﬁne runtime Train-test-split by date (80/20) Approach: Pick randomly CVs out of the test group Count how many CVs have to be screened to find all the good CVs

19. 3. run it! After the resumes are transformed to vector form, the classification gets done with a classical statistical machine learning model     (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)

20. 4. Results Generated with a combination of stochastic-gradient-descent- classifier and logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %

21. Wrap Up 1. Prepare 2. Build Model 3. Run 4. Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)

22. Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks, Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics

23. Outlook Build a better database Experiment with new approaches and tune models Build a continuous learning model

24. Happy End. Thanks :-)

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Recommended

Recommended

More Related Content

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Similar to Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015 (20)

Recently uploaded

Recently uploaded (20)

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015