Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
…with 

Natural Language
Processing and
Text Classification
Data Natives 2015
19.11.2015 - Peter Grosskopf
Hey, I’m Peter.
Developer (mostly Ruby), Founder (of Zweitag)
Chief Development Officer @ HitFox Group
Department „Tech & ...
Company Builder with 500+
employees
in AdTech, FinTech and Big Data
Company Builder =
💡Ideas + 👥People
How do we select the best people out of more than 1000
applications every month in a consistent way?
?
? ?
Machine Learnin...
Yeah!
I found a
solution
Not really 💩
Our Goal
Add a sort-by-
relevance to lower
the screening costs
and invite people
faster
Let’s Go!
Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret 

the r...
1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for wor...
2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency
Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, litt...
term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1...
inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(...
bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information 😢
better:
n-grams
n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, ni...
vectorize the resumes
build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer
Define runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have t...
3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
mac...
4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python...
Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Lea...
Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words st...
Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model
Happy End.
Thanks :-)
Upcoming SlideShare
Loading in …5
×

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

Talk from Data Natives Conference 2015 about a experimental project for Natural Language Processing.

  • Be the first to comment

Into the Wild - wilth Natural Language Processing and Text Classification - Data Natives Conference 2015

  1. 1. …with 
 Natural Language Processing and Text Classification Data Natives 2015 19.11.2015 - Peter Grosskopf
  2. 2. Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group Department „Tech & Development“ (TechDev)
  3. 3. Company Builder with 500+ employees in AdTech, FinTech and Big Data
  4. 4. Company Builder = 💡Ideas + 👥People
  5. 5. How do we select the best people out of more than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?
  6. 6. Yeah! I found a solution Not really 💩
  7. 7. Our Goal Add a sort-by- relevance to lower the screening costs and invite people faster
  8. 8. Let’s Go!
  9. 9. Action Steps 1. Prepare the textual data 2. Build a model to classify the data 3. Run it! 4. Display and interpret 
 the results
  10. 10. 1. Prepare Load data Kick out outlier Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states Link data
  11. 11. 2. Build a model tf-idf / bag of words !: term-frequency idf: inverse document frequency
  12. 12. Transform / Quantization from a textual shape to a numerical vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
  13. 13. term-frequency (tf) Count occurrences in document I am a nice little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
  14. 14. inverse document frequency (idf) Count how often a term occurs in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
  15. 15. bag of words Simple approach to calculate the frequency of relevant terms Ignores contextual information 😢 better: n-grams
  16. 16. n-grams Generate new tokens by concatenating neighboured tokens example (1 and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)
  17. 17. vectorize the resumes build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer
  18. 18. Define runtime Train-test-split by date (80/20) Approach: Pick randomly CVs out of the test group Count how many CVs have to be screened to find all the good CVs
  19. 19. 3. run it! After the resumes are transformed to vector form, the classification gets done with a classical statistical machine learning model 
 
 (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)
  20. 20. 4. Results Generated with a combination of stochastic-gradient-descent- classifier and logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %
  21. 21. Wrap Up 1. Prepare 2. Build Model 3. Run 4. Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)
  22. 22. Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks, Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics
  23. 23. Outlook Build a better database Experiment with new approaches and tune models Build a continuous learning model
  24. 24. Happy End. Thanks :-)

    Be the first to comment

    Login to see the comments

  • GregoryRenard

    Jan. 19, 2016

Talk from Data Natives Conference 2015 about a experimental project for Natural Language Processing.

Views

Total views

551

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

5

Shares

0

Comments

0

Likes

1

×