9. Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret āØ
the results
10. 1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for workflow states
Link data
11. 2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency
12. Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
13. term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
14. inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(I play a fun game)
-> v1(i, play, a, fun, game)
d2(I am a nice little text)
-> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), ā¦)
-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
15. bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information š¢
better:
n-grams
16. n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)
-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice,
city)
-> (new, new_york, york, york_is, is, is_a, a,
a_nice, nice, nice_city, city)
18. Deļ¬ne runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have to be
screened to find all the good CVs
19. 3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
machine learning model āØ
āØ
(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-
classifier, logistic-regression and
random-forest)
20. 4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python machine-learning
library scikit-learn
AUC: 73.0615 %
21. Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Learning model
visualize results
clean data
define train-test-
split
run it!
Area under curve
(AUC)
22. Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words still the
best
Explana<on: CV documents do not
contain too many semantics
23. Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model