SlideShare a Scribd company logo
1 of 16
Download to read offline




Machine intelligence in HR technology: resume analysis at scale.



Similarity matching, resume processing and no-frills deep learning models deployment
Matching jobs to people

—



We apply data science over large numbers of resumes in real time telling recruiters

who the most qualified candidates are for their job requirements and explain why.
Resumes processing and profile analysis

—



Opening scans through resume files and database candidate profiles to recommend the
perfect candidates for any given raw job description by analyzing patterns in candidate
history, weighing up skills and fetching candidate code & portfolios to support the decision.
A high level overview of our platform is here: 

https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture
Quick overview: resume logic pipeline
input doc->pdf
string byte array (pdf)
read pdf
resume text
download
byte array
feature extraction
topics extraction
json
json
education parser
json
json
… (10 other tasks)
elasticsearch percolator
combine
json
json
text stream
extra tasks
json
regex (email, etc)
salary regression
json
…
Reactive streams - successive aggregation of state generated by specialized actors
Information extraction flow
links screenshot
array[url]
screenshot
screenshot
…
json
json
json
code extraction
json
combine
json
json
links/emails/phones/etc
github link
simeria http call
combine
jsontext
regex
…
search index
experience vector
summary vector
json
json vec
vec
Async i/o & search index creation. Indexes (candidate vectors) generated/stored on-the-fly.
Matching pipeline
provided title
search
job description
job title
neural parsing
encoder network
neural parsing
dense vector
encoder network
dense vector
Matching jobs/candidates and people similar to each other in high volumes of resumes

—



All input encoded as dense vectors
Similarity = angular/cosine sim between sets of encodings
Real time queries
random projection trees candidates
candidates
A * x + B * (1-x)
random projection trees
Fast matching - computing similarity over vast vector collections (x2)

—



Expensive to compute similarity metrics in real time -> k-nn approximations.
dense input
dense input
job title
x - search biasjob description
PARSING: 

Multi-class, seq2seq, character-level output (dates / OOV names / ..)



SIMILARITY/ENCODERS: 

siamese networks



UP-SKILLING

model ensembles (input -> latent space -> salary regression -> sequences)



SUMMARIES

current area of research



We train multiple models for various contexts (jobs / resumes / ..) 

Encoding input and NLP models architecture
General considerations

—



Mostly seq2seq, siamese, attention architectures
Input is mostly word vectors - however at times we augment input features

with ngrams / character-level information
Caution on word embedding
Potentially trivial example, however - ideal to have models trained on data specific to a particular problem domain

—



fastText (own corpus, 10gb)



“scala’s”, “java/c++/scala”, “java/scala”, “clojure”, .. 

similarity “scala” - “opera” = 0.17 (very syntax oriented)

fastText (own corpus, no character n-grams)



“kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, .. 

similarity “scala” - “opera” = 0.14 (good)

fastText (facebook pre-trained vectors, en wiki)



“traviata”, “barbiere”, “teatro”, “verdi”, .. 

similarity “scala” - “opera” = 0.57 (very broad)

word2vec (own corpus)



“kotlin”, “clojure”, “haskell”, “f#”, .. 

similarity “scala” - “opera” = 0.05 (very specific)

syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”
Similarity network architecture
Sequence encoders and similarity core

—



Recurrent networks sharing weights (siamese architecture)
x
1
(b)
x
2
(b)
x
3
(b)
machine learning rocks
h
1
(b)
h
2
(b)
h
3
(b)
x
1
(a)
x
2
(a)
x
3
(a)
x
4
(a)
she loves data science
h
1
(a)
h
2
(a)
h
3
(a)
h
4
(a)
objective score
Input encoding derives from the trained sim network:
activations from the last dense layer before output.
Models as http micro-services

—



Components: Simeria (horizontal scale), Yenisei (vertical), model servers

All native binaries - golang (simeria), c (yenisei & model servers)



Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorflow serving), AWS/Azure VMs.
Deployments at scale - opening Baikal vm’s
json
processing / search
simeria
…
vector
candidates
yenisei
model server
model server
model server
yenisei
model server
model server
model server
horizontal
verticalvertical
http
http, grpc grpc
LSH query
Search approximation take 1: random projection trees
Forced to optimize this from day one: not a problem of high traffic on regular usage, instead one of large spikes in I/O at ingestion, each customer

having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search. 

—



Reduced number of lookups via hyperplanes:

k random partitions of set elements using a suitable sim metric (eq. cosine)
dense input
id
id
id
idid
id
sim
sort
idid
id
id
id
id
sim
id
id
id
id
id
id
sim
candidates
Random projection trees: issues
Good. 

—



Good recall, fast queries Slow to generate
The bad. 

—



The ugly. 

—



Memory usage
Hashing functions generating identical hashes for similar (but not identical) input. 

Various implementations for different distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), …



Survey:

https://arxiv.org/pdf/1408.2927.pdf
Locality sensitive hashing
Alternatives. 

—



We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available: 

FALCONN, ANNOY, FLANN, RPFOREST, ..
The bigger picture
client resume file S3 bucket path
http://x.x.x.x/parser
http request (internal)
future(json)
connemara

(resume parsing, i/o, 

task orchestration)

Supporting infrastructure (i/o & conversion)
doc->pdf
byte array http post
conversion service
storage
pdf byte array
screenshot
string byte arrayhttp post
screenshot service
http response (zip with images)
storage
json {“path”: …”, “url”: “… }
Load balancing containerized services via Fabio
pdf byte array
conversion service
conversion service
screenshot service
Supporting infrastructure (i/o & conversion)
libreoffice ramdisk
golang web server (iris)Docker containers (http micro servers, golang)
Deployed via MESOS / Marathon
Mesos - kernel abstraction over a cluster
(exposes several machines as they would be one)
Marathon - Mesos init system

Discovery & http load balancing - Consul / Fabio

Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf
URL screenshots service: http://convert.opening.io/visitor
Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img

Generic demo: http://engineering.opening.io/demo.html
doc to pdf container
Thank you.





https://opening.io





@openingdublin

founders@opening.io





25 Oxford Lane, Ranelagh

Dublin, Ireland, European Union.
Catalyser programme

More Related Content

Similar to Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai

Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
Connected Data World
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
rantav
 

Similar to Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai (20)

Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
 
Writing RESTful web services using Node.js
Writing RESTful web services using Node.jsWriting RESTful web services using Node.js
Writing RESTful web services using Node.js
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and PythonRapid, Scalable Web Development with MongoDB, Ming, and Python
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Modern C++
Modern C++Modern C++
Modern C++
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 

More from Sebastian Ruder

More from Sebastian Ruder (20)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 

Recently uploaded

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai

  • 1. 
 
 Machine intelligence in HR technology: resume analysis at scale.
 
 Similarity matching, resume processing and no-frills deep learning models deployment
  • 2. Matching jobs to people
 —
 
 We apply data science over large numbers of resumes in real time telling recruiters
 who the most qualified candidates are for their job requirements and explain why. Resumes processing and profile analysis
 —
 
 Opening scans through resume files and database candidate profiles to recommend the perfect candidates for any given raw job description by analyzing patterns in candidate history, weighing up skills and fetching candidate code & portfolios to support the decision. A high level overview of our platform is here: 
 https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture
  • 3. Quick overview: resume logic pipeline input doc->pdf string byte array (pdf) read pdf resume text download byte array feature extraction topics extraction json json education parser json json … (10 other tasks) elasticsearch percolator combine json json text stream extra tasks json regex (email, etc) salary regression json … Reactive streams - successive aggregation of state generated by specialized actors
  • 4. Information extraction flow links screenshot array[url] screenshot screenshot … json json json code extraction json combine json json links/emails/phones/etc github link simeria http call combine jsontext regex … search index experience vector summary vector json json vec vec Async i/o & search index creation. Indexes (candidate vectors) generated/stored on-the-fly.
  • 5. Matching pipeline provided title search job description job title neural parsing encoder network neural parsing dense vector encoder network dense vector Matching jobs/candidates and people similar to each other in high volumes of resumes
 —
 
 All input encoded as dense vectors Similarity = angular/cosine sim between sets of encodings
  • 6. Real time queries random projection trees candidates candidates A * x + B * (1-x) random projection trees Fast matching - computing similarity over vast vector collections (x2)
 —
 
 Expensive to compute similarity metrics in real time -> k-nn approximations. dense input dense input job title x - search biasjob description
  • 7. PARSING: 
 Multi-class, seq2seq, character-level output (dates / OOV names / ..)
 
 SIMILARITY/ENCODERS: 
 siamese networks
 
 UP-SKILLING
 model ensembles (input -> latent space -> salary regression -> sequences)
 
 SUMMARIES
 current area of research
 
 We train multiple models for various contexts (jobs / resumes / ..) 
 Encoding input and NLP models architecture General considerations
 —
 
 Mostly seq2seq, siamese, attention architectures Input is mostly word vectors - however at times we augment input features
 with ngrams / character-level information
  • 8. Caution on word embedding Potentially trivial example, however - ideal to have models trained on data specific to a particular problem domain
 —
 
 fastText (own corpus, 10gb)
 
 “scala’s”, “java/c++/scala”, “java/scala”, “clojure”, .. 
 similarity “scala” - “opera” = 0.17 (very syntax oriented)
 fastText (own corpus, no character n-grams)
 
 “kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, .. 
 similarity “scala” - “opera” = 0.14 (good)
 fastText (facebook pre-trained vectors, en wiki)
 
 “traviata”, “barbiere”, “teatro”, “verdi”, .. 
 similarity “scala” - “opera” = 0.57 (very broad)
 word2vec (own corpus)
 
 “kotlin”, “clojure”, “haskell”, “f#”, .. 
 similarity “scala” - “opera” = 0.05 (very specific)
 syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”
  • 9. Similarity network architecture Sequence encoders and similarity core
 —
 
 Recurrent networks sharing weights (siamese architecture) x 1 (b) x 2 (b) x 3 (b) machine learning rocks h 1 (b) h 2 (b) h 3 (b) x 1 (a) x 2 (a) x 3 (a) x 4 (a) she loves data science h 1 (a) h 2 (a) h 3 (a) h 4 (a) objective score Input encoding derives from the trained sim network: activations from the last dense layer before output.
  • 10. Models as http micro-services
 —
 
 Components: Simeria (horizontal scale), Yenisei (vertical), model servers
 All native binaries - golang (simeria), c (yenisei & model servers)
 
 Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorflow serving), AWS/Azure VMs. Deployments at scale - opening Baikal vm’s json processing / search simeria … vector candidates yenisei model server model server model server yenisei model server model server model server horizontal verticalvertical http http, grpc grpc LSH query
  • 11. Search approximation take 1: random projection trees Forced to optimize this from day one: not a problem of high traffic on regular usage, instead one of large spikes in I/O at ingestion, each customer
 having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search. 
 —
 
 Reduced number of lookups via hyperplanes:
 k random partitions of set elements using a suitable sim metric (eq. cosine) dense input id id id idid id sim sort idid id id id id sim id id id id id id sim candidates
  • 12. Random projection trees: issues Good. 
 —
 
 Good recall, fast queries Slow to generate The bad. 
 —
 
 The ugly. 
 —
 
 Memory usage Hashing functions generating identical hashes for similar (but not identical) input. 
 Various implementations for different distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), …
 
 Survey:
 https://arxiv.org/pdf/1408.2927.pdf Locality sensitive hashing Alternatives. 
 —
 
 We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available: 
 FALCONN, ANNOY, FLANN, RPFOREST, ..
  • 13. The bigger picture client resume file S3 bucket path http://x.x.x.x/parser http request (internal) future(json) connemara
 (resume parsing, i/o, 
 task orchestration)

  • 14. Supporting infrastructure (i/o & conversion) doc->pdf byte array http post conversion service storage pdf byte array screenshot string byte arrayhttp post screenshot service http response (zip with images) storage json {“path”: …”, “url”: “… } Load balancing containerized services via Fabio pdf byte array conversion service conversion service screenshot service
  • 15. Supporting infrastructure (i/o & conversion) libreoffice ramdisk golang web server (iris)Docker containers (http micro servers, golang) Deployed via MESOS / Marathon Mesos - kernel abstraction over a cluster (exposes several machines as they would be one) Marathon - Mesos init system
 Discovery & http load balancing - Consul / Fabio
 Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf URL screenshots service: http://convert.opening.io/visitor Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img
 Generic demo: http://engineering.opening.io/demo.html doc to pdf container
  • 16. Thank you.
 
 
 https://opening.io
 
 
 @openingdublin
 founders@opening.io
 
 
 25 Oxford Lane, Ranelagh
 Dublin, Ireland, European Union. Catalyser programme