Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Crowdsourced query augmentation through the semantic discovery of domain specific jargon
1. Crowdsourced Query Augmentation through the
Semantic Discovery of Domain-specific Jargon
Khalifeh Aljadda, Mohammed Korayem,
Trey Grainger, Chris Russell
2014.10.28 - 2014 IEEE International Conference on Big Data - Washington, D.C.
2. Authors
• Khalifeh AlJadda
– Ph.D. Candidate, University of Georgia
• Mohammed Korayem
– Ph.D. Candidate, Indiana University
• Trey Grainger
– Director of Engineering, Search, CareerBuilder
• Chris Russell
– Engineering Lead, Relevancy & Recommendations, CareerBuilder
3. The problem
• Traditional search engines (i.e. Lucene, Solr, Elasticsearch) tokenize text
and find documents containing those tokens and linguistic variations:
– User’s Search: machine learning
Tokenization: ["machine", "learning"] =>
Stemming: ["machin", "learn"]
Final Query: machin AND learn
This could match a document for a “machinist” who has “learned” something.
– software architect => … => software AND architect
• Might identify a building architect requiring knowledge of specialized architecture software
– account manager => … => account AND manag
• Will match text such as “need to manage the process and account for any variances”
• We need a way to identify and search for the meaning of keyword
phrases, not just the individual text tokens
– i.e. machine learning = "machine learning" OR "data scientist" OR
"mahout" OR "svm" OR "neural networks" …
4. Goals for the proposed system
• System should be language-agnostic. We don’t want custom NLP
rules to be required for each language (we support dozens of
languages).
• The output of the system should be human-readable. We want to
show user’s how we enhance their queries in language they will
understand so they can modify our enhancements.
• The system should be very high-precision (since end-users will be
seeing and critiquing the output) and should be automatically
updated based upon new data.
• The system must be fast and scalable, handling billions of search
log entries (offline) and processing millions of queries an hour in
real-time
5. Alternate Techniques
• Latent Semantic Indexing
– Approach involves doing dimensionality reduction of text
across your corpus to derive underlying relationships
between terms:
• i.e. java => programming, c# => programming,
therefore they are related.
– Pros:
• Can be run automatically against your corpus of data to
discover underlying (latent) relationships between
terms, which requires very little human work
– Cons:
• The latent relationships often aren’t represented as a
human would express them, so it would confuse users
if they saw this information.
6. Alternate Techniques
• Manual building of taxonomies
– Approach requires hiring human data analysts to manually
build, correct, and improve taxonomies
– Pros:
• high-precision relationships can be mapped depending
upon the quality of your hired data analysts
– Cons:
• Requires human analysts to comb through hundreds of
thousands of data points and generate lists of
important phrases and relationships, which go stale
• Requires expertise in every supported spoken language
to rebuild taxonomies per-language
7. Example use case
• User’s Query:
machine learning research and development Portland, OR software engineer
AND hadoop java
• Traditional Search Engine Parsing:
(machine AND learning AND research AND development AND portland) OR (software
AND engineer AND hadoop AND java )
• Ideal Parsing:
"machine learning" AND "research and development" AND "Portland, OR” AND
"software engineer" AND hadoop AND java
• Semantically Enhanced Query:
("machine learning" OR "computer vision" OR "data mining" OR matlab) AND
("research and development" OR "r&d") AND ("Portland, OR" OR "Portland,
Oregon") AND ("software engineer" OR "software developer") AND (hadoop OR
"big data" OR hbase OR hive) AND (java OR j2ee)
8. Proposed strategy
1. Mine user search logs for a list of common phrases (“jargon”)
within our domain.
2. Perform collaborative filtering on the common jargon (“user’s who
searched for that phrase also search for this phrase”)
3. Remove noise through several methodologies:
– Segment search phrases based upon the
classification of users
– Consider shared jargon used by multiple
sides of our two-sided market (i.e. both
Job Seekers and Recruiters utilize the
same phrase)
– Validate that the two “related” phrases
actually co-occur in real content (i.e. within
the same job or resume) with some
frequency
9. Finding and Scoring Related Jargon
● Implementation:
Map/Reduce job which finds and scores similar searches run for the same users
○ Jane searched for “registered nurse” and “r.n.” and “nurse”.
○ Zeke searched for “java developer” and “scala” and “jvm” and “j2ee”
10. Finding and Scoring Related Jargon
Similarity Score:
To do the collaborative filtering, we look at two similarity measures:
1. Search Co-occurrences - provides raw, real-world correlation
2. Point-wise Mutual Information - examines probability of terms being
related by contrasting individual vs joint distributions:
Final Score:
12. Example output
Cashier => retail, retail cashier, customer service, cashiers
CDL => cdl driver, cdl a, driver
Data Scientist => machine learning, big data
14. Follow-on work: Differentiating related Jargon
Synonyms: cpa => Certified Public Accountant
rn => Registered Nurse
r.n. => Registered Nurse
Ambiguous Terms*: driver => driver (trucking) ~80%
driver => driver (software) ~20%
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
*disambiguated based upon user and query context
15. Applicability of Methodology
• Can be used to discover domain-specific jargon across most
domains (not just employment search)
• Can be used to discover related jargon in any language since
the jargon and relationships is crowd-sourced at the phrase
level
• The high-precision results achieved by intersecting input from
both sides of a two-sided market is optional. If you only have
a single source of user queries, you will just get lower-precision
mappings.
• The only absolute requirement is sufficient search log history
mapping users to multiple search phrases
18. Contact Info
▪ Trey Grainger
trey.grainger@careerbuilder.com
@treygrainger
Other presentations:
http://www.treygrainger.com http://solrinaction.com
Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…