SlideShare a Scribd company logo
1 of 35
Download to read offline
CompanyDepot:	Employer	
Name	Normalization	in	the	
Online	Recruitment	Industry
Qiaoling	Liu
Sep.	2017
The	Employer	Name	Normalization	Task
links	employer	names	in	job	postings	or	resumes	to	entities in	an	
employer	knowledge	base	(KB)
A	domain-specific	case	of	entity	linking
Traditional entity	
linking
links
entity	mentions
in
text
to	entities	in
often	a global	KB
Employer	name	
normalization
employer	names jobs	/	resumes	 an	employer KB
Key	Challenges
1. Handle	name	variations
Ø Legacy	names,	nicknames,	acronyms,	typos
2. Handle	irrelevant	or	unlinkable input	data
Ø E.g.,	“self-employed”,	“not	specified”
3. Handle	employer	names	from	both	job	postings	and	resumes
Ø Different	semi-structured	formats
4. Leverage	the	location/url context
Ø e.g.,	(Macys.com,	San	Francisco)
5. Handle	duplicates	in	the	KB
Ø e.g.,	{“Enterprise	Rent	A	Car”,	“Enterprise	Rentacar”,	“Enterprise	Rent-A-Car	Company”}
Unique	Challenges!
Common	Challenges!
Two	Levels	of	Employer	Name	Normalization
•Handle	name	variations
•Handle	irrelevant	or	unlinkable input	data
•Handle	employer	names	from	both	job	postings	&	resumes
•Leverage	the	location/url context
Entity-level	
normalization
•Handle	duplicates	in	the	KB
Cluster-level	
normalization
Entity-Level	Normalization
-- mapping	a	query	to	an	entity
Entity-Level Normalization
- mapping a query to an entity
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comQueries walmart
Cluster-Level	Normalization
-- mapping	a	query	to	a	cluster	of	entities
Cluster-Level Normalization
- mapping a query to a cluster of entities
Walmart
Pharmacy
Target
Pharmacy
Walmart
Supercenter
Walmart
Wal-Mart
Stores, Inc.
target.com
Target
Corporation
Entities
walmart pharmacy target.comwalmartQueries
Architecture	of
CompanyDepot
atistics and examples for mapping sources.
e Example
K IBM Corp. ! International Business Machines Corporation
MSFT ! Microso Corporation
K Amazon Web Services, Inc. ! Amazon.com, Inc.
M bankofamerica ! Bank of America Corporation
M pricewaterhouse coopers ! PwC
is ready, it can take normalization requests. Each
of an employer name and its location context (part
ation information could be empty). e system then
searcher to retrieve a list of N employer entities.
date entities are then sent to the reranking step,
s a feature vector for each entity and uses a machine
anking model to rank them. Finally, the top-ranked
the validation step to decide whether it is a correct
uery using a binary classier. If it says yes, the
this entity to the user; otherwise, it outputs NIL.
ng Sources
s are used in both our entity-level normalization (to
sion) and cluster-level normalization (to do graph-
g). Each source contains a set of mappings from
o normalized forms. Table 1 shows the statistics and
ach source. We describe how each mapping source
w:
Cluster Result
Entity
Result
Query
Employer
Knowledge
Base
Mapping
Source 2
Mapping
Source 1
Mapping
Source 5
Mapping
Source 4
Mapping
Source 3
Client
KB
Index
Clusters
Mapping
Index
Cluster
Index
Reranking Step
Indexing Step
Retrieval Step
Validation Step
Clustering Step
Cluster Lookup
Offline
Online
Learning	to	Rank
Entity-Level	Normalization
Query	Expansion	using	External	Knowledge	from		
5	Mapping	Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each
Indexing	Step
• Using	Lucene	indexer
Table 2: Index structure.
(a) A document in the KB index.
Field Value
id 15
normalized form International Business Machines Corporation
calibrated name internationalbusinessmachines
domain ibm.com
json {“id”: “15”, “normalized form”: “International
Business Machines Corporation”, …}
(b) A document in the mapping index.
Field Value
surface form IBM
normalized form International Business Machines Corporation
mapping source wikipedia
(c) A document in the cluster index.
Field Value
cluster member key internationalbusinessmachines
cluster representative International Business Machines Corporation
Retrieval	Step
1. Get	top	1000	entities	using	Lucene	aggregated	search	combining			
(1)	keyword	searches;	(2)	fuzzy	searches;	(3)	phrase	searches.
• Query	expansion	based	on	mappings,	e.g.,	MSFT	-	Microsoft	Corporation
2. From	these	results,	get	top	N1 entities	by	Lucene	score,	top	N2
entities	by	Levenshtein Distance,	top	N3 entities	by	mapping	table,	
and	top	N4 entities	by	url matching.
3. Return	the	pool	of	N=N1+N2+N3+N4 entities	(N	is	about	10~20).
Reranking Step
1. Generate	Features	for	each	entity:	
• query	features:	query	length,	if	query	location/url is	specified,	etc.
• query-entity	features:	Lucene	score,	string	similarity,	location/url match,	etc.
• entity	features:	entity	popularity,	#	locations,	legal	word	presence,	etc.
2. Learn	to	rank	the	entities	using	coordinate	ascent	in	RankLib,	
• a	list-wise	method	that	can	directly	optimize	any	user	specified	ranking	
measure	(e.g.,	P@1).
Validation	Step
1. Generate	features	for	the	top-ranked	entity
• all	features	from	the	previous	step;	
• score	of	the	learning-to-rank	method.
2. Classify	the	top-ranked	entity	into	CORRECT	or	WRONG
• binary	classification	using	LibSVM.
Cluster-Level	Normalization
Graph-Based	Clustering	using	External	Knowledge	
from	5	Mapping	Sources
Table 1: Statistics and examples for mapping sources.
Source Size Example
Wikipedia 135K IBM Corp. ! International Business Machines Corporation
Stock 6K MSFT ! Microso Corporation
Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc.
Legacy 26M bankofamerica ! Bank of America Corporation
Provider 10M pricewaterhouse coopers ! PwC
Once the index is ready, it can take normalization requests. Each
Create	an	Undirected	GraphCreate an Undirected Graph
walmartcanada
targetpharmacywalmartsupercenter
walmart
walmartstores
target
wamart
targets
4
1
3
2
3
2
targetstore
1
Remove	Low-Quality	EdgesRemove Low-Quality Edges
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
4
1
3
2
targetpharmacy
target
targets
3
2
targetstore
1
Find	All	Connected	Components	as	ClustersFind All Connected Components as Clusters
walmartcanada
walmartsupercenter
walmart
walmartstores
wamart
targetpharmacy
target
targets targetstore
Select	Cluster	Representative	Entity
Wal-Mart Stores, Inc. Target Corporation
Select Cluster Representative Entity
walmartcanada
walmartsupercenter
walmart
walmartstores
targetpharmacy
target
targetstore
Experiments
Entity-Level	Datasets
Table 4: Statistics about the entity-level datasets. %Country
(State, URL) means the percentage of queries with country
(state, url) specied. %US means the percentage of queries
with country=US when country is specied.
Dataset #eries %Country %US %State %URL
RDB 1098 58.5% 96.4% 50.9% 0%
EDGE 1093 97.3% 45.3% 20.8% 0%
JOB1 1100 100% 100% 99.7% 0%
JOB2 500 100% 98.4% 100% 0%
JOBFEED 453 87.5% 100% 87.5% 100%
Metrics	for	Entity-Level	Normalization
• Ic:	correct	results;	Iw:	wrong	results;	In:	null	results
• Precision	=	Ic /	(Ic +	Iw):	percentage	of	correct	results	out	of	all	non-
null	results.
• Coverage	=	(Ic +	Iw)	/	(Ic +	Iw +	In):	percentage	of	queries	that	a	non-
null	result	is	returned.
Results	on	Entity-Level	Normalization	Datasets
0.0 0.2 0.4 0.6 0.8 1.0
0.20.40.60.81.0
Coverage
Precision
●●
●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
JOB2: CD−V1
JOB2: Legacy
JOB2: WService
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
0.40.60.81.0
●●
●
●
●
●
●
RDB: CD−V2−E
RDB: CD−V1
RDB: Legacy
RDB: WService
EDGE: CD−V2−E
EDGE: CD−V1
EDGE: Legacy
EDGE: WService
JOB1: CD−V2−E
JOB1: CD−V1
JOB1: Legacy
JOB1: WService
JOB2: CD−V2−E
0.0 0.2 0.4 0.6 0.8 1.0
0.800.901.00
Coverage
Precision
JOBFEED: CD−V2−E (using query url)
JOBFEED: CD−V2−E (ignoring query url)
JOBFEED: WService
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
Cluster-Level	Datasets
• Resume	dataset
• Search	for	resumes	by	98	most	frequent	search	queries	about	companies
• Get	20	most	frequent	raw	employer	names	from	these	resumes
• Collect	817	unique	raw	employer	names	from	resumes
• Job	dataset
• Get	top	182	employer	entities	with	the	most	jobs	by	a	baseline	normalizer	
• Get	the	raw	employer	names	in	the	jobs	posted	by	these	entities
• Collect	6515	unique	raw	employer	names	from	job	postings
Metrics	for	Cluster-Level	Normalization
• Success	Rate	(SR):	
• how	likely	the	system	returns	a	correct	result.
• Diversity	Reduction	Ratio	(DRR):	
• how	much	result	diversity	the	system	reduces	correctly	via	clustering.
• Light-weight	labeling:	
• for	each	query,	label	whether	the	result	returned	by	the	system	is	correct.
each query q, we label whether the result r returned by the sys-
tem is correct or not. Let QS be the set of successful queries,
i.e., the queries which receive a correct result, i.e., QS = {q 2
Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of
the system as
SR =
|QS |
|Q|
(1)
To measure the diversity in results returned by a system, we
adapted the true diversity metric [14] which is dened based on
entropy. As it does not maer how diverse the wrong results are,
we only compute the diversity in the correct results. Let QS |r be
the set of successful queries that are mapped to the cluster of r, i.e.,
QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the
correct results as
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
Train EDGE
Test RDB EDGE
and shared across system
correct cluster for each i
evaluation indicators. 
entity-level normalizatio
6.3 Systems and
Table 5 summarizes the
6.3.1 Results of Entit
normalization datasets,
CD-V1, CD-V2-E, Legac
E, the output contains a
By varying the threshol
precision-coverage curv
condence score is not
(q) is a correct result for q}. We dene Success Rate (SR) of
stem as
SR =
|QS |
|Q|
(1)
measure the diversity in results returned by a system, we
ed the true diversity metric [14] which is dened based on
py. As it does not maer how diverse the wrong results are,
ly compute the diversity in the correct results. Let QS |r be
of successful queries that are mapped to the cluster of r, i.e.,
= {q 2 QS | fC (q) = r}. We rst compute the entropy of the
t results as
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
it a lile hard to understand and interpret. So True Diver-
4] is proposed as TD = exp(H). It gives the eective number
rect clusters returned by the system, and is linear to |QS |.
and shared across systems. 
correct cluster for each input a
evaluation indicators. ird,
entity-level normalization: Suc
6.3 Systems and Resu
Table 5 summarizes the system
6.3.1 Results of Entity-Leve
normalization datasets, we co
CD-V1, CD-V2-E, Legacy, and
E, the output contains a con
By varying the threshold on t
precision-coverage curve. For
condence score is not availa
precision and coverage value.
Figure 4 shows the precision
H =
’
r 2R
|QS |r |
|QS |
· ln
✓
|QS |r |
|QS |
◆
(2)
e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which
makes it a lile hard to understand and interpret. So True Diver-
sity [14] is proposed as TD = exp(H). It gives the eective number
of correct clusters returned by the system, and is linear to |QS |.
Based on the above True Diversity, we can compute how much
result diversity the system reduces correctly, i.e., Diversity Reduc-
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
(3)
Finally, we compute the f-score (or the harmonic mean) of Suc-
cess Rate and Diversity Reduction Ratio to measure the normaliza-
tion quality:
F-score =
2 · SR · DRR
SR + DRR
(4)
tion Ratio (DRR), which is in range [0, 1]:
DRR = 1
exp(H) 1
|QS | 1
Finally, we compute the f-score (or the harmon
cess Rate and Diversity Reduction Ratio to measu
tion quality:
F-score =
2 · SR · DRR
SR + DRR
e proposed metric has three merits. First, it is
showing the correctness and diversity of the resu
cluster-level normalization system. Second, it only
labeling eort, i.e., labeling for each (query, resu
the result is correct for the query or not. e labe
Results	on	Cluster-Level	Normalization	Datasets
1.0
asets.
means
Figure 5: Results on JOBFEED (entity-level normalization).
Table 7: Results on cluster-level normalization datasets.
(a) Resume dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.963 0.704 0.814
CD-V1.5-C 0.897 0.688 0.779
CD-V2-E 0.958 0.416 0.580
(b) Job dataset.
System SuccessRate DiversityReductionRatio F-score
CD-V2-C 0.904 0.979 0.940
CD-V1.5-C 0.778 0.981 0.868
CD-V2-E 0.905 0.926 0.915
the other hand, CD-V2-C has a much higher diversity reduction
the trad
data so
the em
as han
employ
and du
per ada
system
duplica
2.2
e sys
the foll
on exte
to impr
cluster
normal
Our
malizat
domain
e employer name normalization task discussed in
can be viewed as a general entity linking problem, yet it d
the traditional entity linking task in three aspects [20]: (
data sources; (2) dierent contexts; (3) dierent KBs.
the employer name normalization task has unique chall
as handling the location and the url context associate
employer names in jobs and resumes, as well as hand
and duplicate entities in the KB. e system proposed
per adapts the three-module framework used in the en
systems. We also propose cluster-level normalization
duplicate results, which is not considered in entity linki
2.2 Domain-Specic Name Normalizati
e system described in this paper extends the system in
the following contributions: (1) performing query expan
on external mapping sources and supporting using u
to improve normalization quality; (2) supporting norm
cluster level; (3) proposing a new metric for evaluating c
normalization. More details will be described in Section
Our work is also related to a set of domain-specic
malization applications. For example, within the same r
From	entity-level	normalization	
to	cluster-level	normalization:
ü Correctness remained
ü Diversity reduced
Candidate	Search	Results	Facets
Conclusion	and	Future	Work
ü Presented	CompanyDepot:	supporting	employer	name	
normalization	at	both	entity	and	cluster	level
ü Proposed	new	metrics	for	cluster-level	normalization
q Improve	clustering,	e.g.,	merge	and	split
q Develop	more	features	for	entity	quality	and	query	segmentation.	
q Improve	the	quality	and	coverage	of	the	employer	KB
Thank	you!
Any	questions?
qiaoling.liu@careerbuilder.com
Backup
Calibrating	Employer	Names
1. Convert	the	name	to	lowercase,	and	replace	’s	with	s;
2. Convert	all	the	non-alphanumeric	characters	to	space;	
3. Remove	stop-phrases	(e.g.,	“pvt ltd”	and	“l	l	c”)	and	stop-words	(e.g.,	“inc”,	
“corporation”,	“incorporated”,	and	“the”);	
4. Expand	commonly	used	abbreviations,	e.g.,	“ctr”	-	“center”,	“svc”	-	
“services”;	
5. remove	all	spaces	in	the	name.	
Employer	name After	calibration
International	Business	Machines	
Corporation
internationalbusinessmachines
Sherman		Howard	L.L.C. shermanhoward
Oxnard	Police	Dept oxnardpolicedepartment
Macy’s,	Inc. macys
Related	Work
• Entity	Linking	with	a	Knowedge Base
• Domain-Specific	Name	Normalization
• Deduplicating Domain-Specific	KBs
• Clustering	Methods	and	Evaluation	Metrics
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017
Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

More Related Content

What's hot

Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0
Splunk
 

What's hot (20)

Webinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the DarkWebinar: Don't Leave Your Data in the Dark
Webinar: Don't Leave Your Data in the Dark
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
Still on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youStill on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for you
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Security Breakout Session
Security Breakout Session Security Breakout Session
Security Breakout Session
 
Db2 event store
Db2 event storeDb2 event store
Db2 event store
 
H2O Driverless AI Workshop
H2O Driverless AI WorkshopH2O Driverless AI Workshop
H2O Driverless AI Workshop
 
Big Data for Everyman
Big Data for EverymanBig Data for Everyman
Big Data for Everyman
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital OneUsing H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
Using H2O for Mobile Transaction Forecasting & Anomaly Detection - Capital One
 
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4jAdobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
 
Introducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineIntroducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data Engine
 
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
EsgynDB: A Big Data Engine. Simplifying Fast and Reliable Mixed Workloads
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0Splunk 5 Overview Analyst v1.0
Splunk 5 Overview Analyst v1.0
 
Creando un Portal Oracle para una Empresa
Creando un Portal Oracle para una EmpresaCreando un Portal Oracle para una Empresa
Creando un Portal Oracle para una Empresa
 
Splunk introduction
Splunk introductionSplunk introduction
Splunk introduction
 

Similar to Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01
google
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming Man
Quek Lilian
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
ccis224477
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
cis247
 
Cis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee classCis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee class
sdjdskjd9097
 
Richard Simmons
Richard SimmonsRichard Simmons
Richard Simmons
rasimmons
 
Cis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variablesCis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variables
sdjdskjd9097
 
Cis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variablesCis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variables
ccis224477
 
Cis 247 all i labs
Cis 247 all i labsCis 247 all i labs
Cis 247 all i labs
ccis224477
 

Similar to Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017 (20)

Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01Linqtosql 090629035715 Phpapp01
Linqtosql 090629035715 Phpapp01
 
Dev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming ManDev-In-Town:Linq To Sql by Chan Ming Man
Dev-In-Town:Linq To Sql by Chan Ming Man
 
[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and Network[React Native Tutorial] Lecture 6: Component, Props, and Network
[React Native Tutorial] Lecture 6: Component, Props, and Network
 
1z0 591
1z0 5911z0 591
1z0 591
 
Webinar: What's new in the .NET Driver
Webinar: What's new in the .NET DriverWebinar: What's new in the .NET Driver
Webinar: What's new in the .NET Driver
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
 
Cis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee classCis247 a ilab 2 of 7 employee class
Cis247 a ilab 2 of 7 employee class
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Cis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee classCis247 i lab 2 of 7 employee class
Cis247 i lab 2 of 7 employee class
 
Advanced Index Tuning
Advanced Index TuningAdvanced Index Tuning
Advanced Index Tuning
 
Richard Simmons
Richard SimmonsRichard Simmons
Richard Simmons
 
Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016
 
Data Entry Operator Certification
Data Entry Operator CertificationData Entry Operator Certification
Data Entry Operator Certification
 
Cis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variablesCis247 i lab 3 overloaded methods and static methods variables
Cis247 i lab 3 overloaded methods and static methods variables
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
 
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web ServicesCMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
CMP323_AWS Batch Easy & Efficient Batch Computing on Amazon Web Services
 
Cis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variablesCis247 a ilab 3 overloaded methods and static methods variables
Cis247 a ilab 3 overloaded methods and static methods variables
 
Improving the Quality of Existing Software
Improving the Quality of Existing SoftwareImproving the Quality of Existing Software
Improving the Quality of Existing Software
 
Cis 247 all i labs
Cis 247 all i labsCis 247 all i labs
Cis 247 all i labs
 
Necessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD ProceduresNecessary Evils, Building Optimized CRUD Procedures
Necessary Evils, Building Optimized CRUD Procedures
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Qiaoling Liu, Lead Data Scientist, CareerBuilder at MLconf ATL 2017

  • 4. Key Challenges 1. Handle name variations Ø Legacy names, nicknames, acronyms, typos 2. Handle irrelevant or unlinkable input data Ø E.g., “self-employed”, “not specified” 3. Handle employer names from both job postings and resumes Ø Different semi-structured formats 4. Leverage the location/url context Ø e.g., (Macys.com, San Francisco) 5. Handle duplicates in the KB Ø e.g., {“Enterprise Rent A Car”, “Enterprise Rentacar”, “Enterprise Rent-A-Car Company”} Unique Challenges! Common Challenges!
  • 6. Entity-Level Normalization -- mapping a query to an entity Entity-Level Normalization - mapping a query to an entity Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comQueries walmart
  • 7. Cluster-Level Normalization -- mapping a query to a cluster of entities Cluster-Level Normalization - mapping a query to a cluster of entities Walmart Pharmacy Target Pharmacy Walmart Supercenter Walmart Wal-Mart Stores, Inc. target.com Target Corporation Entities walmart pharmacy target.comwalmartQueries
  • 8. Architecture of CompanyDepot atistics and examples for mapping sources. e Example K IBM Corp. ! International Business Machines Corporation MSFT ! Microso Corporation K Amazon Web Services, Inc. ! Amazon.com, Inc. M bankofamerica ! Bank of America Corporation M pricewaterhouse coopers ! PwC is ready, it can take normalization requests. Each of an employer name and its location context (part ation information could be empty). e system then searcher to retrieve a list of N employer entities. date entities are then sent to the reranking step, s a feature vector for each entity and uses a machine anking model to rank them. Finally, the top-ranked the validation step to decide whether it is a correct uery using a binary classier. If it says yes, the this entity to the user; otherwise, it outputs NIL. ng Sources s are used in both our entity-level normalization (to sion) and cluster-level normalization (to do graph- g). Each source contains a set of mappings from o normalized forms. Table 1 shows the statistics and ach source. We describe how each mapping source w: Cluster Result Entity Result Query Employer Knowledge Base Mapping Source 2 Mapping Source 1 Mapping Source 5 Mapping Source 4 Mapping Source 3 Client KB Index Clusters Mapping Index Cluster Index Reranking Step Indexing Step Retrieval Step Validation Step Clustering Step Cluster Lookup Offline Online Learning to Rank
  • 10. Query Expansion using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  • 11. Indexing Step • Using Lucene indexer Table 2: Index structure. (a) A document in the KB index. Field Value id 15 normalized form International Business Machines Corporation calibrated name internationalbusinessmachines domain ibm.com json {“id”: “15”, “normalized form”: “International Business Machines Corporation”, …} (b) A document in the mapping index. Field Value surface form IBM normalized form International Business Machines Corporation mapping source wikipedia (c) A document in the cluster index. Field Value cluster member key internationalbusinessmachines cluster representative International Business Machines Corporation
  • 12. Retrieval Step 1. Get top 1000 entities using Lucene aggregated search combining (1) keyword searches; (2) fuzzy searches; (3) phrase searches. • Query expansion based on mappings, e.g., MSFT - Microsoft Corporation 2. From these results, get top N1 entities by Lucene score, top N2 entities by Levenshtein Distance, top N3 entities by mapping table, and top N4 entities by url matching. 3. Return the pool of N=N1+N2+N3+N4 entities (N is about 10~20).
  • 13. Reranking Step 1. Generate Features for each entity: • query features: query length, if query location/url is specified, etc. • query-entity features: Lucene score, string similarity, location/url match, etc. • entity features: entity popularity, # locations, legal word presence, etc. 2. Learn to rank the entities using coordinate ascent in RankLib, • a list-wise method that can directly optimize any user specified ranking measure (e.g., P@1).
  • 14. Validation Step 1. Generate features for the top-ranked entity • all features from the previous step; • score of the learning-to-rank method. 2. Classify the top-ranked entity into CORRECT or WRONG • binary classification using LibSVM.
  • 16. Graph-Based Clustering using External Knowledge from 5 Mapping Sources Table 1: Statistics and examples for mapping sources. Source Size Example Wikipedia 135K IBM Corp. ! International Business Machines Corporation Stock 6K MSFT ! Microso Corporation Hierarchy 272K Amazon Web Services, Inc. ! Amazon.com, Inc. Legacy 26M bankofamerica ! Bank of America Corporation Provider 10M pricewaterhouse coopers ! PwC Once the index is ready, it can take normalization requests. Each
  • 17. Create an Undirected GraphCreate an Undirected Graph walmartcanada targetpharmacywalmartsupercenter walmart walmartstores target wamart targets 4 1 3 2 3 2 targetstore 1
  • 19. Find All Connected Components as ClustersFind All Connected Components as Clusters walmartcanada walmartsupercenter walmart walmartstores wamart targetpharmacy target targets targetstore
  • 20. Select Cluster Representative Entity Wal-Mart Stores, Inc. Target Corporation Select Cluster Representative Entity walmartcanada walmartsupercenter walmart walmartstores targetpharmacy target targetstore
  • 22. Entity-Level Datasets Table 4: Statistics about the entity-level datasets. %Country (State, URL) means the percentage of queries with country (state, url) specied. %US means the percentage of queries with country=US when country is specied. Dataset #eries %Country %US %State %URL RDB 1098 58.5% 96.4% 50.9% 0% EDGE 1093 97.3% 45.3% 20.8% 0% JOB1 1100 100% 100% 99.7% 0% JOB2 500 100% 98.4% 100% 0% JOBFEED 453 87.5% 100% 87.5% 100%
  • 23. Metrics for Entity-Level Normalization • Ic: correct results; Iw: wrong results; In: null results • Precision = Ic / (Ic + Iw): percentage of correct results out of all non- null results. • Coverage = (Ic + Iw) / (Ic + Iw + In): percentage of queries that a non- null result is returned.
  • 24. Results on Entity-Level Normalization Datasets 0.0 0.2 0.4 0.6 0.8 1.0 0.20.40.60.81.0 Coverage Precision ●● ● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E JOB2: CD−V1 JOB2: Legacy JOB2: WService 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. 0.40.60.81.0 ●● ● ● ● ● ● RDB: CD−V2−E RDB: CD−V1 RDB: Legacy RDB: WService EDGE: CD−V2−E EDGE: CD−V1 EDGE: Legacy EDGE: WService JOB1: CD−V2−E JOB1: CD−V1 JOB1: Legacy JOB1: WService JOB2: CD−V2−E 0.0 0.2 0.4 0.6 0.8 1.0 0.800.901.00 Coverage Precision JOBFEED: CD−V2−E (using query url) JOBFEED: CD−V2−E (ignoring query url) JOBFEED: WService Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset.
  • 25. Cluster-Level Datasets • Resume dataset • Search for resumes by 98 most frequent search queries about companies • Get 20 most frequent raw employer names from these resumes • Collect 817 unique raw employer names from resumes • Job dataset • Get top 182 employer entities with the most jobs by a baseline normalizer • Get the raw employer names in the jobs posted by these entities • Collect 6515 unique raw employer names from job postings
  • 26. Metrics for Cluster-Level Normalization • Success Rate (SR): • how likely the system returns a correct result. • Diversity Reduction Ratio (DRR): • how much result diversity the system reduces correctly via clustering. • Light-weight labeling: • for each query, label whether the result returned by the system is correct. each query q, we label whether the result r returned by the sys- tem is correct or not. Let QS be the set of successful queries, i.e., the queries which receive a correct result, i.e., QS = {q 2 Q | fC (q) is a correct result for q}. We dene Success Rate (SR) of the system as SR = |QS | |Q| (1) To measure the diversity in results returned by a system, we adapted the true diversity metric [14] which is dened based on entropy. As it does not maer how diverse the wrong results are, we only compute the diversity in the correct results. Let QS |r be the set of successful queries that are mapped to the cluster of r, i.e., QS |r = {q 2 QS | fC (q) = r}. We rst compute the entropy of the correct results as H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- Train EDGE Test RDB EDGE and shared across system correct cluster for each i evaluation indicators. entity-level normalizatio 6.3 Systems and Table 5 summarizes the 6.3.1 Results of Entit normalization datasets, CD-V1, CD-V2-E, Legac E, the output contains a By varying the threshol precision-coverage curv condence score is not (q) is a correct result for q}. We dene Success Rate (SR) of stem as SR = |QS | |Q| (1) measure the diversity in results returned by a system, we ed the true diversity metric [14] which is dened based on py. As it does not maer how diverse the wrong results are, ly compute the diversity in the correct results. Let QS |r be of successful queries that are mapped to the cluster of r, i.e., = {q 2 QS | fC (q) = r}. We rst compute the entropy of the t results as H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) bove entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which it a lile hard to understand and interpret. So True Diver- 4] is proposed as TD = exp(H). It gives the eective number rect clusters returned by the system, and is linear to |QS |. and shared across systems. correct cluster for each input a evaluation indicators. ird, entity-level normalization: Suc 6.3 Systems and Resu Table 5 summarizes the system 6.3.1 Results of Entity-Leve normalization datasets, we co CD-V1, CD-V2-E, Legacy, and E, the output contains a con By varying the threshold on t precision-coverage curve. For condence score is not availa precision and coverage value. Figure 4 shows the precision H = ’ r 2R |QS |r | |QS | · ln ✓ |QS |r | |QS | ◆ (2) e above entropy H 2 [0,ln(|QS |)] is not linear to |QS |, which makes it a lile hard to understand and interpret. So True Diver- sity [14] is proposed as TD = exp(H). It gives the eective number of correct clusters returned by the system, and is linear to |QS |. Based on the above True Diversity, we can compute how much result diversity the system reduces correctly, i.e., Diversity Reduc- tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 (3) Finally, we compute the f-score (or the harmonic mean) of Suc- cess Rate and Diversity Reduction Ratio to measure the normaliza- tion quality: F-score = 2 · SR · DRR SR + DRR (4) tion Ratio (DRR), which is in range [0, 1]: DRR = 1 exp(H) 1 |QS | 1 Finally, we compute the f-score (or the harmon cess Rate and Diversity Reduction Ratio to measu tion quality: F-score = 2 · SR · DRR SR + DRR e proposed metric has three merits. First, it is showing the correctness and diversity of the resu cluster-level normalization system. Second, it only labeling eort, i.e., labeling for each (query, resu the result is correct for the query or not. e labe
  • 27. Results on Cluster-Level Normalization Datasets 1.0 asets. means Figure 5: Results on JOBFEED (entity-level normalization). Table 7: Results on cluster-level normalization datasets. (a) Resume dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.963 0.704 0.814 CD-V1.5-C 0.897 0.688 0.779 CD-V2-E 0.958 0.416 0.580 (b) Job dataset. System SuccessRate DiversityReductionRatio F-score CD-V2-C 0.904 0.979 0.940 CD-V1.5-C 0.778 0.981 0.868 CD-V2-E 0.905 0.926 0.915 the other hand, CD-V2-C has a much higher diversity reduction
  • 28. the trad data so the em as han employ and du per ada system duplica 2.2 e sys the foll on exte to impr cluster normal Our malizat domain e employer name normalization task discussed in can be viewed as a general entity linking problem, yet it d the traditional entity linking task in three aspects [20]: ( data sources; (2) dierent contexts; (3) dierent KBs. the employer name normalization task has unique chall as handling the location and the url context associate employer names in jobs and resumes, as well as hand and duplicate entities in the KB. e system proposed per adapts the three-module framework used in the en systems. We also propose cluster-level normalization duplicate results, which is not considered in entity linki 2.2 Domain-Specic Name Normalizati e system described in this paper extends the system in the following contributions: (1) performing query expan on external mapping sources and supporting using u to improve normalization quality; (2) supporting norm cluster level; (3) proposing a new metric for evaluating c normalization. More details will be described in Section Our work is also related to a set of domain-specic malization applications. For example, within the same r From entity-level normalization to cluster-level normalization: ü Correctness remained ü Diversity reduced Candidate Search Results Facets
  • 29. Conclusion and Future Work ü Presented CompanyDepot: supporting employer name normalization at both entity and cluster level ü Proposed new metrics for cluster-level normalization q Improve clustering, e.g., merge and split q Develop more features for entity quality and query segmentation. q Improve the quality and coverage of the employer KB
  • 32. Calibrating Employer Names 1. Convert the name to lowercase, and replace ’s with s; 2. Convert all the non-alphanumeric characters to space; 3. Remove stop-phrases (e.g., “pvt ltd” and “l l c”) and stop-words (e.g., “inc”, “corporation”, “incorporated”, and “the”); 4. Expand commonly used abbreviations, e.g., “ctr” - “center”, “svc” - “services”; 5. remove all spaces in the name. Employer name After calibration International Business Machines Corporation internationalbusinessmachines Sherman Howard L.L.C. shermanhoward Oxnard Police Dept oxnardpolicedepartment Macy’s, Inc. macys
  • 33. Related Work • Entity Linking with a Knowedge Base • Domain-Specific Name Normalization • Deduplicating Domain-Specific KBs • Clustering Methods and Evaluation Metrics