SlideShare a Scribd company logo
1 of 38
Download to read offline
RECORD LINKAGE,
A REAL USE CASE WITH SPARK ML
Alexis Seigneurin - Pascale Mkhael
Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Spark trainer
• Favorite subjects: Spark, Machine Learning, Cassandra
• @aseigneurin
The DIL’s mission: Help AXA become a
data-driven company…
BUILDING
technological
platforms using Big
Data technologies
SUPPORTING
AXA entities’
Big Data projects
(tools and/or
expertise)
EXPLORING
innovative
opportunities
to transform the
insurance business
Platforms
> By focusing on…
> The DIL in figures:
1
TEAM
3
SITES
77
OPPORTUNITIES
100+
TB
3
SETS OF
PLAFORMS
The project
• Record Linkage with Machine learning
• Use cases:
• Find new clients who come from insurance comparison services
→ Commission
• Find duplicates in existing files (acquisitions)
• Record Linkage
• Entity resolution
• Deduplication
• Entity disambiguation
• …
Overview
• Find duplicates!
Purpose
+---+-------+------------+----------+------------+---------+------------+	
|	ID|				veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp|					formule|	
+---+-------+------------+----------+------------+---------+------------+	
|...|PE28221|							50000|1995-10-12|		2013-10-08|				100.0|							TIERS|	
|...|FO26016|							59270|1967-01-01|		1987-02-01|				100.0|VOL_INCENDIE|	
|...|FI19107|							77100|1988-09-27|		2009-09-13|				105.0|TOUS_RISQUES|	
|...|RE07307|							69100|1984-08-15|		2007-04-20|					50.0|							TIERS|	
|...|FO26016|							59270|1967-01-07|		1987-02-01|				105.0|TOUS_RISQUES|	
+---+-------+------------+----------+------------+---------+------------+
Steps
1. Preprocessing
1. Find potential duplicates
2. Feature engineering
2. Manual labeling of a sample
3. Machine Learning to make predictions on the rest of the
records
Prototype
• Crafted by a Data Scientist
• Not architectured, not versioned, not unit tested…
→ Not ready for production
• Spark, but a lot of Spark SQL (data processing)
• Machine Learning in Python (Scikit Learn)
→ Objective: industrialization of the code
Preprocessing
• Data (CSV) + Schema (JSON)
Inputs
000010;Jose;Lester;10/10/1970	
000011;José;Lester;10/10/1970	
000012;Tyler;Hunt;12/12/1972	
000013;Tiler;Hunt;25/12/1972	
000014;Patrick;Andrews;1973-12-13
{	
		"tableSchemaBeforeSelection":	[	
				{	
						"name":	"ID",	
						"typeField":	"StringType",	
						"hardJoin":	false	
				},	
				{	
						"name":	"name",	
						"typeField":	"StringType",	
						"hardJoin":	true,	
						"cleaning":	"digitLetter",	
						"listFeature":	[	"scarcity"	],	
						"listDistance":	[	"equality",	"soundLike"	]	
				},	
				...
• Spark CSV module → DataFrame
Don’t use type inference
Data loading
+------+-------+-------+----------+	
|				ID|			name|surname|			birthDt|	
+------+-------+-------+----------+	
|000010|			Jose|	Lester|10/10/1970|	
|000011|			José|	Lester|10/10/1970|	
|000012|		Tyler|			Hunt|12/12/1972|	
|000013|		Tiler|			Hunt|25/12/1972|	
|000014|Patrick|Andrews|1970-10-10|	
+------+-------+-------+----------+
• Parsing of dates, numbers…
• Cleaning of strings
Data cleasning
+------+-------+-------+----------+	
|				ID|			name|surname|			birthDt|	
+------+-------+-------+----------+	
|000010|			jose|	lester|1970-10-10|	
|000011|			jose|	lester|1970-10-10|	
|000012|		tyler|			hunt|1972-12-12|	
|000013|		tiler|			hunt|1972-12-25|	
|000014|patrick|andrews|						null|	
+------+-------+-------+----------+
• Convert strings to phonetics (Beider-Morse)
• …
Feature calculation
+------+-------+-------+----------+--------------------+	
|				ID|			name|surname|			birthDt|						BMencoded_name|	
+------+-------+-------+----------+--------------------+	
|000010|			jose|	lester|1970-10-10|ios|iosi|ioz|iozi...|	
|000011|			jose|	lester|1970-10-10|ios|iosi|ioz|iozi...|	
|000012|		tyler|			hunt|1972-12-12|															tilir|	
|000013|		tiler|			hunt|1972-12-25|				tQlir|tili|tilir|	
|000014|patrick|andrews|						null|pYtrQk|pYtrik|pat...|	
+------+-------+-------+----------+--------------------+
• Auto-join (more on that later…)
Find potential duplicates
+------+------+---------+...+------+------+---------+...	
|		ID_1|name_1|surname_1|...|		ID_2|name_2|surname_2|...	
+------+------+---------+...+------+------+---------+...	
|000010|		jose|			lester|...|000011|		jose|			lester|...	
|000012|	tyler|					hunt|...|000013|	tiler|					hunt|...	
+------+------+---------+...+------+------+---------+...
• Several distance algorithms:
• Levenshtein distance, date difference…
Distance calculation
+------+...+------+...+-------------+--------------+...+----------------+	
|		ID_1|...|		ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|	
+------+...+------+...+-------------+--------------+...+----------------+	
|000010|...|000011|...|										0.0|											0.0|...|													0.0|	
|000012|...|000013|...|										1.0|											0.0|...|												13.0|	
+------+...+------+...+-------------+--------------+...+----------------+
• Standardization of distances only
• Vectorization (2 vectors)
Standardization / vectorization
+------+------+---------+----------+------+------+---------+----------+------------+--------------+	
|		ID_1|name_1|surname_1|	birthDt_1|		ID_2|name_2|surname_2|	birthDt_2|			distances|other_features|	
+------+------+---------+----------+------+------+---------+----------+------------+--------------+	
|000010|		jose|			lester|1970-10-10|000011|		jose|			lester|1970-10-10|[0.0,0.0,...|		[2.0,2.0,...|	
|000012|	tyler|					hunt|1972-12-12|000013|	tiler|					hunt|1972-12-25|[1.0,1.0,...|		[1.0,2.0,...|	
+------+------+---------+----------+------+------+---------+----------+------------+--------------+
Spark
SQL → DataFrames
From SQL…
• Generated SQL requests
• Hard to maintain (especially as regards to UDFs)
val	cleaningRequest	=	tableSchema.map(x	=>	{	
		x.CleaningFuction	match	{	
				case	(Some(a),	_)	=>	a	+	"("	+	x.name	+	")	as	"	+	x.name	
				case	_	=>	x.name	
		}	
}).mkString(",	")	
val	cleanedTable	=	sqlContext.sql("select	"	+	cleaningRequest	+	"	from	"	+	tableName)	
cleanedTable.registerTempTable(schema.tableName	+	"_cleaned")
… to DataFrames
• DataFrame primitives
• More work done by the Scala compiler
val	cleanedDF	=	tableSchema.filter(_.cleaning.isDefined).foldLeft(df)	{	
		case	(df,	field)	=>	
				val	udf:	UserDefinedFunction	=	...	//	get	the	cleaning	UDF	
					
				df.withColumn(field.name	+	"_cleaned",	udf.apply(df(field.name)))	
						.drop(field.name)	
						.withColumnRenamed(field.name	+	"_cleaned",	field.name)	
}
Unit testing
Unit testing
• Scalatest + Scoverage
• Coverage of all the data processing operations
• Comparison of Row objects
val resDF = schema.cleanTable(rows)

"The cleaning process" should "clean text fields" in {

val res = resDF.select("ID", "name", "surname").collect()

val expected = Array(

Row("000010", "jose", "lester"),

Row("000011", "jose", "lester ea"),

Row("000012", "jose", "lester")

)

res should contain theSameElementsAs expected

}


"The cleaning process" should "parse dates" in {
...
Unit testing
000010;Jose;Lester;10/10/1970

000011;Jose =-+;Lester éà;10/10/1970

000012;Jose;Lester;invalid date
Matching potential duplicates
Join strategy
• For record linkage, first merge the
two sources
• Then auto-join
Prospects New clients
Duplicate
Join -Volume of data
• Input: 1M records
• Cartesian product: 1000 B records
→ Find an appropriate join condition
0
25
50
75
100
Join condition
• Multiples join on 2 fields
• Equality of values or custom condition (UDF)
• Union between all the intermediate results
• E.g. with fields name, surname, birth_date:
df1.join(df2, (df1("ID_1") < df2("ID_2"))

&& (df1("name_1") === df2("name_2"))

&& (soundLike(df1("surname_1"), df2("surname_2")))
df1.join(df2, (df1("ID_1") < df2("ID_2"))

&& (df1("name_1") === df2("name_2"))

&& (df1("birth_date_1") === df2("birth_date_2")))
df1.join(df2, (df1("ID_1") < df2("ID_2"))

&& (soundLike(df1("surname_1"), df2("surname_2")))

&& (df1("birth_date_1") === df2("birth_date_2")))
UNION
DataFrames extension
• 3 types of columns
DataFrames extension
+------+...+------+...+-------------+--------------+...+----------------+	
|		ID_1|...|		ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|	
+------+...+------+...+-------------+--------------+...+----------------+	
|000010|...|000011|...|										0.0|											0.0|...|													0.0|	
|000012|...|000013|...|										1.0|											0.0|...|												13.0|	
+------+...+------+...+-------------+--------------+...+----------------+
Data DistancesNon-distance features
• DataFrame columns have a name and a data type
• DataFrameExt = DataFrame + metadata over columns
DataFrames extension
case class OutputColumn(name: String, columnType: ColumnType)





class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {
def show() = df.show()
def drop(colName: String): DataFrameExt = ...
def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...
...
Labeling
Labeling
• Manual operation
• Is this a duplicate? →Yes / No
• Performed on a sample of the potential duplicates
• Between 1000 and 10 000 records
Labeling
Predictions
Predictions
• Machine Learning
• Random Forests
• (Gradient BoostingTrees also give good results)
• Training on the potential duplicates labeled by hand
• Predictions on the potential duplicates not labeled by hand
Predictions
• Sample: 1000 records
• Training set: 800 records
• Test set: 200 records
• Results
• True positives: 53
• False positives: 2
• True negatives: 126
• False negatives: 5
→ Found 53 duplicates on the 58
expected (53+5) and only 2 errors
•Precision ≈ 93%
•Recall ≈ 91%
Summary
&
Conclusion
Summary
✓ Single engine for Record Linkage and Deduplication
✓ Machine Learning → Specific rules for each dataset
✓ Higher identification of matches
• Previously ~50% → Now 80-90%
Thank you!
@aseigneurin

More Related Content

What's hot

What's hot (19)

Bidmas treasure hunt more challenging.
Bidmas treasure hunt more challenging.Bidmas treasure hunt more challenging.
Bidmas treasure hunt more challenging.
 
Writing quadratic equation
Writing quadratic equationWriting quadratic equation
Writing quadratic equation
 
Math primary 3 product of numbers
Math primary 3 product of numbersMath primary 3 product of numbers
Math primary 3 product of numbers
 
Lesson 1: Special Products
Lesson 1: Special ProductsLesson 1: Special Products
Lesson 1: Special Products
 
09 Trial Penang S1
09 Trial Penang S109 Trial Penang S1
09 Trial Penang S1
 
Special products and factorization / algebra
Special products and factorization / algebraSpecial products and factorization / algebra
Special products and factorization / algebra
 
E3 f1 bộ binh
E3 f1 bộ binhE3 f1 bộ binh
E3 f1 bộ binh
 
Bt0063 mathematics for it
Bt0063  mathematics for itBt0063  mathematics for it
Bt0063 mathematics for it
 
Ejercicio 211 del libro de baldor
Ejercicio 211 del libro de baldorEjercicio 211 del libro de baldor
Ejercicio 211 del libro de baldor
 
1 4 homework
1 4 homework1 4 homework
1 4 homework
 
Integral table
Integral tableIntegral table
Integral table
 
U1 03 productos notables
U1   03 productos notablesU1   03 productos notables
U1 03 productos notables
 
Math integration-homework help
Math integration-homework helpMath integration-homework help
Math integration-homework help
 
Polynomial Derivation from Data
Polynomial Derivation from DataPolynomial Derivation from Data
Polynomial Derivation from Data
 
Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1
 
2013 1
2013 1 2013 1
2013 1
 
Algebra digital textbook gopika
Algebra digital textbook gopikaAlgebra digital textbook gopika
Algebra digital textbook gopika
 
TensorFriday#01
TensorFriday#01TensorFriday#01
TensorFriday#01
 
Alg2 lesson 10-3
Alg2 lesson 10-3Alg2 lesson 10-3
Alg2 lesson 10-3
 

Viewers also liked

Business of Social 3.0 - overview
Business of Social 3.0 - overviewBusiness of Social 3.0 - overview
Business of Social 3.0 - overviewgoldzerg
 
Knowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsKnowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsOlivier Serrat
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryVinayak Agrawal
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache SparkMahmoud Parsian
 
AI and Big Data For National Intelligence
AI and Big Data For National IntelligenceAI and Big Data For National Intelligence
AI and Big Data For National IntelligenceSonal Goyal
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache SparkOren Raboy
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionCloudera, Inc.
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project SparkFabrício Catae
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemVineetha Menon
 
Cultural competence in Healthcare: Amish Culture
Cultural competence in Healthcare: Amish CultureCultural competence in Healthcare: Amish Culture
Cultural competence in Healthcare: Amish CultureTosin Ola-Weissmann
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Spark Summit
 

Viewers also liked (20)

A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011
 
Business of Social 3.0 - overview
Business of Social 3.0 - overviewBusiness of Social 3.0 - overview
Business of Social 3.0 - overview
 
Knowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web SpecialistsKnowledge Collaboration: Working with Data and Web Specialists
Knowledge Collaboration: Working with Data and Web Specialists
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
Lect21 09-11
Lect21 09-11Lect21 09-11
Lect21 09-11
 
Apache Spark Use case for Education Industry
Apache Spark Use case for Education IndustryApache Spark Use case for Education Industry
Apache Spark Use case for Education Industry
 
Cancer Outlier Pro file Analysis using Apache Spark
Cancer Outlier Profile Analysis using Apache SparkCancer Outlier Profile Analysis using Apache Spark
Cancer Outlier Pro file Analysis using Apache Spark
 
AI and Big Data For National Intelligence
AI and Big Data For National IntelligenceAI and Big Data For National Intelligence
AI and Big Data For National Intelligence
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
Getting Apache Spark Customers to Production
Getting Apache Spark Customers to ProductionGetting Apache Spark Customers to Production
Getting Apache Spark Customers to Production
 
Kodu Game Lab e Project Spark
Kodu Game Lab e Project SparkKodu Game Lab e Project Spark
Kodu Game Lab e Project Spark
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
 
Cultural competence in Healthcare: Amish Culture
Cultural competence in Healthcare: Amish CultureCultural competence in Healthcare: Amish Culture
Cultural competence in Healthcare: Amish Culture
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 

Similar to Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Trumania: generate all the things!
Trumania: generate all the things!Trumania: generate all the things!
Trumania: generate all the things!Svend Vanderveken
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Mydbops
 
Visualize data using the split-apply-combine approach
Visualize data using the split-apply-combine approachVisualize data using the split-apply-combine approach
Visualize data using the split-apply-combine approachLuca Candela
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondIke Walker
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementationSimon Su
 
The Inside Scoop on Neo4j: Meet the Builders
The Inside Scoop on Neo4j: Meet the BuildersThe Inside Scoop on Neo4j: Meet the Builders
The Inside Scoop on Neo4j: Meet the BuildersNeo4j
 
MySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queriesMySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queriesDamien Seguy
 
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...Alkin Tezuysal
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersSveta Smirnova
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Applied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System PresentationApplied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System PresentationRichard Crowley
 
关系数据库存储树形结构数据的理想实践 20100222
关系数据库存储树形结构数据的理想实践 20100222关系数据库存储树形结构数据的理想实践 20100222
关系数据库存储树形结构数据的理想实践 20100222Cabin WJ
 
Window functions in MariaDB 10.2
Window functions in MariaDB 10.2Window functions in MariaDB 10.2
Window functions in MariaDB 10.2Sergey Petrunya
 
Coding Horrors
Coding HorrorsCoding Horrors
Coding HorrorsMark Baker
 
Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesAdrian Nuta
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 

Similar to Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015 (20)

Trumania: generate all the things!
Trumania: generate all the things!Trumania: generate all the things!
Trumania: generate all the things!
 
Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.Modern query optimisation features in MySQL 8.
Modern query optimisation features in MySQL 8.
 
Visualize data using the split-apply-combine approach
Visualize data using the split-apply-combine approachVisualize data using the split-apply-combine approach
Visualize data using the split-apply-combine approach
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
The Inside Scoop on Neo4j: Meet the Builders
The Inside Scoop on Neo4j: Meet the BuildersThe Inside Scoop on Neo4j: Meet the Builders
The Inside Scoop on Neo4j: Meet the Builders
 
MySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queriesMySQL Kitchen : spice up your everyday SQL queries
MySQL Kitchen : spice up your everyday SQL queries
 
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
MySQL Cookbook: Recipes for Developers, Alkin Tezuysal and Sveta Smirnova - P...
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Applied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System PresentationApplied Partitioning And Scaling Your Database System Presentation
Applied Partitioning And Scaling Your Database System Presentation
 
关系数据库存储树形结构数据的理想实践 20100222
关系数据库存储树形结构数据的理想实践 20100222关系数据库存储树形结构数据的理想实践 20100222
关系数据库存储树形结构数据的理想实践 20100222
 
Window functions in MariaDB 10.2
Window functions in MariaDB 10.2Window functions in MariaDB 10.2
Window functions in MariaDB 10.2
 
Explain2
Explain2Explain2
Explain2
 
Inner Join In Ms Access
Inner Join In Ms AccessInner Join In Ms Access
Inner Join In Ms Access
 
Coding Horrors
Coding HorrorsCoding Horrors
Coding Horrors
 
Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searches
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 

More from Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXHadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXModern Data Stack France
 

More from Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXHadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

  • 1. RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis Seigneurin - Pascale Mkhael
  • 2. Who I am • Software engineer for 15 years • Consultant at IpponTech in Paris, France • Spark trainer • Favorite subjects: Spark, Machine Learning, Cassandra • @aseigneurin
  • 3. The DIL’s mission: Help AXA become a data-driven company… BUILDING technological platforms using Big Data technologies SUPPORTING AXA entities’ Big Data projects (tools and/or expertise) EXPLORING innovative opportunities to transform the insurance business Platforms > By focusing on… > The DIL in figures: 1 TEAM 3 SITES 77 OPPORTUNITIES 100+ TB 3 SETS OF PLAFORMS
  • 4. The project • Record Linkage with Machine learning • Use cases: • Find new clients who come from insurance comparison services → Commission • Find duplicates in existing files (acquisitions) • Record Linkage • Entity resolution • Deduplication • Entity disambiguation • …
  • 7. Steps 1. Preprocessing 1. Find potential duplicates 2. Feature engineering 2. Manual labeling of a sample 3. Machine Learning to make predictions on the rest of the records
  • 8. Prototype • Crafted by a Data Scientist • Not architectured, not versioned, not unit tested… → Not ready for production • Spark, but a lot of Spark SQL (data processing) • Machine Learning in Python (Scikit Learn) → Objective: industrialization of the code
  • 10. • Data (CSV) + Schema (JSON) Inputs 000010;Jose;Lester;10/10/1970 000011;José;Lester;10/10/1970 000012;Tyler;Hunt;12/12/1972 000013;Tiler;Hunt;25/12/1972 000014;Patrick;Andrews;1973-12-13 { "tableSchemaBeforeSelection": [ { "name": "ID", "typeField": "StringType", "hardJoin": false }, { "name": "name", "typeField": "StringType", "hardJoin": true, "cleaning": "digitLetter", "listFeature": [ "scarcity" ], "listDistance": [ "equality", "soundLike" ] }, ...
  • 11. • Spark CSV module → DataFrame Don’t use type inference Data loading +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| Jose| Lester|10/10/1970| |000011| José| Lester|10/10/1970| |000012| Tyler| Hunt|12/12/1972| |000013| Tiler| Hunt|25/12/1972| |000014|Patrick|Andrews|1970-10-10| +------+-------+-------+----------+
  • 12. • Parsing of dates, numbers… • Cleaning of strings Data cleasning +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| jose| lester|1970-10-10| |000011| jose| lester|1970-10-10| |000012| tyler| hunt|1972-12-12| |000013| tiler| hunt|1972-12-25| |000014|patrick|andrews| null| +------+-------+-------+----------+
  • 13. • Convert strings to phonetics (Beider-Morse) • … Feature calculation +------+-------+-------+----------+--------------------+ | ID| name|surname| birthDt| BMencoded_name| +------+-------+-------+----------+--------------------+ |000010| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000011| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000012| tyler| hunt|1972-12-12| tilir| |000013| tiler| hunt|1972-12-25| tQlir|tili|tilir| |000014|patrick|andrews| null|pYtrQk|pYtrik|pat...| +------+-------+-------+----------+--------------------+
  • 14. • Auto-join (more on that later…) Find potential duplicates +------+------+---------+...+------+------+---------+... | ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|... +------+------+---------+...+------+------+---------+... |000010| jose| lester|...|000011| jose| lester|... |000012| tyler| hunt|...|000013| tiler| hunt|... +------+------+---------+...+------+------+---------+...
  • 15. • Several distance algorithms: • Levenshtein distance, date difference… Distance calculation +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+
  • 16. • Standardization of distances only • Vectorization (2 vectors) Standardization / vectorization +------+------+---------+----------+------+------+---------+----------+------------+--------------+ | ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features| +------+------+---------+----------+------+------+---------+----------+------------+--------------+ |000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...| |000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...| +------+------+---------+----------+------+------+---------+----------+------------+--------------+
  • 18. From SQL… • Generated SQL requests • Hard to maintain (especially as regards to UDFs) val cleaningRequest = tableSchema.map(x => { x.CleaningFuction match { case (Some(a), _) => a + "(" + x.name + ") as " + x.name case _ => x.name } }).mkString(", ") val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName) cleanedTable.registerTempTable(schema.tableName + "_cleaned")
  • 19. … to DataFrames • DataFrame primitives • More work done by the Scala compiler val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) { case (df, field) => val udf: UserDefinedFunction = ... // get the cleaning UDF df.withColumn(field.name + "_cleaned", udf.apply(df(field.name))) .drop(field.name) .withColumnRenamed(field.name + "_cleaned", field.name) }
  • 21. Unit testing • Scalatest + Scoverage • Coverage of all the data processing operations • Comparison of Row objects
  • 22. val resDF = schema.cleanTable(rows)
 "The cleaning process" should "clean text fields" in {
 val res = resDF.select("ID", "name", "surname").collect()
 val expected = Array(
 Row("000010", "jose", "lester"),
 Row("000011", "jose", "lester ea"),
 Row("000012", "jose", "lester")
 )
 res should contain theSameElementsAs expected
 } 
 "The cleaning process" should "parse dates" in { ... Unit testing 000010;Jose;Lester;10/10/1970
 000011;Jose =-+;Lester éà;10/10/1970
 000012;Jose;Lester;invalid date
  • 24. Join strategy • For record linkage, first merge the two sources • Then auto-join Prospects New clients Duplicate
  • 25. Join -Volume of data • Input: 1M records • Cartesian product: 1000 B records → Find an appropriate join condition 0 25 50 75 100
  • 26. Join condition • Multiples join on 2 fields • Equality of values or custom condition (UDF) • Union between all the intermediate results • E.g. with fields name, surname, birth_date: df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (soundLike(df1("surname_1"), df2("surname_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (df1("birth_date_1") === df2("birth_date_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (soundLike(df1("surname_1"), df2("surname_2")))
 && (df1("birth_date_1") === df2("birth_date_2"))) UNION
  • 28. • 3 types of columns DataFrames extension +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+ Data DistancesNon-distance features
  • 29. • DataFrame columns have a name and a data type • DataFrameExt = DataFrame + metadata over columns DataFrames extension case class OutputColumn(name: String, columnType: ColumnType)
 
 
 class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) { def show() = df.show() def drop(colName: String): DataFrameExt = ... def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ... ...
  • 31. Labeling • Manual operation • Is this a duplicate? →Yes / No • Performed on a sample of the potential duplicates • Between 1000 and 10 000 records
  • 34. Predictions • Machine Learning • Random Forests • (Gradient BoostingTrees also give good results) • Training on the potential duplicates labeled by hand • Predictions on the potential duplicates not labeled by hand
  • 35. Predictions • Sample: 1000 records • Training set: 800 records • Test set: 200 records • Results • True positives: 53 • False positives: 2 • True negatives: 126 • False negatives: 5 → Found 53 duplicates on the 58 expected (53+5) and only 2 errors •Precision ≈ 93% •Recall ≈ 91%
  • 37. Summary ✓ Single engine for Record Linkage and Deduplication ✓ Machine Learning → Specific rules for each dataset ✓ Higher identification of matches • Previously ~50% → Now 80-90%