Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin

Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application.

Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

  1. 1. RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis Seigneurin - Pascale Mkhael
  2. 2. Who I am • Software engineer for 15 years • Consultant at IpponTech in Paris, France • Spark trainer • Favorite subjects: Spark, Machine Learning, Cassandra • @aseigneurin
  3. 3. The DIL’s mission: Help AXA become a data-driven company… BUILDING technological platforms using Big Data technologies SUPPORTING AXA entities’ Big Data projects (tools and/or expertise) EXPLORING innovative opportunities to transform the insurance business Platforms > By focusing on… > The DIL in figures: 1 TEAM 3 SITES 77 OPPORTUNITIES 100+ TB 3 SETS OF PLAFORMS
  4. 4. The project • Record Linkage with Machine learning • Use cases: • Find new clients who come from insurance comparison services → Commission • Find duplicates in existing files (acquisitions) • Record Linkage • Entity resolution • Deduplication • Entity disambiguation • …
  5. 5. Overview
  6. 6. • Find duplicates! Purpose +---+-------+------------+----------+------------+---------+------------+ | ID| veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp| formule| +---+-------+------------+----------+------------+---------+------------+ |...|PE28221| 50000|1995-10-12| 2013-10-08| 100.0| TIERS| |...|FO26016| 59270|1967-01-01| 1987-02-01| 100.0|VOL_INCENDIE| |...|FI19107| 77100|1988-09-27| 2009-09-13| 105.0|TOUS_RISQUES| |...|RE07307| 69100|1984-08-15| 2007-04-20| 50.0| TIERS| |...|FO26016| 59270|1967-01-07| 1987-02-01| 105.0|TOUS_RISQUES| +---+-------+------------+----------+------------+---------+------------+
  7. 7. Steps 1. Preprocessing 1. Find potential duplicates 2. Feature engineering 2. Manual labeling of a sample 3. Machine Learning to make predictions on the rest of the records
  8. 8. Prototype • Crafted by a Data Scientist • Not architectured, not versioned, not unit tested… → Not ready for production • Spark, but a lot of Spark SQL (data processing) • Machine Learning in Python (Scikit Learn) → Objective: industrialization of the code
  9. 9. Preprocessing
  10. 10. • Data (CSV) + Schema (JSON) Inputs 000010;Jose;Lester;10/10/1970 000011;José;Lester;10/10/1970 000012;Tyler;Hunt;12/12/1972 000013;Tiler;Hunt;25/12/1972 000014;Patrick;Andrews;1973-12-13 { "tableSchemaBeforeSelection": [ { "name": "ID", "typeField": "StringType", "hardJoin": false }, { "name": "name", "typeField": "StringType", "hardJoin": true, "cleaning": "digitLetter", "listFeature": [ "scarcity" ], "listDistance": [ "equality", "soundLike" ] }, ...
  11. 11. • Spark CSV module → DataFrame Don’t use type inference Data loading +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| Jose| Lester|10/10/1970| |000011| José| Lester|10/10/1970| |000012| Tyler| Hunt|12/12/1972| |000013| Tiler| Hunt|25/12/1972| |000014|Patrick|Andrews|1970-10-10| +------+-------+-------+----------+
  12. 12. • Parsing of dates, numbers… • Cleaning of strings Data cleasning +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| jose| lester|1970-10-10| |000011| jose| lester|1970-10-10| |000012| tyler| hunt|1972-12-12| |000013| tiler| hunt|1972-12-25| |000014|patrick|andrews| null| +------+-------+-------+----------+
  13. 13. • Convert strings to phonetics (Beider-Morse) • … Feature calculation +------+-------+-------+----------+--------------------+ | ID| name|surname| birthDt| BMencoded_name| +------+-------+-------+----------+--------------------+ |000010| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000011| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000012| tyler| hunt|1972-12-12| tilir| |000013| tiler| hunt|1972-12-25| tQlir|tili|tilir| |000014|patrick|andrews| null|pYtrQk|pYtrik|pat...| +------+-------+-------+----------+--------------------+
  14. 14. • Auto-join (more on that later…) Find potential duplicates +------+------+---------+...+------+------+---------+... | ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|... +------+------+---------+...+------+------+---------+... |000010| jose| lester|...|000011| jose| lester|... |000012| tyler| hunt|...|000013| tiler| hunt|... +------+------+---------+...+------+------+---------+...
  15. 15. • Several distance algorithms: • Levenshtein distance, date difference… Distance calculation +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+
  16. 16. • Standardization of distances only • Vectorization (2 vectors) Standardization / vectorization +------+------+---------+----------+------+------+---------+----------+------------+--------------+ | ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features| +------+------+---------+----------+------+------+---------+----------+------------+--------------+ |000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...| |000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...| +------+------+---------+----------+------+------+---------+----------+------------+--------------+
  17. 17. Spark SQL → DataFrames
  18. 18. From SQL… • Generated SQL requests • Hard to maintain (especially as regards to UDFs) val cleaningRequest = tableSchema.map(x => { x.CleaningFuction match { case (Some(a), _) => a + "(" + x.name + ") as " + x.name case _ => x.name } }).mkString(", ") val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName) cleanedTable.registerTempTable(schema.tableName + "_cleaned")
  19. 19. … to DataFrames • DataFrame primitives • More work done by the Scala compiler val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) { case (df, field) => val udf: UserDefinedFunction = ... // get the cleaning UDF df.withColumn(field.name + "_cleaned", udf.apply(df(field.name))) .drop(field.name) .withColumnRenamed(field.name + "_cleaned", field.name) }
  20. 20. Unit testing
  21. 21. Unit testing • Scalatest + Scoverage • Coverage of all the data processing operations • Comparison of Row objects
  22. 22. val resDF = schema.cleanTable(rows)
 "The cleaning process" should "clean text fields" in {
 val res = resDF.select("ID", "name", "surname").collect()
 val expected = Array(
 Row("000010", "jose", "lester"),
 Row("000011", "jose", "lester ea"),
 Row("000012", "jose", "lester")
 )
 res should contain theSameElementsAs expected
 } 
 "The cleaning process" should "parse dates" in { ... Unit testing 000010;Jose;Lester;10/10/1970
 000011;Jose =-+;Lester éà;10/10/1970
 000012;Jose;Lester;invalid date
  23. 23. Matching potential duplicates
  24. 24. Join strategy • For record linkage, first merge the two sources • Then auto-join Prospects New clients Duplicate
  25. 25. Join -Volume of data • Input: 1M records • Cartesian product: 1000 B records → Find an appropriate join condition 0 25 50 75 100
  26. 26. Join condition • Multiples join on 2 fields • Equality of values or custom condition (UDF) • Union between all the intermediate results • E.g. with fields name, surname, birth_date: df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (soundLike(df1("surname_1"), df2("surname_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (df1("birth_date_1") === df2("birth_date_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (soundLike(df1("surname_1"), df2("surname_2")))
 && (df1("birth_date_1") === df2("birth_date_2"))) UNION
  27. 27. DataFrames extension
  28. 28. • 3 types of columns DataFrames extension +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+ Data DistancesNon-distance features
  29. 29. • DataFrame columns have a name and a data type • DataFrameExt = DataFrame + metadata over columns DataFrames extension case class OutputColumn(name: String, columnType: ColumnType)
 
 
 class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) { def show() = df.show() def drop(colName: String): DataFrameExt = ... def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ... ...
  30. 30. Labeling
  31. 31. Labeling • Manual operation • Is this a duplicate? →Yes / No • Performed on a sample of the potential duplicates • Between 1000 and 10 000 records
  32. 32. Labeling
  33. 33. Predictions
  34. 34. Predictions • Machine Learning • Random Forests • (Gradient BoostingTrees also give good results) • Training on the potential duplicates labeled by hand • Predictions on the potential duplicates not labeled by hand
  35. 35. Predictions • Sample: 1000 records • Training set: 800 records • Test set: 200 records • Results • True positives: 53 • False positives: 2 • True negatives: 126 • False negatives: 5 → Found 53 duplicates on the 58 expected (53+5) and only 2 errors •Precision ≈ 93% •Recall ≈ 91%
  36. 36. Summary & Conclusion
  37. 37. Summary ✓ Single engine for Record Linkage and Deduplication ✓ Machine Learning → Specific rules for each dataset ✓ Higher identification of matches • Previously ~50% → Now 80-90%
  38. 38. Thank you! @aseigneurin

    Be the first to comment

    Login to see the comments

  • eliasah

    Dec. 8, 2015
  • AvatarZhang

    Dec. 10, 2016
  • CristfolTorrensMorel

    Dec. 28, 2016
  • kennethowino9

    Jul. 14, 2017
  • caroljmcdonald

    Jul. 2, 2018
  • igreg10

    Sep. 27, 2018
  • FbioSouza112

    May. 28, 2019

Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application. Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.

Views

Total views

2,209

On Slideshare

0

From embeds

0

Number of embeds

60

Actions

Downloads

82

Shares

0

Comments

0

Likes

7

×