SlideShare a Scribd company logo
1 of 16
Applied Analytics with Greenplum Hadoop:

                    Standardizing +113 million Merchant Names
                                with RegEx and Fuzzy Matching




                                                         Ian Andrews
                                                         Mike Goddard
© Copyright 2012 EMC Corporation. All rights reserved.                  1
Greenplum, A Division of EMC
• 10 years of experience building and supporting enterprise-class massively
  parallel data processing software based on open source technology
• Silicon-valley based core engineering talent from
  Yahoo!, Teradata, Oracle, Amazon, Microsoft, IBM, etc
• 1,000 (and growing) personnel focused on Greenplum’s Big Data Platform
     – Greenplum Database
     – Greenplum HD (Hadoop)
     – Chorus
     – Data Computing Appliances
     – Data Scientists
     – Pivotal Labs
• Fully integrated with EMC’s award-winning global support infrastructure.
• 500+ customers in production globally across all industry segments.
• Established relationships with ecosystems partners:
  Informatica, SAS, Talend, Pentaho, Microstrategy, etc.
• Strategic development relationship with VMware around virtual big data
  platforms


© Copyright 2012 EMC Corporation. All rights reserved.                        2
Greenplum Unified Analytic Platform




© Copyright 2012 EMC Corporation. All rights reserved.   3
Transaction Data - Merchant Name
                      Standardization System




© Copyright 2012 EMC Corporation. All rights reserved.   4
Overview of Findings
• Transaction data is difficult to analyze as merchants
  names found in credit and debit data are unstructured
  and non-standardized across single business entities
• We created a system for cleaning and standardizing
  merchant names
         –      Stage 1: feature extraction
         –      Stage 2: automated cleanup using regular expressions
         –      Stage 3: fuzzy matching algorithm
         –      Stage 4: application of manual rules
• This is an open system, easy to use, extend and modify
• We used the results to do some preliminary analysis on
  the transaction data


© Copyright 2012 EMC Corporation. All rights reserved.                 5
Background Information -
 Credit and Debit Data Overview
                                                           % # transactions
Credit Transactions1                                                                        Debit Transactions
• 1,396,344 distinct merchant                                14.62%                         • 2,598,462 distinct merchant
  names                                                                                       names
• 16,554,889 credit transactions                                                            • 96,658,020 debit transactions
  ($1,979,801,143.50)                                                                         ($3,471,084,518.72)
                                                                            85.38%
• 161,931 households with                                                                   • 435,615 households with debit
  credit transaction                                                                          transaction
• Min: -$32,585                                                  Debit     Credit           • Min: $0.01
• Max: $99,000                                                                              • Max: $39,404
• Average: $120                                           % sum transactions                • Average: $36
• Std. Deviation: $496                                                                      • Std. Deviation: $89

                                                          36.32%


                                                                             63.68%




                                                                   Debit    Credit

                                                             1   Excludes 13 Sic Codes in depository institution activity group



 © Copyright 2012 EMC Corporation. All rights reserved.                                                                           6
Why standardize merchant names?
• Due to multiple names of same businesses across
  locations a single business entity appears as many
  in the database
• Examples

                 WAL-MART                                         PAYPAL              STARBUCKS
WALMART PORTRAITS 23093                                  PAYPAL *SACCAR.COM    STARBUCKSSTORE.COM-USD
WAL-MART #2366                      SE2                  PAYPAL *BRICKSUPPLY   STARBUCKS CORP00034488
WAL-MART STORE#1041                                      PAYPAL *BRETT2010FL   SS-STARBUCKS
WAL-MART SUPERCENTER 20                                  PAYPAL *UNITED        T1 STARBUCKS J10431542
WAL MART LINCOLN                                         PAYPAL *TL5354        STARBUCKS C #112201505
WALMART.COM RELOAD                                       PAYPAL *CAR-KIT.COM   STARBUCKS WEST30081525




© Copyright 2012 EMC Corporation. All rights reserved.                                                  7
Examples of name passing thru
merchant name standardization system
Original:                                                          Original:

                   GIANT FOOD #089                                             PETSMART INC 1963
Features:                                                Stage 1   Features:
                   Length: 14                                                  Length: 17
                   1st White Space: 6                                          1st White Space: 9
                   1st Special Characters: 12                                  Business Suffix: 10
                   1st Digit: 13                                               1st Digit: 14
                                                         Stage 2   Regex:
Regex:
                   [^(?-i)a-z]                                                 [^(?-i)a-z]|( INC )$
                   Remove all numbers (0-9),                                   Remove all numbers (0-9),
                   white space,                                                white space, special
                   & special characters                                        characters, & remove
                                                         Stage 3   business    suffix
Fuzzy Matching:                                                    Fuzzy Matching:
                   1016 (count of                                              <170 PETSMART FOUND
                   GIANTFOOD matches)                                          (Not run)
                                                         Stage 4
Manual Override:                                                   Manual Override:
                   None                                                        None
Final Results:                                                     Final Results:
                   GIANTFOOD                                                   PETSMART



© Copyright 2012 EMC Corporation. All rights reserved.                                                     8
Example Results - STARBUCKS
                    Pre-Standardization                         Post-Standardization
STARBUCKS DELI20371514                                   STARBUCKS
STARBUCKS-ARIFJAN CAMP2                                  STARBUCKS
STARBUCKS C #112201505                                   STARBUCKS
STARBUCKS USA 00115832                                   STARBUCKS
STARBUCK'S CAFE CROWNE                                   STARBUCKS
STARBUCKS CORP00134759                                   STARBUCKS
ATL MED CTR STARBUCKS                                    STARBUCKS
T3 N STARBUCKS30031512                                   STARBUCKS
STARBUCKS COFEE                                          STARBUCKS
STARBUCKS LA ISLA                                        STARBUCKS
OMNI FT WORTH - STARBUCKS                                STARBUCKS
ST. RITA'S STARBUCKS                                     STARBUCKS
MGM GRND STARBUCKS-CASINO                                STARBUCKS
006 STARBUCKS AMR                                        STARBUCKS




© Copyright 2012 EMC Corporation. All rights reserved.                                 9
90% of all transactions occur at 7% of the
  merchants
Company                            Total
Name                               Transactions
MCDONALDS                                  4,309,728
SPEEDWAY                                   2,032,474
WALMART                                    1,606,446
KROGER                                     1,564,819
SHELLOIL                                   1,546,056
SHEETZ                                     1,358,977
SUBWAY                                     1,280,037
REDBOX                                     1,236,148
EXXONMOBIL                                 1,205,451
WAWA                                       1,197,711
SUNO                                       1,180,799
WENDYS                                     1,066,628       Gini Coefficient = 0.9447
MARATHONOIL                                1,050,593
                                                           •   0 represents equality
MEIJER                                     1,017,998
                                                           •   1 represents all transactions at 1 merchant
STARBUCKS                                  1,002,805



  © Copyright 2012 EMC Corporation. All rights reserved.                                                     10
90% of the total spend in 2011 occurred
 at top 8.3% of merchants
Company                              Total spent
Name
WALMART                                $87,454,235.66
KROGER                                 $63,850,902.99
SPEEDWAY                               $54,270,752.65
TARGET                                 $48,086,797.70
MEIJER                                 $46,716,327.56
WMSUPERCENTER                          $46,650,761.15
SHELLOIL                               $45,115,993.12
GIANTEAGLE                             $44,668,211.07
ATT                                    $44,497,819.88
VERIZONWRLS                            $41,971,943.31
LOWES                                  $34,952,686.13
SUNO                                   $34,498,328.42
EXXONMOBIL                             $33,695,575.95
                                                          Gini Coefficient = 0.9408
MCDONALDS                              $30,869,463.74
                                                          •   0 represents equality
SHEETZ                                 $30,273,183.81
                                                          •   1 represents all money spent at 1 merchant


 © Copyright 2012 EMC Corporation. All rights reserved.                                                    11
‘Sic Codes’ alone are problematic; they
 can differ greatly across like businesses
 • On average the top 1,000 frequently occurring
   merchants have ~6 sic codes associated with their
   cleaned merchant name

WALMART                  TARGET                     SAFEWAY        KROGER    AT&T        VERIZON          T-MOBILE
4814                     5310                       5411           12        1711        4812             12
4816                     5411                       5499           5411      2741        4814             4812
5300                     5732                       5921           5499      3640        4899             5732
5411                     8043                                      5541      4112        5999             5999
6300                     8099                                      5542      5971        7311             7299
…                        …                                                   …           7399             …
Total 31                 Total 8                                             Total 71                     Total 10

      6 total matches                                      2 total matches              4 total matches


 © Copyright 2012 EMC Corporation. All rights reserved.                                                              12
Relative Value Add segments created by
splitting population into deciles based on
RVA                                 RVA




• Relative Value Added (RVA) provides an estimated ordinal
  ranking of customers using balance and transaction data (a
  rough precursor of EVA)
• The RVA was created to put a context around the merchant
  name discovery, the distribution of PNC’s products and how
  they interact

© Copyright 2012 EMC Corporation. All rights reserved.         13
Segment Profiles
Index: % segment / % population
                            Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort marketing to 8 Cohort 9 Cohort 10
                                                                    Target’s 6 Cohort 7 Cohort
 Cellular telephone providers                                                 higher income
 ATT                               1.00         0.86     1.18   1.24   1.14   1.04   0.97
                                                                       households seems        to 0.91   0.86   0.79
 SPRINT                            1.75         0.55     1.93   1.72   1.15   0.81   0.67        0.56    0.50   0.36
                                                                            have worked
 TMOBILE                           1.35         0.95     1.38   1.36   1.06   0.86   0.92        0.81    0.71   0.60
 VERIZONWRLS                       0.95         0.52     1.18   1.32   1.28   1.11   1.01        0.95    0.90   0.78
 Retail stores
 SEARSROEBUCK                      0.64         1.60     0.60   0.63   0.79      0.90   1.03     1.12    1.25   1.45
 TJMAXX                            0.68         1.46     0.71   0.66   0.83      0.96   1.02     1.12    1.22   1.32
 TARGET                            0.72         1.51     0.63   0.69   0.87      1.02   1.11     1.16    1.18   1.12
 WALMART                           0.82         1.77     0.82   0.82   0.88      0.89   0.92     0.97    1.00   1.11
 STAPLES                           0.69         1.72     0.71   0.55   0.68      0.88   0.97     1.06    1.19   1.54
 STARBUCKS                         0.82         0.47     0.81   0.88   1.04      1.21   1.23     1.23    1.19   1.14
 PAYPAL                            1.13         1.51     1.03   0.86   0.82      0.91   1.00     0.90    0.92   0.93
 Groceries
 PUBLIX                            0.84         3.16     0.35   0.45   0.56      0.72   0.83     0.86    0.94   1.27
 MENARDS                           0.75         3.66     0.42   0.38   0.55      0.71   0.77     0.93    0.85   0.98
 KROGER                            0.79         1.13     0.79   0.87   1.00      1.01   1.03     1.10    1.09   1.20
 Gas and convenience stores
 EXXONMOBIL                        1.07         0.93     1.04   1.03   1.01      0.99   1.00     0.96    0.96   1.01
 SHEETZ                            0.87         0.36     0.91   1.01   0.96      0.96   1.04     1.21    1.37   1.31
 SHELLOIL                          1.12         1.04     1.03   1.04   1.01      1.01   0.98     0.93    0.93   0.91
 SPEEDWAY                          1.17         0.90     1.25   1.24   1.16      1.04   0.97     0.87    0.77   0.63
 Hotels
 HILTON                            0.69         1.70     0.49   0.53   0.76      1.02   1.15     1.14    1.16   1.36
 RAMADAINN                         0.75         2.29     0.40   0.64   0.90      0.88   1.00     1.10    0.90   1.13
 RESIDENCEINN                      0.92         1.94     0.56   0.73   0.68      0.84   1.00     0.82    0.97   1.55
 ROYALINN                          0.23         0.87     1.07   0.81   0.99      0.85   0.78     0.49    1.04   2.87




© Copyright 2012 EMC Corporation. All rights reserved.                                                                 14
Segment Profiles
Index: % segment / % population
                            Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort 6 Cohort 7 Cohort 8 Cohort 9 Cohort 10
 Cellular telephone providers
 ATT                               1.00         0.86     1.18   1.24   1.14      1.04    0.97   0.91    0.86      0.79
 SPRINT                            1.75         0.55     1.93   1.72   1.15      0.81    0.67   0.56    0.50      0.36
 TMOBILE                           1.35         0.95     1.38   1.36   1.06      0.86    0.92   0.81    0.71      0.60
 VERIZONWRLS                       0.95         0.52     1.18   1.32   1.28      1.11    1.01   0.95    0.90      0.78
 Retail stores
 SEARSROEBUCK                      0.64         1.60     0.60   0.63   0.79      0.90    1.03   1.12    1.25      1.45
 TJMAXX                            0.68         1.46     0.71   0.66   0.83      0.96    1.02   1.12    1.22      1.32
 TARGET                            0.72         1.51     0.63   0.69   0.87      1.02    1.11   1.16    1.18      1.12
 WALMART                           0.82         1.77     0.82   0.82   0.88      0.89    0.92   0.97    1.00      1.11
 STAPLES                           0.69         1.72     0.71   0.55   0.68      0.88    0.97   1.06    1.19      1.54
 STARBUCKS                         0.82         0.47     0.81   0.88   1.04      1.21    1.23   1.23    1.19      1.14
 PAYPAL                            1.13         1.51     1.03   0.86   0.82      0.91 and1.00
                                                                               AT&T             0.90
                                                                                          Verizon       0.92      0.93
 Groceries
 PUBLIX                            0.84         3.16     0.35   0.45   0.56
                                                                              appear to be gaining
                                                                                 0.72    0.83   0.86    0.94      1.27
 MENARDS                           0.75         3.66     0.42   0.38   0.55     more high value0.93
                                                                                 0.71    0.77           0.85      0.98
 KROGER                            0.79         1.13     0.79   0.87   1.00        customers 1.10
                                                                                 1.01    1.03           1.09      1.20
 Gas and convenience stores
 EXXONMOBIL                        1.07         0.93     1.04   1.03   1.01      0.99    1.00   0.96    0.96      1.01
 SHEETZ                            0.87         0.36     0.91   1.01   0.96      0.96    1.04   1.21    1.37      1.31
 SHELLOIL                          1.12         1.04     1.03   1.04   1.01      1.01    0.98   0.93    0.93      0.91
 SPEEDWAY                          1.17         0.90     1.25   1.24   1.16      1.04    0.97   0.87    0.77      0.63
 Hotels
 HILTON                            0.69         1.70     0.49   0.53   0.76      1.02    1.15   1.14    1.16      1.36
 RAMADAINN                         0.75         2.29     0.40   0.64   0.90      0.88    1.00   1.10    0.90      1.13
 RESIDENCEINN                      0.92         1.94     0.56   0.73   0.68      0.84    1.00   0.82    0.97      1.55
 ROYALINN                          0.23         0.87     1.07   0.81   0.99      0.85    0.78   0.49    1.04      2.87




© Copyright 2012 EMC Corporation. All rights reserved.                                                                   15
Summary of Findings
• We cleaned and standardized merchant names and
         – Found 1.1 million distinct merchants from the original 113+ million
         – Discovered 90% of transactions and 90% of the money spent
           happened at less than 10% of the merchants
         – Identified that ‘Sic Codes’ significantly differ across like businesses
         – Identified differences in credit and debit purchase behavior
         – In reaction to the announcement that Square made August 8th we
           used cleaned merchant names to evaluate the potential impact of
           the trend towards alternative payment methods using the clean
           merchant names
• Segmentation augmented by a value added metric
         – We found that segmenting customers based on a rough measure of
           value added and combining that with transaction data can provide
           interesting insights
         – Prediction of migration from low to high value segments seems
           possible


© Copyright 2012 EMC Corporation. All rights reserved.                               16

More Related Content

What's hot

Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in KafkaJoel Koshy
 
Alexei vladishev - Open Source Monitoring With Zabbix
Alexei vladishev - Open Source Monitoring With ZabbixAlexei vladishev - Open Source Monitoring With Zabbix
Alexei vladishev - Open Source Monitoring With ZabbixAndré Déo
 
Blockchain Security Issues and Challenges
Blockchain Security Issues and Challenges Blockchain Security Issues and Challenges
Blockchain Security Issues and Challenges Merlec Mpyana
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation SystemMinha Hwang
 
Blockchain 101 by imran bashir
Blockchain 101  by imran bashirBlockchain 101  by imran bashir
Blockchain 101 by imran bashirImran Bashir
 
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...Edureka!
 
Webinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanWebinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanGabriele Bartolini
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
 
Intro to Web3 and Polygon.pdf
Intro to Web3 and Polygon.pdfIntro to Web3 and Polygon.pdf
Intro to Web3 and Polygon.pdfTinaBregovi
 
Getting Started in Web3 with MetaMask.pptx
Getting Started in Web3 with MetaMask.pptxGetting Started in Web3 with MetaMask.pptx
Getting Started in Web3 with MetaMask.pptxssuser455e28
 
Enjin Coin - Pitch Deck For Investors
Enjin Coin - Pitch Deck For InvestorsEnjin Coin - Pitch Deck For Investors
Enjin Coin - Pitch Deck For InvestorsSimonKertonegoro
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j Max De Marzi
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsXavier Amatriain
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scaleMydbops
 
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...ScyllaDB
 

What's hot (20)

Consumer offset management in Kafka
Consumer offset management in KafkaConsumer offset management in Kafka
Consumer offset management in Kafka
 
Alexei vladishev - Open Source Monitoring With Zabbix
Alexei vladishev - Open Source Monitoring With ZabbixAlexei vladishev - Open Source Monitoring With Zabbix
Alexei vladishev - Open Source Monitoring With Zabbix
 
Blockchain Security Issues and Challenges
Blockchain Security Issues and Challenges Blockchain Security Issues and Challenges
Blockchain Security Issues and Challenges
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
 
Blockchain 101 by imran bashir
Blockchain 101  by imran bashirBlockchain 101  by imran bashir
Blockchain 101 by imran bashir
 
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
 
Webinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanWebinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with Barman
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 
Bitcoin & Bitcoin Mining
Bitcoin & Bitcoin MiningBitcoin & Bitcoin Mining
Bitcoin & Bitcoin Mining
 
Blockchain concepts
Blockchain conceptsBlockchain concepts
Blockchain concepts
 
Intro to Web3 and Polygon.pdf
Intro to Web3 and Polygon.pdfIntro to Web3 and Polygon.pdf
Intro to Web3 and Polygon.pdf
 
Getting Started in Web3 with MetaMask.pptx
Getting Started in Web3 with MetaMask.pptxGetting Started in Web3 with MetaMask.pptx
Getting Started in Web3 with MetaMask.pptx
 
Enjin Coin - Pitch Deck For Investors
Enjin Coin - Pitch Deck For InvestorsEnjin Coin - Pitch Deck For Investors
Enjin Coin - Pitch Deck For Investors
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
 
Introduction to Blockchain
Introduction to BlockchainIntroduction to Blockchain
Introduction to Blockchain
 
Postgresql
PostgresqlPostgresql
Postgresql
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
 
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
How to Secure Your Scylla Deployment: Authorization, Encryption, LDAP Authent...
 

Viewers also liked

Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeNati Shalom
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Nati Shalom
 
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...Carmine Gallo
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 

Viewers also liked (11)

Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017Open Stack Days israel Keynote 2017
Open Stack Days israel Keynote 2017
 
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
The Storyteller's Secret: 3 Keys to Mastering Storytelling to Win Hearts and ...
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 

Similar to Standardizing 113 Million Merchant Names with RegEx and Fuzzy Matching

50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.pptDanielYang700061
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleTigerGraph
 
Bending The Curve
Bending The CurveBending The Curve
Bending The Curvefinteligent
 
Assignment 2 (8119)
Assignment 2 (8119)Assignment 2 (8119)
Assignment 2 (8119)garimakashap
 
GDC '09: Creating Value for Video Game Companies
GDC '09: Creating Value for Video Game CompaniesGDC '09: Creating Value for Video Game Companies
GDC '09: Creating Value for Video Game CompaniesMitch Lasky
 
TVA Utilility Forum - North Amer Sept 2011
TVA Utilility Forum - North Amer Sept 2011TVA Utilility Forum - North Amer Sept 2011
TVA Utilility Forum - North Amer Sept 2011Mike Wallace
 
CCLLC Firm Intro
CCLLC Firm IntroCCLLC Firm Intro
CCLLC Firm IntroTony Latona
 
Engage 2013 - Tag Management
Engage 2013 - Tag ManagementEngage 2013 - Tag Management
Engage 2013 - Tag ManagementWebtrends
 
A Closer Look at Churn - June 2012
A Closer Look at Churn - June 2012A Closer Look at Churn - June 2012
A Closer Look at Churn - June 2012Neil Hartz
 
Transworld Systems Profit Recovery Program
Transworld Systems Profit Recovery ProgramTransworld Systems Profit Recovery Program
Transworld Systems Profit Recovery Programjeff dorsey
 
Challenges Of Open Org Sun 2008
Challenges Of Open Org Sun 2008Challenges Of Open Org Sun 2008
Challenges Of Open Org Sun 2008Scott Farquhar
 
iPods and iTunes
iPods and iTunesiPods and iTunes
iPods and iTunesEric Moon
 
Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.jhugg
 
Scaling a Rails Application from the Bottom Up
Scaling a Rails Application from the Bottom Up Scaling a Rails Application from the Bottom Up
Scaling a Rails Application from the Bottom Up Abhishek Singh
 
Last Thursday Club Sep 08
Last Thursday Club Sep 08Last Thursday Club Sep 08
Last Thursday Club Sep 08Scott Farquhar
 
CFITS Disaster Recovery 2009
CFITS Disaster Recovery 2009CFITS Disaster Recovery 2009
CFITS Disaster Recovery 2009cfits
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Michael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallMichael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallDanuta Pysarenko
 

Similar to Standardizing 113 Million Merchant Names with RegEx and Fuzzy Matching (20)

50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt50-AAPL-Buyside-Pitchbook.ppt
50-AAPL-Buyside-Pitchbook.ppt
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at Scale
 
Bending The Curve
Bending The CurveBending The Curve
Bending The Curve
 
Assignment 2 (8119)
Assignment 2 (8119)Assignment 2 (8119)
Assignment 2 (8119)
 
GDC '09: Creating Value for Video Game Companies
GDC '09: Creating Value for Video Game CompaniesGDC '09: Creating Value for Video Game Companies
GDC '09: Creating Value for Video Game Companies
 
TVA Utilility Forum - North Amer Sept 2011
TVA Utilility Forum - North Amer Sept 2011TVA Utilility Forum - North Amer Sept 2011
TVA Utilility Forum - North Amer Sept 2011
 
CCLLC Firm Intro
CCLLC Firm IntroCCLLC Firm Intro
CCLLC Firm Intro
 
Engage 2013 - Tag Management
Engage 2013 - Tag ManagementEngage 2013 - Tag Management
Engage 2013 - Tag Management
 
A Closer Look at Churn - June 2012
A Closer Look at Churn - June 2012A Closer Look at Churn - June 2012
A Closer Look at Churn - June 2012
 
Transworld Systems Profit Recovery Program
Transworld Systems Profit Recovery ProgramTransworld Systems Profit Recovery Program
Transworld Systems Profit Recovery Program
 
Challenges Of Open Org Sun 2008
Challenges Of Open Org Sun 2008Challenges Of Open Org Sun 2008
Challenges Of Open Org Sun 2008
 
The World is Flat
The World is FlatThe World is Flat
The World is Flat
 
iPods and iTunes
iPods and iTunesiPods and iTunes
iPods and iTunes
 
Nano Dimension Presentation
Nano Dimension PresentationNano Dimension Presentation
Nano Dimension Presentation
 
Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.Transactional Streaming: If you can compute it, you can probably stream it.
Transactional Streaming: If you can compute it, you can probably stream it.
 
Scaling a Rails Application from the Bottom Up
Scaling a Rails Application from the Bottom Up Scaling a Rails Application from the Bottom Up
Scaling a Rails Application from the Bottom Up
 
Last Thursday Club Sep 08
Last Thursday Club Sep 08Last Thursday Club Sep 08
Last Thursday Club Sep 08
 
CFITS Disaster Recovery 2009
CFITS Disaster Recovery 2009CFITS Disaster Recovery 2009
CFITS Disaster Recovery 2009
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Michael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start SmallMichael Goguen, Sequoia Capital: Think Big, Start Small
Michael Goguen, Sequoia Capital: Think Big, Start Small
 

More from Data Science London

Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingData Science London
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Data Science London
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresData Science London
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayData Science London
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignData Science London
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Data Science London
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryData Science London
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutData Science London
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersData Science London
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourData Science London
 

More from Data Science London (20)

Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
Understanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer BehaviourUnderstanding Cause & Effect in Customer Behaviour
Understanding Cause & Effect in Customer Behaviour
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Standardizing 113 Million Merchant Names with RegEx and Fuzzy Matching

  • 1. Applied Analytics with Greenplum Hadoop: Standardizing +113 million Merchant Names with RegEx and Fuzzy Matching Ian Andrews Mike Goddard © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Greenplum, A Division of EMC • 10 years of experience building and supporting enterprise-class massively parallel data processing software based on open source technology • Silicon-valley based core engineering talent from Yahoo!, Teradata, Oracle, Amazon, Microsoft, IBM, etc • 1,000 (and growing) personnel focused on Greenplum’s Big Data Platform – Greenplum Database – Greenplum HD (Hadoop) – Chorus – Data Computing Appliances – Data Scientists – Pivotal Labs • Fully integrated with EMC’s award-winning global support infrastructure. • 500+ customers in production globally across all industry segments. • Established relationships with ecosystems partners: Informatica, SAS, Talend, Pentaho, Microstrategy, etc. • Strategic development relationship with VMware around virtual big data platforms © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. Greenplum Unified Analytic Platform © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Transaction Data - Merchant Name Standardization System © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Overview of Findings • Transaction data is difficult to analyze as merchants names found in credit and debit data are unstructured and non-standardized across single business entities • We created a system for cleaning and standardizing merchant names – Stage 1: feature extraction – Stage 2: automated cleanup using regular expressions – Stage 3: fuzzy matching algorithm – Stage 4: application of manual rules • This is an open system, easy to use, extend and modify • We used the results to do some preliminary analysis on the transaction data © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Background Information - Credit and Debit Data Overview % # transactions Credit Transactions1 Debit Transactions • 1,396,344 distinct merchant 14.62% • 2,598,462 distinct merchant names names • 16,554,889 credit transactions • 96,658,020 debit transactions ($1,979,801,143.50) ($3,471,084,518.72) 85.38% • 161,931 households with • 435,615 households with debit credit transaction transaction • Min: -$32,585 Debit Credit • Min: $0.01 • Max: $99,000 • Max: $39,404 • Average: $120 % sum transactions • Average: $36 • Std. Deviation: $496 • Std. Deviation: $89 36.32% 63.68% Debit Credit 1 Excludes 13 Sic Codes in depository institution activity group © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Why standardize merchant names? • Due to multiple names of same businesses across locations a single business entity appears as many in the database • Examples WAL-MART PAYPAL STARBUCKS WALMART PORTRAITS 23093 PAYPAL *SACCAR.COM STARBUCKSSTORE.COM-USD WAL-MART #2366 SE2 PAYPAL *BRICKSUPPLY STARBUCKS CORP00034488 WAL-MART STORE#1041 PAYPAL *BRETT2010FL SS-STARBUCKS WAL-MART SUPERCENTER 20 PAYPAL *UNITED T1 STARBUCKS J10431542 WAL MART LINCOLN PAYPAL *TL5354 STARBUCKS C #112201505 WALMART.COM RELOAD PAYPAL *CAR-KIT.COM STARBUCKS WEST30081525 © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Examples of name passing thru merchant name standardization system Original: Original: GIANT FOOD #089 PETSMART INC 1963 Features: Stage 1 Features: Length: 14 Length: 17 1st White Space: 6 1st White Space: 9 1st Special Characters: 12 Business Suffix: 10 1st Digit: 13 1st Digit: 14 Stage 2 Regex: Regex: [^(?-i)a-z] [^(?-i)a-z]|( INC )$ Remove all numbers (0-9), Remove all numbers (0-9), white space, white space, special & special characters characters, & remove Stage 3 business suffix Fuzzy Matching: Fuzzy Matching: 1016 (count of <170 PETSMART FOUND GIANTFOOD matches) (Not run) Stage 4 Manual Override: Manual Override: None None Final Results: Final Results: GIANTFOOD PETSMART © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Example Results - STARBUCKS Pre-Standardization Post-Standardization STARBUCKS DELI20371514 STARBUCKS STARBUCKS-ARIFJAN CAMP2 STARBUCKS STARBUCKS C #112201505 STARBUCKS STARBUCKS USA 00115832 STARBUCKS STARBUCK'S CAFE CROWNE STARBUCKS STARBUCKS CORP00134759 STARBUCKS ATL MED CTR STARBUCKS STARBUCKS T3 N STARBUCKS30031512 STARBUCKS STARBUCKS COFEE STARBUCKS STARBUCKS LA ISLA STARBUCKS OMNI FT WORTH - STARBUCKS STARBUCKS ST. RITA'S STARBUCKS STARBUCKS MGM GRND STARBUCKS-CASINO STARBUCKS 006 STARBUCKS AMR STARBUCKS © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. 90% of all transactions occur at 7% of the merchants Company Total Name Transactions MCDONALDS 4,309,728 SPEEDWAY 2,032,474 WALMART 1,606,446 KROGER 1,564,819 SHELLOIL 1,546,056 SHEETZ 1,358,977 SUBWAY 1,280,037 REDBOX 1,236,148 EXXONMOBIL 1,205,451 WAWA 1,197,711 SUNO 1,180,799 WENDYS 1,066,628 Gini Coefficient = 0.9447 MARATHONOIL 1,050,593 • 0 represents equality MEIJER 1,017,998 • 1 represents all transactions at 1 merchant STARBUCKS 1,002,805 © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. 90% of the total spend in 2011 occurred at top 8.3% of merchants Company Total spent Name WALMART $87,454,235.66 KROGER $63,850,902.99 SPEEDWAY $54,270,752.65 TARGET $48,086,797.70 MEIJER $46,716,327.56 WMSUPERCENTER $46,650,761.15 SHELLOIL $45,115,993.12 GIANTEAGLE $44,668,211.07 ATT $44,497,819.88 VERIZONWRLS $41,971,943.31 LOWES $34,952,686.13 SUNO $34,498,328.42 EXXONMOBIL $33,695,575.95 Gini Coefficient = 0.9408 MCDONALDS $30,869,463.74 • 0 represents equality SHEETZ $30,273,183.81 • 1 represents all money spent at 1 merchant © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. ‘Sic Codes’ alone are problematic; they can differ greatly across like businesses • On average the top 1,000 frequently occurring merchants have ~6 sic codes associated with their cleaned merchant name WALMART TARGET SAFEWAY KROGER AT&T VERIZON T-MOBILE 4814 5310 5411 12 1711 4812 12 4816 5411 5499 5411 2741 4814 4812 5300 5732 5921 5499 3640 4899 5732 5411 8043 5541 4112 5999 5999 6300 8099 5542 5971 7311 7299 … … … 7399 … Total 31 Total 8 Total 71 Total 10 6 total matches 2 total matches 4 total matches © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Relative Value Add segments created by splitting population into deciles based on RVA RVA • Relative Value Added (RVA) provides an estimated ordinal ranking of customers using balance and transaction data (a rough precursor of EVA) • The RVA was created to put a context around the merchant name discovery, the distribution of PNC’s products and how they interact © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Segment Profiles Index: % segment / % population Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort marketing to 8 Cohort 9 Cohort 10 Target’s 6 Cohort 7 Cohort Cellular telephone providers higher income ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97 households seems to 0.91 0.86 0.79 SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36 have worked TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60 VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78 Retail stores SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45 TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32 TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12 WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11 STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54 STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14 PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 1.00 0.90 0.92 0.93 Groceries PUBLIX 0.84 3.16 0.35 0.45 0.56 0.72 0.83 0.86 0.94 1.27 MENARDS 0.75 3.66 0.42 0.38 0.55 0.71 0.77 0.93 0.85 0.98 KROGER 0.79 1.13 0.79 0.87 1.00 1.01 1.03 1.10 1.09 1.20 Gas and convenience stores EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01 SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31 SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91 SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63 Hotels HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36 RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13 RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55 ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87 © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Segment Profiles Index: % segment / % population Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5 Cohort 6 Cohort 7 Cohort 8 Cohort 9 Cohort 10 Cellular telephone providers ATT 1.00 0.86 1.18 1.24 1.14 1.04 0.97 0.91 0.86 0.79 SPRINT 1.75 0.55 1.93 1.72 1.15 0.81 0.67 0.56 0.50 0.36 TMOBILE 1.35 0.95 1.38 1.36 1.06 0.86 0.92 0.81 0.71 0.60 VERIZONWRLS 0.95 0.52 1.18 1.32 1.28 1.11 1.01 0.95 0.90 0.78 Retail stores SEARSROEBUCK 0.64 1.60 0.60 0.63 0.79 0.90 1.03 1.12 1.25 1.45 TJMAXX 0.68 1.46 0.71 0.66 0.83 0.96 1.02 1.12 1.22 1.32 TARGET 0.72 1.51 0.63 0.69 0.87 1.02 1.11 1.16 1.18 1.12 WALMART 0.82 1.77 0.82 0.82 0.88 0.89 0.92 0.97 1.00 1.11 STAPLES 0.69 1.72 0.71 0.55 0.68 0.88 0.97 1.06 1.19 1.54 STARBUCKS 0.82 0.47 0.81 0.88 1.04 1.21 1.23 1.23 1.19 1.14 PAYPAL 1.13 1.51 1.03 0.86 0.82 0.91 and1.00 AT&T 0.90 Verizon 0.92 0.93 Groceries PUBLIX 0.84 3.16 0.35 0.45 0.56 appear to be gaining 0.72 0.83 0.86 0.94 1.27 MENARDS 0.75 3.66 0.42 0.38 0.55 more high value0.93 0.71 0.77 0.85 0.98 KROGER 0.79 1.13 0.79 0.87 1.00 customers 1.10 1.01 1.03 1.09 1.20 Gas and convenience stores EXXONMOBIL 1.07 0.93 1.04 1.03 1.01 0.99 1.00 0.96 0.96 1.01 SHEETZ 0.87 0.36 0.91 1.01 0.96 0.96 1.04 1.21 1.37 1.31 SHELLOIL 1.12 1.04 1.03 1.04 1.01 1.01 0.98 0.93 0.93 0.91 SPEEDWAY 1.17 0.90 1.25 1.24 1.16 1.04 0.97 0.87 0.77 0.63 Hotels HILTON 0.69 1.70 0.49 0.53 0.76 1.02 1.15 1.14 1.16 1.36 RAMADAINN 0.75 2.29 0.40 0.64 0.90 0.88 1.00 1.10 0.90 1.13 RESIDENCEINN 0.92 1.94 0.56 0.73 0.68 0.84 1.00 0.82 0.97 1.55 ROYALINN 0.23 0.87 1.07 0.81 0.99 0.85 0.78 0.49 1.04 2.87 © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Summary of Findings • We cleaned and standardized merchant names and – Found 1.1 million distinct merchants from the original 113+ million – Discovered 90% of transactions and 90% of the money spent happened at less than 10% of the merchants – Identified that ‘Sic Codes’ significantly differ across like businesses – Identified differences in credit and debit purchase behavior – In reaction to the announcement that Square made August 8th we used cleaned merchant names to evaluate the potential impact of the trend towards alternative payment methods using the clean merchant names • Segmentation augmented by a value added metric – We found that segmenting customers based on a rough measure of value added and combining that with transaction data can provide interesting insights – Prediction of migration from low to high value segments seems possible © Copyright 2012 EMC Corporation. All rights reserved. 16

Editor's Notes

  1. SCRIPT:This diagram depicts the Greenplum Unified Analytics Platform. Let’s take a high level look of what it looks like from a stack diagram. The foundations of UAP lie in Greenplum Database for analyzing your structured data, co-processing unstructured data with Greenplum Hadoop. These two components are fused together by Greenplum gNet, which allows for parallel data exchange and parallel query access. These are overlaid with a unified data access and query layer that combines the languages of choice for your analysts (SQL, MapReduce, Etc.). Over the access layer comes our powerful partner tool and services layer. We are not about locking customers into a single tool or stack. Instead we work with the tool vendor of your choice, be it SAS or R, Microstrategy or informatica. And what truly enables productivity and ensures you are getting maximum value out of your data scientist team is Greenplum Chorus. What sets this diagram apart from a typically vendor example is the inclusion of people – Data Stakeholders. UAP is designed to enable an emerging group of talent, the new practitioners, that we refer to as the Data Science team. This team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team.We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance. NOTES: