SlideShare a Scribd company logo
1 of 18
Download to read offline
© Nube Technologies
Real Time Fuzzy Matching With Spark and
ElasticSearch
© Nube Technologies
About Us
The only way to do great work is to love what
you do.
- Steve Jobs
© Nube Technologies
The problem - lake or swamp?
© Nube Technologies
Duplicates
© Nube Technologies
Challenges
● Quadratic problem
● No standard notion of similarity
● Omissions, typos and other issues
● Different languages
© Nube Technologies
Use Case - Customer Record Dedup
© Nube Technologies
Use Case - Customer Record Dedup
© Nube Technologies
Use Case - Shopping Site Comparison
© Nube Technologies
Use Case - Shopping Site Comparison
© Nube Technologies
Other Use Cases
● Cross selling
● Financial Credit Ratings
● Fraud Analytics
● Catalog and inventory management
● Household and individual level analytics.
© Nube Technologies
Lets start wishing...
● Data variety
● Scalable
● No manual configuration of rules or
algorithms
● Multi language
● Real time
© Nube Technologies
Reifier - learn
© Nube Technologies
Reifier - learn
© Nube Technologies
Reifier - learn
© Nube Technologies
Reifier - learn
© Nube Technologies
Real Time
Spark + ElasticSearch
© Nube Technologies
Spark Benefits
● Distributed
● Scalable
● Fast
● Machine Learning
● Sampling
● No need to orchestrate multiple jobs
© Nube Technologies
Thank You!
www.nubetech.co
sonal@nubetech.co

More Related Content

Similar to 06 sonal goyal

Jaya WWW talk 2023.pdf
Jaya WWW talk 2023.pdfJaya WWW talk 2023.pdf
Jaya WWW talk 2023.pdfJaya Kawale
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Nitin Panjwani
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Usability247 business overview
Usability247 business overviewUsability247 business overview
Usability247 business overviewUX247
 
Agile methods and dw mha
Agile methods and dw mhaAgile methods and dw mha
Agile methods and dw mhaAgileDenver
 
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...Ravindra Singh
 
Deploying your SaaS stack OnPrem
Deploying your SaaS stack OnPremDeploying your SaaS stack OnPrem
Deploying your SaaS stack OnPremKris Buytaert
 
ViewPro | Productivity never looked better...
ViewPro | Productivity never looked better...ViewPro | Productivity never looked better...
ViewPro | Productivity never looked better...Niall O'Callaghan
 
Why the h# should I use Appium with React Native
Why the h# should I use Appium with React NativeWhy the h# should I use Appium with React Native
Why the h# should I use Appium with React NativeWim Selles
 
7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaSFrederik Denkens
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
A Brave New World of Delivering IT
A Brave New World of Delivering ITA Brave New World of Delivering IT
A Brave New World of Delivering ITXebiaLabs
 
Dev ops tutorial for beginners what is devops & devops tools
Dev ops tutorial for beginners what is devops & devops toolsDev ops tutorial for beginners what is devops & devops tools
Dev ops tutorial for beginners what is devops & devops toolsJanBask Training
 
Targeted documentation STC Houston, Mar 20, 2012
Targeted documentation   STC Houston, Mar 20, 2012Targeted documentation   STC Houston, Mar 20, 2012
Targeted documentation STC Houston, Mar 20, 2012STC_Houston
 
Dutch entrepreneurs visiting twago in Berlin
Dutch entrepreneurs visiting twago in BerlinDutch entrepreneurs visiting twago in Berlin
Dutch entrepreneurs visiting twago in BerlinGunnar Berning
 

Similar to 06 sonal goyal (20)

Jaya WWW talk 2023.pdf
Jaya WWW talk 2023.pdfJaya WWW talk 2023.pdf
Jaya WWW talk 2023.pdf
 
PHP + Business = Money!
PHP + Business = Money!PHP + Business = Money!
PHP + Business = Money!
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414Drawbridge_MeetUp_June19_072414
Drawbridge_MeetUp_June19_072414
 
New Relic
New RelicNew Relic
New Relic
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Usability247 business overview
Usability247 business overviewUsability247 business overview
Usability247 business overview
 
Agile methods and dw mha
Agile methods and dw mhaAgile methods and dw mha
Agile methods and dw mha
 
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...
Starting a Drupal 8 Project? Let’s do a Technical Discovery - DrupalConAsia 2...
 
Deploying your SaaS stack OnPrem
Deploying your SaaS stack OnPremDeploying your SaaS stack OnPrem
Deploying your SaaS stack OnPrem
 
ViewPro | Productivity never looked better...
ViewPro | Productivity never looked better...ViewPro | Productivity never looked better...
ViewPro | Productivity never looked better...
 
Why the h# should I use Appium with React Native
Why the h# should I use Appium with React NativeWhy the h# should I use Appium with React Native
Why the h# should I use Appium with React Native
 
7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Go Revel Gooo...
Go Revel Gooo...Go Revel Gooo...
Go Revel Gooo...
 
A Brave New World of Delivering IT
A Brave New World of Delivering ITA Brave New World of Delivering IT
A Brave New World of Delivering IT
 
Dev ops tutorial for beginners what is devops & devops tools
Dev ops tutorial for beginners what is devops & devops toolsDev ops tutorial for beginners what is devops & devops tools
Dev ops tutorial for beginners what is devops & devops tools
 
Targeted documentation STC Houston, Mar 20, 2012
Targeted documentation   STC Houston, Mar 20, 2012Targeted documentation   STC Houston, Mar 20, 2012
Targeted documentation STC Houston, Mar 20, 2012
 
Dutch entrepreneurs visiting twago in Berlin
Dutch entrepreneurs visiting twago in BerlinDutch entrepreneurs visiting twago in Berlin
Dutch entrepreneurs visiting twago in Berlin
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

06 sonal goyal