SlideShare a Scribd company logo
1 of 21
Download to read offline
Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfers:
A Use-Case of Spark in Banking
#EUds7 @BBVAData
Our Goal:
Improving User Experience
in Retail Banking Products
with Machine Learning
(Using Spark)
#EUds7
Dashboard View of Expenses and Income
Anticipated Expenses
Customer Comparison Engine
#EUds7
All these examples require
categorized bank transactions
#EUds7
GAS COMPANY VAT 1234567X
Categorizing Bank Transactions
#EUds7
232 - 60.30 € 23/10/2015
...
BILL NO. A12345
Account operation
Taxonomy rules
(Hive / Spark)
Identifier matching
(Hive)
Company name
matching
(Lucene)
Text classification
(Spark + MLlib)
Other busines
rules (Hive)
7 Million Daily Operations
(Card and Bank account)
A12345
Bill Transaction
gas bill
groceries
public transport
rent
+ About 60 categories
health insurance
The Problem: Categorizing Transfers
#EUds7
007 - 60.30 € 23/10/2015 ...
...
PAYMENT RENT OCTOBER FLAT 15
Account operation
Code = 007
Amount < 0
Category =
Outgoing Transfer
Why not…
Category =
Household
Expenses?
700,000 Daily Operations (Transfers)
Data Science Challenges
• We did not know the data source in advance
• We did not have a labeled set
• A fraction of texts is useless (“detection” rather than classification)
• Distribution of categories is imbalanced
• Prefer false negatives over false positives
• Very short text, language not even syntactically correct
#EUds7
Basic Pipeline (in Pro since 2016)
#EUds7
GAS BILL/OCTOBER
Source text Sentence
classification
GAS
BILL
OCTOBER
Tokenization Word
Encoding
Sentence
Pooling
• First implementation: TF-IDF features + linear classifier (98% precision, 21% recall
• Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)
• Implemented in Spark/Scala, using MLlib classes
• Own classes implemented for Multi-class Logistic Regression, VLAD
• Scala dependency injection useful to quickly setup variants of the above steps
…
Where Do We Get a Dataset From?
First Attempt:
Domain adaptation
Second Attempt:
“Assisted Dataset creation”
#EUds7
x y x’ y’
x y
y’
Non-Transfers Transfers
But: p(x, y) ≠ p(x’, y’)
x’
Transfers
p(x, y) = p(x’, y’)
Assisted annotation
with tagging interface.
Tags suggested by rules,
incomplete training,
similar-string matches,
and confirmed manually.
> 45K Transfers Tagged
Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes gastos
boda regalo casa
alquiler à piso cdad alq comunidad local garaje mes prop
calle
pago à factura fra fras fact 14 resto fac 50 facturas
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms =
model.findSynonyms(“alquiler”, 10)
Vector of Locally Aggregated Descriptors (VLAD)
#EUds7
Word 1 Word 2 Word N…
…
Cluster k
1. Assign word code
to nearest cluster
2. Compute
residuals for each
cluster
…
3. Concatenate
residuals
Embedding
Own implementation in Spark/Scala
Jegou et al., Aggregating local descriptors into
compact codes, IEEE TPAMI, 2012
Results of the Experiment
#EUds7
Method Recall @ Prec=98%
TF-IDF 21.0 %
Word2vec (avg pool) 24.5 %
Word2vec (avg pool)+
Amount
26.9 %
Word2vec (VLAD pool) 37.3 %
• The combination of w2v + VLAD is not in production
• Currently working on own multi-word embedding which yields similar results
P/R curve for a tag
We wantto be here
(high precision,
fewer false positives).
Enablers
#EUds7
7065%
3004%
1320%
741%
271%
LONDRES/%LONDON%
PARIS%
NEW%YORK%/%NUEVA%YORK%/%NEWYORK%
ROMA%
AMSTERDAM%
When do household expenses happen?
Aggregated statistics about trip destinations
Bconomy –
Identifying Rental Expenses
Other Challenges
The system goes to the wild…
• Failure cases stay within
the tolerated 2%
Open system
• Word black-list needed
• Proposal for daily quality
checks
(plots: daily evolution of
number of word-tag pairs)
M. Zinkevich, Rules of ML:
Best Practices for ML Engineering
#EUds7
Conclusions
How Spark Helped Us
• Spark (+MLlib) was crucial in this use case.
• We complemented with own implementations of Multi-Class LR and Vector of
Locally aggregated descriptors.
• We took advantage of Scala’s code injection mechanisms to frame the
classification process as a set of standardized steps.
Data Science Experience
• Running text classifier in a real production system
• Improving the standard components through experiments on word2vec and
VLAD
• Ongoing work: own embedding method
#EUds7
Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfers:
A Use-Case of Spark in Banking
#EUds7@BBVAData
Thanks to:
• Roberto Maestre and Advisory Team @ BBVA D&A for
insightful comments
• Everyone at BBVA who made using Spark possible
and disseminated Spark and Scala knowledge
Appendix
(Just in case)
#EUds7
Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes gastos boda regalo casa
alquiler à piso cdad alq comunidad local garaje mes prop calle
jose à antonio juan manuel francisco luis maria miguel carmen jesus
lloguer à pis pagament quota comunitat jordi josep despeses parking joan
2014 à 2013 06 07 08 04 05 02 09 03
madrid à viaje hotel barcelona sevilla malaga la san sl reserva
pago à factura fra fras fact 14 resto fac 50 facturas
enero à febrero marzo abril mayo septiembre noviembre junio agosto
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms =
model.findSynonyms(“alquiler”, 10)
References
Scientific literature
• Jegou et al., Aggregating local descriptors into compact codes, IEEE Trans. on Pattern Analysis
and Machine Intelligence, 2012.
• Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant
Features, ECML 1998.
• Mikolov et al., Distributed representation of words and phrases and their compositionality, NIPS
2013.
• Clinchant and Perronnin, Aggregating Continuous Word Embeddings for Information Retrieval,
Workshop on Continuous Vector Space Models and Their Compositionality, 2013.
Docs and posts
• Feature Extraction and Transformation using the RDD API
• Word Embeddings in 2017: Trends and Future Directions
• M. Zinkevich, Rules of ML: Best Practices for ML Engineering
#EUds7

More Related Content

Similar to Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

Khaled_Waheed_resume
Khaled_Waheed_resumeKhaled_Waheed_resume
Khaled_Waheed_resumeKhaled Waheed
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
OgH Data Visualization Special Part III
OgH Data Visualization Special Part IIIOgH Data Visualization Special Part III
OgH Data Visualization Special Part IIILuc Bors
 
Introducing Stitch
Introducing Stitch Introducing Stitch
Introducing Stitch MongoDB
 
Devoxx UK 2022 - Application security: What should the attack landscape look ...
Devoxx UK 2022 - Application security: What should the attack landscape look ...Devoxx UK 2022 - Application security: What should the attack landscape look ...
Devoxx UK 2022 - Application security: What should the attack landscape look ...Chris Swan
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsSriskandarajah Suhothayan
 
How Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsHow Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsNeo4j
 
Manigandan Devarajan_Cards and Payments_BA_Presales_T
Manigandan Devarajan_Cards and Payments_BA_Presales_TManigandan Devarajan_Cards and Payments_BA_Presales_T
Manigandan Devarajan_Cards and Payments_BA_Presales_TManigandan Devarajan
 
MongoDB BI Connector & Tableau
MongoDB BI Connector & Tableau MongoDB BI Connector & Tableau
MongoDB BI Connector & Tableau MongoDB
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windowsJosé António Silva
 
354836_(General_Format)Mahaboob Basha Shaik
354836_(General_Format)Mahaboob Basha Shaik354836_(General_Format)Mahaboob Basha Shaik
354836_(General_Format)Mahaboob Basha ShaikMahaboob Basha Shaik
 
It's the markup that matters - Hidde de Vries - Codeland
It's the markup that matters - Hidde de Vries - CodelandIt's the markup that matters - Hidde de Vries - Codeland
It's the markup that matters - Hidde de Vries - CodelandHiddedeVries6
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 

Similar to Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano (20)

Khaled_Waheed_resume
Khaled_Waheed_resumeKhaled_Waheed_resume
Khaled_Waheed_resume
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
CV RCD- Eng
CV RCD- EngCV RCD- Eng
CV RCD- Eng
 
Aamod_Chandra
Aamod_ChandraAamod_Chandra
Aamod_Chandra
 
PASHA MSBI
PASHA MSBIPASHA MSBI
PASHA MSBI
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
OgH Data Visualization Special Part III
OgH Data Visualization Special Part IIIOgH Data Visualization Special Part III
OgH Data Visualization Special Part III
 
Introducing Stitch
Introducing Stitch Introducing Stitch
Introducing Stitch
 
Devoxx UK 2022 - Application security: What should the attack landscape look ...
Devoxx UK 2022 - Application security: What should the attack landscape look ...Devoxx UK 2022 - Application security: What should the attack landscape look ...
Devoxx UK 2022 - Application security: What should the attack landscape look ...
 
QueensLab presentation
QueensLab presentation QueensLab presentation
QueensLab presentation
 
Discover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 AnalyticsDiscover Data That Matters- Deep dive into WSO2 Analytics
Discover Data That Matters- Deep dive into WSO2 Analytics
 
How Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting WorkflowsHow Will Knowledge Graphs Improve Clinical Reporting Workflows
How Will Knowledge Graphs Improve Clinical Reporting Workflows
 
Manigandan Devarajan_Cards and Payments_BA_Presales_T
Manigandan Devarajan_Cards and Payments_BA_Presales_TManigandan Devarajan_Cards and Payments_BA_Presales_T
Manigandan Devarajan_Cards and Payments_BA_Presales_T
 
EDI_gowtam
EDI_gowtamEDI_gowtam
EDI_gowtam
 
ELAVARASAN.pdf
ELAVARASAN.pdfELAVARASAN.pdf
ELAVARASAN.pdf
 
MongoDB BI Connector & Tableau
MongoDB BI Connector & Tableau MongoDB BI Connector & Tableau
MongoDB BI Connector & Tableau
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
 
354836_(General_Format)Mahaboob Basha Shaik
354836_(General_Format)Mahaboob Basha Shaik354836_(General_Format)Mahaboob Basha Shaik
354836_(General_Format)Mahaboob Basha Shaik
 
It's the markup that matters - Hidde de Vries - Codeland
It's the markup that matters - Hidde de Vries - CodelandIt's the markup that matters - Hidde de Vries - Codeland
It's the markup that matters - Hidde de Vries - Codeland
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 

More from Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 

More from Spark Summit (20)

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

  • 1. Luis Peinado Fuentes, BBVA Data & Analytics Jose A. Rodriguez Serrano, BBVA Data & Analytics Tagging Text in Money Transfers: A Use-Case of Spark in Banking #EUds7 @BBVAData
  • 2. Our Goal: Improving User Experience in Retail Banking Products with Machine Learning (Using Spark) #EUds7
  • 3. Dashboard View of Expenses and Income
  • 6. All these examples require categorized bank transactions #EUds7
  • 7. GAS COMPANY VAT 1234567X Categorizing Bank Transactions #EUds7 232 - 60.30 € 23/10/2015 ... BILL NO. A12345 Account operation Taxonomy rules (Hive / Spark) Identifier matching (Hive) Company name matching (Lucene) Text classification (Spark + MLlib) Other busines rules (Hive) 7 Million Daily Operations (Card and Bank account) A12345 Bill Transaction gas bill groceries public transport rent + About 60 categories health insurance
  • 8. The Problem: Categorizing Transfers #EUds7 007 - 60.30 € 23/10/2015 ... ... PAYMENT RENT OCTOBER FLAT 15 Account operation Code = 007 Amount < 0 Category = Outgoing Transfer Why not… Category = Household Expenses? 700,000 Daily Operations (Transfers)
  • 9. Data Science Challenges • We did not know the data source in advance • We did not have a labeled set • A fraction of texts is useless (“detection” rather than classification) • Distribution of categories is imbalanced • Prefer false negatives over false positives • Very short text, language not even syntactically correct #EUds7
  • 10. Basic Pipeline (in Pro since 2016) #EUds7 GAS BILL/OCTOBER Source text Sentence classification GAS BILL OCTOBER Tokenization Word Encoding Sentence Pooling • First implementation: TF-IDF features + linear classifier (98% precision, 21% recall • Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD) • Implemented in Spark/Scala, using MLlib classes • Own classes implemented for Multi-class Logistic Regression, VLAD • Scala dependency injection useful to quickly setup variants of the above steps …
  • 11. Where Do We Get a Dataset From? First Attempt: Domain adaptation Second Attempt: “Assisted Dataset creation” #EUds7 x y x’ y’ x y y’ Non-Transfers Transfers But: p(x, y) ≠ p(x’, y’) x’ Transfers p(x, y) = p(x’, y’) Assisted annotation with tagging interface. Tags suggested by rules, incomplete training, similar-string matches, and confirmed manually. > 45K Transfers Tagged
  • 12. Exploring better embeddings/poolings: w2v + VLAD Word2vec “Synonyms” #EUds7 viaje à billete vuelo hotel reserva billetes gastos boda regalo casa alquiler à piso cdad alq comunidad local garaje mes prop calle pago à factura fra fras fact 14 resto fac 50 facturas val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms(“alquiler”, 10)
  • 13. Vector of Locally Aggregated Descriptors (VLAD) #EUds7 Word 1 Word 2 Word N… … Cluster k 1. Assign word code to nearest cluster 2. Compute residuals for each cluster … 3. Concatenate residuals Embedding Own implementation in Spark/Scala Jegou et al., Aggregating local descriptors into compact codes, IEEE TPAMI, 2012
  • 14. Results of the Experiment #EUds7 Method Recall @ Prec=98% TF-IDF 21.0 % Word2vec (avg pool) 24.5 % Word2vec (avg pool)+ Amount 26.9 % Word2vec (VLAD pool) 37.3 % • The combination of w2v + VLAD is not in production • Currently working on own multi-word embedding which yields similar results P/R curve for a tag We wantto be here (high precision, fewer false positives).
  • 15. Enablers #EUds7 7065% 3004% 1320% 741% 271% LONDRES/%LONDON% PARIS% NEW%YORK%/%NUEVA%YORK%/%NEWYORK% ROMA% AMSTERDAM% When do household expenses happen? Aggregated statistics about trip destinations Bconomy – Identifying Rental Expenses
  • 16. Other Challenges The system goes to the wild… • Failure cases stay within the tolerated 2% Open system • Word black-list needed • Proposal for daily quality checks (plots: daily evolution of number of word-tag pairs) M. Zinkevich, Rules of ML: Best Practices for ML Engineering #EUds7
  • 17. Conclusions How Spark Helped Us • Spark (+MLlib) was crucial in this use case. • We complemented with own implementations of Multi-Class LR and Vector of Locally aggregated descriptors. • We took advantage of Scala’s code injection mechanisms to frame the classification process as a set of standardized steps. Data Science Experience • Running text classifier in a real production system • Improving the standard components through experiments on word2vec and VLAD • Ongoing work: own embedding method #EUds7
  • 18. Luis Peinado Fuentes, BBVA Data & Analytics Jose A. Rodriguez Serrano, BBVA Data & Analytics Tagging Text in Money Transfers: A Use-Case of Spark in Banking #EUds7@BBVAData Thanks to: • Roberto Maestre and Advisory Team @ BBVA D&A for insightful comments • Everyone at BBVA who made using Spark possible and disseminated Spark and Scala knowledge
  • 20. Exploring better embeddings/poolings: w2v + VLAD Word2vec “Synonyms” #EUds7 viaje à billete vuelo hotel reserva billetes gastos boda regalo casa alquiler à piso cdad alq comunidad local garaje mes prop calle jose à antonio juan manuel francisco luis maria miguel carmen jesus lloguer à pis pagament quota comunitat jordi josep despeses parking joan 2014 à 2013 06 07 08 04 05 02 09 03 madrid à viaje hotel barcelona sevilla malaga la san sl reserva pago à factura fra fras fact 14 resto fac 50 facturas enero à febrero marzo abril mayo septiembre noviembre junio agosto val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms(“alquiler”, 10)
  • 21. References Scientific literature • Jegou et al., Aggregating local descriptors into compact codes, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2012. • Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, ECML 1998. • Mikolov et al., Distributed representation of words and phrases and their compositionality, NIPS 2013. • Clinchant and Perronnin, Aggregating Continuous Word Embeddings for Information Retrieval, Workshop on Continuous Vector Space Models and Their Compositionality, 2013. Docs and posts • Feature Extraction and Transformation using the RDD API • Word Embeddings in 2017: Trends and Future Directions • M. Zinkevich, Rules of ML: Best Practices for ML Engineering #EUds7