Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfers:
A Use-Case of Spark in Banking
#EUds7 @BBVAData

Our Goal:
Improving User Experience
in Retail Banking Products
with Machine Learning
(Using Spark)
#EUds7

Dashboard View of Expenses and Income

Customer Comparison Engine
#EUds7

All these examples require
categorized bank transactions
#EUds7

GAS COMPANY VAT 1234567X
Categorizing Bank Transactions
#EUds7
232 - 60.30 € 23/10/2015
...
BILL NO. A12345
Account operation
Taxonomy rules
(Hive / Spark)
Identifier matching
(Hive)
Company name
matching
(Lucene)
Text classification
(Spark + MLlib)
Other busines
rules (Hive)
7 Million Daily Operations
(Card and Bank account)
A12345
Bill Transaction
gas bill
groceries
public transport
rent
+ About 60 categories
health insurance

The Problem: Categorizing Transfers
#EUds7
007 - 60.30 € 23/10/2015 ...
...
PAYMENT RENT OCTOBER FLAT 15
Account operation
Code = 007
Amount < 0
Category =
Outgoing Transfer
Why not…
Category =
Household
Expenses?
700,000 Daily Operations (Transfers)

Data Science Challenges
• We did not know the data source in advance
• We did not have a labeled set
• A fraction of texts is useless (“detection” rather than classification)
• Distribution of categories is imbalanced
• Prefer false negatives over false positives
• Very short text, language not even syntactically correct
#EUds7

Basic Pipeline (in Pro since 2016)
#EUds7
GAS BILL/OCTOBER
Source text Sentence
classification
GAS
BILL
OCTOBER
Tokenization Word
Encoding
Sentence
Pooling
• First implementation: TF-IDF features + linear classifier (98% precision, 21% recall
• Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)
• Implemented in Spark/Scala, using MLlib classes
• Own classes implemented for Multi-class Logistic Regression, VLAD
• Scala dependency injection useful to quickly setup variants of the above steps
…

Where Do We Get a Dataset From?
First Attempt:
Domain adaptation
Second Attempt:
“Assisted Dataset creation”
#EUds7
x y x’ y’
x y
y’
Non-Transfers Transfers
But: p(x, y) ≠ p(x’, y’)
x’
Transfers
p(x, y) = p(x’, y’)
Assisted annotation
with tagging interface.
Tags suggested by rules,
incomplete training,
similar-string matches,
and confirmed manually.
> 45K Transfers Tagged

Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes gastos
boda regalo casa
alquiler à piso cdad alq comunidad local garaje mes prop
calle
pago à factura fra fras fact 14 resto fac 50 facturas
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms =
model.findSynonyms(“alquiler”, 10)

Vector of Locally Aggregated Descriptors (VLAD)
#EUds7
Word 1 Word 2 Word N…
…
Cluster k
1. Assign word code
to nearest cluster
2. Compute
residuals for each
cluster
…
3. Concatenate
residuals
Embedding
Own implementation in Spark/Scala
Jegou et al., Aggregating local descriptors into
compact codes, IEEE TPAMI, 2012

Results of the Experiment
#EUds7
Method Recall @ Prec=98%
TF-IDF 21.0 %
Word2vec (avg pool) 24.5 %
Word2vec (avg pool)+
Amount
26.9 %
Word2vec (VLAD pool) 37.3 %
• The combination of w2v + VLAD is not in production
• Currently working on own multi-word embedding which yields similar results
P/R curve for a tag
We wantto be here
(high precision,
fewer false positives).

Enablers
#EUds7
7065%
3004%
1320%
741%
271%
LONDRES/%LONDON%
PARIS%
NEW%YORK%/%NUEVA%YORK%/%NEWYORK%
ROMA%
AMSTERDAM%
When do household expenses happen?
Aggregated statistics about trip destinations
Bconomy –
Identifying Rental Expenses

Other Challenges
The system goes to the wild…
• Failure cases stay within
the tolerated 2%
Open system
• Word black-list needed
• Proposal for daily quality
checks
(plots: daily evolution of
number of word-tag pairs)
M. Zinkevich, Rules of ML:
Best Practices for ML Engineering
#EUds7

Conclusions
How Spark Helped Us
• Spark (+MLlib) was crucial in this use case.
• We complemented with own implementations of Multi-Class LR and Vector of
Locally aggregated descriptors.
• We took advantage of Scala’s code injection mechanisms to frame the
classification process as a set of standardized steps.
Data Science Experience
• Running text classifier in a real production system
• Improving the standard components through experiments on word2vec and
VLAD
• Ongoing work: own embedding method
#EUds7

Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfers:
A Use-Case of Spark in Banking
#EUds7@BBVAData
Thanks to:
• Roberto Maestre and Advisory Team @ BBVA D&A for
insightful comments
• Everyone at BBVA who made using Spark possible
and disseminated Spark and Scala knowledge

Appendix
(Just in case)
#EUds7

Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes gastos boda regalo casa
alquiler à piso cdad alq comunidad local garaje mes prop calle
jose à antonio juan manuel francisco luis maria miguel carmen jesus
lloguer à pis pagament quota comunitat jordi josep despeses parking joan
2014 à 2013 06 07 08 04 05 02 09 03
madrid à viaje hotel barcelona sevilla malaga la san sl reserva
pago à factura fra fras fact 14 resto fac 50 facturas
enero à febrero marzo abril mayo septiembre noviembre junio agosto
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
val synonyms =
model.findSynonyms(“alquiler”, 10)

References
Scientific literature
• Jegou et al., Aggregating local descriptors into compact codes, IEEE Trans. on Pattern Analysis
and Machine Intelligence, 2012.
• Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant
Features, ECML 1998.
• Mikolov et al., Distributed representation of words and phrases and their compositionality, NIPS
2013.
• Clinchant and Perronnin, Aggregating Continuous Word Embeddings for Information Retrieval,
Workshop on Continuous Vector Space Models and Their Compositionality, 2013.
Docs and posts
• Feature Extraction and Transformation using the RDD API
• Word Embeddings in 2017: Trends and Future Directions
• M. Zinkevich, Rules of ML: Best Practices for ML Engineering
#EUds7

Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

Recommended

Recommended

More Related Content

Similar to Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

Similar to Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano