What's happening to my clients? Extracting value from news articles

WHAT'S HAPPENING
TO MY CLIENTS*?
Extracting value from news articles
* or partners, competitors, suppliers, etc…
Ugo Scaiella @ Speck&Tech
7 Apr 2022

★ Master + 3 yrs research on IR/ML @ UniPI
★ 2 years @ ION Trading as SWE
★ From 2013 @ spaziodati.eu
★ Led DandelionAPI dev, now SWE Manager
★ Strongly-typed languages lover
★ In troubled relationship with Guido
Wagyu addicted
Wannabe grill master
Father of 4
ABOUT ME
Me, working on
Neural Alcoholic
Networks

ATOKA.IO Info about
6M companies in Italy

WHAT IF MY
PORTFOLIO IS
MADE OF
THOUSANDS OF
SMEs?

COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS

SPAZIODATI
2 entities
★ SpazioDati (Trento)
★ Spazio Dati (Sassuolo)
MICHELE BARBERA
95 entities
★ …
★ CEO @SpazioDati
★ …
THIS IS COMPANYTXT
MENTION
IDENTIFICATION
CANDIDATE
EXTRACTION
DISAMBIGUATION
… Michele Barbera presenta
il nuovo prodotto di
SpazioDati, Atoka …

MENTION IDENTIFICATION
CANNOT BE SYNTAX BASED ONLY
CAVIT CANTINA VITICOLTORI CONSORZIO CANTINE SOCIALI DEL TRENTINO SOCIETA'
COOPERATIVA PIU' BREVEMENTE CAVIT S.C. PER FINALITA' PRODUTTIVE POTRA' OPERARE
ANCHE COME CANTINA PRODUTTORI, VITICOLTORI TRENTINI, VINTRENTO, TRENTINA VINI,
VILLALTA, ACCADEMIA DELLO SPUMANTE TRENTINO, CAVIT, C.V., C.C.S.T., RA.VIN
CANNOT BE REGEX ONLY
PANINI
GRUPPO DISTRIBUZIONE
DIMENSION
AZIENDA TRASPORTI

CANDIDATE SELECTION
FUZZY MATCHING
yeah, but "authoritative sources" ⇏ "good data":
"V.N.P. - VALSA NUOVA PERLINO S.P.A.,"SIGLABILE:"V.N.P. S.P.A" "
VALSA S.P.A.","PERLINO S.P.A.","PERLINO OPTIMA S.P.A.","V.A.T. S.P.A.",
"P.A.T. S.P.A.", "P.O. S.P.A.","SCANAVINO S.P.A.","FILIPETTI S.P.A.","CA' V
ERGANA S.P.A.","TERRE DEI SESI S.P.A.","SANDILIANO S.P.A.","TERRE
DEI SOLARI S.P
DEAL WITH ACRONYMS
SOCIETA' NAZIONALE APPALTI MANUTENZIONI LAZIO SUD S.N.A.M. SOCIET A' A
RESPONSABILITA LIMITATA

NOT JOKING
GE.A. S.R.L. (LA LETTERA E DELLA PAROLA
GE.A. DEVE INTENDERSI SCRITTA CON
CARATTERI MINUSCOLI)
U GARIXAN DI ZUNINO ZULEIKA E CAMILLO
LORENZO - S.N.C. ***(LA LETTERA "A" DELLA
PA ROLA GARIXAN E' DA INTENDERSI
ACCENTATA CON ACCENTO ACUTO)***
SOCIETA' COOPERATIVA DI CONSUMO DI
GNOCCA

DISAMBIGUATION
Ideally, let's exploit
everything we know about
the company:
★ Locations
★ Sectors
★ Related companies
★ Key people
Hard part is to mix
everything together
Pisa
Trento
Gabriele
Antonelli
Michele
Barbera
SpazioDati
Cerved Group Big Data
Business
Intelligence
Lead
generation

CURRENT WORKING ON
MENTIONS
Pattern matching on pre-computed
mentions + NER
Fine-tuned NER
CANDIDATES Pre-computed Fuzzy matching on names
DISAMBIGUATION
Only structured links
(people, companies)
Add also contextual
information about locations
and activity
NLP PIPELINE Separated steps Fully integrated pipeline
LANGUAGES Only Italian Major EU languages
IMPLEMENTATION

★ Only Società di Capitale
★ Not bad, but not WOW!
★ Huge room for
improvement
RESULTS

TECH STACK
★ Current: mainly java
★ NER and Disambiguator:
simple and fast random
forests
★ Ad hoc and optimized
data-structure
★ It's getting old 🥺
★ Now working with: BERT,
Tensorflow e NN
★ Still not taking into account
timing 😬

TAKEAWAYS
★ Never use Apple M1 for these jobs… maybe in a couple of years
★ Language models are REALLY effective, it's not just the hype
… but, if you really want to reach a real-world level, you have to adapt them
★ NLP building blocks (POS tagger, encoders, etc…) are now a commodity,
… but availability of good training data for e2e task is still THE problem
★ You need GPU for those models only when have a lot of data
… and in that case, GPUs really make the difference
a sweating g3.16xlarge

MAIN PIPELINE
Classifier B
Locations
Classifier A
Business Event
Cleansing
Dirty work
Annotation
Company
annotations
Deduplication
Remove same
articles

CLEANSING
Il titolo ? arrivato a perdere
oltre il 3 per cento. n n
Tronchetti: <Apertura in
calo? Vediamo prossimi
mesi,l'azienda ? solida>
di Andrea Fontana
seguici su Twitter
★ Data cleansing is like sewer
pipes cleaning, someone has
to do it
★ Web News is Web Data, so
HORRIBLE DATA
★ NLP tools are significantly
affected by bad input texts

DEDUPLICATION
★ Same articles, different newspaper
★ Stopword removal + stemming +
discarding shortest phrases + local
sensitive hashing
★ Streaming approach, use Redis for
caching

BUSINESS EVENTS
★ Management changes
★ Economic results
★ Launch of new products
★ Failures
★ Accidents
★ Strikes
★ …

AND MANY OTHERS
★ E-S-G themes
✴ Environmental
✴ Social
✴ Governance
★ Sentiment
★ Locations: provinces

TECH STACK
★ Mainly Python
★ Celery for distributed tasks
★ Django + Postgres for API
★ GRPC + Golang for core
clustering algorithm
★ Classifiers: Scikit Learn and TF
★ S3 for raw data
★ ES for news articles storage
★ Redis for caching
★ K8s cluster on AWS
★ Sentry, ELK, Prometheus,
Grafana

★ Scalability
★ Streaming
★ Operational burden
★ Idempotency
CHALLENGES

LAST BUT NOT LEAST…
SPAZIODATI = Startup culture (team of 40, several in this room, talk to us!)
within one of the largest Fintech in the world: ION Group (>15k
employees)
https://spaziodati.eu/en/jobs/
But if you’re smart we welcome any role, even if not listed :-)
Yes, vegetarians too!

CREDITS: This presentation template was created by Slidesgo,
including icons by Flaticon and infographics & images by Freepik
THANKS!
QUESTIONS?

What's happening to my clients? Extracting value from news articles

Recommended

Recommended

More Related Content

Similar to What's happening to my clients? Extracting value from news articles

Similar to What's happening to my clients? Extracting value from news articles (20)

More from Speck&Tech

More from Speck&Tech (20)

Recently uploaded

Recently uploaded (20)

What's happening to my clients? Extracting value from news articles