ABSTRACT: When we think about web news as a source of information it seems that there's nothing new to talk about. Google News et similia are out there for years and everybody can access that huge amount of information, mostly for free. However, if you want to extract real value from news articles, things are getting much more complicated. At SpazioDati we are focused on collecting as much information as possible about all Italian companies from many different sources and news is one of the richest, but at the same time hardest, kinds of sources we are dealing with. So we built Sedano, our news processing pipeline that is able to ingest, clean, deduplicate, annotate, classify and cluster several thousands of news articles per day and make them available to our users. We will talk about the challenges we faced, the solutions we implemented, and the open issues we are currently working on.
BIO: Ugo Scaiella is Software Engineering Manager at SpazioDati where he leads a team of more than 30 highly skilled and talented engineers and data scientists. Previously, he spent several years developing and playing with Machine Learning and Information Retrieval systems both in industry and academic environments. When not dealing with crazy deadlines and an insane amount of projects simultaneously, you might find him grilling a wagyu steak or waiting for the pork ribs to reach 98°C inside.
What's happening to my clients? Extracting value from news articles
1. WHAT'S HAPPENING
TO MY CLIENTS*?
Extracting value from news articles
* or partners, competitors, suppliers, etc…
Ugo Scaiella @ Speck&Tech
7 Apr 2022
2. ★ Master + 3 yrs research on IR/ML @ UniPI
★ 2 years @ ION Trading as SWE
★ From 2013 @ spaziodati.eu
★ Led DandelionAPI dev, now SWE Manager
★ Strongly-typed languages lover
★ In troubled relationship with Guido
Wagyu addicted
Wannabe grill master
Father of 4
ABOUT ME
Me, working on
Neural Alcoholic
Networks
10. COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
11. COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
12. SPAZIODATI
2 entities
★ SpazioDati (Trento)
★ Spazio Dati (Sassuolo)
MICHELE BARBERA
95 entities
★ …
★ CEO @SpazioDati
★ …
THIS IS COMPANYTXT
MENTION
IDENTIFICATION
CANDIDATE
EXTRACTION
DISAMBIGUATION
… Michele Barbera presenta
il nuovo prodotto di
SpazioDati, Atoka …
13. MENTION IDENTIFICATION
CANNOT BE SYNTAX BASED ONLY
CAVIT CANTINA VITICOLTORI CONSORZIO CANTINE SOCIALI DEL TRENTINO SOCIETA'
COOPERATIVA PIU' BREVEMENTE CAVIT S.C. PER FINALITA' PRODUTTIVE POTRA' OPERARE
ANCHE COME CANTINA PRODUTTORI, VITICOLTORI TRENTINI, VINTRENTO, TRENTINA VINI,
VILLALTA, ACCADEMIA DELLO SPUMANTE TRENTINO, CAVIT, C.V., C.C.S.T., RA.VIN
CANNOT BE REGEX ONLY
PANINI
GRUPPO DISTRIBUZIONE
DIMENSION
AZIENDA TRASPORTI
14. CANDIDATE SELECTION
FUZZY MATCHING
yeah, but "authoritative sources" ⇏ "good data":
"V.N.P. - VALSA NUOVA PERLINO S.P.A.,"SIGLABILE:"V.N.P. S.P.A" "
VALSA S.P.A.","PERLINO S.P.A.","PERLINO OPTIMA S.P.A.","V.A.T. S.P.A.",
"P.A.T. S.P.A.", "P.O. S.P.A.","SCANAVINO S.P.A.","FILIPETTI S.P.A.","CA' V
ERGANA S.P.A.","TERRE DEI SESI S.P.A.","SANDILIANO S.P.A.","TERRE
DEI SOLARI S.P
DEAL WITH ACRONYMS
SOCIETA' NAZIONALE APPALTI MANUTENZIONI LAZIO SUD S.N.A.M. SOCIET A' A
RESPONSABILITA LIMITATA
15. NOT JOKING
GE.A. S.R.L. (LA LETTERA E DELLA PAROLA
GE.A. DEVE INTENDERSI SCRITTA CON
CARATTERI MINUSCOLI)
U GARIXAN DI ZUNINO ZULEIKA E CAMILLO
LORENZO - S.N.C. ***(LA LETTERA "A" DELLA
PA ROLA GARIXAN E' DA INTENDERSI
ACCENTATA CON ACCENTO ACUTO)***
SOCIETA' COOPERATIVA DI CONSUMO DI
GNOCCA
16. DISAMBIGUATION
Ideally, let's exploit
everything we know about
the company:
★ Locations
★ Sectors
★ Related companies
★ Key people
Hard part is to mix
everything together
Pisa
Trento
Gabriele
Antonelli
Michele
Barbera
SpazioDati
Cerved Group Big Data
Business
Intelligence
Lead
generation
17. CURRENT WORKING ON
MENTIONS
Pattern matching on pre-computed
mentions + NER
Fine-tuned NER
CANDIDATES Pre-computed Fuzzy matching on names
DISAMBIGUATION
Only structured links
(people, companies)
Add also contextual
information about locations
and activity
NLP PIPELINE Separated steps Fully integrated pipeline
LANGUAGES Only Italian Major EU languages
IMPLEMENTATION
18. ★ Only Società di Capitale
★ Not bad, but not WOW!
★ Huge room for
improvement
RESULTS
19. TECH STACK
★ Current: mainly java
★ NER and Disambiguator:
simple and fast random
forests
★ Ad hoc and optimized
data-structure
★ It's getting old 🥺
★ Now working with: BERT,
Tensorflow e NN
★ Still not taking into account
timing 😬
20. TAKEAWAYS
★ Never use Apple M1 for these jobs… maybe in a couple of years
★ Language models are REALLY effective, it's not just the hype
… but, if you really want to reach a real-world level, you have to adapt them
★ NLP building blocks (POS tagger, encoders, etc…) are now a commodity,
… but availability of good training data for e2e task is still THE problem
★ You need GPU for those models only when have a lot of data
… and in that case, GPUs really make the difference
a sweating g3.16xlarge
21. COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
24. CLEANSING
Il titolo ? arrivato a perdere
oltre il 3 per cento. n n
Tronchetti: <Apertura in
calo? Vediamo prossimi
mesi,l'azienda ? solida>
di Andrea Fontana
seguici su Twitter
★ Data cleansing is like sewer
pipes cleaning, someone has
to do it
★ Web News is Web Data, so
HORRIBLE DATA
★ NLP tools are significantly
affected by bad input texts
25. DEDUPLICATION
★ Same articles, different newspaper
★ Stopword removal + stemming +
discarding shortest phrases + local
sensitive hashing
★ Streaming approach, use Redis for
caching
26. BUSINESS EVENTS
★ Management changes
★ Economic results
★ Launch of new products
★ Failures
★ Accidents
★ Strikes
★ …
27. AND MANY OTHERS
★ E-S-G themes
✴ Environmental
✴ Social
✴ Governance
★ Sentiment
★ Locations: provinces
28. TECH STACK
★ Mainly Python
★ Celery for distributed tasks
★ Django + Postgres for API
★ GRPC + Golang for core
clustering algorithm
★ Classifiers: Scikit Learn and TF
★ S3 for raw data
★ ES for news articles storage
★ Redis for caching
★ K8s cluster on AWS
★ Sentry, ELK, Prometheus,
Grafana
30. COMPANYTXT
Entity linking vs Atoka 6M
companies DB
SEDANO
Our news processing
pipeline
ATOKA NEWS
Search engine and news
monitoring for end users
LET'S TRY TO SOLVE THIS
36. LAST BUT NOT LEAST…
SPAZIODATI = Startup culture (team of 40, several in this room, talk to us!)
within one of the largest Fintech in the world: ION Group (>15k
employees)
https://spaziodati.eu/en/jobs/
But if you’re smart we welcome any role, even if not listed :-)
Yes, vegetarians too!
37. CREDITS: This presentation template was created by Slidesgo,
including icons by Flaticon and infographics & images by Freepik
THANKS!
QUESTIONS?