SlideShare a Scribd company logo
1 of 38
Download to read offline
AI@Wholi
Traian Rebedea, Ph.D.
traian@wholi.com
>> Bucharest.AI #5 << :: 05 Dec 17
What is Wholi?
• The right people find each other (at the right time)
• People search at first, company search now
• Aggregating signals from all public online sources
• Building a knowledge graph (e.g. People Graph, Company Graph)
• Use the graph to make informed decisions (e.g. contacting, selling)
• $5M+ total funding, last round in 2016
• Working with tens of companies for lead generation & scoring
>> Bucharest.AI #5 << :: 05 Dec 17
Wholi – from people data to information
>> Bucharest.AI #5 << :: 05 Dec 17
Wholi – from people data to information
>> Bucharest.AI #5 << :: 05 Dec 17
From a technical perspective
>> Bucharest.AI #5 << :: 05 Dec 17
From a technical perspective
>> Bucharest.AI #5 << :: 05 Dec 17
AI @ Wholi
Two main directions
• PART 1: Machine learning for entity extraction
• Natural language processing (NLP), information extraction
• PART 2: Matching profiles using deep learning classifier
• Deep learning, word embeddings
>> Bucharest.AI #5 << :: 05 Dec 17
Machine learning for entity extraction
• Given the bio (description) from a social profile
or news item about a person/company
• Extract features (entities, relations) useful
for matching and for search/filtering
• Use NER (Named Entity Recognition) to extract some
standard entities
• Bootstrapped entity / relationship learning
• Extracting entities from email signatures
>> Bucharest.AI #5 << :: 05 Dec 17
CoreNLP for NER
• Out of the box, with some tuning for speed
• Used pretrained NER tagger in CoreNLP
• To enhance speed, we only kept the CRF classifier & got rid of regex for
time and money
• Detects three types of named entities: PERSON, ORGANISATION and
LOCATION
• Trained on news dataset, therefore performance/accuracy degrades on
social profiles
• Can improve performance by integrating new features (use gazetteer from
our index), new dataset with bio from social profiles
>> Bucharest.AI #5 << :: 05 Dec 17
Extracting topics from text
• Topic modelling (e.g. LDA – Latent Dirichlet Allocation) allows to extract a mix (distribution) of
topics from a given text
• Unfortunately topics are not automatically labeled, we only know the most important words in
each topic
• Given the text in a social profile, compute the most important topic(s)
• Uber Technologies Inc. is an American international transportation network company headquartered in San Francisco, California.
The company develops, markets and operates the Uber mobile app, which allows consumers with smartphones to submit a trip
request which is then routed to Uber drivers who use their own cars.[1][2] By May 28, 2015, the service was available in 58
countries and 300 cities worldwide => [taxi, transportation, ]
• Connectifier is a next-gen search platform specifically targeting hiring - one of the most fundamental pieces of our economy.
Discover, qualify, and connect with exceptional job candidates twice as effectively. => [hiring, career and jobs, recruitment,
marketplace, coaching]
• Dropbox is a free service that lets you bring your photos, docs, and videos anywhere and share them easily. Dropbox was founded
in 2007 by Drew Houston and Arash Ferdowsi, two MIT students tired of emailing files to themselves to work from more than one
computer. => [cloud, collaborative documents, collaboration, file sharing, storage, photo sharing, photo editing, ]
• PeopleGraph is building a people search engine. 3 billion searches a day are for people, yet there's no way to comprehensively
search someone's online footprint. We are well on the way to solving that problem. => [search, people, ]
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning
• I live in Bucharest<LOC>, working at Wholi<ORG>. Recently I’ve visited
Google<ORG>’s office in Zurich<LOC>.
• Wholi and Google have different semantics in the example
• Identify the relationship between the profile (person, company) and
each entity in the bio
• Also infer other types of entities / relations
• E.g. hobbies (likes, interests), software skills, roles, etc.
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning
• Work with Alex Tomescu, MSc @ Chalmers University of Technology
Software developer @ Google
• Started from SPIED in CoreNLP - Pattern-based entity extraction and visualization
• Similar works in other information extraction systems (e.g. NELL, Probase, Knowledge Vault)
• Bootstrapping is a semi-supervised process which allows to iteratively discover
more and more examples (R ↗) with a pretty good precision (P↘)
1. Start with a (small) seed of correctly classified instances (P high, R low)
2. Using a huge unlabeled dataset, find patterns to discover these & new instances, keep only
high confidence patterns
3. Discover new instances (some may be incorrect, most should be correct => P decreases, R
increases) & repeat from 2 until… performance degrades too much
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning
• Figure retrieved from: https://nlp.stanford.edu/software/patternslearning.html
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning
• Seeds either initialized manually (e.g. hobbies), or from labeled data in our index (e.g. worksAt, livesIn)
• Seed hobbies:
• football, poker, martial arts, chess, swimming, jogging, hiking, fishing, archery, cooking
• Computes:
• baketball[0.0016682103969337127] baseball[0.0018129626504372067]
• dogs[0.0011119719042083742]
• drinks[1.4506528794721354E-4] beer[1.1687126552358244E-4] martini[2.4980405990967884E-4]
• husband[1.7630337419381701E-4]
• church[1.3872841379648836E-4] god[2.0515591306815217E-4]
• gadgets[2.67475711957433E-4]
• music[1.2155990349464713E-4]
• games[1.7018667163236963E-4]
• singing[1.5344948928898647E-4]
• birds[2.746465269584744E-4]
• facebook[1.5347921795982061E-4]
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning
• Patterns for role:
left 1.0000 : __company__ - [support=6891]
right 0.9802 : di __company__ [support=1879]
right 0.9442 : bei __company__ __end__ [support=5043]
right 0.5559 : @ __company__ __end__ [support=343]
left 1.0000 : __start__ __role__ at __company__ and [support=90]
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning - Conclusions
• Relationship learning can provide useful features for matching and for
search/filtering
• However, we have seen that in the best setting we obtain only a
minor boost in precision (not statistically relevant), with a more
relevant increase in recall (about 1%)
• It’s more useful for search in cases when this information can only be
inferred from text
>> Bucharest.AI #5 << :: 05 Dec 17
Matching profiles using deep learning
• Work with Vicentiu Ciorbaru, AI @ UPB MSc
Software developer @ MariaDB
• Dataset built by Dev team @ Wholi
• Used as one of the validation datasets for the different profile
matching methods developed for people search
• Main objective: cluster as many online profile pages of the same
person to build his online identity
>> Bucharest.AI #5 << :: 05 Dec 17
Introduction
• Online people search is a significant part of web search
• 11-17% of queries include a person name
• ~4% contain only a person name
• Name ambiguity makes people search a complex problem to solve
efficiently
• Huge overlap in person names worldwide
• The most popular 90,000 full names (first and last name) worldwide are shared by
100M+ individuals
• An important aspect in people search is to find most/all online sources of
information (e.g. web pages) related to the same person
• Recent shift from general web pages to specific ones, like social networking sites and
other professional communities
>> Bucharest.AI #5 << :: 05 Dec 17
Introduction
• Our problem: given a set of web pages extracted from online social
networks, determine the profiles which relate to the same individual
• Profile matching (or deduplication)
• Generates a (more) complete online identity for an individual
• Only uses public online information, however adding up all this information about a
person can cause privacy concerns
>> Bucharest.AI #5 << :: 05 Dec 17
Related work
• Two main directions
• Personal web pages deduplication
• Matching social networking profiles
• First problem is more generic and complex, as one also needs to
extract personal information (e.g. name, occupation, etc.) from a
wide range of different structured web pages
• Entity deduplication, in general, is a very complex field of study in
Databases, Natural Language Processing (NLP), and Information
Retrieval
>> Bucharest.AI #5 << :: 05 Dec 17
Related work
• Web People Search (WePS) datasets and competitions
• Given all web pages returned by a generic search engine for a popular name,
group pages such that each group corresponds to one specic person
• Most solutions employ clustering of the web pages using features extracted from
pages such as Wikipedia concepts, Named Entity Recognition (NER), bag of words
(BoW), and hyperlinks and different similarity measures
• A pairwise approach for solving this problem was also proposed
• Compute the probability that two pages refer to the same person
• Cluster pages by joining pairs that have a high probability to represent the same person
• WePS proposed B-cubed precision and recall for assessing performance
>> Bucharest.AI #5 << :: 05 Dec 17
Related work
• More recent research focused on linking social networking profiles
belonging to the same individual
• Zhang et al. (2015) proposed a binary classifier using a probabilistic
graphical model (factor graph)
• Features computed using BoW and TF-IDF for the text in each profile,
but also its social status (position of node in network) and
connections
• Our solution only uses textual features, since the dataset does not
contain connections (e.g. friends or followers)
• These additional features, or other like avatar/profile image, would only
improve the results
>> Bucharest.AI #5 << :: 05 Dec 17
Dataset
• Snapshot of multiple social networking profiles collected from 15
different online social networks and community websites
• Academia, Code-Project, Facebook, Github, Google+, Instagram, Lanyrd, Linkedin, Mashable, Medium, Moz, Quora,
Slideshare, Twitter, and Vimeo
• For each profile, we extracted some/all of the following information:
username, name (full name or distinct first and last names), gender,
bio (short description), interests, publications, jobs, etc.
• The average number of social profiles per individual is 2.04
and the maximum is 10
• Most profile pages feature a brief description (bio) of the owner
• Profiles do not contain connections, nor posts written by the owner
>> Bucharest.AI #5 << :: 05 Dec 17
Dataset
• Ground truth obtained from the website about.me
• Complete online information for professionals
• Contains links to several social networking profiles
of the same person, added manually by each user
• Dataset contains information from over 200,000
about.me accounts
• Total number of extracted social networking profiles:
500,000+
• The corpus was created by Wholi and is one of
the largest corpora used for social profile matching
• While other datasets (Perito et al., 2011; Zhang et al.,
2016) have a larger number of distinct profiles, ground
truth is one order of magnitude larger for our dataset
• 200,000+ compared to ~10,000 items
• This allows training more complex classfiers, including
deep neural networks
>> Bucharest.AI #5 << :: 05 Dec 17
Dataset
• Ground truth data has been manually entered by users
• It might be incorrect in some cases (entry errors, user misbehaviour)
• Resembles crowdsourced datasets, which are very popular lately to train complex models
• Train and test sets respect the following rules:
1. Train and test sets should contains different online identities (e.g different individuals)
2. The clusters in the training set should have no entries present in the test set in order to
avoid overfitted models
3. Test set has the same distribution for cluster sizes as the train set to provide a relevant
comparison for various sized online identities
• Positive items extracted from about.me accounts, negative ones added
randomly between profiles with similar names, location, etc.
>> Bucharest.AI #5 << :: 05 Dec 17
Proposed solution
• Main contribution is using a deep neural network (NN) for matching online social
networking profiles
• NN is able to make use efficiently of both textual features and domain-specific
ones
• Also performed a comparison with other solutions used in previous studies,
employing both unsupervised and supervised methods
• For the unsupervised approach, we first generated the feature vector for each
profile, then applied Hierarchical Agglomerative Clustering (HAC) using cosine
distance
• For binary classification we have a twofold objective
1. Detect whether two profiles refer to the same person and should be matched (pairwise
matching)
2. For the graph of connected profiles discovered in phase 1, compute its connected
components
>> Bucharest.AI #5 << :: 05 Dec 17
Extracted features
• Given a pair of profiles (a, b)
• Domain specific features: distance based measures based on names (full, first,
last) and usernames, matching gender, matching location, matching
company/employer, etc.
• Text-based features
• Computed from all the other textual attributes in a profile (e.g. bio, publications, interests)
• Used precomputed Word2Vec word embeddings with 300 dimensions, averaged over all
words in a profile
• Also computed cosine and Euclidian distance between word embeddings of the candidate
pair (a, b)
• Features normalization
• Compute the z-scores for each feature
• Whitening using Principal Component Analysis (PCA) in a 25-dimensional vector space to
remove noise
>> Bucharest.AI #5 << :: 05 Dec 17
Deep neural network for profile matching
• Given the very large dataset and the recent advances of deep learning, we
propose a deep NN model for profile matching
• Deep NNs should be able to model more complex non-linear combinations of the
different features (domain specific, word embeddings)
• Proposed a model which uses 6 fully-connected (FC) layers with different
activation functions
• The loss function uses cross-entropy, with an added weight for false positives
which contribute 10 times more to the loss
• Penalizes false connections between profiles and counteracts the imbalanced distribution
>> Bucharest.AI #5 << :: 05 Dec 17
Deep neural network for profile matching
• The first layer takes as input the features computed for the candidate profile pair
and goes into a larger feature space (612  1024)
• The next two layers iteratively reduce the dimensionality of the representation to
a denser feature space
• The final layers employ RELU activation for the neurons, as RELU units are known
to provide better results for binary classification (Nair & Hinton, 2010)
• Dropout is employed to avoid overfitting
>> Bucharest.AI #5 << :: 05 Dec 17
Results
• Experiments performed using an imbalanced test set with one
positive profile pair for 100 negative ones
• Reflects a real-world scenario, where for each correct match between two
profiles, one compares tens/hundreds of incorrect (but similar) candidates
• Table shows B-cubed precision and recall obtained on the test set
• Using same names or similar names as baselines for comparison
>> Bucharest.AI #5 << :: 05 Dec 17
Results
• Unsupervised methods (HAC) obtain poorer results than baseline mainly because
cosine is not a good measure for cluster/item similarity for our feature vectors
• The RF classifier performs well only when domain specific features are added to
the word embeddings
• The large training set limited the number of trees (to 12) in the forest
• RF usually performs poorly when using word embeddings for a pair of documents (as they
cannot compute a more complex similarity function)
• Mini-batch training of NNs allows using larger datasets than for RF
• The deep NN model learns a more complex combination between word
embeddings and domain specific features, grouping profiles with similar
embeddings and similar names
• Deep NN is the only model which can achieve both high recall (R=0.85) and high
precision (P=0.95)
>> Bucharest.AI #5 << :: 05 Dec 17
Results – examples
• Ground truth
• ['twitter/etniqminerals', 'instagram/etniqminerals', 'googleplus/106318957043871071183', 'facebook/etniqminerals',
'facebook/rockcityelitesalsa', 'facebook/1renaissancewoman', 'facebook/naturalblackgirlguide', 'linkedin/leahpatterson’]
• Computed
• [ 'facebook/1renaissancewoman’, 'linkedin/leahpatterson’, 'googleplus/106318957043871071183’]
• ['twitter/etniqminerals', 'instagram/etniqminerals', 'facebook/etniqminerals']
• [ 'facebook/naturalblackgirlguide']
• “Leah Patterson” is an individual who has two different companies “Etniq Minerals” and “Natural Black
Girl Guide”
• Ground truth
• 3 different individuals whose first name is “Tim” and all of them work in IT
• Computed
• ['googleplus/113375270405699485276', 'linkedin/timsmith78', 'googleplus/117829094399867770981',
'twitter/bbyxinnocenz', 'facebook/tim.tio.5', 'vimeo/user616297', 'linkedin/timtio', 'twitter/wbcsaint', 'twitter/turnitontim']
>> Bucharest.AI #5 << :: 05 Dec 17
Matching profiles using deep learning - Conclusions
• Proposed a large dataset for matching online social networking
profiles
• This allowed us to train a deep neural network for profile matching
using both domain-specific features and word embeddings generated
from textual descriptions from social profiles
• Experiments showed that the NN surpassed both unsupervised and
supervised models, achieving a high precision (P = 0.95) with a good
recall rate (R = 0.85)
• Further advancements can be made by training more complex deep
learning models, using recurrent or convolutional networks, and by
adding features extracted from profile pictures
>> Bucharest.AI #5 << :: 05 Dec 17
Summing up…
• Machine learning can be used to improve feature extraction and matching for people /
company search
• This requires a mix of NLP, ML, and information extraction techniques
• Also datasets for training / fitting parameters and testing / validation
• Improvements for matching were only marginally relevant using relationship extraction,
however filtering/search benefits more
• Deep learning is a solution for improving matching, especially for profiles that contain a
larger textul description
• We are moving from people data to company data, therefore we need to change / adapt
the proposed methods and datasets
>> Bucharest.AI #5 << :: 05 Dec 17
Thank you!
traian@wholi.com
_____
_____
>> Bucharest.AI #5 << :: 05 Dec 17
Bootstrapped entity / relationship learning - Results
• For other relations (worksAt, livesIn, studiedAt, worksAs) the seeds
are generated from our index
• Use a test set to compute precision and recall (would like large R,
even with a smaller P)
>> Bucharest.AI #5 << :: 05 Dec 17
Discussions (if time)
• A similar architecture has been proposed by Google (Convington et al., 2016) for
recommending YouTube videos
• However we have only found this work recently
>> Bucharest.AI #5 << :: 05 Dec 17
About me
• Passionate of Machine Learning, Natural Language Processing and
algorithms, with solid experience as a ML scientist both in companies and
academia.
• Teaching & mentoring (A&C @ UPB) to transmit some of my passion and
interest to others.
• Linkedin profile: https://www.linkedin.com/in/trebedea/
• Google Scholar:
https://scholar.google.ro/citations?user=7NxaE1MAAAAJ&hl=en&oi=ao
• Deep learning meetup (more scientific, less business)
https://www.meetup.com/Bucharest-Deep-Learning/
>> Bucharest.AI #5 << :: 05 Dec 17

More Related Content

What's hot

An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications sathish sak
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学Xu jiakon
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_servicessiyaza
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...John Breslin
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Artificial Intelligence Institute at UofSC
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...Daniel Katz
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineYi Zeng
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUDRobert Sanderson
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Robert Sanderson
 

What's hot (20)

An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
992 sms10 social_media_services
992 sms10 social_media_services992 sms10 social_media_services
992 sms10 social_media_services
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...Interlinking Online Communities and Enriching Social Software with the Semant...
Interlinking Online Communities and Enriching Social Software with the Semant...
 
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Men...
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Q046049397
Q046049397Q046049397
Q046049397
 

Similar to AI at Wholi: Machine Learning and Deep Learning for Entity Extraction and Profile Matching

Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
The evolution of Search spscinci
The evolution of Search spscinciThe evolution of Search spscinci
The evolution of Search spscinciJohnny Lopez
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationLorri Mon
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Data Analytics Unleashed 2017
Data Analytics Unleashed 2017Data Analytics Unleashed 2017
Data Analytics Unleashed 2017Michael Perillo
 
Todd davis facebook sourcing
Todd davis facebook sourcingTodd davis facebook sourcing
Todd davis facebook sourcingTalent42
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
SEMPO Canada Summit in Vancouver May 2013
SEMPO Canada Summit in Vancouver May 2013SEMPO Canada Summit in Vancouver May 2013
SEMPO Canada Summit in Vancouver May 2013Duane Forrester
 
Physical and Online Card Sorts: A Practical Overview and Case Study
Physical and Online Card Sorts: A Practical Overview and Case StudyPhysical and Online Card Sorts: A Practical Overview and Case Study
Physical and Online Card Sorts: A Practical Overview and Case StudyBob Thomas
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Connotate
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchMECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchUniversity of Sydney
 
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-1910 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19rgillis
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content StrategistsLouis Rosenfeld
 
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...Amsive
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudDing Li
 

Similar to AI at Wholi: Machine Learning and Deep Learning for Entity Extraction and Profile Matching (20)

Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
The evolution of Search spscinci
The evolution of Search spscinciThe evolution of Search spscinci
The evolution of Search spscinci
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Data Analytics Unleashed 2017
Data Analytics Unleashed 2017Data Analytics Unleashed 2017
Data Analytics Unleashed 2017
 
Todd davis facebook sourcing
Todd davis facebook sourcingTodd davis facebook sourcing
Todd davis facebook sourcing
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
SEMPO Canada Summit in Vancouver May 2013
SEMPO Canada Summit in Vancouver May 2013SEMPO Canada Summit in Vancouver May 2013
SEMPO Canada Summit in Vancouver May 2013
 
Physical and Online Card Sorts: A Practical Overview and Case Study
Physical and Online Card Sorts: A Practical Overview and Case StudyPhysical and Online Card Sorts: A Practical Overview and Case Study
Physical and Online Card Sorts: A Practical Overview and Case Study
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of searchMECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
MECO3602 2014, Week 4 Lecture 'Duck Duck Go[ogle]: The politics of search
 
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-1910 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
 
Internet Search and DRM Issues
Internet Search and DRM IssuesInternet Search and DRM Issues
Internet Search and DRM Issues
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
SEOktoberfest 2022 - Blending SEO, Discover, & Entity Extraction to Analyze D...
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
 

More from Traian Rebedea

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...Traian Rebedea
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discoveryTraian Rebedea
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianTraian Rebedea
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...Traian Rebedea
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriTraian Rebedea
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Traian Rebedea
 

More from Traian Rebedea (20)

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in Romanian
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 

Recently uploaded

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Recently uploaded (20)

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

AI at Wholi: Machine Learning and Deep Learning for Entity Extraction and Profile Matching

  • 1. AI@Wholi Traian Rebedea, Ph.D. traian@wholi.com >> Bucharest.AI #5 << :: 05 Dec 17
  • 2. What is Wholi? • The right people find each other (at the right time) • People search at first, company search now • Aggregating signals from all public online sources • Building a knowledge graph (e.g. People Graph, Company Graph) • Use the graph to make informed decisions (e.g. contacting, selling) • $5M+ total funding, last round in 2016 • Working with tens of companies for lead generation & scoring >> Bucharest.AI #5 << :: 05 Dec 17
  • 3. Wholi – from people data to information >> Bucharest.AI #5 << :: 05 Dec 17
  • 4. Wholi – from people data to information >> Bucharest.AI #5 << :: 05 Dec 17
  • 5. From a technical perspective >> Bucharest.AI #5 << :: 05 Dec 17
  • 6. From a technical perspective >> Bucharest.AI #5 << :: 05 Dec 17
  • 7. AI @ Wholi Two main directions • PART 1: Machine learning for entity extraction • Natural language processing (NLP), information extraction • PART 2: Matching profiles using deep learning classifier • Deep learning, word embeddings >> Bucharest.AI #5 << :: 05 Dec 17
  • 8. Machine learning for entity extraction • Given the bio (description) from a social profile or news item about a person/company • Extract features (entities, relations) useful for matching and for search/filtering • Use NER (Named Entity Recognition) to extract some standard entities • Bootstrapped entity / relationship learning • Extracting entities from email signatures >> Bucharest.AI #5 << :: 05 Dec 17
  • 9. CoreNLP for NER • Out of the box, with some tuning for speed • Used pretrained NER tagger in CoreNLP • To enhance speed, we only kept the CRF classifier & got rid of regex for time and money • Detects three types of named entities: PERSON, ORGANISATION and LOCATION • Trained on news dataset, therefore performance/accuracy degrades on social profiles • Can improve performance by integrating new features (use gazetteer from our index), new dataset with bio from social profiles >> Bucharest.AI #5 << :: 05 Dec 17
  • 10. Extracting topics from text • Topic modelling (e.g. LDA – Latent Dirichlet Allocation) allows to extract a mix (distribution) of topics from a given text • Unfortunately topics are not automatically labeled, we only know the most important words in each topic • Given the text in a social profile, compute the most important topic(s) • Uber Technologies Inc. is an American international transportation network company headquartered in San Francisco, California. The company develops, markets and operates the Uber mobile app, which allows consumers with smartphones to submit a trip request which is then routed to Uber drivers who use their own cars.[1][2] By May 28, 2015, the service was available in 58 countries and 300 cities worldwide => [taxi, transportation, ] • Connectifier is a next-gen search platform specifically targeting hiring - one of the most fundamental pieces of our economy. Discover, qualify, and connect with exceptional job candidates twice as effectively. => [hiring, career and jobs, recruitment, marketplace, coaching] • Dropbox is a free service that lets you bring your photos, docs, and videos anywhere and share them easily. Dropbox was founded in 2007 by Drew Houston and Arash Ferdowsi, two MIT students tired of emailing files to themselves to work from more than one computer. => [cloud, collaborative documents, collaboration, file sharing, storage, photo sharing, photo editing, ] • PeopleGraph is building a people search engine. 3 billion searches a day are for people, yet there's no way to comprehensively search someone's online footprint. We are well on the way to solving that problem. => [search, people, ] >> Bucharest.AI #5 << :: 05 Dec 17
  • 11. Bootstrapped entity / relationship learning • I live in Bucharest<LOC>, working at Wholi<ORG>. Recently I’ve visited Google<ORG>’s office in Zurich<LOC>. • Wholi and Google have different semantics in the example • Identify the relationship between the profile (person, company) and each entity in the bio • Also infer other types of entities / relations • E.g. hobbies (likes, interests), software skills, roles, etc. >> Bucharest.AI #5 << :: 05 Dec 17
  • 12. Bootstrapped entity / relationship learning • Work with Alex Tomescu, MSc @ Chalmers University of Technology Software developer @ Google • Started from SPIED in CoreNLP - Pattern-based entity extraction and visualization • Similar works in other information extraction systems (e.g. NELL, Probase, Knowledge Vault) • Bootstrapping is a semi-supervised process which allows to iteratively discover more and more examples (R ↗) with a pretty good precision (P↘) 1. Start with a (small) seed of correctly classified instances (P high, R low) 2. Using a huge unlabeled dataset, find patterns to discover these & new instances, keep only high confidence patterns 3. Discover new instances (some may be incorrect, most should be correct => P decreases, R increases) & repeat from 2 until… performance degrades too much >> Bucharest.AI #5 << :: 05 Dec 17
  • 13. Bootstrapped entity / relationship learning • Figure retrieved from: https://nlp.stanford.edu/software/patternslearning.html >> Bucharest.AI #5 << :: 05 Dec 17
  • 14. Bootstrapped entity / relationship learning • Seeds either initialized manually (e.g. hobbies), or from labeled data in our index (e.g. worksAt, livesIn) • Seed hobbies: • football, poker, martial arts, chess, swimming, jogging, hiking, fishing, archery, cooking • Computes: • baketball[0.0016682103969337127] baseball[0.0018129626504372067] • dogs[0.0011119719042083742] • drinks[1.4506528794721354E-4] beer[1.1687126552358244E-4] martini[2.4980405990967884E-4] • husband[1.7630337419381701E-4] • church[1.3872841379648836E-4] god[2.0515591306815217E-4] • gadgets[2.67475711957433E-4] • music[1.2155990349464713E-4] • games[1.7018667163236963E-4] • singing[1.5344948928898647E-4] • birds[2.746465269584744E-4] • facebook[1.5347921795982061E-4] >> Bucharest.AI #5 << :: 05 Dec 17
  • 15. Bootstrapped entity / relationship learning • Patterns for role: left 1.0000 : __company__ - [support=6891] right 0.9802 : di __company__ [support=1879] right 0.9442 : bei __company__ __end__ [support=5043] right 0.5559 : @ __company__ __end__ [support=343] left 1.0000 : __start__ __role__ at __company__ and [support=90] >> Bucharest.AI #5 << :: 05 Dec 17
  • 16. Bootstrapped entity / relationship learning - Conclusions • Relationship learning can provide useful features for matching and for search/filtering • However, we have seen that in the best setting we obtain only a minor boost in precision (not statistically relevant), with a more relevant increase in recall (about 1%) • It’s more useful for search in cases when this information can only be inferred from text >> Bucharest.AI #5 << :: 05 Dec 17
  • 17. Matching profiles using deep learning • Work with Vicentiu Ciorbaru, AI @ UPB MSc Software developer @ MariaDB • Dataset built by Dev team @ Wholi • Used as one of the validation datasets for the different profile matching methods developed for people search • Main objective: cluster as many online profile pages of the same person to build his online identity >> Bucharest.AI #5 << :: 05 Dec 17
  • 18. Introduction • Online people search is a significant part of web search • 11-17% of queries include a person name • ~4% contain only a person name • Name ambiguity makes people search a complex problem to solve efficiently • Huge overlap in person names worldwide • The most popular 90,000 full names (first and last name) worldwide are shared by 100M+ individuals • An important aspect in people search is to find most/all online sources of information (e.g. web pages) related to the same person • Recent shift from general web pages to specific ones, like social networking sites and other professional communities >> Bucharest.AI #5 << :: 05 Dec 17
  • 19. Introduction • Our problem: given a set of web pages extracted from online social networks, determine the profiles which relate to the same individual • Profile matching (or deduplication) • Generates a (more) complete online identity for an individual • Only uses public online information, however adding up all this information about a person can cause privacy concerns >> Bucharest.AI #5 << :: 05 Dec 17
  • 20. Related work • Two main directions • Personal web pages deduplication • Matching social networking profiles • First problem is more generic and complex, as one also needs to extract personal information (e.g. name, occupation, etc.) from a wide range of different structured web pages • Entity deduplication, in general, is a very complex field of study in Databases, Natural Language Processing (NLP), and Information Retrieval >> Bucharest.AI #5 << :: 05 Dec 17
  • 21. Related work • Web People Search (WePS) datasets and competitions • Given all web pages returned by a generic search engine for a popular name, group pages such that each group corresponds to one specic person • Most solutions employ clustering of the web pages using features extracted from pages such as Wikipedia concepts, Named Entity Recognition (NER), bag of words (BoW), and hyperlinks and different similarity measures • A pairwise approach for solving this problem was also proposed • Compute the probability that two pages refer to the same person • Cluster pages by joining pairs that have a high probability to represent the same person • WePS proposed B-cubed precision and recall for assessing performance >> Bucharest.AI #5 << :: 05 Dec 17
  • 22. Related work • More recent research focused on linking social networking profiles belonging to the same individual • Zhang et al. (2015) proposed a binary classifier using a probabilistic graphical model (factor graph) • Features computed using BoW and TF-IDF for the text in each profile, but also its social status (position of node in network) and connections • Our solution only uses textual features, since the dataset does not contain connections (e.g. friends or followers) • These additional features, or other like avatar/profile image, would only improve the results >> Bucharest.AI #5 << :: 05 Dec 17
  • 23. Dataset • Snapshot of multiple social networking profiles collected from 15 different online social networks and community websites • Academia, Code-Project, Facebook, Github, Google+, Instagram, Lanyrd, Linkedin, Mashable, Medium, Moz, Quora, Slideshare, Twitter, and Vimeo • For each profile, we extracted some/all of the following information: username, name (full name or distinct first and last names), gender, bio (short description), interests, publications, jobs, etc. • The average number of social profiles per individual is 2.04 and the maximum is 10 • Most profile pages feature a brief description (bio) of the owner • Profiles do not contain connections, nor posts written by the owner >> Bucharest.AI #5 << :: 05 Dec 17
  • 24. Dataset • Ground truth obtained from the website about.me • Complete online information for professionals • Contains links to several social networking profiles of the same person, added manually by each user • Dataset contains information from over 200,000 about.me accounts • Total number of extracted social networking profiles: 500,000+ • The corpus was created by Wholi and is one of the largest corpora used for social profile matching • While other datasets (Perito et al., 2011; Zhang et al., 2016) have a larger number of distinct profiles, ground truth is one order of magnitude larger for our dataset • 200,000+ compared to ~10,000 items • This allows training more complex classfiers, including deep neural networks >> Bucharest.AI #5 << :: 05 Dec 17
  • 25. Dataset • Ground truth data has been manually entered by users • It might be incorrect in some cases (entry errors, user misbehaviour) • Resembles crowdsourced datasets, which are very popular lately to train complex models • Train and test sets respect the following rules: 1. Train and test sets should contains different online identities (e.g different individuals) 2. The clusters in the training set should have no entries present in the test set in order to avoid overfitted models 3. Test set has the same distribution for cluster sizes as the train set to provide a relevant comparison for various sized online identities • Positive items extracted from about.me accounts, negative ones added randomly between profiles with similar names, location, etc. >> Bucharest.AI #5 << :: 05 Dec 17
  • 26. Proposed solution • Main contribution is using a deep neural network (NN) for matching online social networking profiles • NN is able to make use efficiently of both textual features and domain-specific ones • Also performed a comparison with other solutions used in previous studies, employing both unsupervised and supervised methods • For the unsupervised approach, we first generated the feature vector for each profile, then applied Hierarchical Agglomerative Clustering (HAC) using cosine distance • For binary classification we have a twofold objective 1. Detect whether two profiles refer to the same person and should be matched (pairwise matching) 2. For the graph of connected profiles discovered in phase 1, compute its connected components >> Bucharest.AI #5 << :: 05 Dec 17
  • 27. Extracted features • Given a pair of profiles (a, b) • Domain specific features: distance based measures based on names (full, first, last) and usernames, matching gender, matching location, matching company/employer, etc. • Text-based features • Computed from all the other textual attributes in a profile (e.g. bio, publications, interests) • Used precomputed Word2Vec word embeddings with 300 dimensions, averaged over all words in a profile • Also computed cosine and Euclidian distance between word embeddings of the candidate pair (a, b) • Features normalization • Compute the z-scores for each feature • Whitening using Principal Component Analysis (PCA) in a 25-dimensional vector space to remove noise >> Bucharest.AI #5 << :: 05 Dec 17
  • 28. Deep neural network for profile matching • Given the very large dataset and the recent advances of deep learning, we propose a deep NN model for profile matching • Deep NNs should be able to model more complex non-linear combinations of the different features (domain specific, word embeddings) • Proposed a model which uses 6 fully-connected (FC) layers with different activation functions • The loss function uses cross-entropy, with an added weight for false positives which contribute 10 times more to the loss • Penalizes false connections between profiles and counteracts the imbalanced distribution >> Bucharest.AI #5 << :: 05 Dec 17
  • 29. Deep neural network for profile matching • The first layer takes as input the features computed for the candidate profile pair and goes into a larger feature space (612  1024) • The next two layers iteratively reduce the dimensionality of the representation to a denser feature space • The final layers employ RELU activation for the neurons, as RELU units are known to provide better results for binary classification (Nair & Hinton, 2010) • Dropout is employed to avoid overfitting >> Bucharest.AI #5 << :: 05 Dec 17
  • 30. Results • Experiments performed using an imbalanced test set with one positive profile pair for 100 negative ones • Reflects a real-world scenario, where for each correct match between two profiles, one compares tens/hundreds of incorrect (but similar) candidates • Table shows B-cubed precision and recall obtained on the test set • Using same names or similar names as baselines for comparison >> Bucharest.AI #5 << :: 05 Dec 17
  • 31. Results • Unsupervised methods (HAC) obtain poorer results than baseline mainly because cosine is not a good measure for cluster/item similarity for our feature vectors • The RF classifier performs well only when domain specific features are added to the word embeddings • The large training set limited the number of trees (to 12) in the forest • RF usually performs poorly when using word embeddings for a pair of documents (as they cannot compute a more complex similarity function) • Mini-batch training of NNs allows using larger datasets than for RF • The deep NN model learns a more complex combination between word embeddings and domain specific features, grouping profiles with similar embeddings and similar names • Deep NN is the only model which can achieve both high recall (R=0.85) and high precision (P=0.95) >> Bucharest.AI #5 << :: 05 Dec 17
  • 32. Results – examples • Ground truth • ['twitter/etniqminerals', 'instagram/etniqminerals', 'googleplus/106318957043871071183', 'facebook/etniqminerals', 'facebook/rockcityelitesalsa', 'facebook/1renaissancewoman', 'facebook/naturalblackgirlguide', 'linkedin/leahpatterson’] • Computed • [ 'facebook/1renaissancewoman’, 'linkedin/leahpatterson’, 'googleplus/106318957043871071183’] • ['twitter/etniqminerals', 'instagram/etniqminerals', 'facebook/etniqminerals'] • [ 'facebook/naturalblackgirlguide'] • “Leah Patterson” is an individual who has two different companies “Etniq Minerals” and “Natural Black Girl Guide” • Ground truth • 3 different individuals whose first name is “Tim” and all of them work in IT • Computed • ['googleplus/113375270405699485276', 'linkedin/timsmith78', 'googleplus/117829094399867770981', 'twitter/bbyxinnocenz', 'facebook/tim.tio.5', 'vimeo/user616297', 'linkedin/timtio', 'twitter/wbcsaint', 'twitter/turnitontim'] >> Bucharest.AI #5 << :: 05 Dec 17
  • 33. Matching profiles using deep learning - Conclusions • Proposed a large dataset for matching online social networking profiles • This allowed us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles • Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85) • Further advancements can be made by training more complex deep learning models, using recurrent or convolutional networks, and by adding features extracted from profile pictures >> Bucharest.AI #5 << :: 05 Dec 17
  • 34. Summing up… • Machine learning can be used to improve feature extraction and matching for people / company search • This requires a mix of NLP, ML, and information extraction techniques • Also datasets for training / fitting parameters and testing / validation • Improvements for matching were only marginally relevant using relationship extraction, however filtering/search benefits more • Deep learning is a solution for improving matching, especially for profiles that contain a larger textul description • We are moving from people data to company data, therefore we need to change / adapt the proposed methods and datasets >> Bucharest.AI #5 << :: 05 Dec 17
  • 36. Bootstrapped entity / relationship learning - Results • For other relations (worksAt, livesIn, studiedAt, worksAs) the seeds are generated from our index • Use a test set to compute precision and recall (would like large R, even with a smaller P) >> Bucharest.AI #5 << :: 05 Dec 17
  • 37. Discussions (if time) • A similar architecture has been proposed by Google (Convington et al., 2016) for recommending YouTube videos • However we have only found this work recently >> Bucharest.AI #5 << :: 05 Dec 17
  • 38. About me • Passionate of Machine Learning, Natural Language Processing and algorithms, with solid experience as a ML scientist both in companies and academia. • Teaching & mentoring (A&C @ UPB) to transmit some of my passion and interest to others. • Linkedin profile: https://www.linkedin.com/in/trebedea/ • Google Scholar: https://scholar.google.ro/citations?user=7NxaE1MAAAAJ&hl=en&oi=ao • Deep learning meetup (more scientific, less business) https://www.meetup.com/Bucharest-Deep-Learning/ >> Bucharest.AI #5 << :: 05 Dec 17