SlideShare a Scribd company logo
1 of 43
Dato Confidential
Text Analysis with Machine Learning
1
• Chris DuBois
• Dato, Inc.
Dato Confidential
About me
Chris DuBois
Staff Data Scientist
Ph.D. in Statistics
Previously: probabilistic models of social networks
and event data occurring over time.
Recently: tools for recommender systems, text
analysis, feature engineering and model parameter
search.
2
chris@dato.com
@chrisdubois
Dato Confidential
About Dato
3
We aim to help developers…
• develop valuable ML-based applications
• deploy models as intelligent microservices
Dato Confidential
Quick poll!
4
Dato Confidential
Agenda
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
5
Dato Confidential
Applications of text analysis
6
Dato Confidential
Product reviews and comments
7
Applications of text analysis
Dato Confidential
User forums and customer service chats
8
Applications of text analysis
Dato Confidential
Social media: blogs, tweets, etc.
9
Applications of text analysis
Dato Confidential
News feeds
10
Applications of text analysis
Dato Confidential
Speech recognition and chat bots
11
Applications of text analysis
Dato Confidential
Aspect mining
12
Applications of text analysis
Dato Confidential
Quick poll!
13
Dato Confidential
Example of a product sentiment application
14
Dato Confidential
Text processing fundamentals
15
Dato Confidential16
Data
id timestamp user_name text
102 2016-04-05 9:50:31 Bob The food tasted bland
and uninspired.
103 2016-04-06 3:20:25 Charles I would absolutely bring
my friends here. I could
eat those fries every …
104 2016-04-08 11:52 Alicia Too expensive for me.
Even the water was too
expensive.
… … … …
int datetime str str
Dato Confidential17
Transforming string columns
id text
102 The food tasted bland and uninspired.
103 I would absolutely bring my friends here. I
could eat those fries every …
104 Too expensive for me. Even the water was
too expensive.
… …
int str
Tokenizer
CountWords
NGramCounter
TFIDF
RemoveRareWords
SentenceSplitter
ExtractPartsOfSpeech
stack/unstack
Pipelines
Dato Confidential18
Tokenization
Convert each document into a list of tokens.
“The caesar salad was amazing.”INPUT
[“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT
Dato Confidential19
Bag of words representation
Compute the number of times each word occurs.
“The ice cream was the best.”
{
“the”: 2,
“ice”: 1,
“cream”: 1,
“was”: 1,
“best”: 1
}
INPUT
OUTPUT
Dato Confidential20
N-Grams (words)
Compute the number of times each set of words occur.
“Give it away, give it away, give it away now”
{
“give it”: 3,
“it away”: 3,
“away give”: 2,
“away now”: 1
}
INPUT
OUTPUT
Dato Confidential21
N-Grams (characters)
Compute the number of times each set of characters occur.
“mississippi”
{
“mis”: 1,
“iss”: 2,
“sis”: 1,
“ssi”: 2,
“sip”: 1,
...
}
INPUT
OUTPUT
Dato Confidential22
TF-IDF representation
Rescale word counts to discriminate between common and
distinctive words.
{“this”: 1, “is”: 1, “a”: 2, “example”: 3}
{“this”: .3, “is”: .1, “a”: .2, “example”: .9}
INPUT
OUTPUT
Low scores for words that are common among all documents
High scores for rare words occurring often in a document
Dato Confidential23
Remove rare words
Remove rare words (or pre-defined stopwords).
INPUT
OUTPUT
“I like green eggs3 and ham.”
“I like green and ham.”
Dato Confidential24
Split by sentence
Convert each document into a list of sentences.
INPUT
OUTPUT
“It was delicious. I will be back.”
[“It was delicious.”, “I will be back.”]
Dato Confidential25
Extract parts of speech
Identify particular parts of speech.
INPUT
OUTPUT
“It was delicious. I will go back.”
[“delicious”]
Dato Confidential26
stack
Rearrange the data set to “stack” the list column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
Dato Confidential27
unstack
Rearrange the data set to “unstack” the string column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
Dato Confidential28
Pipelines
Combine multiple transformations to create a single pipeline.
from graphlab.feature_engineering import
WordCounter, RareWordTrimmer
f = gl.feature_engineering.create(sf,
[WordCounter, RareWordTrimmer])
f.fit(data)
f.transform(new_data)
Dato Confidential
Machine learning toolkits
29
Dato Confidential
matches unstructured
text queries to a set of
strings
30
Task-oriented toolkits and building blocks
Autotagger =
= Nearest neighbors +
NGramCounter
Dato Confidential31
Nearest neighbors
id TF-IDF of text
Reference
NearestNeighborsModel
Query
id TF-IDF of text
int dict
query_id reference_id score rank
int int float int
Result
Dato Confidential32
Autotagger = Nearest Neighbors + Feature Engineering
id char
ngrams
…
Reference
Result
AutotaggerModel
Query
id char ngrams …
int dict
query_id reference_id score rank
int int float int
Dato Confidential33
Task-oriented toolkits and building blocks
SentimentAnalysis predicts
positive/negative
sentiment in
unstructured text
=
WordCounter +
Logistic Classifier
=
Dato Confidential34
Logistic classifier
label feature(s)
Training data
LogisticClassifier
label feature(s)
int/str int/float/str/dict/list
New data
scores
float
Probability label=1
Dato Confidential35
Sentiment analysis = LogisticClassifier + Feature Engineering
label bag of words
Training data
SentimentAnalysis
label bag of words
int/str dict
New data
scores
float
Sentiment scores
(probability label = 1)
Dato Confidential36
Product sentiment toolkit
>>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence')
>>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow'])
+---------+----------------+-----------------+--------------+
| keyword | sd_sentiment | mean_sentiment | review_count |
+---------+----------------+-----------------+--------------+
| cable | 0.302471264675 | 0.285512408978 | 1618 |
| slow | 0.282117103769 | 0.243490314737 | 369 |
| cost | 0.283310577512 | 0.197087019219 | 291 |
| charges | 0.164350792173 | 0.0853637431588 | 1412 |
| late | 0.119163914305 | 0.0712757752753 | 2202 |
| billing | 0.159655783707 | 0.0697454360245 | 583 |
+---------+----------------+-----------------+--------------+
Dato Confidential
Code for deploying the demo
37
Dato Confidential
Advanced topics in modeling text data
38
Dato Confidential39
Topic models
Cluster words according to their co-occurrence in documents.
Dato Confidential40
Word2vec
Embeddings where nearby words are used in similar contexts.
Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Dato Confidential41
LSTM (Long-Short Term Memory)
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Train a neural network to predict the next word given
previous words.
Dato Confidential
Conclusion
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
42
Dato Confidential
Next steps
Webinar material? Watch for an upcoming email!
Help getting started? Userguide, API docs, Forum
More webinars? Benchmarking ML, data mining, and more
Contact: Chris DuBois, chris@dato.com
43

More Related Content

What's hot

Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computingSVijaylakshmi
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Rabin BK
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 

What's hot (20)

Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Clustering
ClusteringClustering
Clustering
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computing
 
Fuzzy c means manual work
Fuzzy c means manual workFuzzy c means manual work
Fuzzy c means manual work
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Rough K Means - Numerical Example
Rough K Means - Numerical ExampleRough K Means - Numerical Example
Rough K Means - Numerical Example
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 

Viewers also liked

Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big DataMax Lin
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en Méxicombrionessauceda
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementQuantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementEmery Berger
 
Empowerment Awareness
Empowerment AwarenessEmpowerment Awareness
Empowerment Awarenessaltonbaird
 
Teoriaelctromagneticanocambio
TeoriaelctromagneticanocambioTeoriaelctromagneticanocambio
TeoriaelctromagneticanocambioRnr Navarro R
 
C. V Hanaa Ahmed
C. V Hanaa AhmedC. V Hanaa Ahmed
C. V Hanaa AhmedHanaa Ahmed
 
Η Σπάρτη
Η ΣπάρτηΗ Σπάρτη
Η Σπάρτηvasso76
 
LISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSLISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSguest0dbad523
 

Viewers also liked (17)

Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en México
 
LinkedIn for Beginners
LinkedIn for BeginnersLinkedIn for Beginners
LinkedIn for Beginners
 
Goal Centre e-bulletin Feb 2015
Goal Centre e-bulletin Feb 2015Goal Centre e-bulletin Feb 2015
Goal Centre e-bulletin Feb 2015
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementQuantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
 
Empowerment Awareness
Empowerment AwarenessEmpowerment Awareness
Empowerment Awareness
 
Ehealthhoy
EhealthhoyEhealthhoy
Ehealthhoy
 
Teoriaelctromagneticanocambio
TeoriaelctromagneticanocambioTeoriaelctromagneticanocambio
Teoriaelctromagneticanocambio
 
C. V Hanaa Ahmed
C. V Hanaa AhmedC. V Hanaa Ahmed
C. V Hanaa Ahmed
 
Η Σπάρτη
Η ΣπάρτηΗ Σπάρτη
Η Σπάρτη
 
Trabajo
TrabajoTrabajo
Trabajo
 
Cover Diari de Girona
Cover Diari de GironaCover Diari de Girona
Cover Diari de Girona
 
Taller de primer semestre
Taller de primer semestreTaller de primer semestre
Taller de primer semestre
 
certificate
certificatecertificate
certificate
 
LISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSLISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOS
 
Publicidad
PublicidadPublicidad
Publicidad
 

Similar to Text Analysis with Machine Learning

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptxLonghow Lam
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
Redis for your boss 2.0
Redis for your boss 2.0Redis for your boss 2.0
Redis for your boss 2.0Elena Kolevska
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseit-people
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data CrunchingMichelle D'israeli
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916Craig Chao
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsySmartHinJ
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 

Similar to Text Analysis with Machine Learning (20)

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Redis for your boss 2.0
Redis for your boss 2.0Redis for your boss 2.0
Redis for your boss 2.0
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's database
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data Crunching
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Text Analysis with Machine Learning

  • 1. Dato Confidential Text Analysis with Machine Learning 1 • Chris DuBois • Dato, Inc.
  • 2. Dato Confidential About me Chris DuBois Staff Data Scientist Ph.D. in Statistics Previously: probabilistic models of social networks and event data occurring over time. Recently: tools for recommender systems, text analysis, feature engineering and model parameter search. 2 chris@dato.com @chrisdubois
  • 3. Dato Confidential About Dato 3 We aim to help developers… • develop valuable ML-based applications • deploy models as intelligent microservices
  • 5. Dato Confidential Agenda • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 5
  • 7. Dato Confidential Product reviews and comments 7 Applications of text analysis
  • 8. Dato Confidential User forums and customer service chats 8 Applications of text analysis
  • 9. Dato Confidential Social media: blogs, tweets, etc. 9 Applications of text analysis
  • 11. Dato Confidential Speech recognition and chat bots 11 Applications of text analysis
  • 14. Dato Confidential Example of a product sentiment application 14
  • 16. Dato Confidential16 Data id timestamp user_name text 102 2016-04-05 9:50:31 Bob The food tasted bland and uninspired. 103 2016-04-06 3:20:25 Charles I would absolutely bring my friends here. I could eat those fries every … 104 2016-04-08 11:52 Alicia Too expensive for me. Even the water was too expensive. … … … … int datetime str str
  • 17. Dato Confidential17 Transforming string columns id text 102 The food tasted bland and uninspired. 103 I would absolutely bring my friends here. I could eat those fries every … 104 Too expensive for me. Even the water was too expensive. … … int str Tokenizer CountWords NGramCounter TFIDF RemoveRareWords SentenceSplitter ExtractPartsOfSpeech stack/unstack Pipelines
  • 18. Dato Confidential18 Tokenization Convert each document into a list of tokens. “The caesar salad was amazing.”INPUT [“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT
  • 19. Dato Confidential19 Bag of words representation Compute the number of times each word occurs. “The ice cream was the best.” { “the”: 2, “ice”: 1, “cream”: 1, “was”: 1, “best”: 1 } INPUT OUTPUT
  • 20. Dato Confidential20 N-Grams (words) Compute the number of times each set of words occur. “Give it away, give it away, give it away now” { “give it”: 3, “it away”: 3, “away give”: 2, “away now”: 1 } INPUT OUTPUT
  • 21. Dato Confidential21 N-Grams (characters) Compute the number of times each set of characters occur. “mississippi” { “mis”: 1, “iss”: 2, “sis”: 1, “ssi”: 2, “sip”: 1, ... } INPUT OUTPUT
  • 22. Dato Confidential22 TF-IDF representation Rescale word counts to discriminate between common and distinctive words. {“this”: 1, “is”: 1, “a”: 2, “example”: 3} {“this”: .3, “is”: .1, “a”: .2, “example”: .9} INPUT OUTPUT Low scores for words that are common among all documents High scores for rare words occurring often in a document
  • 23. Dato Confidential23 Remove rare words Remove rare words (or pre-defined stopwords). INPUT OUTPUT “I like green eggs3 and ham.” “I like green and ham.”
  • 24. Dato Confidential24 Split by sentence Convert each document into a list of sentences. INPUT OUTPUT “It was delicious. I will be back.” [“It was delicious.”, “I will be back.”]
  • 25. Dato Confidential25 Extract parts of speech Identify particular parts of speech. INPUT OUTPUT “It was delicious. I will go back.” [“delicious”]
  • 26. Dato Confidential26 stack Rearrange the data set to “stack” the list column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  • 27. Dato Confidential27 unstack Rearrange the data set to “unstack” the string column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  • 28. Dato Confidential28 Pipelines Combine multiple transformations to create a single pipeline. from graphlab.feature_engineering import WordCounter, RareWordTrimmer f = gl.feature_engineering.create(sf, [WordCounter, RareWordTrimmer]) f.fit(data) f.transform(new_data)
  • 30. Dato Confidential matches unstructured text queries to a set of strings 30 Task-oriented toolkits and building blocks Autotagger = = Nearest neighbors + NGramCounter
  • 31. Dato Confidential31 Nearest neighbors id TF-IDF of text Reference NearestNeighborsModel Query id TF-IDF of text int dict query_id reference_id score rank int int float int Result
  • 32. Dato Confidential32 Autotagger = Nearest Neighbors + Feature Engineering id char ngrams … Reference Result AutotaggerModel Query id char ngrams … int dict query_id reference_id score rank int int float int
  • 33. Dato Confidential33 Task-oriented toolkits and building blocks SentimentAnalysis predicts positive/negative sentiment in unstructured text = WordCounter + Logistic Classifier =
  • 34. Dato Confidential34 Logistic classifier label feature(s) Training data LogisticClassifier label feature(s) int/str int/float/str/dict/list New data scores float Probability label=1
  • 35. Dato Confidential35 Sentiment analysis = LogisticClassifier + Feature Engineering label bag of words Training data SentimentAnalysis label bag of words int/str dict New data scores float Sentiment scores (probability label = 1)
  • 36. Dato Confidential36 Product sentiment toolkit >>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence') >>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow']) +---------+----------------+-----------------+--------------+ | keyword | sd_sentiment | mean_sentiment | review_count | +---------+----------------+-----------------+--------------+ | cable | 0.302471264675 | 0.285512408978 | 1618 | | slow | 0.282117103769 | 0.243490314737 | 369 | | cost | 0.283310577512 | 0.197087019219 | 291 | | charges | 0.164350792173 | 0.0853637431588 | 1412 | | late | 0.119163914305 | 0.0712757752753 | 2202 | | billing | 0.159655783707 | 0.0697454360245 | 583 | +---------+----------------+-----------------+--------------+
  • 37. Dato Confidential Code for deploying the demo 37
  • 38. Dato Confidential Advanced topics in modeling text data 38
  • 39. Dato Confidential39 Topic models Cluster words according to their co-occurrence in documents.
  • 40. Dato Confidential40 Word2vec Embeddings where nearby words are used in similar contexts. Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
  • 41. Dato Confidential41 LSTM (Long-Short Term Memory) Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Train a neural network to predict the next word given previous words.
  • 42. Dato Confidential Conclusion • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 42
  • 43. Dato Confidential Next steps Webinar material? Watch for an upcoming email! Help getting started? Userguide, API docs, Forum More webinars? Benchmarking ML, data mining, and more Contact: Chris DuBois, chris@dato.com 43

Editor's Notes

  1. We aim to help developers… create and grow business value with ML: Immediately create a business application using our fully customizable toolkits, and continuously and easily enhance it: Business-task centered ML toolkits: recommenders, churn prediction, lead scoring,…, not just ML algorithms like boosted trees and deep learning. Automated model selection, parameter optimization, and evaluation. Painless feature engineering. Delightful user experience for developers: Use SFrames to manipulate multiple data types, easily create features with pre-defined transformations or easily create your own. 50+ robust ML algorithms where out-of-core computation eliminate scale pains. Deploy models as intelligent microservices: Easily compose intelligent applications from hosted microservices, using the customer’s own infrastructure or AWS Host any model. Performant for production. Model experimentation, monitoring and management.