Revolutionize Text Mining with Spark and Zeppelin

‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheSparkcommitter
Softwareengineer@Hortonworks

‹#
›
Agenda
TextminingworkﬂowonBigData
TextminingwithSparkandMLlib
SparkandZeppelinastheplatform

‹#
›
TextMining:PracticalApplications
•Textclassiﬁcation
–Spamﬁltering
–Frauddetection
•Textclustering
•Sentimentanalysis
•Entityextraction
•Recommendations
•Automaticlabeling
•Contextualadvertising

‹#
›
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R

‹#
›
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R

‹#
›
TextMiningonBigData

‹#
›
TextMiningonBigData
DataScientistsSoftwareengineers

‹#
›
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–AlternatingLeastSquaresonSpotifydata
•50+millionusersx30+millionsongs,50billionratings
•Forrank10with10iterations,~1hourrunningtime
•Workﬂowutilities
–MLpipeline
–Modelimport/export
–crossvalidation

‹#
›
TextMiningworkﬂow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering

‹#
›
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5
nevergotit.Seller…1
Iorderedthisto…3
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel).
•Estimator
–MLalgorithmsfortrainingmodels(e.g.,NaiveBayes).
•Evaluator
–Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,
BinaryClassiﬁcationEvaluator).

‹#
›
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning
•CountVectorizerturnsdocumentsintovectorsbasedonwordcount
•HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus
•Andmuchmore

‹#
›
MLlibTextMiningPipeline-classiﬁcation
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
textclassiﬁcation

‹#
›
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDFLDAtopicmodel

‹#
›
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
recommendation

‹#
›
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
LDA
Word2Vec
textclassiﬁcation
topicmodel
recommendation

‹#
›
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitableformachinelearning
•trainalinearmodeltoperformcategorization
•useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction
componentsandtheclassifier
https://github.com/yanboliang/dataworks-munich-2017

‹#
›
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer,Word2Vec,…)
–25+models(forclassiﬁcation,regression,clustering,…)
–Modeltuning&evaluation
•Butsomeapplicationsrequirecustomized
–Transformers&Models

‹#
›
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•Extendabstractions
–Transformer
–Estimator&Model
–Evaluator

‹#
›
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7

‹#
›
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
Python3.5
Python3.5
Python3.5

‹#
›
TextMiningworkﬂow
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
Duplicatedand
error-prone

‹#
›
MLpersistence
•CreatePipeline
•LoadPipeline(Java/Scala)
–Model.load(“s3n://…”)
•Deployinproduction
PersistmodelorPipeline:
model.save(“s3n://…”)

‹#
›
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Exploredata
Createpipeline
Findbestparams
Savemodel
Loadmodel
Deployinproduction
Scoringon
batch/streamingdata

Revolutionize Text Mining with Spark and Zeppelin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Revolutionize Text Mining with Spark and Zeppelin

Similar to Revolutionize Text Mining with Spark and Zeppelin (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Revolutionize Text Mining with Spark and Zeppelin