SlideShare a Scribd company logo
1 of 28
Download to read offline
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheSparkcommitter
Softwareengineer@Hortonworks
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Agenda
TextminingworkflowonBigData
TextminingwithSparkandMLlib
SparkandZeppelinastheplatform
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMining:PracticalApplications
•Textclassification
–Spamfiltering
–Frauddetection
•Textclustering
•Sentimentanalysis
•Entityextraction
•Recommendations
•Automaticlabeling
•Contextualadvertising
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
DataScientistsSoftwareengineers
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–AlternatingLeastSquaresonSpotifydata
•50+millionusersx30+millionsongs,50billionratings
•Forrank10with10iterations,~1hourrunningtime
•Workflowutilities
–MLpipeline
–Modelimport/export
–crossvalidation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5
nevergotit.Seller…1
Iorderedthisto…3
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel).
•Estimator
–MLalgorithmsfortrainingmodels(e.g.,NaiveBayes).
•Evaluator
–Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,
BinaryClassificationEvaluator).
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning
•CountVectorizerturnsdocumentsintovectorsbasedonwordcount
•HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus
•Andmuchmore
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-classification
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
textclassification
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDFLDAtopicmodel
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
LDA
Word2Vec
textclassification
topicmodel
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitableformachinelearning
•trainalinearmodeltoperformcategorization
•useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction
componentsandtheclassifier
https://github.com/yanboliang/dataworks-munich-2017
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer,Word2Vec,…)
–25+models(forclassification,regression,clustering,…)
–Modeltuning&evaluation
•Butsomeapplicationsrequirecustomized
–Transformers&Models
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•Extendabstractions
–Transformer
–Estimator&Model
–Evaluator
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
Python3.5
Python3.5
Python3.5
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
Duplicatedand
error-prone
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLpersistence
•Prototype(Python/R)
•CreatePipeline
•LoadPipeline(Java/Scala)
–Model.load(“s3n://…”)
•Deployinproduction
DataScienceSoftwareengineering
PersistmodelorPipeline:
model.save(“s3n://…”)
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Exploredata
Createpipeline
Findbestparams
Savemodel
Loadmodel
Deployinproduction
Scoringon
batch/streamingdata

More Related Content

What's hot

Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi:   Ingesting Enterprise Data At Scale Apache NiFi:   Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale Timothy Spann
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方HortonworksJapan
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...DataWorks Summit/Hadoop Summit
 
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationFlink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationMatthew Ring
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerYifeng Jiang
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFiIntelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFiDataWorks Summit
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash CourseDataWorks Summit
 
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016Timothy Spann
 

What's hot (20)

The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
 
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi:   Ingesting Enterprise Data At Scale Apache NiFi:   Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
 
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationFlink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data Constellation
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFiIntelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
 
Data in the Cloud Crash Course
Data in the Cloud Crash CourseData in the Cloud Crash Course
Data in the Cloud Crash Course
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
 

Similar to Revolutionize Text Mining with Spark and Zeppelin

Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensVinay Bhat
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Timothy Spann
 
API Description Languages
API Description LanguagesAPI Description Languages
API Description LanguagesAkana
 
API Description Languages
API Description LanguagesAPI Description Languages
API Description LanguagesAkana
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
API Description Languages: Which Is The Right One For Me?
 API Description Languages: Which Is The Right One For Me?  API Description Languages: Which Is The Right One For Me?
API Description Languages: Which Is The Right One For Me? ProgrammableWeb
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Spark Summit
 
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Databricks
 
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Databricks
 
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!ankitbhandari32
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta Mukherjee
 
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...Bruno Capuano
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
"React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi "React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi ThinkOpen
 
Continuous Delivery in a content centric world
Continuous Delivery in a content centric worldContinuous Delivery in a content centric world
Continuous Delivery in a content centric worldJeroen Reijn
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...Dataconomy Media
 

Similar to Revolutionize Text Mining with Spark and Zeppelin (20)

Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
API Description Languages
API Description LanguagesAPI Description Languages
API Description Languages
 
API Description Languages
API Description LanguagesAPI Description Languages
API Description Languages
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
API Description Languages: Which Is The Right One For Me?
 API Description Languages: Which Is The Right One For Me?  API Description Languages: Which Is The Right One For Me?
API Description Languages: Which Is The Right One For Me?
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
 
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
 
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
 
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
"React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi "React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi
 
Continuous Delivery in a content centric world
Continuous Delivery in a content centric worldContinuous Delivery in a content centric world
Continuous Delivery in a content centric world
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
 
Data streaming
Data streamingData streaming
Data streaming
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Revolutionize Text Mining with Spark and Zeppelin