Submit Search
Upload
Revolutionize Text Mining with Spark and Zeppelin
•
4 likes
•
2,084 views
DataWorks Summit/Hadoop Summit
Follow
Revolutionize Text Mining with Spark and Zeppelin Slides
Read less
Read more
Technology
Report
Share
Report
Share
1 of 28
Download now
Download to read offline
Recommended
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Timothy Spann
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Timothy Spann
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
Introduction to HDF 3.0
Introduction to HDF 3.0
Timothy Spann
Introduction to Hadoop
Introduction to Hadoop
Timothy Spann
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
Timothy Spann
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
Recommended
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Timothy Spann
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Timothy Spann
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
Daniel Madrigal
Introduction to HDF 3.0
Introduction to HDF 3.0
Timothy Spann
Introduction to Hadoop
Introduction to Hadoop
Timothy Spann
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
Timothy Spann
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
The Elephant in the Clouds
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
Apache Hadoop Crash Course
Apache Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
HortonworksJapan
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Matthew Ring
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Timothy Spann
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
Yifeng Jiang
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
Data in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
Vinay Bhat
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
More Related Content
What's hot
The Elephant in the Clouds
The Elephant in the Clouds
DataWorks Summit/Hadoop Summit
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
Apache Hadoop Crash Course
Apache Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
HortonworksJapan
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Matthew Ring
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Timothy Spann
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
Yifeng Jiang
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
Data in the Cloud Crash Course
Data in the Cloud Crash Course
DataWorks Summit
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
What's hot
(20)
The Elephant in the Clouds
The Elephant in the Clouds
Apache NiFi: Ingesting Enterprise Data At Scale
Apache NiFi: Ingesting Enterprise Data At Scale
Apache Hadoop Crash Course
Apache Hadoop Crash Course
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
Apache NiFi + Tensorflow + Hadoop:Big Data AI サンドイッチの作り方
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Data in the Cloud Crash Course
Data in the Cloud Crash Course
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
Similar to Revolutionize Text Mining with Spark and Zeppelin
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
Vinay Bhat
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
API Description Languages
API Description Languages
Akana
API Description Languages
API Description Languages
Akana
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
Mac Moore
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
API Description Languages: Which Is The Right One For Me?
API Description Languages: Which Is The Right One For Me?
ProgrammableWeb
SoCal BigData Day
SoCal BigData Day
John Park
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Databricks
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Databricks
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!
ankitbhandari32
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta Mukherjee
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
Bruno Capuano
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
"React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi
ThinkOpen
Continuous Delivery in a content centric world
Continuous Delivery in a content centric world
Jeroen Reijn
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
Dataconomy Media
Data streaming
Data streaming
Alberto Paro
Similar to Revolutionize Text Mining with Spark and Zeppelin
(20)
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
API Description Languages
API Description Languages
API Description Languages
API Description Languages
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
API Description Languages: Which Is The Right One For Me?
API Description Languages: Which Is The Right One For Me?
SoCal BigData Day
SoCal BigData Day
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
2018 04 20 Azure Global Bootcamp - Artificial Intelligence and Cognitive Serv...
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
"React Native" by Vanessa Leo e Roberto Brogi
"React Native" by Vanessa Leo e Roberto Brogi
Continuous Delivery in a content centric world
Continuous Delivery in a content centric world
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
Data streaming
Data streaming
More from DataWorks Summit/Hadoop Summit
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
More from DataWorks Summit/Hadoop Summit
(20)
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Hadoop Crash Course
Data Science Crash Course
Data Science Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Recently uploaded
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
Sandro Moreira
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
apidays
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Orbitshub
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
Recently uploaded
(20)
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Architecting Cloud Native Applications
Architecting Cloud Native Applications
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Revolutionize Text Mining with Spark and Zeppelin
1.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved RevolutionizeTextMining withSparkandZeppelin April2017 YanboLiang ApacheSparkcommitter Softwareengineer@Hortonworks
2.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Agenda TextminingworkflowonBigData TextminingwithSparkandMLlib SparkandZeppelinastheplatform
3.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMining:PracticalApplications •Textclassification –Spamfiltering –Frauddetection •Textclustering •Sentimentanalysis •Entityextraction •Recommendations •Automaticlabeling •Contextualadvertising
4.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TraditionalTextMining •Commercialsoftware •Opensourcesoftware –Gensim,KNIME,NLTK, sklearn,R
5.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TraditionalTextMining •Commercialsoftware –IBMSPSS,RapidMiner,SAS •Opensourcesoftware –Gensim,KNIME,NLTK, sklearn,R
6.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningonBigData
7.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningonBigData DataScientistsSoftwareengineers
8.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved WhyApacheSparkMLlib •ScalablemachinelearningalgorithmsontopofSpark –AlternatingLeastSquaresonSpotifydata •50+millionusersx30+millionsongs,50billionratings •Forrank10with10iterations,~1hourrunningtime •Workflowutilities –MLpipeline –Modelimport/export –crossvalidation
9.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering
10.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering
11.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Loaddata TextLabel Iboughtthegame…4 DoNOTbothertry…1 Thisshirtisawesome…5 nevergotit.Seller…1 Iorderedthisto…3 Dataset Feature engineering Model training Model evaluation
12.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Extractfeatures TextLabelWordsFeatures Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…] DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…] Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…] nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…] Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…] Dataset Feature engineering Model training Model evaluation
13.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Fitamodel TextLabelWordsFeaturesProbabilityPrediction Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84 DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62 Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95 nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71 Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74 Dataset Feature engineering Model training Model evaluation
14.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Evaluate TextLabelWordsFeaturesProbabilityPrediction Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84 DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62 Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95 nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71 Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74 Dataset Feature engineering Model training Model evaluation
15.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved KeyabstractionofSparkMLpipeline •Transformer –Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel). •Estimator –MLalgorithmsfortrainingmodels(e.g.,NaiveBayes). •Evaluator –Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g., BinaryClassificationEvaluator).
16.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Spark’sTextMiningalgorithms •LDAfortopicmodel •Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning •CountVectorizerturnsdocumentsintovectorsbasedonwordcount •HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus •Andmuchmore
17.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline-classification Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDF StringIndexer NaiveBayes LogisticRegression SVM MLP textclassification
18.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline–topicmodel Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDFLDAtopicmodel
19.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline-recommendation Dataset RegexTokenizerWord2Vec recommendation
20.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLlibTextMiningPipeline Dataset RegexTokenizer StopWordsRemover CountVectorizer HashingTF IDF StringIndexer NaiveBayes LogisticRegression SVM MLP LDA Word2Vec textclassification topicmodel recommendation
21.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Demo •loadthefilecontentsandthecategories •extractfeaturevectorssuitableformachinelearning •trainalinearmodeltoperformcategorization •useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction componentsandtheclassifier https://github.com/yanboliang/dataworks-munich-2017
22.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved CustomingMLPipelines •MLlib2.1includes: –30+featuretransformers(Tokenizer,Word2Vec,…) –25+models(forclassification,regression,clustering,…) –Modeltuning&evaluation •Butsomeapplicationsrequirecustomized –Transformers&Models
23.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Optionsforcustomization •Existingusecases: –spark-corenlp –spark-vlbfgs •Extendabstractions –Transformer –Estimator&Model –Evaluator
24.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Sparkvirtualenvironment DataScientistADataScientistB Python2.7 Python2.7 Python2.7 Python2.7 Python2.7
25.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Sparkvirtualenvironment DataScientistADataScientistB Python2.7 Python2.7 Python2.7 Python2.7 Python2.7 Python3.5 Python3.5 Python3.5
26.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved TextMiningworkflow •Prototype(Python/R) •CreatePipeline –Loaddataset –Extractrawfeatures –Transformfeatures –Selectkeyfeatures –Fitandchoosebestmodels •Re-implementPipelinefor production(Java/Scala) •DeployPipeline •Scoring DataScienceSoftwareengineering Duplicatedand error-prone
27.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved MLpersistence •Prototype(Python/R) •CreatePipeline •LoadPipeline(Java/Scala) –Model.load(“s3n://…”) •Deployinproduction DataScienceSoftwareengineering PersistmodelorPipeline: model.save(“s3n://…”)
28.
‹# › ©HortonworksInc.2011–2016.AllRightsReserved Datascientistsworkwithsoftwareengineer DataScientistsSoftwareengineers Exploredata Createpipeline Findbestparams Savemodel Loadmodel Deployinproduction Scoringon batch/streamingdata
Download now