An excursion into Text Analytics with Apache Spark

•

4 likes•1,604 views

Krishna Sankar

For our Strata 2016 Tutorial http://goo.gl/SfuSnn

Internet

Text Analytics
In which we analyze the mood of the
nation & then the 2016 US
Presidential primary debates
(Don’t worry, we will keep it
civilized)
Krishna Sankar
https://www.linkedin.com/in/ksankar
March 29, 2016

WJC
Mood Of The Nation
JFK
FDR
GW
AL
JFK

Text Analytics Pipeline
§ Understand the tutorial in the context of a broader reference
architecture (e.g. next 2 slides), Select components as needed by the app
Download
Data
Split Into Words
(n-grams)
Canonicalize
Extract
Features
Analytics
ml pipeline/
models
o http://stateoftheunion.o
netwothree.net/texts/in
dex.html
o Spark RDD Mechanisms o Case transformation,
special characters, space
elimination,…
o Common word
elimination
o Exploration o Logistic Regression
o BayesianModels
o LDA
o http://www.presidency.u
csb.edu/debates.php
o Dataframes (especially
for ml pipelines
o Stemming, Lemma o TF-IDF o Knowledge
Representation
(Knowledge Graph)
o Deep Learning
o (Topics, Classification,…)
o DomainScoping word2vec

Reference Architectures for Text Analytics
https://doubleclix.wordpress.com/2015/07/05/twitter-2-0-curated-
signals-applied-intelligence-stratified-inference/

Reference Architectures for Text Analytics

Text Analytics
§ A Data Scientist canuse RDD primitives to do interesting work with text
- Map-reduce in a couple of lines !
- But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
- Set differences using substractByKey
- Ability to sort a map by values (or any arbitrary function, for that matter)
- Dataframe operations
§ TF-IDF as Feature Extraction http://spark.apache.org/docs/latest/mllib-feature-
extraction.html#tf-idf
§ LDA for topic extraction/ml pipeline in next section
§ Good & Bad – mllib features span a continuum of libraries from rdds to
dataframes to datasets to extraction to ml models to ml pipelines and beyond …
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/

Code Walkthru
§ 03-Text AnalyticsNotebook

What's hot

[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會

Quick Tour of Text MiningYi-Shin Chen

Recommender Systems in the Linked Data eraRoku

Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkJongwook Woo

Measuring Relevance in the Negative SpaceTrey Grainger

Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017MLconf

On Entities and Evaluationkrisztianbalog

How We Use Functional Programming to Find the Bad GuysNew York City College of Technology Computer Systems Technology Colloquium

Getting to Know Your Data with RStephen Withington

Introduction to Graph DatabasesPaolo Pareti

Entities for Augmented Intelligencekrisztianbalog

Bio ontologies and semantic technologiesProf. Wim Van Criekinge

Hadoop Case Studies in the Real WorldMobin Ranjbar

The Empirical Turn in Knowledge RepresentationFrank van Harmelen

Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen

Watson at RPI - Summer 2013James Hendler

Trends in data scienceFernando Ortega Gallego

2017 biological databases_part1_vuploadProf. Wim Van Criekinge

What's hot (18)

[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用

Quick Tour of Text Mining

Recommender Systems in the Linked Data era

Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark

Measuring Relevance in the Negative Space

Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017

On Entities and Evaluation

How We Use Functional Programming to Find the Bad Guys

Getting to Know Your Data with R

Introduction to Graph Databases

Entities for Augmented Intelligence

Bio ontologies and semantic technologies

Hadoop Case Studies in the Real World

The Empirical Turn in Knowledge Representation

Semantic Web questions we couldn't ask 10 years ago

Watson at RPI - Summer 2013

Trends in data science

2017 biological databases_part1_vupload

Viewers also liked

NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson

NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit

An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Machine Learning and GraphXAndy Petrella

Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit

IBM SPSS Overview Text Analytics BriefIan Balina

Xia Zhu – Intel at MLconf ATLMLconf

Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.

Advanced Spark Meetup - Jan 12, 2016Michelle Casbon

Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio, Inc.

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

GraphX and Pregel - Apache SparkAshutosh Trivedi

Social Network Analysis with SparkGhulam Imaduddin

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Using spark for timeseries graph analyticsSigmoid

Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit

Neo4j Makes Graphs Easy- GraphDaysNeo4j

Viewers also liked (20)

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

NLP Structured Data Investigation on Non-Text by Casey Stella

Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...

An excursion into Graph Analytics with Apache Spark GraphX

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Machine Learning and GraphX

Dictionary Based Annotation at Scale with Spark by Sujit Pal

IBM SPSS Overview Text Analytics Brief

Xia Zhu – Intel at MLconf ATL

Tachyon Presentation at AMPCamp 6 (November, 2015)

Advanced Spark Meetup - Jan 12, 2016

Alluxio Presentation at AMPLab Summer Retreat 2016

Performant data processing with PySpark, SparkR and DataFrame API

GraphX and Pregel - Apache Spark

Social Network Analysis with Spark

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

Using spark for timeseries graph analytics

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Neo4j Makes Graphs Easy- GraphDays

Similar to An excursion into Text Analytics with Apache Spark

Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen

Release webinar: Sansa and OntarioBigData_Europe

A Standard Data Format for Computational Chemistry: CSXStuart Chalk

Kid171 chap0 english versionFrank S.C. Tseng

Digital Library Applications Of Social Networking Linked Data Research Center, Seoul National University

Digital Library Applications Of Social Networking Jeju Intl Conferenceguestbba8ac

Ontology mapping for the semantic webWorawith Sangkatip

Semantic Web and Web 3.0 - Web Technologies (1019888BNR)Beat Signer

Moving Library Metadata Toward Linked Data: Opportunities Provided by the eX...Jennifer Bowen

How to put an annotation in htmlSTIinnsbruck

AINL 2016: Kozerenko Lidia Pivovarova

The SPARQL Anything projectEnrico Daga

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC

Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen

Xs path navigation on xml schemas made easyShakas Technologies

RDF SHACL, Annotations, and Data FramesKurt Cagle

Representation of molecular structures and related computations on the Sema...sopekmir

2/24(Wed) - PowerPoint Presentationbutest

Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...Christoph Lange

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...Carsten Saathoff

Similar to An excursion into Text Analytics with Apache Spark (20)

Multimedia Data Navigation and the Semantic Web (SemTech 2006)

Release webinar: Sansa and Ontario

A Standard Data Format for Computational Chemistry: CSX

Kid171 chap0 english version

Digital Library Applications Of Social Networking

Digital Library Applications Of Social Networking Jeju Intl Conference

Ontology mapping for the semantic web

Semantic Web and Web 3.0 - Web Technologies (1019888BNR)

Moving Library Metadata Toward Linked Data: Opportunities Provided by the eX...

How to put an annotation in html

AINL 2016: Kozerenko

The SPARQL Anything project

Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...

Model-based Analysis of Large Scale Software Repositories

Xs path navigation on xml schemas made easy

RDF SHACL, Annotations, and Data Frames

Representation of molecular structures and related computations on the Sema...

2/24(Wed) - PowerPoint Presentation

Bringing Mathematics To the Web of Data: the Case of the Mathematics Subject ...

Unlocking the Semantics of Multimedia Presentations in the Web with the Multi...

Recently uploaded

Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan

Contact Rya Baby for Call Girls New Delhimiss dipika

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29

Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665

SCM Symposium PPT Format Customer loyalty is predieusebiomeyer

Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton

Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb

定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs

Intellectual property rightsand its types.pptxBipin Adhikari

Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo

Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs

Font Performance - NYC WebPerf Meetup April '24Paul Calvano

Recently uploaded (20)

Call Girls Near The Suryaa Hotel New Delhi 9873777170

Contact Rya Baby for Call Girls New Delhi

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书

Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝

Blepharitis inflammation of eyelid symptoms cause everything included along w...

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012

SCM Symposium PPT Format Customer loyalty is predi

Top 10 Interactive Website Design Trends in 2024.pptx

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)

Q4-1-Illustrating-Hypothesis-Testing.pptx

young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作

定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一

Intellectual property rightsand its types.pptx

Elevate Your Business with Our IT Expertise in New Orleans

Magic exist by Marta Loveguard - presentation.pptx

定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一

Font Performance - NYC WebPerf Meetup April '24

An excursion into Text Analytics with Apache Spark

1. Text Analytics In which we analyze the mood of the nation & then the 2016 US Presidential primary debates (Don’t worry, we will keep it civilized) Krishna Sankar https://www.linkedin.com/in/ksankar March 29, 2016

2. WJC Mood Of The Nation JFK FDR GW AL JFK

3. Text Analytics Pipeline § Understand the tutorial in the context of a broader reference architecture (e.g. next 2 slides), Select components as needed by the app Download Data Split Into Words (n-grams) Canonicalize Extract Features Analytics ml pipeline/ models o http://stateoftheunion.o netwothree.net/texts/in dex.html o Spark RDD Mechanisms o Case transformation, special characters, space elimination,… o Common word elimination o Exploration o Logistic Regression o BayesianModels o LDA o http://www.presidency.u csb.edu/debates.php o Dataframes (especially for ml pipelines o Stemming, Lemma o TF-IDF o Knowledge Representation (Knowledge Graph) o Deep Learning o (Topics, Classification,…) o DomainScoping word2vec

4. Reference Architectures for Text Analytics https://doubleclix.wordpress.com/2015/07/05/twitter-2-0-curated- signals-applied-intelligence-stratified-inference/

5. Reference Architectures for Text Analytics

6. Text Analytics § A Data Scientist canuse RDD primitives to do interesting work with text - Map-reduce in a couple of lines ! - But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) - Set differences using substractByKey - Ability to sort a map by values (or any arbitrary function, for that matter) - Dataframe operations § TF-IDF as Feature Extraction http://spark.apache.org/docs/latest/mllib-feature- extraction.html#tf-idf § LDA for topic extraction/ml pipeline in next section § Good & Bad – mllib features span a continuum of libraries from rdds to dataframes to datasets to extraction to ml models to ml pipelines and beyond … http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/

7. Code Walkthru § 03-Text AnalyticsNotebook

An excursion into Text Analytics with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to An excursion into Text Analytics with Apache Spark

Similar to An excursion into Text Analytics with Apache Spark (20)

More from Krishna Sankar

More from Krishna Sankar (16)

Recently uploaded

Recently uploaded (20)

An excursion into Text Analytics with Apache Spark