A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

Big Data Pipeline for Topic and
Sentiment Analysis, with
Applications
Srivatsan Ramanujam (@being_bayesian)
Senior Data Scientist, Pivotal

11 Jan 2014

© Copyright 2013 Pivotal. All rights reserved.

1

Agenda
Introduction
The Problem
The Platform
The Pipeline
Live Demo: Topic and Sentiment Analysis Engine
Applications in real world customer engagements


2

Pivotal: A New Platform for a New Era
Data-Driven Application Development

Pivotal Data
Science Labs

App Fabric

Data Fabric

“The new Middleware”

“The new Database”

Cloud Fabric
“The new OS”
...ETC

“The new Hardware”


3

The Problem


4

The Problem
Make sense of large volumes of unstructured text and integrate this with the
structured sources of data to make better predictions
Approaches
– Topic Analysis
– Sentiment Analysis


5

The Platform


6

Pivotal Greenplum MPP DB
Think of it as multiple
PostGreSQL servers
Master

Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)


7

Pivotal Hadoop

• The pipeline in this
talk can be run on
Pivotal Hadoop +
HAWQ


8

Data Parallelism Vs. Task Parallelism
Data Parallelism: Little or no effort is required to break up the problem
into a number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
– Ex: Build one Churn model for each state in the US simultaneously, when
customer data is distributed by state code.

Task Parallelism: Split the problem into independent sub-tasks which
can executed in parallel.
– Ex: Build one Churn model in parallel for the entire US, though customer
data is distributed by state code.


9

User-Defined Functions (UDFs)
PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE FUNCTION times2(INT)
RETURNS INT
AS $$
SELECT 2 * $1
$$ LANGUAGE sql;

SELECT times2(1);
times2
-------2
(1 row)


10

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•

Allows users to write
Greenplum/PostgreSQL functions in the
R/Python/Java, Perl, pgsql or C languages

SQL
Master
Host

The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•

Data Parallelism:
- PL/X piggybacks on
Greenplum’s MPP architecture


Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…
11

Going Beyond Data Parallelism
Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data

For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.


12

Scalable, in-database ML

•
•
•

Open Source!https://github.com/madlib/madlib
Works on Greenplum DB and PostgreSQL
Active development by Pivotal
-

•


Latest Release : 1.4 (Dec 2014)
Downloads and Docs: http://madlib.net/

13

MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
• Single Value Decomposition (SVD)
• Low-Rank


Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis, Market
Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers

Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

14

Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)

15

MADlib on Hadoop

• A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of
the box on HAWQ.
• Other functions are being ported.
• With the general availability and support for User Defined Functions in
HAWQ, MADlib will attain full parity with GPDB


16

The Pipeline


17

The Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis
through MADlib pLDA

(gpfdist)
Loaded as
external tables
into GPDB


Parallel Parsing of
JSON and extraction
of fields using
PL/Python

Sentiment Analysis
through custom
PL/Python functions

18

Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic
Modeling

Topic Graph
Topic composition

MADlib Topic
Model
Topic
Clouds


19

Sentiment Analysis
We don’t have labeled data for our problem (Tweets
aren’t tagged with Sentiment)

“Unpredictable”

Semi-Supervised Sentiment Prediction can be
achieved by dictionary look-ups of tokens in a Tweet,
but without Context, Sentiment Prediction is futile!

“Breakthrough”


20

Sentiment Analysis – PL/X Functions
Break-up Tweets into
tokens and tag their
parts-of-speech

Part-of-speech
tagger1

1:

Semi-Supervised Sentiment Classification

Phrase Extraction

Phrasal Polarity
Scoring

Use learned phrasal
polarities to score
sentiment of new tweets

Sentiment Scored
Tweets

Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)


21

Live Demo


22

Real World Applications


23

Churn Models for Telecom Industry
Goal
– Identify and prevent customers who are likely to churn.

Challenges
–
–
–
–

Cost of acquiring new customers is high
Recouping cost of customer acquisition high if customer is not retained long enough
Lower barrier to switching subscribers
With mobile number portability, barrier to switching even lower

Good News
– Cost of retaining existing customers is lower!


24

Structured Features for Churn Models
The problem is extensively studied with a rich set of approaches in the literature

Device

Texting Stats

Call Stats

Rate Plans

Customer
Demographics

These features are great, but the models soon hit a plateau with structured
features!


25

Blending the Unstructured with the Structured
What other sources of previously untapped data could we use ?

Are our customers happy ? Where ? What segments ?
What are the common topics in their conversations online ?


26

Sentiment Analysis and Topic Models
MORE ACCURATE LIKELIHOOD
TO CHURN

Unstructured Data
External

Internal
Sentiment Analysis
Engine
(Classifier)

Topic Engine
(LDA)

Structured Data: EDW


Topic Dashboard

27

Predicting Commodity Futures through Twitter
Customer
A major a agri-business cooperative
Business Problem
Predict price of commodity futures through
Twitter

Solution

Built Sentiment Analysis and Text
Regression algorithms to predict commodity
futures from Tweets
Established the foundation for blending the
structured data (market fundamentals) with
unstructured data (tweets)

Challenges
Language on Twitter does not adhere to
rules of grammar and has poor structure
No domain specific label corpus of tweet
sentiment – problem is semi-supervised


28

The Approach

•

Tweets alone had significant predictive power for the commodity of interest to
us. When blended with structured features like weather data we expect to see
much better results.


29

What’s in it for me?


30

Pivotal Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software

• PyMADlib – Python Wrapper for MADlib
-

https://github.com/gopivotal/pymadlib

• PivotalR – R wrapper for MADlib
-

https://github.com/madlib-internal/PivotalR

• Part-of-speech tagger for Twitter via SQL
-

http://vatsan.github.io/gp-ark-tweet-nlp/

Questions?
@being_bayesian


31

BUILT FOR THE SPEED OF BUSINESS

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

Similar to A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database (20)

More from Srivatsan Ramanujam

More from Srivatsan Ramanujam (6)

Recently uploaded

Recently uploaded (20)

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

Editor's Notes