A Primer for Your Next Data Science Proof of Concept on the Cloud

A Primer for Your Next
Data Science Proof of
Concept on the Cloud

Your Presenters
Alton Alexander
About Me : Data scientist. PhD dropout with a love for
solving real world business problems. Experience delivering
solutions with big data, machine learning, and statistics for
marketing, manufacturing, and finance industries.
Affiliations : Front Analytics Consulting
Connect : altonalexander
alexanderalton@gmail.com
Matt Davies
About Me : “Big Data” architect / engineer with clients in
retail, healthcare, e-commerce, insurance, and government.
Primarily focused on operationalizing complex data-driven
solutions.
Affiliations : Xpert Data Solutions
Connect : 4mattdavies
matt@xds.io
Site : http://xds.io

Agenda
● Scoping the problem and solution
● Discuss pros / cons of starting with a cloud solution
● Establishing realistic expectations, budgets, constraints, etc…
● Hands on demo
● Q&A

Scoping the Problem and Solution
● What are you trying to solve?
● What data might be helpful in answering the question(s)?
● Are there specific techniques which are known to work well?
● Do you need to use BI tools and/or export data?
● What are your timelines?
● What are your resources?

Competing on Analytics
Product Customer Operations
Who are your potential
customers?
What do they want?
Brand loyalty?
What’s next?
What motivates customers?
Which channels work best?
What else do they need?
What is a “customer”?
How long will X function?
How much product waste?
Will Y be cheaper?
e.g. Customer segmentation / profiles
Market analysis
Channel attribution
Keywords
Engagement
Enrichment
Churn
Conversion
Offers
Sentiment
Profiling
Yield optimization
Failure rates
Futures

data collectors
bulk
store
batch
process
live
store
api
service
ui
Datacenter
AWSand/or
“Analytic”
queues & oltp
*SQS
redis or couch
mongodb
rdbms
olap
mongo
Hbase
Thrift/Protobuf/AVRO
sockets style
*messagepack based
netty
kafka
*kinesis
apps
*elastic beanstalk
ec2/vm + load balance
emitters
*messagepack
*s3
ebs
*HDFS
cassandra
columnar
on file system
M/R based (pig, hive)
Graph based
off file system
anything language
diy json
mongodb
*BaaSes
postgres
column
Hbase/Impala
*cassandra
graph
cypher
gremlin
search
elasticsearch
*stupid-simple-n-scale
cloud/dc apps
ec2/vm + load balance
*elastic beanstalk
sql-ish
Phoenix
Cassandra
graph
Gremlin
Cypher
search
elasticsearch
d3
nvd3.js - simple
d3.js - complex
dc.js - dimensions
putting the long-A in OLAP

Pros / Cons of Starting With a Cloud Solution
Cons:
● Data sensitivity
● Less control
● Unfamiliarity with terminology and/or
design
● On prem world very different than cloud.
Terms, risk factors, skillsets
● Data movement can be difficult
● Cloud “tax”
Pros:
● Elasticity
● Scalability
● Speed of implementation
● Focus on business problem
● Can easily create multiple instances for
tests
● Less management
● Strong security
● No Network, Datacenter barriers
● Strong industry adoption

Pros / Cons of Starting With a Cloud Solution
General Challenges
● POC -> Production can be difficult
● Security is widely misunderstood
● Skillsets: When to hire, develop, consult, outsource

Use Case : Clear objective with
identified stakeholders.
Sufficient Time : Discovery is
such a large part of these projects
that projecting “Put this legacy
project out in X hrs will translate
to Y in big data” is not reliable.
Iterate : Like all software projects
it is usually better to iterate than
have large waterfall deployments.
Review : What went well, what
failed, where is our technical
debt.
Establishing Realistic Expectations, Budgets, Constraints, Etc...

Budgets
● Know how the cloud provider makes money
● Start lean
● “Leave No Trace”
Constraints
● Time
● Complexity
● Resources

● Sleep on it
● Are you a solution in search of a need?
● Use the scientific method
● Involve yourself in the community
● Hire a consultant

POC Example:
Multiple product
offers to distinct
products
http://dbs.uni-leipzig.
de/file/parallel_er_with_de
doop.pdf

Configure and Launch your cluster
Open the AWS Web Console

Connect to the master and Monitor Cluster
Be sure to configure your security group settings and use a private key to login
ssh -i ~/hackathon.pem hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
Let’s set up an ssh tunnel so we can see what is happening on the cluster
● Hadoop, Ganglia, and other applications publish user interfaces as websites
hosted on the master node. For security reasons, these websites are only
available on the master node's local webserver (http://localhost:port) and are not
published on the Internet.
ssh -i ~/hackathon.pem -ND 8157 hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
● http://ec2-52-91-26-92.compute-1.amazonaws.com

Configure Hive to query JSON
Set up the hive table to query the underlying json files -- (see notes)
/* ---[ A tool to automate creation of Hive JSON schemas ]--- */
One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a
large complex, highly nested JSON document is quite tedious.
I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.
How to get data in and out.

Bootstrap our cluster
Now we can bootstrap our cluster
to load additional libraries and
functions on all the nodes
We are going to bootstrap with
python and nlp and the stanford
library so we can pick out
keywords in each record.

Map Reduce Step
How to write, test and configure
a map reduce step

Retrieve and
Analyze the Results
Use JDBC and R to read the
results directly from RStudio.
Plot results

A Primer for Your Next Data Science Proof of Concept on the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to A Primer for Your Next Data Science Proof of Concept on the Cloud

Similar to A Primer for Your Next Data Science Proof of Concept on the Cloud (20)

Recently uploaded

Recently uploaded (20)

A Primer for Your Next Data Science Proof of Concept on the Cloud