SlideShare a Scribd company logo
1 of 23
Download to read offline
A Primer for Your Next
Data Science Proof of
Concept on the Cloud
Your Presenters
Alton Alexander
About Me : Data scientist. PhD dropout with a love for
solving real world business problems. Experience delivering
solutions with big data, machine learning, and statistics for
marketing, manufacturing, and finance industries.
Affiliations : Front Analytics Consulting
Connect : altonalexander
alexanderalton@gmail.com
Matt Davies
About Me : “Big Data” architect / engineer with clients in
retail, healthcare, e-commerce, insurance, and government.
Primarily focused on operationalizing complex data-driven
solutions.
Affiliations : Xpert Data Solutions
Connect : 4mattdavies
matt@xds.io
Site : http://xds.io
Agenda
● Scoping the problem and solution
● Discuss pros / cons of starting with a cloud solution
● Establishing realistic expectations, budgets, constraints, etc…
● Hands on demo
● Q&A
Scoping the Problem and Solution
● What are you trying to solve?
● What data might be helpful in answering the question(s)?
● Are there specific techniques which are known to work well?
● Do you need to use BI tools and/or export data?
● What are your timelines?
● What are your resources?
Competing on Analytics
Product Customer Operations
Who are your potential
customers?
What do they want?
Brand loyalty?
What’s next?
What motivates customers?
Which channels work best?
What else do they need?
What is a “customer”?
How long will X function?
How much product waste?
Will Y be cheaper?
e.g. Customer segmentation / profiles
Market analysis
Channel attribution
Keywords
Engagement
Enrichment
Churn
Conversion
Offers
Sentiment
Profiling
Yield optimization
Failure rates
Futures
data collectors
bulk
store
batch
process
live
store
api
service
ui
Datacenter
AWSand/or
“Analytic”
queues & oltp
*SQS
redis or couch
mongodb
rdbms
olap
mongo
Hbase
Thrift/Protobuf/AVRO
sockets style
*messagepack based
netty
kafka
*kinesis
apps
*elastic beanstalk
ec2/vm + load balance
emitters
*messagepack
*s3
ebs
*HDFS
cassandra
columnar
on file system
M/R based (pig, hive)
Graph based
off file system
anything language
diy json
mongodb
*BaaSes
postgres
column
Hbase/Impala
*cassandra
graph
cypher
gremlin
search
elasticsearch
*stupid-simple-n-scale
cloud/dc apps
ec2/vm + load balance
*elastic beanstalk
sql-ish
Phoenix
Cassandra
graph
Gremlin
Cypher
search
elasticsearch
d3
nvd3.js - simple
d3.js - complex
dc.js - dimensions
putting the long-A in OLAP
Pros / Cons of Starting With a Cloud Solution
Cons:
● Data sensitivity
● Less control
● Unfamiliarity with terminology and/or
design
● On prem world very different than cloud.
Terms, risk factors, skillsets
● Data movement can be difficult
● Cloud “tax”
Pros:
● Elasticity
● Scalability
● Speed of implementation
● Focus on business problem
● Can easily create multiple instances for
tests
● Less management
● Strong security
● No Network, Datacenter barriers
● Strong industry adoption
Pros / Cons of Starting With a Cloud Solution
General Challenges
● POC -> Production can be difficult
● Security is widely misunderstood
● Skillsets: When to hire, develop, consult, outsource
Use Case : Clear objective with
identified stakeholders.
Sufficient Time : Discovery is
such a large part of these projects
that projecting “Put this legacy
project out in X hrs will translate
to Y in big data” is not reliable.
Iterate : Like all software projects
it is usually better to iterate than
have large waterfall deployments.
Review : What went well, what
failed, where is our technical
debt.
Establishing Realistic Expectations, Budgets, Constraints, Etc...
Establishing Realistic Expectations, Budgets, Constraints, Etc...
Budgets
● Know how the cloud provider makes money
● Start lean
● “Leave No Trace”
Constraints
● Time
● Complexity
● Resources
Establishing Realistic Expectations, Budgets, Constraints, Etc...
● Sleep on it
● Are you a solution in search of a need?
● Use the scientific method
● Involve yourself in the community
● Hire a consultant
A Quick Walkthrough on AWS
POC Example:
Multiple product
offers to distinct
products
http://dbs.uni-leipzig.
de/file/parallel_er_with_de
doop.pdf
EMR Overview
Create your bucket
Configure and Launch your cluster
Open the AWS Web Console
Connect to the master and Monitor Cluster
Be sure to configure your security group settings and use a private key to login
ssh -i ~/hackathon.pem hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
Let’s set up an ssh tunnel so we can see what is happening on the cluster
● Hadoop, Ganglia, and other applications publish user interfaces as websites
hosted on the master node. For security reasons, these websites are only
available on the master node's local webserver (http://localhost:port) and are not
published on the Internet.
ssh -i ~/hackathon.pem -ND 8157 hadoop@ec2-52-91-26-92.compute-1.amazonaws.com
● http://ec2-52-91-26-92.compute-1.amazonaws.com
Configure Hive to query JSON
Set up the hive table to query the underlying json files -- (see notes)
/* ---[ A tool to automate creation of Hive JSON schemas ]--- */
One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a
large complex, highly nested JSON document is quite tedious.
I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.
How to get data in and out.
Bootstrap our cluster
Now we can bootstrap our cluster
to load additional libraries and
functions on all the nodes
We are going to bootstrap with
python and nlp and the stanford
library so we can pick out
keywords in each record.
Map Reduce Step
How to write, test and configure
a map reduce step
Retrieve and
Analyze the Results
Use JDBC and R to read the
results directly from RStudio.
Plot results
Q&A

More Related Content

What's hot

Guiding Principles & Methodology for Cloud Computing Adoption
Guiding Principles & Methodology for Cloud Computing AdoptionGuiding Principles & Methodology for Cloud Computing Adoption
Guiding Principles & Methodology for Cloud Computing AdoptionKumar Arikrishnan
 
Software estimation challenge diederik wortman - metri
Software estimation challenge   diederik wortman - metriSoftware estimation challenge   diederik wortman - metri
Software estimation challenge diederik wortman - metriNesma
 
What I Did Last Summer
What I Did Last SummerWhat I Did Last Summer
What I Did Last SummerKevin Kelso
 
Bridging the gap rob de munnik - dutch tax office
Bridging the gap   rob de munnik - dutch tax officeBridging the gap   rob de munnik - dutch tax office
Bridging the gap rob de munnik - dutch tax officeNesma
 
Managing the p6 schedule from the perspective of an owner ppt
Managing the p6 schedule from the perspective of an owner pptManaging the p6 schedule from the perspective of an owner ppt
Managing the p6 schedule from the perspective of an owner pptp6academy
 
Karen_Anne_Johnson_Resume_June2016
Karen_Anne_Johnson_Resume_June2016Karen_Anne_Johnson_Resume_June2016
Karen_Anne_Johnson_Resume_June2016Karen Anne Johnson
 
Software sizing the cornerstone for iceaa's scebok - Carol Dekkers
Software sizing the cornerstone for iceaa's scebok - Carol DekkersSoftware sizing the cornerstone for iceaa's scebok - Carol Dekkers
Software sizing the cornerstone for iceaa's scebok - Carol DekkersNesma
 
6. software cost estimation finally becoming a real profession! - harold va...
6. software cost estimation   finally becoming a real profession! - harold va...6. software cost estimation   finally becoming a real profession! - harold va...
6. software cost estimation finally becoming a real profession! - harold va...Nesma
 
Agile Methodology - Data Migration v1.0
Agile Methodology - Data Migration v1.0Agile Methodology - Data Migration v1.0
Agile Methodology - Data Migration v1.0Julian Samuels
 
Ken Wong Resume
Ken Wong ResumeKen Wong Resume
Ken Wong ResumeKen Wong
 
Z0G Project Portfolio Management overview
Z0G Project Portfolio  Management overviewZ0G Project Portfolio  Management overview
Z0G Project Portfolio Management overviewChuong Nguyen
 
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...A benchmark based approach to determine language verbosity - Hans Kuijpers - ...
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...Nesma
 
Ms Dynamics Sure Step 2010
Ms Dynamics Sure Step 2010Ms Dynamics Sure Step 2010
Ms Dynamics Sure Step 2010Mohamed Aamer
 
What is rad model
What is rad modelWhat is rad model
What is rad modelrjasad
 
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...camunda services GmbH
 
Estimation of a micro services based estimation application bhawna thakur -...
Estimation of a micro services based estimation application   bhawna thakur -...Estimation of a micro services based estimation application   bhawna thakur -...
Estimation of a micro services based estimation application bhawna thakur -...Nesma
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingGrid Dynamics
 

What's hot (20)

Guiding Principles & Methodology for Cloud Computing Adoption
Guiding Principles & Methodology for Cloud Computing AdoptionGuiding Principles & Methodology for Cloud Computing Adoption
Guiding Principles & Methodology for Cloud Computing Adoption
 
Software estimation challenge diederik wortman - metri
Software estimation challenge   diederik wortman - metriSoftware estimation challenge   diederik wortman - metri
Software estimation challenge diederik wortman - metri
 
What I Did Last Summer
What I Did Last SummerWhat I Did Last Summer
What I Did Last Summer
 
Bridging the gap rob de munnik - dutch tax office
Bridging the gap   rob de munnik - dutch tax officeBridging the gap   rob de munnik - dutch tax office
Bridging the gap rob de munnik - dutch tax office
 
Managing the p6 schedule from the perspective of an owner ppt
Managing the p6 schedule from the perspective of an owner pptManaging the p6 schedule from the perspective of an owner ppt
Managing the p6 schedule from the perspective of an owner ppt
 
Cyndee_Blenkush_Resume
Cyndee_Blenkush_ResumeCyndee_Blenkush_Resume
Cyndee_Blenkush_Resume
 
Karen_Anne_Johnson_Resume_June2016
Karen_Anne_Johnson_Resume_June2016Karen_Anne_Johnson_Resume_June2016
Karen_Anne_Johnson_Resume_June2016
 
Software sizing the cornerstone for iceaa's scebok - Carol Dekkers
Software sizing the cornerstone for iceaa's scebok - Carol DekkersSoftware sizing the cornerstone for iceaa's scebok - Carol Dekkers
Software sizing the cornerstone for iceaa's scebok - Carol Dekkers
 
6. software cost estimation finally becoming a real profession! - harold va...
6. software cost estimation   finally becoming a real profession! - harold va...6. software cost estimation   finally becoming a real profession! - harold va...
6. software cost estimation finally becoming a real profession! - harold va...
 
ACONEX-workflow System
ACONEX-workflow SystemACONEX-workflow System
ACONEX-workflow System
 
Agile Methodology - Data Migration v1.0
Agile Methodology - Data Migration v1.0Agile Methodology - Data Migration v1.0
Agile Methodology - Data Migration v1.0
 
Ken Wong Resume
Ken Wong ResumeKen Wong Resume
Ken Wong Resume
 
Z0G Project Portfolio Management overview
Z0G Project Portfolio  Management overviewZ0G Project Portfolio  Management overview
Z0G Project Portfolio Management overview
 
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...A benchmark based approach to determine language verbosity - Hans Kuijpers - ...
A benchmark based approach to determine language verbosity - Hans Kuijpers - ...
 
4+_exp_ETLTester
4+_exp_ETLTester4+_exp_ETLTester
4+_exp_ETLTester
 
Ms Dynamics Sure Step 2010
Ms Dynamics Sure Step 2010Ms Dynamics Sure Step 2010
Ms Dynamics Sure Step 2010
 
What is rad model
What is rad modelWhat is rad model
What is rad model
 
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...
Camunda Roadshow 2019, Praxisbericht Wien: Migration von Legacy workflow Syst...
 
Estimation of a micro services based estimation application bhawna thakur -...
Estimation of a micro services based estimation application   bhawna thakur -...Estimation of a micro services based estimation application   bhawna thakur -...
Estimation of a micro services based estimation application bhawna thakur -...
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 

Viewers also liked

Pragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsPragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsAlton Alexander
 
Data Scientist Skills
Data Scientist SkillsData Scientist Skills
Data Scientist SkillsDeZyre
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationDmitry Grapov
 

Viewers also liked (6)

Pragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsPragmatic steps to implement big data analytics
Pragmatic steps to implement big data analytics
 
Data Scientist Skills
Data Scientist SkillsData Scientist Skills
Data Scientist Skills
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data and Advanced Analytics
Big Data and Advanced AnalyticsBig Data and Advanced Analytics
Big Data and Advanced Analytics
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar to A Primer for Your Next Data Science Proof of Concept on the Cloud

Avoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation FailureAvoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation FailureNathaniel Payne
 
A Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesA Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesMammoth Data
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuDavid Scruggs
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing OverviewDoug Allen
 
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...Srijan Technologies
 
The Essentials Of Project Management
The Essentials Of Project ManagementThe Essentials Of Project Management
The Essentials Of Project ManagementLaura Arrigo
 
So many clouds - 7 things to consider when choosing your IaaS provider
So many clouds - 7 things to consider when choosing your IaaS providerSo many clouds - 7 things to consider when choosing your IaaS provider
So many clouds - 7 things to consider when choosing your IaaS providerSirris
 
7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaSFrederik Denkens
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
BusinessIntelligenze - On Cloud BI (English)
BusinessIntelligenze - On Cloud BI (English)BusinessIntelligenze - On Cloud BI (English)
BusinessIntelligenze - On Cloud BI (English)BusinessIntelligenze
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - finalAndrew White
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013Kai Wähner
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudOVHcloud
 
MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017Greg Werner
 
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Data Con LA
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateClouderaUserGroups
 
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...RightScale
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
 

Similar to A Primer for Your Next Data Science Proof of Concept on the Cloud (20)

Avoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation FailureAvoiding Cloud Computing Planning & Implementation Failure
Avoiding Cloud Computing Planning & Implementation Failure
 
A Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesA Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial Services
 
Mds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to herokuMds cloud saturday 2015 how to heroku
Mds cloud saturday 2015 how to heroku
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing Overview
 
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
 
The Essentials Of Project Management
The Essentials Of Project ManagementThe Essentials Of Project Management
The Essentials Of Project Management
 
So many clouds - 7 things to consider when choosing your IaaS provider
So many clouds - 7 things to consider when choosing your IaaS providerSo many clouds - 7 things to consider when choosing your IaaS provider
So many clouds - 7 things to consider when choosing your IaaS provider
 
Hello Cloud
Hello CloudHello Cloud
Hello Cloud
 
7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS7 things to consider when choosing your IaaS provider for ISV/SaaS
7 things to consider when choosing your IaaS provider for ISV/SaaS
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
BusinessIntelligenze - On Cloud BI (English)
BusinessIntelligenze - On Cloud BI (English)BusinessIntelligenze - On Cloud BI (English)
BusinessIntelligenze - On Cloud BI (English)
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
 
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public Cloud
 
MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017MLConf Atlanta - Sept 2017
MLConf Atlanta - Sept 2017
 
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterpr...
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
Transitioning to the Cloud: Implications for Reliability, Redundancy & Recove...
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 

Recently uploaded

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 

Recently uploaded (20)

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 

A Primer for Your Next Data Science Proof of Concept on the Cloud

  • 1. A Primer for Your Next Data Science Proof of Concept on the Cloud
  • 2. Your Presenters Alton Alexander About Me : Data scientist. PhD dropout with a love for solving real world business problems. Experience delivering solutions with big data, machine learning, and statistics for marketing, manufacturing, and finance industries. Affiliations : Front Analytics Consulting Connect : altonalexander alexanderalton@gmail.com Matt Davies About Me : “Big Data” architect / engineer with clients in retail, healthcare, e-commerce, insurance, and government. Primarily focused on operationalizing complex data-driven solutions. Affiliations : Xpert Data Solutions Connect : 4mattdavies matt@xds.io Site : http://xds.io
  • 3. Agenda ● Scoping the problem and solution ● Discuss pros / cons of starting with a cloud solution ● Establishing realistic expectations, budgets, constraints, etc… ● Hands on demo ● Q&A
  • 4.
  • 5. Scoping the Problem and Solution ● What are you trying to solve? ● What data might be helpful in answering the question(s)? ● Are there specific techniques which are known to work well? ● Do you need to use BI tools and/or export data? ● What are your timelines? ● What are your resources?
  • 6. Competing on Analytics Product Customer Operations Who are your potential customers? What do they want? Brand loyalty? What’s next? What motivates customers? Which channels work best? What else do they need? What is a “customer”? How long will X function? How much product waste? Will Y be cheaper? e.g. Customer segmentation / profiles Market analysis Channel attribution Keywords Engagement Enrichment Churn Conversion Offers Sentiment Profiling Yield optimization Failure rates Futures
  • 7. data collectors bulk store batch process live store api service ui Datacenter AWSand/or “Analytic” queues & oltp *SQS redis or couch mongodb rdbms olap mongo Hbase Thrift/Protobuf/AVRO sockets style *messagepack based netty kafka *kinesis apps *elastic beanstalk ec2/vm + load balance emitters *messagepack *s3 ebs *HDFS cassandra columnar on file system M/R based (pig, hive) Graph based off file system anything language diy json mongodb *BaaSes postgres column Hbase/Impala *cassandra graph cypher gremlin search elasticsearch *stupid-simple-n-scale cloud/dc apps ec2/vm + load balance *elastic beanstalk sql-ish Phoenix Cassandra graph Gremlin Cypher search elasticsearch d3 nvd3.js - simple d3.js - complex dc.js - dimensions putting the long-A in OLAP
  • 8. Pros / Cons of Starting With a Cloud Solution Cons: ● Data sensitivity ● Less control ● Unfamiliarity with terminology and/or design ● On prem world very different than cloud. Terms, risk factors, skillsets ● Data movement can be difficult ● Cloud “tax” Pros: ● Elasticity ● Scalability ● Speed of implementation ● Focus on business problem ● Can easily create multiple instances for tests ● Less management ● Strong security ● No Network, Datacenter barriers ● Strong industry adoption
  • 9. Pros / Cons of Starting With a Cloud Solution General Challenges ● POC -> Production can be difficult ● Security is widely misunderstood ● Skillsets: When to hire, develop, consult, outsource
  • 10. Use Case : Clear objective with identified stakeholders. Sufficient Time : Discovery is such a large part of these projects that projecting “Put this legacy project out in X hrs will translate to Y in big data” is not reliable. Iterate : Like all software projects it is usually better to iterate than have large waterfall deployments. Review : What went well, what failed, where is our technical debt. Establishing Realistic Expectations, Budgets, Constraints, Etc...
  • 11. Establishing Realistic Expectations, Budgets, Constraints, Etc... Budgets ● Know how the cloud provider makes money ● Start lean ● “Leave No Trace” Constraints ● Time ● Complexity ● Resources
  • 12. Establishing Realistic Expectations, Budgets, Constraints, Etc... ● Sleep on it ● Are you a solution in search of a need? ● Use the scientific method ● Involve yourself in the community ● Hire a consultant
  • 14. POC Example: Multiple product offers to distinct products http://dbs.uni-leipzig. de/file/parallel_er_with_de doop.pdf
  • 17. Configure and Launch your cluster Open the AWS Web Console
  • 18. Connect to the master and Monitor Cluster Be sure to configure your security group settings and use a private key to login ssh -i ~/hackathon.pem hadoop@ec2-52-91-26-92.compute-1.amazonaws.com Let’s set up an ssh tunnel so we can see what is happening on the cluster ● Hadoop, Ganglia, and other applications publish user interfaces as websites hosted on the master node. For security reasons, these websites are only available on the master node's local webserver (http://localhost:port) and are not published on the Internet. ssh -i ~/hackathon.pem -ND 8157 hadoop@ec2-52-91-26-92.compute-1.amazonaws.com ● http://ec2-52-91-26-92.compute-1.amazonaws.com
  • 19. Configure Hive to query JSON Set up the hive table to query the underlying json files -- (see notes) /* ---[ A tool to automate creation of Hive JSON schemas ]--- */ One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious. I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema. How to get data in and out.
  • 20. Bootstrap our cluster Now we can bootstrap our cluster to load additional libraries and functions on all the nodes We are going to bootstrap with python and nlp and the stanford library so we can pick out keywords in each record.
  • 21. Map Reduce Step How to write, test and configure a map reduce step
  • 22. Retrieve and Analyze the Results Use JDBC and R to read the results directly from RStudio. Plot results
  • 23. Q&A