SlideShare a Scribd company logo
1 of 21
"Full Stack" Data Science with R
Startups: Production-Ready
with Open Source Tools
#rstats #SoCalDS17 #IDEAS17
Oct 22, 2017
Ajay Gopal
1
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Me: (Data) Scientist, Technologist, Entrepreneur
2
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
SelfScore: Financial Education & Inclusion
3
SelfScore
Industry
FinTech Alt-Lending Startup, Menlo Park, CA
What we do
Use ML models with alternative financial signals
to help deserving but underserved populations
gain access to fair credit, started with
international students (2 products in market)
Differentiator
Measure borrower’s potential
instead of history (eg without SSN / FICO etc)
Team
~ 30 (4 in Data Science + You?)
Funding
Series B, Founded in 2013
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
... was born on Twitter
For Startups + New Teams
1) Evolving Data Science needs
2) What’s “Full Stack” DS?
3) Why use R (or Python)?
4) Cloud R-based DS Stack
- Sample Infra
- Open Source tools
-------------------------
5) Production Mindset
6) Buy or Build?
This talk
4
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science (VC) Expectations Evolve
Innovation Vertical + Optimization Laterally
5
Data Science
IP, AI,
Innovation,
R&D
Operations
Finance
Compliance
Technology
Product
CX
Demand Gen
Growth
Infra Process Automation Product Optimization Ad / Comms Optim
Considerations:
● Disruptive if
relying on resources
from other verticals
● More ad-hoc work
● R&D timelines not
predictable
● Faster cadence for
analytics
Solution:
● “Full Stack”
Infra & Teams!
● Tools & Training for
others to self-serve
Data Science in Modern (Gen-AI) Startups
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
The “Full Stack” Analogy
6
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Full Stack” Web Services - Technologies
7
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Technology
rocker, EMIs, ECS, GCE, other cloud
tools
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), Spark etc.
Your internal pkgs, RServer, CI, Git,
Chron, (most R packages), sparkR
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino Data Lab
8
Front End
Back End
Data Store
Devops
APIs
UX
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
R is Sufficient For All Key Stack Functions
1) Retrieve Data
- Ad / Marketing
- Sales
- Transaction
- 3rd Party / Behavioral
2) Process (ETL)
- Fetch, clean up, store
3) Analyze
- Cross-Connectivity
- Aggregation & Features
- Algorithms
4) Predict
- Models in batch
- In-memory modeling
- REST APIs
5) Inform
- Customers (Services & API)
- Partners
Eg: Marketing, fulfillment
- Internal Stakeholders
Eg: Reporting / Dashboards
9
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10
Front End
Back End
Data Store
Devops
APIs
UX
Technology
rocker, EMIs, ECS, GCE, other cloud
tools, Domino Data Lab, Azure
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), SparkR etc.
Your internal pkgs, RServer, CI, Git,
H2O, (most R packages), Spark
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science with R
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino, Lambda
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Detractors
- Fewer hard-core devs
- Only handful of dev shops;
no serious bandwidth for hire
- Memory mgmt (still?)
R is great for startups!
Top Drivers for Startups
1. Instant Reactive Web Visualizations
via Shiny (Zero front-end dev)
2. Low barrier for cross-training
3. Fantastic IDE (RStudio)
(single-point access to stack)
4. Large ecosystem of packages
(modeling + viz + utils)
5. Great client libraries
for ML frameworks
6. Statistically Trained Prospects
(Python / Pandas odds good too)
11
So how do we build an R based stack?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science Should Be This Easy
12
A
U
T
O
M
A
T
I
O
N
Data Science IDE
Interactive Dashboards
Predictive Models & APIs
Alerts Notification, Files
So how do we build this in the cloud?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Assembly of Cloud Container Services
1) Bastion - to connect to external world
(small, low memory, public IP)
2) Scheduler - do things triggered by time & events
(medium, run CI tools, invoke compute slaves)
3) Workers - heavy feature computations
(highmem, multi core, stateless)
4) Storage - DBs, pipelines & message queues
(distributed storage services or internal clusters)
5) Modeler - H2O Cluster, MLLib, Sci-Kit etc
(multi-node cluster, available on demand)
6) Reporter - API Service / Shiny server
(medium, autoscaled containers)
13
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Sample AWS Infra
14
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Choice of Tools
15
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Staging” Shiny App
1. Git Commit App to “Dev” branch
2. Jenkins Sync Repo on Commit
3. Sync triggers next Jenkins job
creates Docker container
4. Next job: AWS cli tools deploy
Docker container to ECS
5. “Dev” Shiny app live on staging
6. API call to notify Slack channel
Sample Production Workflows
SEM Cost Forecaster
1. Rscript fetches Adwords
spend & internal sales data
every 5 minutes.
2. Rscript runs existing anomaly
detection & forecast model
3. When check fails, API calls
from R to SMS (eg Twilio) and
Email (eg: SendGrid).
16
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Building Full-Stack Data Science Teams
People
- Data / Backend Engineer
- Data Scientist
- Modeller / Statistician
- Product Manager
- Devops Engineer
Team Output
- EDA / ad-hoc
- Scheduled Reporting
- Batch Predictions
- Stream Processing
- Real-Time Prediction APIs
Our “product” is scalable, actionable intelligence
17
… let’s adopt good software development practices
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
BetteR habits:
1. Write inline and offline tests for your code (testthat, checkmate)
2. Generate informational logs so you can debug later (futile.logger)
3. Add versioning (github)
4. Save business logic as functions in package (selfscoRe)
5. Add examples (Rmd)
6. Write documentation (Rmd)
7. Create a web service (Shiny apps)
8. Put the service in a docker container
The Production Mindset for Data Scientists
18
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Should we buy or build?
VS
Should my company buy the infra? Should my team build it?
19
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Buy vs Build Considerations
BUY / RENT
- If no dev/tech in-house
- If time-to-market is key
requires:
- Custom Integrations
- Higher Cost Tolerance
- Niche engagements
BUILD
- If compliance is major factor
(HIPAA, PCI)
- If cost control is key
- Full Control of Features Reqd
requires:
- In-house talent
- Longer time-to-market?
- Ongoing maintenance
20
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Thank You!
21
Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean
*ML Models
Hiring Sr “Full Stack” Data Scientist
In Summary
- Data Science is
Vertical + Lateral!
- Colocate data sources
- Containerize services in the cloud
- Use R’s Rich Ecosystem
(or something easy to
cross-train other verticals on)

More Related Content

What's hot

Get Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on TrainingGet Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on TrainingSri Ambati
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Sri Ambati
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...Dataconomy Media
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speeddanpotterdwch
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...Data Con LA
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)Revolution Analytics
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckSri Ambati
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoSri Ambati
 

What's hot (20)

BI + Big Data
BI + Big DataBI + Big Data
BI + Big Data
 
AI as a service
AI as a serviceAI as a service
AI as a service
 
Get Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on TrainingGet Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on Training
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Resume
ResumeResume
Resume
 
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
 
Arindam Sengupta _ Resume
Arindam Sengupta _ ResumeArindam Sengupta _ Resume
Arindam Sengupta _ Resume
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 

Similar to “Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to GreenJohn Archer
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Impetus Technologies
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolioVijayananda Mohire
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @IndixManoj Mahalingam
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta
 
Bhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemBhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemVijayananda Mohire
 
RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedis Labs
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Andy Lathrop
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Sean Java Arch
Sean Java ArchSean Java Arch
Sean Java ArchSean Bob
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 

Similar to “Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools (20)

Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Talend introduction v1
Talend introduction v1Talend introduction v1
Talend introduction v1
 
Bhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemBhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystem
 
Ravi Sundriyal
Ravi SundriyalRavi Sundriyal
Ravi Sundriyal
 
RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Sean Java Arch
Sean Java ArchSean Java Arch
Sean Java Arch
 
Mohamed-Rashad-Resume
Mohamed-Rashad-ResumeMohamed-Rashad-Resume
Mohamed-Rashad-Resume
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 

More from IDEAS - Int'l Data Engineering and Science Association

More from IDEAS - Int'l Data Engineering and Science Association (20)

How to deliver effective data science projects
How to deliver effective data science projectsHow to deliver effective data science projects
How to deliver effective data science projects
 
Digital cracks in banking--Sid Nandi
Digital cracks in banking--Sid NandiDigital cracks in banking--Sid Nandi
Digital cracks in banking--Sid Nandi
 
Battling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial IntelligenceBattling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial Intelligence
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Blockchain Application in Real Estate Transactions
Blockchain Application in Real Estate TransactionsBlockchain Application in Real Estate Transactions
Blockchain Application in Real Estate Transactions
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.
 
Operationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced AnalyticsOperationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced Analytics
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Best Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and AcademiaBest Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and Academia
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
 
Data-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and HealthcareData-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and Healthcare
 
Generating Creative Works with AI
Generating Creative Works with AIGenerating Creative Works with AI
Generating Creative Works with AI
 
Using AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care DataUsing AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care Data
 
State of AI/ML in Real Estate
State of AI/ML in Real EstateState of AI/ML in Real Estate
State of AI/ML in Real Estate
 
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

“Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

  • 1. "Full Stack" Data Science with R Startups: Production-Ready with Open Source Tools #rstats #SoCalDS17 #IDEAS17 Oct 22, 2017 Ajay Gopal 1
  • 2. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Me: (Data) Scientist, Technologist, Entrepreneur 2 Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab
  • 3. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab SelfScore: Financial Education & Inclusion 3 SelfScore Industry FinTech Alt-Lending Startup, Menlo Park, CA What we do Use ML models with alternative financial signals to help deserving but underserved populations gain access to fair credit, started with international students (2 products in market) Differentiator Measure borrower’s potential instead of history (eg without SSN / FICO etc) Team ~ 30 (4 in Data Science + You?) Funding Series B, Founded in 2013
  • 4. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore ... was born on Twitter For Startups + New Teams 1) Evolving Data Science needs 2) What’s “Full Stack” DS? 3) Why use R (or Python)? 4) Cloud R-based DS Stack - Sample Infra - Open Source tools ------------------------- 5) Production Mindset 6) Buy or Build? This talk 4
  • 5. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science (VC) Expectations Evolve Innovation Vertical + Optimization Laterally 5 Data Science IP, AI, Innovation, R&D Operations Finance Compliance Technology Product CX Demand Gen Growth Infra Process Automation Product Optimization Ad / Comms Optim Considerations: ● Disruptive if relying on resources from other verticals ● More ad-hoc work ● R&D timelines not predictable ● Faster cadence for analytics Solution: ● “Full Stack” Infra & Teams! ● Tools & Training for others to self-serve Data Science in Modern (Gen-AI) Startups
  • 6. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore The “Full Stack” Analogy 6 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 7. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Full Stack” Web Services - Technologies 7 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 8. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Technology rocker, EMIs, ECS, GCE, other cloud tools DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), Spark etc. Your internal pkgs, RServer, CI, Git, Chron, (most R packages), sparkR shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Generic: rapache, opencpu, plumber ML: h2o/steam, Domino Data Lab 8 Front End Back End Data Store Devops APIs UX Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science Goal: Scalable, Timely, Intelligence/Economic Services
  • 9. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore R is Sufficient For All Key Stack Functions 1) Retrieve Data - Ad / Marketing - Sales - Transaction - 3rd Party / Behavioral 2) Process (ETL) - Fetch, clean up, store 3) Analyze - Cross-Connectivity - Aggregation & Features - Algorithms 4) Predict - Models in batch - In-memory modeling - REST APIs 5) Inform - Customers (Services & API) - Partners Eg: Marketing, fulfillment - Internal Stakeholders Eg: Reporting / Dashboards 9
  • 10. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10 Front End Back End Data Store Devops APIs UX Technology rocker, EMIs, ECS, GCE, other cloud tools, Domino Data Lab, Azure DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), SparkR etc. Your internal pkgs, RServer, CI, Git, H2O, (most R packages), Spark shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science with R Generic: rapache, opencpu, plumber ML: h2o/steam, Domino, Lambda Goal: Scalable, Timely, Intelligence/Economic Services
  • 11. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Detractors - Fewer hard-core devs - Only handful of dev shops; no serious bandwidth for hire - Memory mgmt (still?) R is great for startups! Top Drivers for Startups 1. Instant Reactive Web Visualizations via Shiny (Zero front-end dev) 2. Low barrier for cross-training 3. Fantastic IDE (RStudio) (single-point access to stack) 4. Large ecosystem of packages (modeling + viz + utils) 5. Great client libraries for ML frameworks 6. Statistically Trained Prospects (Python / Pandas odds good too) 11 So how do we build an R based stack?
  • 12. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science Should Be This Easy 12 A U T O M A T I O N Data Science IDE Interactive Dashboards Predictive Models & APIs Alerts Notification, Files So how do we build this in the cloud?
  • 13. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Assembly of Cloud Container Services 1) Bastion - to connect to external world (small, low memory, public IP) 2) Scheduler - do things triggered by time & events (medium, run CI tools, invoke compute slaves) 3) Workers - heavy feature computations (highmem, multi core, stateless) 4) Storage - DBs, pipelines & message queues (distributed storage services or internal clusters) 5) Modeler - H2O Cluster, MLLib, Sci-Kit etc (multi-node cluster, available on demand) 6) Reporter - API Service / Shiny server (medium, autoscaled containers) 13
  • 14. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Sample AWS Infra 14
  • 15. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Choice of Tools 15
  • 16. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Staging” Shiny App 1. Git Commit App to “Dev” branch 2. Jenkins Sync Repo on Commit 3. Sync triggers next Jenkins job creates Docker container 4. Next job: AWS cli tools deploy Docker container to ECS 5. “Dev” Shiny app live on staging 6. API call to notify Slack channel Sample Production Workflows SEM Cost Forecaster 1. Rscript fetches Adwords spend & internal sales data every 5 minutes. 2. Rscript runs existing anomaly detection & forecast model 3. When check fails, API calls from R to SMS (eg Twilio) and Email (eg: SendGrid). 16
  • 17. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Building Full-Stack Data Science Teams People - Data / Backend Engineer - Data Scientist - Modeller / Statistician - Product Manager - Devops Engineer Team Output - EDA / ad-hoc - Scheduled Reporting - Batch Predictions - Stream Processing - Real-Time Prediction APIs Our “product” is scalable, actionable intelligence 17 … let’s adopt good software development practices
  • 18. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore BetteR habits: 1. Write inline and offline tests for your code (testthat, checkmate) 2. Generate informational logs so you can debug later (futile.logger) 3. Add versioning (github) 4. Save business logic as functions in package (selfscoRe) 5. Add examples (Rmd) 6. Write documentation (Rmd) 7. Create a web service (Shiny apps) 8. Put the service in a docker container The Production Mindset for Data Scientists 18
  • 19. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Should we buy or build? VS Should my company buy the infra? Should my team build it? 19
  • 20. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Buy vs Build Considerations BUY / RENT - If no dev/tech in-house - If time-to-market is key requires: - Custom Integrations - Higher Cost Tolerance - Niche engagements BUILD - If compliance is major factor (HIPAA, PCI) - If cost control is key - Full Control of Features Reqd requires: - In-house talent - Longer time-to-market? - Ongoing maintenance 20
  • 21. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Thank You! 21 Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean *ML Models Hiring Sr “Full Stack” Data Scientist In Summary - Data Science is Vertical + Lateral! - Colocate data sources - Containerize services in the cloud - Use R’s Rich Ecosystem (or something easy to cross-train other verticals on)