Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

“Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

Presented by Ajay Gopal, Chief Data Scientist, SelfScore Inc

  • Login to see the comments

  • Be the first to like this

“Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

  1. 1. "Full Stack" Data Science with R Startups: Production-Ready with Open Source Tools #rstats #SoCalDS17 #IDEAS17 Oct 22, 2017 Ajay Gopal 1
  2. 2. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Me: (Data) Scientist, Technologist, Entrepreneur 2 Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab
  3. 3. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab SelfScore: Financial Education & Inclusion 3 SelfScore Industry FinTech Alt-Lending Startup, Menlo Park, CA What we do Use ML models with alternative financial signals to help deserving but underserved populations gain access to fair credit, started with international students (2 products in market) Differentiator Measure borrower’s potential instead of history (eg without SSN / FICO etc) Team ~ 30 (4 in Data Science + You?) Funding Series B, Founded in 2013
  4. 4. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore ... was born on Twitter For Startups + New Teams 1) Evolving Data Science needs 2) What’s “Full Stack” DS? 3) Why use R (or Python)? 4) Cloud R-based DS Stack - Sample Infra - Open Source tools ------------------------- 5) Production Mindset 6) Buy or Build? This talk 4
  5. 5. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science (VC) Expectations Evolve Innovation Vertical + Optimization Laterally 5 Data Science IP, AI, Innovation, R&D Operations Finance Compliance Technology Product CX Demand Gen Growth Infra Process Automation Product Optimization Ad / Comms Optim Considerations: ● Disruptive if relying on resources from other verticals ● More ad-hoc work ● R&D timelines not predictable ● Faster cadence for analytics Solution: ● “Full Stack” Infra & Teams! ● Tools & Training for others to self-serve Data Science in Modern (Gen-AI) Startups
  6. 6. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore The “Full Stack” Analogy 6 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  7. 7. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Full Stack” Web Services - Technologies 7 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  8. 8. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Technology rocker, EMIs, ECS, GCE, other cloud tools DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), Spark etc. Your internal pkgs, RServer, CI, Git, Chron, (most R packages), sparkR shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Generic: rapache, opencpu, plumber ML: h2o/steam, Domino Data Lab 8 Front End Back End Data Store Devops APIs UX Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science Goal: Scalable, Timely, Intelligence/Economic Services
  9. 9. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore R is Sufficient For All Key Stack Functions 1) Retrieve Data - Ad / Marketing - Sales - Transaction - 3rd Party / Behavioral 2) Process (ETL) - Fetch, clean up, store 3) Analyze - Cross-Connectivity - Aggregation & Features - Algorithms 4) Predict - Models in batch - In-memory modeling - REST APIs 5) Inform - Customers (Services & API) - Partners Eg: Marketing, fulfillment - Internal Stakeholders Eg: Reporting / Dashboards 9
  10. 10. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10 Front End Back End Data Store Devops APIs UX Technology rocker, EMIs, ECS, GCE, other cloud tools, Domino Data Lab, Azure DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), SparkR etc. Your internal pkgs, RServer, CI, Git, H2O, (most R packages), Spark shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science with R Generic: rapache, opencpu, plumber ML: h2o/steam, Domino, Lambda Goal: Scalable, Timely, Intelligence/Economic Services
  11. 11. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Detractors - Fewer hard-core devs - Only handful of dev shops; no serious bandwidth for hire - Memory mgmt (still?) R is great for startups! Top Drivers for Startups 1. Instant Reactive Web Visualizations via Shiny (Zero front-end dev) 2. Low barrier for cross-training 3. Fantastic IDE (RStudio) (single-point access to stack) 4. Large ecosystem of packages (modeling + viz + utils) 5. Great client libraries for ML frameworks 6. Statistically Trained Prospects (Python / Pandas odds good too) 11 So how do we build an R based stack?
  12. 12. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science Should Be This Easy 12 A U T O M A T I O N Data Science IDE Interactive Dashboards Predictive Models & APIs Alerts Notification, Files So how do we build this in the cloud?
  13. 13. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Assembly of Cloud Container Services 1) Bastion - to connect to external world (small, low memory, public IP) 2) Scheduler - do things triggered by time & events (medium, run CI tools, invoke compute slaves) 3) Workers - heavy feature computations (highmem, multi core, stateless) 4) Storage - DBs, pipelines & message queues (distributed storage services or internal clusters) 5) Modeler - H2O Cluster, MLLib, Sci-Kit etc (multi-node cluster, available on demand) 6) Reporter - API Service / Shiny server (medium, autoscaled containers) 13
  14. 14. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Sample AWS Infra 14
  15. 15. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Choice of Tools 15
  16. 16. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Staging” Shiny App 1. Git Commit App to “Dev” branch 2. Jenkins Sync Repo on Commit 3. Sync triggers next Jenkins job creates Docker container 4. Next job: AWS cli tools deploy Docker container to ECS 5. “Dev” Shiny app live on staging 6. API call to notify Slack channel Sample Production Workflows SEM Cost Forecaster 1. Rscript fetches Adwords spend & internal sales data every 5 minutes. 2. Rscript runs existing anomaly detection & forecast model 3. When check fails, API calls from R to SMS (eg Twilio) and Email (eg: SendGrid). 16
  17. 17. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Building Full-Stack Data Science Teams People - Data / Backend Engineer - Data Scientist - Modeller / Statistician - Product Manager - Devops Engineer Team Output - EDA / ad-hoc - Scheduled Reporting - Batch Predictions - Stream Processing - Real-Time Prediction APIs Our “product” is scalable, actionable intelligence 17 … let’s adopt good software development practices
  18. 18. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore BetteR habits: 1. Write inline and offline tests for your code (testthat, checkmate) 2. Generate informational logs so you can debug later (futile.logger) 3. Add versioning (github) 4. Save business logic as functions in package (selfscoRe) 5. Add examples (Rmd) 6. Write documentation (Rmd) 7. Create a web service (Shiny apps) 8. Put the service in a docker container The Production Mindset for Data Scientists 18
  19. 19. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Should we buy or build? VS Should my company buy the infra? Should my team build it? 19
  20. 20. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Buy vs Build Considerations BUY / RENT - If no dev/tech in-house - If time-to-market is key requires: - Custom Integrations - Higher Cost Tolerance - Niche engagements BUILD - If compliance is major factor (HIPAA, PCI) - If cost control is key - Full Control of Features Reqd requires: - In-house talent - Longer time-to-market? - Ongoing maintenance 20
  21. 21. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Thank You! 21 Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean *ML Models Hiring Sr “Full Stack” Data Scientist In Summary - Data Science is Vertical + Lateral! - Colocate data sources - Containerize services in the cloud - Use R’s Rich Ecosystem (or something easy to cross-train other verticals on)

×