SlideShare a Scribd company logo
1 of 14
Hidden Gems found with Hadoop
Paco Nathan
Lead, Analytics team @ IMVU.com
Ask Questions Early…

‣ How do Hadoop and “Big Data” fit into the practice
  of Continuous Deployment ?

‣ Why don’t we simply load all our data into Oracle,
  then generate reports and spreadsheets as needed ?

‣ Given all the conflicting “NoSQL” options, how
  does an engineer design an effective data store ?

‣ Is there one framework we can just buy and resolve
  all these annoying data issues ?

‣ What kinds of analytics work can be performed
  using Hadoop in the cloud ?

‣ Is IMVU currently hiring ? ☺
Continuous Deployment

• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes
• depends on “immune system” regression checks, progressive roll-outs
• dedication to transparency and metrics: data-intensive company culture
• extensive use of customer experiments (A/B testing) on millions of users
• instrumentation, alerting, strict discipline on config and resource usage
• Ops excellence, plus big investment in a finely tuned production environment
  http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment

  http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment

  http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html
Continuous Deployment
Data Analytics

• data usage downstream from production cluster is a lower priority
• industry truism: data usage downstream almost never trumps
  the priority of direct revenue transactions

• even so, business strategy depends on data analytics – which in
  practice, at scale, must live downstream from transactions

• however, data analytics jobs tend to break that extensive work in
   testing/monitoring which allows for continuous deployment:

     - mission critical code which can’t be verified readily by unit tests
     - “slow queries” trip immune system, signaling regressions
     - likewise for large data transfers within production cluster
     - tightly configured environment vs. elastic resource needs
How Did We Get Here?

• big Internet successes after 1997 holiday season…
  AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

• consider how, among tech firms, this metric:
    annual revenue per customer / operational data store size
  dropped more than 100x within a few years after 1997

• “conventional wisdom” of RDBMS and BI tools became
  much less viable; however, business cadre which came of
  age when “spreadsheets were new” tends to carry along
  too much inertia to confront these issues pro-actively

• one one hand, storage and processing costs plummeted…
  on the other hand, we must now work much smarter
  to extract ROI from “Big Data”, so methods must adapt

• MapReduce and the Hadoop open source stack grew
  directly out of this context… but they only solve part
  of these problems
CAP Theorem

• Eric Brewer, 2000: “You can have at most two of these properties for
  any shared-data system … the choice of which feature to discard
  determines the nature of your system.”

• direct revenue apps in consumer Internet require consistency and
   partition tolerance

• data analytics jobs for business uses generally require availability and
  eventual consistency, but tend to not tolerate highly partitioned data

• ETL becomes an Achilles heal for “Lean Startup™”:
     ‣ agile/experiment-driven/scale-out, which leads to…                    strong
                                                                             consistency
                                                                                                        high
                                                                                                        availability

     ‣ provably-hard-to-detect metadata drift, which leads to…
                                                                                 C                      A
     ‣ high-risk technical debt
  https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

  http://www.julianbrowne.com/article/viewer/brewers-cap-theorem     RDBMS
                                                                                             P                          eventual
                                                                                            partition                  consistency
                                                                                           tolerance
Data Access Patterns

• design patterns: originated in consensus negotiation
  for architecture, then software engineering

• consider the corollaries in large scale data wrangling…
• essential advice:
  select data frameworks based on your data access patterns

• in other words, decouple usage based on need –
  to avoid “one size fits all” blockers

• let’s review some examples…
Access Patterns ↔ Frameworks

financial transactions               general ledger in RDBMS            CAx
ad-hoc queries                      RDS (hosted MySQL)                 CAx
reporting, dashboards               like Pentaho                       CAx
log rotation/persistence            like Riak                          xxP
search indexes                      like Lucene, Solr                  xAP
static content, archives            S3 (durable storage)               xAP
customer facts                      like Redis, Membase                xAP
distributed counters, locks, sets   like Redis                         x A P*
data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
authoritative metadata              like Zookeeper                     CxP
data prep, modeling at scale        like Hadoop/Hive/Cascading + R     CxP
graph analysis                      like Hadoop + Redis + Gephi        CxP
data marts                          like Hadoop/Hive/HBase             CxP
Access Patterns ↔ Frameworks

financial transactions               general ledger in RDBMS            CAx
ad-hoc queries                      RDS (hosted MySQL)                 CAx
reporting, dashboards               like Pentaho                       CAx
log rotation/persistence            like Riak                          xxP
search indexes                      like Lucene, Solr                  xAP
static content, archives            S3 (durable storage)               xAP
customer facts                      like Redis, Membase                xAP
distributed counters, locks, sets   like Redis                         x A P*
data objects CRUD                   key/value – like, NoSQL on MySQL   CxP
authoritative metadata              like Zookeeper                     CxP
data prep, modeling at scale        like Hadoop/Hive/Cascading + R     CxP
graph analysis                      like Hadoop + Redis + Gephi        CxP
data marts                          like Hadoop/Hive/HBase             CxP
Data Prep → Modeling at Scale
Analytics jobs performed in the cloud with Hadoop, R, etc.:
 • log clean-up, sessionization
 •   roll-ups, slices, sampling, data cubes, visualizations
 •   language identification, key phrase extraction
 •   co-occurrence analysis, topic trending
 •   custom search indexes
 •   random forests and other classifiers
 •   connected components, effects across social graph
 •   virtual economy metrics

Business use cases:
 • customer segmentation                                                             edoMpUsserD:IUN
                                                                 tcudorP ylppA lenaP yrotnevnI tneilC
                                                              tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                     edoMmooRyM:IUN
                                                                                 edoMmooRcilbuP:IUN




 •   retention models
                                                                                              ydduB ddA
                                                                                           nigoL etisbeW
                                                                                                       vd
                                                                                      edoMsdneirF:IUN
                                                                                          edoMtahC:IUN
                                                                                      egasseM a evaeL
                                                                         G1 :gniniamer ecaps sserddA




 •
                                                                                  dekcilCeliforPyM:IUN



     anti-fraud
                                                                                   edoMstiderCyuB:IUN
                                                                                       tohspanS a ekaT
                                                                                   egapemoH nwO tisiV
                                                                                           elbbuB a epyT
                                                                                            taeS egnahC
                                                                                      wodniW D3 nepO
                                                                                              dneirF ddA




 •
                                                             revO tcudorP pilF lenaP yrotnevnI tneilC



     content recommendation
                                                                                               lenaP tidE
                                                                                                woN tahC
                                                                                                 teP yalP
                                                                                                teP deeF
                                                         2 petS egaP traC esahcruP edaM remotsuC
                                                                      M215 :gniniamer ecaps sserddA
                                                                                          gnihtolC no tuP




 •
                                                                                       bew :metI na yuB



     ad optimization
                                                                                         edoMeivoM:IUN
                                                                ytinummoc ,tneilc :detratS weiV eivoM
                                                                                         teP weN etaerC
                                                                    detrats etius tset :tseTytivitcennoC
                                                                               emag pazyeh dehcnuaL
                                                                                eciov mooRcilbuP tahC
                                                                                      egasseM yadhtriB
                                                                                      edoMlairotuT:IUN
                                                                                ybbol semag dehcnuaL
                                                                                    noitartsigeR euqinU
                                                                                                            edoMpUss
                                                                                                            tcudorP yl
                                                                                                            tcudorP ev
                                                                                                            edoMmoo
                                                                                                            edoMmoo
                                                                                                            ydduB ddA
                                                                                                            nigoL etisb
                                                                                                            vd
                                                                                                            edoMsdnei
                                                                                                            edoMtahC:
                                                                                                            egasseM a
                                                                                                            G1 :gninia
                                                                                                            dekcilCelif
                                                                                                            edoMstider
                                                                                                            tohspanS
                                                                                                            egapemoH
                                                                                                            elbbuB a e
                                                                                                            taeS egna
                                                                                                            wodniW D
                                                                                                            dneirF ddA
                                                                                                            revO tcudo
                                                                                                            lenaP tidE
                                                                                                            woN tahC
                                                                                                            teP yalP
                                                                                                            teP deeF
                                                                                                            2 petS ega
                                                                                                            M215 :gnin
                                                                                                            gnihtolC n
                                                                                                            bew :metI
                                                                                                            edoMeivo
                                                                                                            ytinummoc
                                                                                                            teP weN et
                                                                                                            detrats etiu
                                                                                                            emag pazy
                                                                                                            eciov moo
                                                                                                            egasseM y
                                                                                                            edoMlairot
                                                                                                            ybbol sem
                                                                                                            noitartsige
Finding Hidden Gems…

data objects,          cloud-based         data access patterns           business
transactions            data marts                                        use cases



                           Hive                            reporting   ad-hoc queries,
                                         RDS              framework    reporting



                                                                       search,
                                     Lucene / Solr          cache      recommenders,
                                                                       data services
                          Hadoop


                                                                       graph analysis,
                                                            Redis      sessionization,
                                                                       data services
 MySQL
partitions
   MySQL                                                               predictive modeling,
  partitions
     MySQL                                                  Gephi
                 ETL         S3                                        social graph,
    partitions                                                         factor analysis,
                                                                  R    time series,
                                                                       data visualization
Related Resources

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

http://www.slideshare.net/pacoid/getting-started-on-hadoop

https://github.com/ceteri/ceteri-mapred

http://redis.io/

http://www.r-project.org/

http://gephi.org/
Analytics Team, IMVU.com

• IMVU: 90 employees in Bay Area, $40MM annual rev
• largest virtual goods catalog: +6MM items UGC
- Best Places to Work in Bay Area, 2011 & 2010
- Red Herring Global 100 Tech Startup, 2010
- Inc. 500, 2010
http://www.imvu.com/jobs/
@pacoid

More Related Content

Similar to Hidden Gems found with Hadoop

SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?Chris Richardson
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 
2014 01-23-eranea-apalia-private-cloud
2014 01-23-eranea-apalia-private-cloud2014 01-23-eranea-apalia-private-cloud
2014 01-23-eranea-apalia-private-cloudDidier Durand
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
2011 05-12 nosql-fritidsresor
2011 05-12 nosql-fritidsresor2011 05-12 nosql-fritidsresor
2011 05-12 nosql-fritidsresorMårten Gustafson
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 

Similar to Hidden Gems found with Hadoop (20)

SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Lecture1
Lecture1Lecture1
Lecture1
 
iRODS
iRODSiRODS
iRODS
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
2014 01-23-eranea-apalia-private-cloud
2014 01-23-eranea-apalia-private-cloud2014 01-23-eranea-apalia-private-cloud
2014 01-23-eranea-apalia-private-cloud
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
2011 05-12 nosql-fritidsresor
2011 05-12 nosql-fritidsresor2011 05-12 nosql-fritidsresor
2011 05-12 nosql-fritidsresor
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Hidden Gems found with Hadoop

  • 1. Hidden Gems found with Hadoop Paco Nathan Lead, Analytics team @ IMVU.com
  • 2. Ask Questions Early… ‣ How do Hadoop and “Big Data” fit into the practice of Continuous Deployment ? ‣ Why don’t we simply load all our data into Oracle, then generate reports and spreadsheets as needed ? ‣ Given all the conflicting “NoSQL” options, how does an engineer design an effective data store ? ‣ Is there one framework we can just buy and resolve all these annoying data issues ? ‣ What kinds of analytics work can be performed using Hadoop in the cloud ? ‣ Is IMVU currently hiring ? ☺
  • 3. Continuous Deployment • IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes • depends on “immune system” regression checks, progressive roll-outs • dedication to transparency and metrics: data-intensive company culture • extensive use of customer experiments (A/B testing) on millions of users • instrumentation, alerting, strict discipline on config and resource usage • Ops excellence, plus big investment in a finely tuned production environment http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html
  • 5. Data Analytics • data usage downstream from production cluster is a lower priority • industry truism: data usage downstream almost never trumps the priority of direct revenue transactions • even so, business strategy depends on data analytics – which in practice, at scale, must live downstream from transactions • however, data analytics jobs tend to break that extensive work in testing/monitoring which allows for continuous deployment: - mission critical code which can’t be verified readily by unit tests - “slow queries” trip immune system, signaling regressions - likewise for large data transfers within production cluster - tightly configured environment vs. elastic resource needs
  • 6. How Did We Get Here? • big Internet successes after 1997 holiday season… AMZN, EBAY, then GOOG, Inktomi (YHOO Search) • consider how, among tech firms, this metric: annual revenue per customer / operational data store size dropped more than 100x within a few years after 1997 • “conventional wisdom” of RDBMS and BI tools became much less viable; however, business cadre which came of age when “spreadsheets were new” tends to carry along too much inertia to confront these issues pro-actively • one one hand, storage and processing costs plummeted… on the other hand, we must now work much smarter to extract ROI from “Big Data”, so methods must adapt • MapReduce and the Hadoop open source stack grew directly out of this context… but they only solve part of these problems
  • 7. CAP Theorem • Eric Brewer, 2000: “You can have at most two of these properties for any shared-data system … the choice of which feature to discard determines the nature of your system.” • direct revenue apps in consumer Internet require consistency and partition tolerance • data analytics jobs for business uses generally require availability and eventual consistency, but tend to not tolerate highly partitioned data • ETL becomes an Achilles heal for “Lean Startup™”: ‣ agile/experiment-driven/scale-out, which leads to… strong consistency high availability ‣ provably-hard-to-detect metadata drift, which leads to… C A ‣ high-risk technical debt https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://www.julianbrowne.com/article/viewer/brewers-cap-theorem RDBMS P eventual partition consistency tolerance
  • 8. Data Access Patterns • design patterns: originated in consensus negotiation for architecture, then software engineering • consider the corollaries in large scale data wrangling… • essential advice: select data frameworks based on your data access patterns • in other words, decouple usage based on need – to avoid “one size fits all” blockers • let’s review some examples…
  • 9. Access Patterns ↔ Frameworks financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene, Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/Hive/HBase CxP
  • 10. Access Patterns ↔ Frameworks financial transactions general ledger in RDBMS CAx ad-hoc queries RDS (hosted MySQL) CAx reporting, dashboards like Pentaho CAx log rotation/persistence like Riak xxP search indexes like Lucene, Solr xAP static content, archives S3 (durable storage) xAP customer facts like Redis, Membase xAP distributed counters, locks, sets like Redis x A P* data objects CRUD key/value – like, NoSQL on MySQL CxP authoritative metadata like Zookeeper CxP data prep, modeling at scale like Hadoop/Hive/Cascading + R CxP graph analysis like Hadoop + Redis + Gephi CxP data marts like Hadoop/Hive/HBase CxP
  • 11. Data Prep → Modeling at Scale Analytics jobs performed in the cloud with Hadoop, R, etc.: • log clean-up, sessionization • roll-ups, slices, sampling, data cubes, visualizations • language identification, key phrase extraction • co-occurrence analysis, topic trending • custom search indexes • random forests and other classifiers • connected components, effects across social graph • virtual economy metrics Business use cases: • customer segmentation edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN • retention models ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA • dekcilCeliforPyM:IUN anti-fraud edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA • revO tcudorP pilF lenaP yrotnevnI tneilC content recommendation lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP • bew :metI na yuB ad optimization edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUss tcudorP yl tcudorP ev edoMmoo edoMmoo ydduB ddA nigoL etisb vd edoMsdnei edoMtahC: egasseM a G1 :gninia dekcilCelif edoMstider tohspanS egapemoH elbbuB a e taeS egna wodniW D dneirF ddA revO tcudo lenaP tidE woN tahC teP yalP teP deeF 2 petS ega M215 :gnin gnihtolC n bew :metI edoMeivo ytinummoc teP weN et detrats etiu emag pazy eciov moo egasseM y edoMlairot ybbol sem noitartsige
  • 12. Finding Hidden Gems… data objects, cloud-based data access patterns business transactions data marts use cases Hive reporting ad-hoc queries, RDS framework reporting search, Lucene / Solr cache recommenders, data services Hadoop graph analysis, Redis sessionization, data services MySQL partitions MySQL predictive modeling, partitions MySQL Gephi ETL S3 social graph, partitions factor analysis, R time series, data visualization
  • 14. Analytics Team, IMVU.com • IMVU: 90 employees in Bay Area, $40MM annual rev • largest virtual goods catalog: +6MM items UGC - Best Places to Work in Bay Area, 2011 & 2010 - Red Herring Global 100 Tech Startup, 2010 - Inc. 500, 2010 http://www.imvu.com/jobs/ @pacoid

Editor's Notes

  1. • prior teams: Jive, ShareThis, Adknowledge, HeadCase\n• worked with Ray while at ShareThis on our DW and recommender systems\n• 5 years experience with AWS, some at firms 100% in the cloud\n
  2. • I’m a big believer in asking many questions up-front…\n• this talk examines how Hadoop fits into what IMVU is famous for: continuous deployment\n• we do some critical work with large data sets which makes RDBMS not a good fit\n
  3. • CD allows many developers to respond to immediate needs, to experiment frequently\n• transparency, measurement, and consistent data-driven decisions are absolutely requisite\n
  4. • in short, we can handle in minutes or hours, what other firms might take days, weeks, or months to do\n• decisions and actions are highly distributed, and engineering process is well disciplined\n\n\n
  5. • my team works in Analytics, and our data usage is at a different priority than our production cluster\n• this is generally true throughout the industry\n• business strategy depends on analytics – \n• however, analytics work tends to break what we’ve so carefully instrumented\n\n\n\n
  6. • how did we reach this condition?\n• 1997Q4 through 1998Q1, AMZN/EBAY/GOOG/YHOO redefined data use\n• revenue/data size, as a metric, fell through the floor\n• previous practices in relational DBs and BI no longer worked so well\n
  7. • CAP theorem explains an inherent conflict there…\n• Internet transactions tend to need different kinds of data management than analytics\n• partitioned databases are a solution for one aspect, but in turn cause ETL to become a huge problem\n
  8. • fortunately, there are patterns we can use to engineer around those conflicts…\n• providing that you don’t buy into “one size fits all” sales rhetoric from DB vendors\n• design patterns help here: choose data frameworks which fit your data access patterns\n
  9. • hopefully, this tables states the CAP forfeits correctly – email me corrections, please :)\n• some of these patterns migrate well to the cloud; you may miss a big opportunity if you don't\n
  10. • Redis is notable; rich/flexible atomic operations lend to not-shared cases\n• let’s drill-down into the Hadoop use cases…\n
  11. • here are a variety of kinds of data preparation, discovery, modeling, and visualization for which my teams have used Hadoop and AWS\n• generally the goal is to automate most all of the work, as “pipelines”, and deliver data products/data services\n• these visualizations are actually some recent products from my team (less a few details stripped out)…\n• geolocation, topic trending from text analytics, measuring effects across the social graph, and comparing features vs. retention\n
  12. • BTW, Redis provides an excellent “left brain” to pair with Hadoop “right brain”\n• this is not strictly “real-time” analytics, but cost-effective and follows guidance from CAP\n• in other words, scalable data frameworks based on prevalent data access patterns\n
  13. • here is some further reading, which I will post online…\n
  14. • oh, and yes we are hiring :)\n