SlideShare a Scribd company logo
1 of 31
Hadoop vs. RDBMS for
 Big Data Analytics...
 Why Choose?



Mingsheng Hong    Field CTO, HP Vertica
Scott McClellan   VP, HP Emerging Applications
2   HP Confidential
Hadoop for Big Data Analytics
•   Scalable
•   Flexible
•   Low cost to try out
•   Strong community
•   But…
    –Batch            oriented jobs
    –Less             efficient storage
    –“Programmer             friendly” (improving)


3   HP Confidential
Survey of Big Data Tools
                               Stats Programs




                                   ?

                      Hadoop      Big Data      CEP Engines
4   HP Confidential
Vertica Analytics RDBMS Platform
Real-time Big Data
       SPEED          SCALABILITY   SIMPLICITY

• Relational DBMS with ACID
• Real-time analytic reporting with SQL
• 50–1000x faster than traditional DBs
• High scalability, elasticity and full parallelism
• Simple install/use with auto setup and tuning
• Industry standard x86 hardware
• Advanced in-database analytics
• Extensible analytics framework

5   HP Confidential
We have a Lot in Common …
•   Purpose-built from scratch for analytics
•   Commodity hardware
•   MPP infrastructure, scaling to 100’s nodes and multiple PBs
•   Robust
•   Diverse use cases with strong market traction




6   HP Confidential
… And We Have Differences
•   Interface
•   Tool chain / ecosystem
•   Storage management
•   Run time optimization
•   Automatic performance tuning




7   HP Confidential
Column Store – Column-Based Disk I/O
•   Typical FinServ price per stock for 1 day
                         Column Store - Reads 3 columns
                                                                                                                                                                                                   SELECT AVG(price)
                      AAPL            NQDS
                                      NYSE
                                      NYSE
                                      NYSE
                                                  NQDS
                                                  NYSE
                                                  NYSE
                                                  NYSE
                                                                  NQDS
                                                                  NYSE
                                                                  NYSE
                                                                  NYSE
                                                                                   NQDS
                                                                                   NYSE
                                                                                   NYSE
                                                                                   NYSE
                                                                                                    NQDS
                                                                                                    NYSE
                                                                                                    NYSE
                                                                                                    NYSE
                                                                                                                   NQDS
                                                                                                                   NYSE
                                                                                                                   NYSE
                                                                                                                   NYSE
                                                                                                                                  NQDS
                                                                                                                                  NYSE
                                                                                                                                  NYSE
                                                                                                                                  NYSE
                                                                                                                                             143.74
                                                                                                                                             143.75
                                                                                                                                                                           NQDS
                                                                                                                                                                           NYSE
                                                                                                                                                                           NYSE
                                                                                                                                                                           NYSE
                                                                                                                                                                                  NQDS
                                                                                                                                                                                  NYSE
                                                                                                                                                                                  NYSE
                                                                                                                                                                                  NYSE
                                                                                                                                                                                         5/05/09
                                                                                                                                                                                         5/06/09
                                                                                                                                                                                                   FROM tickstore
                      AAPL
                                      NQDS        NQDS            NQDS             NQDS             NQDS           NQDS           NQDS                                     NQDS   NQDS
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE




                                                                                                                                                                                                   WHERE
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NQDS        NQDS            NQDS             NQDS             NQDS           NQDS           NQDS                                     NQDS   NQDS



                      BBY                                                                                                                     37.03                                      5/05/09
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NQDS        NQDS            NQDS             NQDS             NQDS           NQDS           NQDS                                     NQDS   NQDS
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE



                      BBY                                                                                                                     37.13                                      5/06/09
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE




                                                                                                                                                                                                   symbol = ‘AAPL” AND date =
                                      NYSE        NYSE            NYSE             NYSE             NYSE           NYSE           NYSE                                     NYSE   NYSE
                                      NQDS        NQDS            NQDS             NQDS             NQDS           NQDS           NQDS                                     NQDS   NQDS




                                                                                                                                                                                                   ‘5/06/09’
                             Row Store - Reads all columns
                      AAPL    NYASE      NYAASE    NYSE    NYASE          NGGYSE          NYGGGSE    NYSE    NYSE          NYSE     143.74      NYSE      NYSE      NYSE    5/05/09
                      AAPL    NYASE      NYAASE    NYSE    NYASE          NGGYSE          NYGGGSE    NYSE    NYSE          NYSE     143.74      NYSE      NYSE      NYSE    5/06/09
                      BBY    NYASE     NYAASE     NYSE    NYASE          NGGYSE      NYGGGSE        NYSE    NYSE          NYSE       37.03   NYSE      NYSE      NYSE      5/05/09
                      BBY                                                                                                            37.13                                 5/06/09

                                                  …
                             NYASE     NYAASE     NYSE    NYASE          NGGYSE      NYGGGSE        NYSE    NYSE          NYSE               NYSE      NYSE      NYSE




8   HP Confidential
Column Store – Sort and Encode for Speed
               Student_ID             Name            Gender    Class      Score   Grade
                      1256678     Cappiello, Emilia     F      Sophomore    62       D
                      1254038        Dalal, Alana       F         Senior    92       A
                      1278858        Orner, Katy        F         Junior    76       C
                      1230807         Frigo, Avis       M         Senior    64       D
                      1210466     Stober, Saundra       F         Junior    90       A
                      1249290     Borba, Milagros       F       Freshman    96       A
                      1244262    Sosnowski, Hillary     F         Junior    68       D
                      1252490       Nibert, Emilia      F      Sophomore    59       F
                      1267170     Popovic, Tanisha      F       Freshman    95       A
                      1248100   Schreckengost, Max      M         Senior    76       C
                      1243483      Porcelli, Darren     M         Junior    67       D
                      1230382         Sinko, Erik       M       Freshman    91       A
                      1240224        Tarvin, Julio      M      Sophomore    85       B
                      1222781       Lessig, Elnora      F         Junior    63       D
                      1231806         Thon, Max         M      Sophomore    82       B
                      1246648    Trembley, Allyson      F         Junior    100      A




9   HP Confidential
Column Store – Sort and Encode for Speed
                       Gender     Class      Grade   Score         Name            Student_ID
                         F       Sophomore     D      62       Cappiello, Emilia     1256678
                         F          Senior     A      92          Dalal, Alana       1254038
                         F          Junior     C      76          Orner, Katy        1278858
                         M          Senior     D      64           Frigo, Avis       1230807
                         F          Junior     A      90       Stober, Saundra       1210466
                         F        Freshman     A      96       Borba, Milagros       1249290
                         F          Junior     D      68      Sosnowski, Hillary     1244262
                         F       Sophomore     F      59         Nibert, Emilia      1252490
                         F        Freshman     A      95       Popovic, Tanisha      1267170
                         M          Senior     C      76     Schreckengost, Max      1248100
                         M          Junior     D      67        Porcelli, Darren     1243483
                         M        Freshman     A      91           Sinko, Erik       1230382
                         M       Sophomore     B      85          Tarvin, Julio      1240224
                         F          Junior     D      63         Lessig, Elnora      1222781
                         M       Sophomore     B      82           Thon, Max         1231806
                         F          Junior     A      100     Trembley, Allyson      1246648



                       Columns used in predicates      Correlated values
                                                     “indexed” by preceding
                                                         column values
10   HP Confidential
Column Store – Sort and Encode for Speed
                       Gender     Class      Grade   Score         Name            Student_ID
                         F        Freshman     A      95       Popovic, Tanisha      1267170
                         F        Freshman     A      96       Borba, Milagros       1249290
                         F          Junior     A      90       Stober, Saundra       1210466
                         F          Junior     A      100     Trembley, Allyson      1246648
                         F          Junior     C      76          Orner, Katy        1278858
                         F          Junior     D      63         Lessig, Elnora      1222781
                         F          Junior     D      68      Sosnowski, Hillary     1244262
                         F          Senior     A      92          Dalal, Alana       1254038
                         F       Sophomore     D      62       Cappiello, Emilia     1256678
                         F       Sophomore     F      59         Nibert, Emilia      1252490
                         M        Freshman     A      91           Sinko, Erik       1230382
                         M          Junior     D      67        Porcelli, Darren     1243483
                         M       Sophomore     B      82           Thon, Max         1231806
                         M       Sophomore     B      85          Tarvin, Julio      1240224
                         M          Senior     C      76     Schreckengost, Max      1248100
                         M          Senior     D      64           Frigo, Avis       1230807



                       Columns used in predicates      Correlated values
                                                     “indexed” by preceding
                                                         column values
11   HP Confidential
Column Store – Sort and Encode for Speed
                       Gender      Class             Grade            Score         Name            Student_ID
                         F         Freshman            A               95       Popovic, Tanisha      1267170
                         F         Freshman offset     A     offset    96       Borba, Milagros       1249290
                         F           Junior            A               90       Stober, Saundra       1210466
                         F           Junior            A               100     Trembley, Allyson      1246648
                         F           Junior            C               76          Orner, Katy        1278858
                         F           Junior            D               63         Lessig, Elnora      1222781
                         F           Junior            D               68      Sosnowski, Hillary     1244262
                         F           Senior            A               92          Dalal, Alana       1254038
                         F        Sophomore            D               62       Cappiello, Emilia     1256678
                         F         2   nd
                                  Sophomore          3 rd
                                                       F              4 th
                                                                       59         Nibert, Emilia      1252490
                         M         Freshman            A               91           Sinko, Erik       1230382
                         M
                         M
                                   I/O
                                     Junior
                                  Sophomore
                                                     I/O
                                                       D
                                                       B
                                                                      I/O
                                                                       67
                                                                       82
                                                                                 Porcelli, Darren
                                                                                    Thon, Max
                                                                                                      1243483
                                                                                                      1231806
                 1st I/O M
                         M
                                  Sophomore
                                     Senior
                                                       B
                                                       C
                                                                       85
                                                                       76
                                                                                   Tarvin, Julio
                                                                              Schreckengost, Max
                                                                                                      1240224
                                                                                                      1248100
          Reads          entire
                         M           Senior            D               64           Frigo, Avis       1230807

             column
                Example query: select avg( Score ) from example where
                Class = ‘Junior’ and Gender = ‘F’ and Grade =
12
                ‘A’
     HP Confidential
Column Store – Column Based Compression
                                                   Compression
                       Compression Results            Ratio
         Clickstream                                      10:1
               Audit                                      10:1
             Trading                                       5:1
                SNMP                                      20:1
        Network Logs                                      60:1
           Marketing                                      20:1
            Consumer                                      30:1
                 CDR                                       8:1
                             0%   20%   40%     60%      80% 100%
                              Encoded Data    Raw Data
13   HP Confidential
Query-Driven Data Segmentation and HA
                                             Segment 1
                                                      Segment N

                                                                             RAID-like functionality
     Client Facing Network




                                                                         •
                                                 Segment 2
                                                                             within DB



                               Cluster Network
                                                             Segment 1


                                                 Segment 3               •   Smart K-safety
                                                             Segment 2

                                                                         •   Always-on loads &
                                                                             queries


                                                 Segment N
                                                         Segment N-1
14           HP Confidential
Automatic Performance Tuning
                       •   Optimal data layout (physical
                           schema)  optimal performance
                       •   User provides
                           –Logical   schema
                           –Sample   data set
                           –Typical   queries

                       •   Database Designer generates data
                           layout proposals which:
                           –Optimize   query performance
                           –Optimize   data loading throughput
                           –Minimize   storage footprint
                       •   Workload Analyzer
15   HP Confidential
Database Designer Case Studies
•    Financial Services (vs manual design)
     –Queries              4x faster
     –Storage:             50% less
     –Design             cost: 4 minutes vs months

•    Marketing & advertising
     –All              queries fully optimized; storage 10% of raw data

•    Retail (vs manual design)
     –Queries              2x faster; storage 33% less

•    News media (vs manual design)
     –Queries              comparable; storage 25% less
16   HP Confidential
Application Integration




17   HP Confidential
Analytics Feature Comparison
•    SQL                                  •   “Everything”
     –Graph            analytics          •   But especially
     –Monte            carlo simulation       –HDFS   for storing schema-less data
     –Statistical            functions        –Parse   & transform semi-structured
•    Extended SQL                              data
     –Clickstream analytics (e.g.             –Machine   learning
      sessionization)                         –Multi-language   scripts and
     –Time             series analytics        libraries
     –Pattern            matching
     –Event            series join

•
18
     Extensible analytics
     HP Confidential
19   HP Confidential
Combining the Strengths
•    Hadoop for exploratory analysis
     –Especially       with existing MR, Pig scripts

•    Vertica for stylized, interactive analysis
     –For   shared features, often faster than Hadoop with a fraction of hardware
        resources

•    Vertica’s Hadoop connector




21   HP Confidential
Hadoop + Vertica Use Case Example

                       Extract     Transform      Load



                                    Hadoop                      HP Vertica
                       Flume
                       Connector     H D F S   Vertica Hadoop
                                               Connector
                       SQOOP
                       Connector




                        Other
                       Sources
22   HP Confidential
More Joint Use Cases
•    Parallel import /export to HDFS
•    MR for data transformation, Vertica for optimized storage &
     retrieval
     –Apache           log parsing
     –Convert           JSON into relational tuples
     –Sentiment          analysis

•    Advanced analytics
     –Filter,           join and aggregation in Vertica
     –Intermediate          result fed into an MR job

23   HP Confidential
Vertica Extensible Analyics SDK

                       •   A framework for user-
                           defined Functions and
                           Transformations
                           –C++   based extensible framework
                           –Flexible: express a wide range
                            of analytic computation
                           –In-process,   fully parallel
                            execution




25   HP Confidential
Vertica Community Edition
•    Join the community: http://www.vertica.com/community
     –Fully            featured, 1 TB + 3 nodes (unlimited academic use)
     –Open             source analytic packages on github




26   HP Confidential
HP Hadoop Reference Architecture
    End-to-End Scalable Information Management Solution

                                                               Systems Management                   Connectors to move
              Analytics tooling to enable
                                                           4   CMU Real-time Monitoring       5     subsets of data in   Structured Big Data
                                                               CMU Push Button Scale Out            and out of Hadoop
    3         users to create and run                          Cloudera Enterprise
              analytics jobs on
              unstructured data                                                                                            BI/Tableau

                                                                HP CMU Real-time Monitoring
                                             Datameer                                              HP Vertica
                                                                HP CMU Scale Out
                                             Karmasphere                                           Cloudera Flume
                                                                                                   Cloudera SQOOP
                                                                Cloudera Enterprise
                                                                                                                           HP VERTICA     Application


        Hadoop Core Execution engine
                                                        Cloudera Distribution of Apache Hadoop
2       and distributed file system
        to run massively parallel
        processing tasks                                            (Map/Reduce and HDFS)
                                                                                                                          RDBMS, SAP,
                                                                                                                           Logs, etc.

                                                                       Operating System
                                                                   e.g. RedHat, Suse, CentOS

                                                                   HP ProLiant and HP Networking

               Scale Out Proliant x86 hardware with
    1          large amounts of DAS storage to store
               and process data
         27      HP Confidential
HP Hadoop Reference Architecture: Basic
     Concepts
                   Starter Kit                                                                              Scaled Deployment
               Development/POC
           Non-Production Environment    Starter Kit  Modest Scale                                                   Production


                                         Typical scale configurations are up to two racks.
                              Switch                                                                 Switch                              Switch
                                                                                                   Second Switch                       Second Switch
                                             Add redundant network/switch
                       Master Node
                                             Move management nodes to separate                  Management Node                    Job Tracker Node
                    Management Node
                                             racks                                                 Name Node                      Secondary Name Node

                 Hadoop Slave Nodes
                                                                                               Hadoop Slave Nodes                Hadoop Slave Nodes
                                         At Larger Scale – “Hundreds of nodes”
Starter Kit                              Typical scale configurations are beyond two racks.
6 nodes (2 mgmt, 4 worker), 1 switch
                                            Upgrade switches (better congestion management)
•    Optimized for low cost                 Add additional management nodes
•    Configurations generally not                                                                       Scaled Deployment
    fully redundant (single NW/switch)       for scaling                                                >2 Racks
•    Same hardware as production               o Separate name nodes (become very busy, need            •
    cluster                                                                                                 Optimized for scale and resiliency
                                                                                                        •
                                                  lots of memory)                                           Same hardware as starter kit




      28    HP Confidential
Visualizing Cluster/Hadoop Performance: Basic
Concepts                       Key system statistics
                              CPU          Disk Reads     # Map Tasks
     Node1                    75%                300          8
                                                                              Visualized as “tubes”
       …
     Node N                   65%                315          7



                        Displayed as gauges (2-dimensional)




                                                                        CPU         Disk Reads      Map Reduce
                                                                                                      Tasks

           CPU                      Disk Reads           Map Reduce
                                                           Tasks
                                                                               (where the z-axis is time)
29    HP Confidential
Visualizing Cluster/Hadoop: Normal Run – No Problems
100 Nodes – TeraSort processing on Hadoop



                                       Write         Shuffle
                                      Results                                     Long lived tasks
          Sort                                         (Move
                                      (1 copy     intermediate     Many short-
       Processing                                                                     700 tasks
                                       only)    results between    lived tasks
          (CPU                                                                      (2 per core)
       intensive)                                     nodes)
                                                                    7823 tasks
                          Data Read                               (1 per block)




  30    HP Confidential
Visualizing Cluster/Hadoop: With Network Problems
100 Nodes – TeraSort processing on Hadoop



                        Job stalls at 90%
                           waiting for
                         remaining tasks
                                                                    Some tasks take a
                                                                  long time to finish
                                                                   (Speculation kicks
                                               Failing switch              in)
                                            caused many network
                                                   retries




 31   HP Confidential
In Closing…

•    Solutions leveraging Vertica in conjunction with
     Hadoop are capable of solving a tremendous range of
     analytical challenges.
•    Hadoop is great for dealing with unstructured data,
     while Vertica is a superior platform for working
     with structured/relational data.
•    Getting them to work together is easy.

32   HP Confidential
Conclusion
•    Join the community: http://www.vertica.com/community
•    Join the core team: http://www.vertica.com/about/careers/




33   HP Confidential

More Related Content

Viewers also liked

Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
IT Project Portfolio Planning Using Excel
IT Project Portfolio Planning Using ExcelIT Project Portfolio Planning Using Excel
IT Project Portfolio Planning Using ExcelJerry Bishop
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsInterQuest Group
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...Neo4j
 

Viewers also liked (11)

Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
IT Project Portfolio Planning Using Excel
IT Project Portfolio Planning Using ExcelIT Project Portfolio Planning Using Excel
IT Project Portfolio Planning Using Excel
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
IQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data AnalyticsIQ Crash Course - Big Data Analytics
IQ Crash Course - Big Data Analytics
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
Hadoop and Graph Databases (Neo4j): Winning Combination for Bioanalytics - Jo...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?

  • 1. Hadoop vs. RDBMS for Big Data Analytics... Why Choose? Mingsheng Hong Field CTO, HP Vertica Scott McClellan VP, HP Emerging Applications
  • 2. 2 HP Confidential
  • 3. Hadoop for Big Data Analytics • Scalable • Flexible • Low cost to try out • Strong community • But… –Batch oriented jobs –Less efficient storage –“Programmer friendly” (improving) 3 HP Confidential
  • 4. Survey of Big Data Tools Stats Programs ? Hadoop Big Data CEP Engines 4 HP Confidential
  • 5. Vertica Analytics RDBMS Platform Real-time Big Data SPEED SCALABILITY SIMPLICITY • Relational DBMS with ACID • Real-time analytic reporting with SQL • 50–1000x faster than traditional DBs • High scalability, elasticity and full parallelism • Simple install/use with auto setup and tuning • Industry standard x86 hardware • Advanced in-database analytics • Extensible analytics framework 5 HP Confidential
  • 6. We have a Lot in Common … • Purpose-built from scratch for analytics • Commodity hardware • MPP infrastructure, scaling to 100’s nodes and multiple PBs • Robust • Diverse use cases with strong market traction 6 HP Confidential
  • 7. … And We Have Differences • Interface • Tool chain / ecosystem • Storage management • Run time optimization • Automatic performance tuning 7 HP Confidential
  • 8. Column Store – Column-Based Disk I/O • Typical FinServ price per stock for 1 day Column Store - Reads 3 columns SELECT AVG(price) AAPL NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE 143.74 143.75 NQDS NYSE NYSE NYSE NQDS NYSE NYSE NYSE 5/05/09 5/06/09 FROM tickstore AAPL NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE WHERE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS BBY 37.03 5/05/09 NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE BBY 37.13 5/06/09 NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE symbol = ‘AAPL” AND date = NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS ‘5/06/09’ Row Store - Reads all columns AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/05/09 AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/06/09 BBY NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 37.03 NYSE NYSE NYSE 5/05/09 BBY 37.13 5/06/09 … NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYSE NYSE NYSE 8 HP Confidential
  • 9. Column Store – Sort and Encode for Speed Student_ID Name Gender Class Score Grade 1256678 Cappiello, Emilia F Sophomore 62 D 1254038 Dalal, Alana F Senior 92 A 1278858 Orner, Katy F Junior 76 C 1230807 Frigo, Avis M Senior 64 D 1210466 Stober, Saundra F Junior 90 A 1249290 Borba, Milagros F Freshman 96 A 1244262 Sosnowski, Hillary F Junior 68 D 1252490 Nibert, Emilia F Sophomore 59 F 1267170 Popovic, Tanisha F Freshman 95 A 1248100 Schreckengost, Max M Senior 76 C 1243483 Porcelli, Darren M Junior 67 D 1230382 Sinko, Erik M Freshman 91 A 1240224 Tarvin, Julio M Sophomore 85 B 1222781 Lessig, Elnora F Junior 63 D 1231806 Thon, Max M Sophomore 82 B 1246648 Trembley, Allyson F Junior 100 A 9 HP Confidential
  • 10. Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Sophomore D 62 Cappiello, Emilia 1256678 F Senior A 92 Dalal, Alana 1254038 F Junior C 76 Orner, Katy 1278858 M Senior D 64 Frigo, Avis 1230807 F Junior A 90 Stober, Saundra 1210466 F Freshman A 96 Borba, Milagros 1249290 F Junior D 68 Sosnowski, Hillary 1244262 F Sophomore F 59 Nibert, Emilia 1252490 F Freshman A 95 Popovic, Tanisha 1267170 M Senior C 76 Schreckengost, Max 1248100 M Junior D 67 Porcelli, Darren 1243483 M Freshman A 91 Sinko, Erik 1230382 M Sophomore B 85 Tarvin, Julio 1240224 F Junior D 63 Lessig, Elnora 1222781 M Sophomore B 82 Thon, Max 1231806 F Junior A 100 Trembley, Allyson 1246648 Columns used in predicates Correlated values “indexed” by preceding column values 10 HP Confidential
  • 11. Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Freshman A 95 Popovic, Tanisha 1267170 F Freshman A 96 Borba, Milagros 1249290 F Junior A 90 Stober, Saundra 1210466 F Junior A 100 Trembley, Allyson 1246648 F Junior C 76 Orner, Katy 1278858 F Junior D 63 Lessig, Elnora 1222781 F Junior D 68 Sosnowski, Hillary 1244262 F Senior A 92 Dalal, Alana 1254038 F Sophomore D 62 Cappiello, Emilia 1256678 F Sophomore F 59 Nibert, Emilia 1252490 M Freshman A 91 Sinko, Erik 1230382 M Junior D 67 Porcelli, Darren 1243483 M Sophomore B 82 Thon, Max 1231806 M Sophomore B 85 Tarvin, Julio 1240224 M Senior C 76 Schreckengost, Max 1248100 M Senior D 64 Frigo, Avis 1230807 Columns used in predicates Correlated values “indexed” by preceding column values 11 HP Confidential
  • 12. Column Store – Sort and Encode for Speed Gender Class Grade Score Name Student_ID F Freshman A 95 Popovic, Tanisha 1267170 F Freshman offset A offset 96 Borba, Milagros 1249290 F Junior A 90 Stober, Saundra 1210466 F Junior A 100 Trembley, Allyson 1246648 F Junior C 76 Orner, Katy 1278858 F Junior D 63 Lessig, Elnora 1222781 F Junior D 68 Sosnowski, Hillary 1244262 F Senior A 92 Dalal, Alana 1254038 F Sophomore D 62 Cappiello, Emilia 1256678 F 2 nd Sophomore 3 rd F 4 th 59 Nibert, Emilia 1252490 M Freshman A 91 Sinko, Erik 1230382 M M I/O Junior Sophomore I/O D B I/O 67 82 Porcelli, Darren Thon, Max 1243483 1231806 1st I/O M M Sophomore Senior B C 85 76 Tarvin, Julio Schreckengost, Max 1240224 1248100 Reads entire M Senior D 64 Frigo, Avis 1230807 column Example query: select avg( Score ) from example where Class = ‘Junior’ and Gender = ‘F’ and Grade = 12 ‘A’ HP Confidential
  • 13. Column Store – Column Based Compression Compression Compression Results Ratio Clickstream 10:1 Audit 10:1 Trading 5:1 SNMP 20:1 Network Logs 60:1 Marketing 20:1 Consumer 30:1 CDR 8:1 0% 20% 40% 60% 80% 100% Encoded Data Raw Data 13 HP Confidential
  • 14. Query-Driven Data Segmentation and HA Segment 1 Segment N RAID-like functionality Client Facing Network • Segment 2 within DB Cluster Network Segment 1 Segment 3 • Smart K-safety Segment 2 • Always-on loads & queries Segment N Segment N-1 14 HP Confidential
  • 15. Automatic Performance Tuning • Optimal data layout (physical schema)  optimal performance • User provides –Logical schema –Sample data set –Typical queries • Database Designer generates data layout proposals which: –Optimize query performance –Optimize data loading throughput –Minimize storage footprint • Workload Analyzer 15 HP Confidential
  • 16. Database Designer Case Studies • Financial Services (vs manual design) –Queries 4x faster –Storage: 50% less –Design cost: 4 minutes vs months • Marketing & advertising –All queries fully optimized; storage 10% of raw data • Retail (vs manual design) –Queries 2x faster; storage 33% less • News media (vs manual design) –Queries comparable; storage 25% less 16 HP Confidential
  • 17. Application Integration 17 HP Confidential
  • 18. Analytics Feature Comparison • SQL • “Everything” –Graph analytics • But especially –Monte carlo simulation –HDFS for storing schema-less data –Statistical functions –Parse & transform semi-structured • Extended SQL data –Clickstream analytics (e.g. –Machine learning sessionization) –Multi-language scripts and –Time series analytics libraries –Pattern matching –Event series join • 18 Extensible analytics HP Confidential
  • 19. 19 HP Confidential
  • 20. Combining the Strengths • Hadoop for exploratory analysis –Especially with existing MR, Pig scripts • Vertica for stylized, interactive analysis –For shared features, often faster than Hadoop with a fraction of hardware resources • Vertica’s Hadoop connector 21 HP Confidential
  • 21. Hadoop + Vertica Use Case Example Extract Transform Load Hadoop HP Vertica Flume Connector H D F S Vertica Hadoop Connector SQOOP Connector Other Sources 22 HP Confidential
  • 22. More Joint Use Cases • Parallel import /export to HDFS • MR for data transformation, Vertica for optimized storage & retrieval –Apache log parsing –Convert JSON into relational tuples –Sentiment analysis • Advanced analytics –Filter, join and aggregation in Vertica –Intermediate result fed into an MR job 23 HP Confidential
  • 23. Vertica Extensible Analyics SDK • A framework for user- defined Functions and Transformations –C++ based extensible framework –Flexible: express a wide range of analytic computation –In-process, fully parallel execution 25 HP Confidential
  • 24. Vertica Community Edition • Join the community: http://www.vertica.com/community –Fully featured, 1 TB + 3 nodes (unlimited academic use) –Open source analytic packages on github 26 HP Confidential
  • 25. HP Hadoop Reference Architecture End-to-End Scalable Information Management Solution Systems Management Connectors to move Analytics tooling to enable 4 CMU Real-time Monitoring 5 subsets of data in Structured Big Data CMU Push Button Scale Out and out of Hadoop 3 users to create and run Cloudera Enterprise analytics jobs on unstructured data BI/Tableau HP CMU Real-time Monitoring Datameer HP Vertica HP CMU Scale Out Karmasphere Cloudera Flume Cloudera SQOOP Cloudera Enterprise HP VERTICA Application Hadoop Core Execution engine Cloudera Distribution of Apache Hadoop 2 and distributed file system to run massively parallel processing tasks (Map/Reduce and HDFS) RDBMS, SAP, Logs, etc. Operating System e.g. RedHat, Suse, CentOS HP ProLiant and HP Networking Scale Out Proliant x86 hardware with 1 large amounts of DAS storage to store and process data 27 HP Confidential
  • 26. HP Hadoop Reference Architecture: Basic Concepts Starter Kit Scaled Deployment Development/POC Non-Production Environment Starter Kit  Modest Scale Production Typical scale configurations are up to two racks. Switch Switch Switch Second Switch Second Switch  Add redundant network/switch Master Node  Move management nodes to separate Management Node Job Tracker Node Management Node racks Name Node Secondary Name Node Hadoop Slave Nodes Hadoop Slave Nodes Hadoop Slave Nodes At Larger Scale – “Hundreds of nodes” Starter Kit Typical scale configurations are beyond two racks. 6 nodes (2 mgmt, 4 worker), 1 switch  Upgrade switches (better congestion management) • Optimized for low cost  Add additional management nodes • Configurations generally not Scaled Deployment fully redundant (single NW/switch) for scaling >2 Racks • Same hardware as production o Separate name nodes (become very busy, need • cluster Optimized for scale and resiliency • lots of memory) Same hardware as starter kit 28 HP Confidential
  • 27. Visualizing Cluster/Hadoop Performance: Basic Concepts Key system statistics CPU Disk Reads # Map Tasks Node1 75% 300 8 Visualized as “tubes” … Node N 65% 315 7 Displayed as gauges (2-dimensional) CPU Disk Reads Map Reduce Tasks CPU Disk Reads Map Reduce Tasks (where the z-axis is time) 29 HP Confidential
  • 28. Visualizing Cluster/Hadoop: Normal Run – No Problems 100 Nodes – TeraSort processing on Hadoop Write Shuffle Results Long lived tasks Sort (Move (1 copy intermediate Many short- Processing 700 tasks only) results between lived tasks (CPU (2 per core) intensive) nodes) 7823 tasks Data Read (1 per block) 30 HP Confidential
  • 29. Visualizing Cluster/Hadoop: With Network Problems 100 Nodes – TeraSort processing on Hadoop Job stalls at 90% waiting for remaining tasks Some tasks take a long time to finish (Speculation kicks Failing switch in) caused many network retries 31 HP Confidential
  • 30. In Closing… • Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges. • Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data. • Getting them to work together is easy. 32 HP Confidential
  • 31. Conclusion • Join the community: http://www.vertica.com/community • Join the core team: http://www.vertica.com/about/careers/ 33 HP Confidential