SlideShare a Scribd company logo
1 of 29
Greenplum & Hadoop
                                            Why do such a thing?


                                            Donald Miner
                                            Solutions Architect
                                            Advanced Technologies Group
                                            Donald.Miner@emc.com




© Copyright 2012 EMC Corporation. All rights reserved.                    1
QUICK INTRODUCTION TO


        GREENPLUM DATABASE




© Copyright 2012 EMC Corporation. All rights reserved.                 2
GREENPLUM DATABASE

Greenplum Database Basics
Massively Parallel Processing (MPP) Database

Uses commodity hardware                                                  Master             Master



Data is distributed by a
user-defined “distribution key”

Master node delegates
queries to segments                                      Segment   Segment        Segment            Segment



1:1 segment and master
mirroring for redundancy




© Copyright 2012 EMC Corporation. All rights reserved.                                                         3
GREENPLUM DATABASE

Greenplum Database Features
Full SQL support based on PostgreSQL 8.2

Columnar or row-oriented storage with compression

Multi-level table partitioning with query time partition pruning

B-tree and bitmap indexes

JDBC, ODBC, OLEDB, etc. interfaces

High speed, parallel bulk ingest

Parallel query optimizer

External tables




© Copyright 2012 EMC Corporation. All rights reserved.             4
GREENPLUM DATABASE

MADlib Analytics with Greenplum

Scalable and in-database                                 > SELECT householdID, variables
                                                            FROM households
Mathematical, statistical,                                  ORDER BY RANDOM()
                                                            LIMIT 100000;
 machine learning
                                                         > SELECT run_univariate_analysis (
                                                               'households_training',
Active open source project                                     'variables');
                                                            WHERE pvalue<.01 AND r2>.01;
                                                         > SELECT run_regression(
                                                               'univariate_results',
                                                               'households_training');
                                                         > SELECT householdID,
                                                         madlib.array_dot(
                                                               coef::REAL[],
                                                               xmatrix::REAL[])
                                                            FROM coefficients, households;




© Copyright 2012 EMC Corporation. All rights reserved.                                        5
GREENPLUM DATABASE

MADlib In-Database Analytical Functions
    Descriptive Statistics                               Modeling
    Quantile                                             Correlation Matrix
    Profile                                              Association Rule Mining

    CountMin (Cormode-Muthukrishnan)
                                                         K-Means Clustering
    Sketch-based Estimator

    FM (Flajolet-Martin) Sketch-based
                                                         Naïve Bayes Classification
    Estimator
    MFV (Most Frequent Values) Sketch-
                                                         Linear Regression
    based Estimator
    Frequency                                            Logistic Regression
    Histogram                                            Support Vector Machines
    Bar Chart                                            SVD Matrix Factorisation
    Box Plot Chart                                       Decision Trees/CART
    Latent Dirichlet Allocation Topic
    Modeling




© Copyright 2012 EMC Corporation. All rights reserved.                                6
GREENPLUM DATABASE

PostGIS Support in Greenplum DB
  PostGIS adds support for geographic objects in PostgreSQL

  Example: find all records within 25 miles of hurricane path
                                                           http://postgis.refractions.net/

 select customer_id, ST_AsText(lat_lon), phone_num
 from clients
 where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING(
 -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, -
 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, -
 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI())


 customer_id | st_astext                         | phone_num
 ------------+-----------------------------+-------------
 493140        | POINT(-80.040397 26.570613) | 1231231234
 192401        | POINT(-81.820933 26.242611) | 2342342345



© Copyright 2012 EMC Corporation. All rights reserved.                                             7
GREENPLUM DATABASE

 Solr integration with GPDB
 Solr is an open source enterprise search engine

 Enable in-database text indexing and search
                                                           id |        score        |      message_text
select                                                    -----------+------------------+-------------------------------------------
   t.id,                                                    71552856 | 5.43078422546387 | Hates BB's Love IPhones!
   q.score,                                                91373993 | 4.06371879577637 | Its a love hate relationship with
   t.message_text                                         iPhone spellcheck
from
   message t,                                              25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate
   gptext.search(                                         relationship...
    'twitter.public.message',
                                                          120166038 | 3.39410924911499 | Love the new iPhone 4s, hate
    '(iphone and (hate or love))',                        @ATT service #Verizonhereicome
    'author_lang:en',
       100                                                117498183 | 3.39181470870972 | I got a love-hate relationship for
   )q                                                     my iPhone!!!
where
   t.id=q.id                                               86416378 | 3.39180779457092 | Absolutely love the new iPhone,
                                                          but Siri seems to hate me..
order by score desc;




 © Copyright 2012 EMC Corporation. All rights reserved.                                                                                8
GREENPLUM HADOOP




© Copyright 2012 EMC Corporation. All rights reserved.   9
GREENPLUM HADOOP

Greenplum “HD”
• Bundled open source

• HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma
  hout




© Copyright 2012 EMC Corporation. All rights reserved.   10
GREENPLUM HADOOP

Greenplum “MR”
• Bundled MapR, a commercial version of Hadoop
• API compatible with traditional Hadoop
• MapR improvements over Hadoop:
        – Improved control system
        – Major portions of HDFS re-implemented
           in C++
        – HDFS is NFS mountable
        – Improved shuffle and sort
        – Distributed NameNode
        – Supports large number of files
        – Mirroring, snapshot capability



© Copyright 2012 EMC Corporation. All rights reserved.   11
Why do such a thing?
 Greenplum DB
MADLib
               Partitioning                                        GP Solr/Lucene
   SQL
                Indexing                                                   Text objects
        RDBMS                                  PostGIS
                                                                     GPMapReduce
Tables and Schemas

  STRUCTURED                                              SEMISTRUCTURED            UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                            12
Why do such a thing?
Hadoop


                                                                              Schema on load
                                                                                   MapReduce
                            Hive
                                                               XML, JSON, …        Flat files
                                           Pig

 STRUCTURED                                              SEMISTRUCTURED       UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                          13
Why do such a thing?
HBase


                                          Row keys

         Hive                                             Flexible schema       MapReduce

                                                          HBase Tables
                          Pig

 STRUCTURED                                              SEMISTRUCTURED     UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                      14
Why do such a thing?
 Hybrid architecture with all three (or two…)
MADLib
        Partitioning Row keys            GP Solr/Lucene
  SQL                                                    Schema on load
        Indexing                                Text objects
                             Flexible schema                  MapReduce
     RDBMS      Hive  PostGIS
                            HBase Tables GPMapReduce
Tables and Schemas Pig              XML, JSON, …              Flat files

  STRUCTURED                                              SEMISTRUCTURED   UNSTRUCTURED




 © Copyright 2012 EMC Corporation. All rights reserved.                                   15
Greenplum Unified Analytics Platform




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop External Tables in GPDB
  External tables bring external data into the database.

  Native support for HDFS with parallelized loading.

  Can write to HDFS or read from HDFS.

 > CREATE EXTERNAL TABLE hdfs_document_feature (
   docid integer,
   term text,
   freq integer)
  LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*')
  FORMAT 'text' (delimiter '|');

 > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE
 h.term = g.word;

 > WRITE INTO hdfs_export SELECT * FROM gpdb_source;




© Copyright 2012 EMC Corporation. All rights reserved.                17
Why do such a thing?
Many of the same use cases of a HBase/Hadoop environment

Use Hadoop as a data groomer

Do rollups in Hadoop and store results in GPDB

Use the best tool for the job (structured vs. unstructured)

Use GPDB to host data sets in a more real-time layer for ad-hoc
analytics




© Copyright 2012 EMC Corporation. All rights reserved.            18
EMC Isilon
    Hardware appliance for scale-out
    network-attached storage (NAS)
    Stripes data across all nodes
    Uses Infiniband for intra-cluster
    communication
    Up to 15.5PB total storage
    3 different hardware configurations
    to handle different workloads
    Uses “OneFS”, Isilon’s operating system and file system
    Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few
    more.



© Copyright 2012 EMC Corporation. All rights reserved.        19
Isilon HDFS interface
    Isilon is able to “pretend” to be a HDFS
    cluster: it mimics the NameNode and
    DataNode protocols to host data.
    Underlying system is OneFS and does not
    follow the traditional HDFS scheme.
    Point HDFS clients (MapReduce, command
    line, etc.) to any IP in the Isilon cluster.




© Copyright 2012 EMC Corporation. All rights reserved.   20
Pros & Cons
    Isilon is more dense
    Isilon can be mounted via a number of
    protocols
        – Easier ingest / egress
        – Raw data accessible by applications
    Isilon is easy to manage
    Free of certain HDFS limitations
    Isilon loses data locality (~250MB/sec
    throughput per node over network)

© Copyright 2012 EMC Corporation. All rights reserved.   21
Why do such a thing?
    Hadoop backup or archive
     – More dense than HDFS, more accessible than
       tape, no need for compute
    Complete HDFS replacement
     – More dense, more accessible, utilize existing
       Isilon, slower per terabyte of storage
    Hot/warm storage
     – Use HDFS as primary, but Isilon as secondary
    Storage for original content
     – Use MapReduce to extract metadata from original
       content, and leave original content in place

© Copyright 2012 EMC Corporation. All rights reserved.   22
HBase External Tables in GPDB
  Project in development

  Load data in parallel from HBase by specifying table name and
  column qualifiers


 > CREATE EXTERNAL TABLE hbase_document_feature (
   “HBASEROWKEY” text,
   “term” text,
   “freq” integer)
  LOCATION ('gphbase://docfeatures')
  FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟);

 > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE
 h.term = g.word;




© Copyright 2012 EMC Corporation. All rights reserved.                 23
HBase External Tables in GPDB
Possible TODO list:

                 Specify range of rowkeys

                 Support writes into HBase

                 Specify filter criteria on the external table

                 select * from hbase_external where ROWKEY=‘abc’

                 Accumulo?




© Copyright 2012 EMC Corporation. All rights reserved.             24
Why do such a thing?
Have HBase store semi-structured data

Exploit the strengths of each

Use HBase for really really wide tables

Use HBase as a scalable archive of raw records

Leverage existing HBase applications




© Copyright 2012 EMC Corporation. All rights reserved.   25
Greenplum On HDFS

  Get Greenplum Database to run natively off of HDFS

  Underlying Greenplum Database data is stored in HDFS

  Unifies the two platform further – no need for external tables

  Fully supports Greenplum’s append-only tables


  Early project in R&D

  Talk will be given by Chang Lei at Yahoo Summit




© Copyright 2012 EMC Corporation. All rights reserved.             26
Greenplum On HDFS
                                                             Master host


                                                                                                         Interconnect




                                                                                                             Segment
                                     Segment                                                                 (Mirror)
    Segment                                                Segment                 Segment
                                                                     Segment
                 Segment                        Segment
                                                                     (Mirror)
                                                                                             Segment                    Segment
                 (Mirror)                       (Mirror)                                     (Mirror)
   Segment host                     Segment host                Segment host      Segment host              Segment host

                                                                       Meta Ops                                             Read/Write
             Tables in HDFS filespace


                                                           Namenode
                                                                                                                        B
                                             Datanode          replication
                                                                                             Datanode             Datanode



                            Rack1                                                                       Rack2




© Copyright 2012 EMC Corporation. All rights reserved.                                                                                   27
Why do such a thing?
Covers many of the same use cases as Hive

Run Hadoop MapReduce over data managed by Greenplum DB

Initial results show it is faster than Hive

You only have to store your data in one system




© Copyright 2012 EMC Corporation. All rights reserved.   28
Hadoop & Greenplum: Why Do Such a Thing?

More Related Content

What's hot

MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11Kenny Gryp
 
Oracle RAC features on Exadata
Oracle RAC features on ExadataOracle RAC features on Exadata
Oracle RAC features on ExadataAnil Nair
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
MySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationMySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationFrederic Descamps
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialJean-François Gagné
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Everything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationEverything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationNuno Carvalho
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)Kenny Gryp
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesFrederic Descamps
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACMarkus Michalewicz
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewRené Cannaò
 
ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)Mydbops
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Mydbops
 
High Availability in MySQL 8 using InnoDB Cluster
High Availability in MySQL 8 using InnoDB ClusterHigh Availability in MySQL 8 using InnoDB Cluster
High Availability in MySQL 8 using InnoDB ClusterSven Sandberg
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationKenny Gryp
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기NHN FORWARD
 

What's hot (20)

Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
Oracle RAC features on Exadata
Oracle RAC features on ExadataOracle RAC features on Exadata
Oracle RAC features on Exadata
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
MySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationMySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group Replication
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Everything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group ReplicationEverything You Need to Know About MySQL Group Replication
Everything You Need to Know About MySQL Group Replication
 
MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)MySQL Group Replication - Ready For Production? (2018-04)
MySQL Group Replication - Ready For Production? (2018-04)
 
Percona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL ArchitecturesPercona Live 2022 - MySQL Architectures
Percona Live 2022 - MySQL Architectures
 
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management Overview
 
ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)ProxySQL High Availability (Clustering)
ProxySQL High Availability (Clustering)
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )
 
High Availability in MySQL 8 using InnoDB Cluster
High Availability in MySQL 8 using InnoDB ClusterHigh Availability in MySQL 8 using InnoDB Cluster
High Availability in MySQL 8 using InnoDB Cluster
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
 

Viewers also liked

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급hslkdfjs
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNuxeo
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasMongoDB
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기설리번 프로젝트
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery conceptphillip shambare
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure ManagementDavid Puckett
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containersbenaam
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a PlatformMicah Laaker
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Ericsson
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolSperasoft
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 SecondsClaritum
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatismNamrata Gupta
 

Viewers also liked (20)

Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the Enterprise
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
 
Hadoop Cluster Management
Hadoop Cluster ManagementHadoop Cluster Management
Hadoop Cluster Management
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery concept
 
Polymer optical fibers
Polymer optical fibersPolymer optical fibers
Polymer optical fibers
 
SAP Cloud for Service
SAP Cloud for ServiceSAP Cloud for Service
SAP Cloud for Service
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure Management
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containers
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a Platform
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
High-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for AuditoriumsHigh-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for Auditoriums
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI Tool
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 Seconds
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatism
 

Similar to Hadoop & Greenplum: Why Do Such a Thing?

Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsThilina Gunarathne
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site OptimizationAmit Kejriwal
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAmazon Web Services
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisCaserta
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zendswiss IT bridge
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 

Similar to Hadoop & Greenplum: Why Do Such a Thing? (20)

Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site Optimization
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
AWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data AnalyticsAWS Summit 2013 | Auckland - Big Data Analytics
AWS Summit 2013 | Auckland - Big Data Analytics
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Predictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, ZementisPredictive Analytics - Big Data Warehousing Meetup, Zementis
Predictive Analytics - Big Data Warehousing Meetup, Zementis
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
An approach to implement model classes in zend
An approach to implement model classes in zendAn approach to implement model classes in zend
An approach to implement model classes in zend
 
London data science
London data scienceLondon data science
London data science
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Hadoop & Greenplum: Why Do Such a Thing?

  • 1. Greenplum & Hadoop Why do such a thing? Donald Miner Solutions Architect Advanced Technologies Group Donald.Miner@emc.com © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. QUICK INTRODUCTION TO GREENPLUM DATABASE © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. GREENPLUM DATABASE Greenplum Database Basics Massively Parallel Processing (MPP) Database Uses commodity hardware Master Master Data is distributed by a user-defined “distribution key” Master node delegates queries to segments Segment Segment Segment Segment 1:1 segment and master mirroring for redundancy © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. GREENPLUM DATABASE Greenplum Database Features Full SQL support based on PostgreSQL 8.2 Columnar or row-oriented storage with compression Multi-level table partitioning with query time partition pruning B-tree and bitmap indexes JDBC, ODBC, OLEDB, etc. interfaces High speed, parallel bulk ingest Parallel query optimizer External tables © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. GREENPLUM DATABASE MADlib Analytics with Greenplum Scalable and in-database > SELECT householdID, variables FROM households Mathematical, statistical, ORDER BY RANDOM() LIMIT 100000; machine learning > SELECT run_univariate_analysis ( 'households_training', Active open source project 'variables'); WHERE pvalue<.01 AND r2>.01; > SELECT run_regression( 'univariate_results', 'households_training'); > SELECT householdID, madlib.array_dot( coef::REAL[], xmatrix::REAL[]) FROM coefficients, households; © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. GREENPLUM DATABASE MADlib In-Database Analytical Functions Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining CountMin (Cormode-Muthukrishnan) K-Means Clustering Sketch-based Estimator FM (Flajolet-Martin) Sketch-based Naïve Bayes Classification Estimator MFV (Most Frequent Values) Sketch- Linear Regression based Estimator Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART Latent Dirichlet Allocation Topic Modeling © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. GREENPLUM DATABASE PostGIS Support in Greenplum DB PostGIS adds support for geographic objects in PostgreSQL Example: find all records within 25 miles of hurricane path http://postgis.refractions.net/ select customer_id, ST_AsText(lat_lon), phone_num from clients where ST_DWithin(lat_lon, ST_GeometryFromText('LINESTRING( -79.3 17, -79.3 17.1, -79.3 17.3, -79.7 17.6, -79.6 17.4, -79.6 16.8, -79.9 15.8, -80.2 15.8, - 80 15.7, -80 15.7, -80.2 15.9, -80.6 16.5, -81.1 16.7, -81.8 16.7, - 82.1 16.8, -82.5 17.2, - 83.9 17.9, -85.2 18.3, -85.5 18.4)', 4326), 25.0/3959.0 * 180.0/PI()) customer_id | st_astext | phone_num ------------+-----------------------------+------------- 493140 | POINT(-80.040397 26.570613) | 1231231234 192401 | POINT(-81.820933 26.242611) | 2342342345 © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. GREENPLUM DATABASE Solr integration with GPDB Solr is an open source enterprise search engine Enable in-database text indexing and search id | score | message_text select -----------+------------------+------------------------------------------- t.id, 71552856 | 5.43078422546387 | Hates BB's Love IPhones! q.score, 91373993 | 4.06371879577637 | Its a love hate relationship with t.message_text iPhone spellcheck from message t, 25444233 | 4.05911064147949 | #iPhone autocorrect is a love/hate gptext.search( relationship... 'twitter.public.message', 120166038 | 3.39410924911499 | Love the new iPhone 4s, hate '(iphone and (hate or love))', @ATT service #Verizonhereicome 'author_lang:en', 100 117498183 | 3.39181470870972 | I got a love-hate relationship for )q my iPhone!!! where t.id=q.id 86416378 | 3.39180779457092 | Absolutely love the new iPhone, but Siri seems to hate me.. order by score desc; © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. GREENPLUM HADOOP © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. GREENPLUM HADOOP Greenplum “HD” • Bundled open source • HDFS, MapReduce, Hive, Pig, HBase, ZooKeeper, Ma hout © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. GREENPLUM HADOOP Greenplum “MR” • Bundled MapR, a commercial version of Hadoop • API compatible with traditional Hadoop • MapR improvements over Hadoop: – Improved control system – Major portions of HDFS re-implemented in C++ – HDFS is NFS mountable – Improved shuffle and sort – Distributed NameNode – Supports large number of files – Mirroring, snapshot capability © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Why do such a thing? Greenplum DB MADLib Partitioning GP Solr/Lucene SQL Indexing Text objects RDBMS PostGIS GPMapReduce Tables and Schemas STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Why do such a thing? Hadoop Schema on load MapReduce Hive XML, JSON, … Flat files Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Why do such a thing? HBase Row keys Hive Flexible schema MapReduce HBase Tables Pig STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Why do such a thing? Hybrid architecture with all three (or two…) MADLib Partitioning Row keys GP Solr/Lucene SQL Schema on load Indexing Text objects Flexible schema MapReduce RDBMS Hive PostGIS HBase Tables GPMapReduce Tables and Schemas Pig XML, JSON, … Flat files STRUCTURED SEMISTRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Greenplum Unified Analytics Platform © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop External Tables in GPDB External tables bring external data into the database. Native support for HDFS with parallelized loading. Can write to HDFS or read from HDFS. > CREATE EXTERNAL TABLE hdfs_document_feature ( docid integer, term text, freq integer) LOCATION ('gphdfs://namenode:9000/user/don/docs/part-*') FORMAT 'text' (delimiter '|'); > SELECT COUNT(*) FROM hdfs_document_feature h, gpdb_words g WHERE h.term = g.word; > WRITE INTO hdfs_export SELECT * FROM gpdb_source; © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Why do such a thing? Many of the same use cases of a HBase/Hadoop environment Use Hadoop as a data groomer Do rollups in Hadoop and store results in GPDB Use the best tool for the job (structured vs. unstructured) Use GPDB to host data sets in a more real-time layer for ad-hoc analytics © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. EMC Isilon Hardware appliance for scale-out network-attached storage (NAS) Stripes data across all nodes Uses Infiniband for intra-cluster communication Up to 15.5PB total storage 3 different hardware configurations to handle different workloads Uses “OneFS”, Isilon’s operating system and file system Interfaces with iSCSI, NFS, CIFS, HTTP, HDFS, and a few more. © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Isilon HDFS interface Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data. Underlying system is OneFS and does not follow the traditional HDFS scheme. Point HDFS clients (MapReduce, command line, etc.) to any IP in the Isilon cluster. © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Pros & Cons Isilon is more dense Isilon can be mounted via a number of protocols – Easier ingest / egress – Raw data accessible by applications Isilon is easy to manage Free of certain HDFS limitations Isilon loses data locality (~250MB/sec throughput per node over network) © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Why do such a thing? Hadoop backup or archive – More dense than HDFS, more accessible than tape, no need for compute Complete HDFS replacement – More dense, more accessible, utilize existing Isilon, slower per terabyte of storage Hot/warm storage – Use HDFS as primary, but Isilon as secondary Storage for original content – Use MapReduce to extract metadata from original content, and leave original content in place © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. HBase External Tables in GPDB Project in development Load data in parallel from HBase by specifying table name and column qualifiers > CREATE EXTERNAL TABLE hbase_document_feature ( “HBASEROWKEY” text, “term” text, “freq” integer) LOCATION ('gphbase://docfeatures') FORMAT ‟CUSTOM' (formatter=„gpdbwriteable_import‟); > SELECT COUNT(*) FROM hbase_document_feature h, gpdb_words g WHERE h.term = g.word; © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. HBase External Tables in GPDB Possible TODO list: Specify range of rowkeys Support writes into HBase Specify filter criteria on the external table select * from hbase_external where ROWKEY=‘abc’ Accumulo? © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Why do such a thing? Have HBase store semi-structured data Exploit the strengths of each Use HBase for really really wide tables Use HBase as a scalable archive of raw records Leverage existing HBase applications © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Greenplum On HDFS Get Greenplum Database to run natively off of HDFS Underlying Greenplum Database data is stored in HDFS Unifies the two platform further – no need for external tables Fully supports Greenplum’s append-only tables Early project in R&D Talk will be given by Chang Lei at Yahoo Summit © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Greenplum On HDFS Master host Interconnect Segment Segment (Mirror) Segment Segment Segment Segment Segment Segment (Mirror) Segment Segment (Mirror) (Mirror) (Mirror) Segment host Segment host Segment host Segment host Segment host Meta Ops Read/Write Tables in HDFS filespace Namenode B Datanode replication Datanode Datanode Rack1 Rack2 © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Why do such a thing? Covers many of the same use cases as Hive Run Hadoop MapReduce over data managed by Greenplum DB Initial results show it is faster than Hive You only have to store your data in one system © Copyright 2012 EMC Corporation. All rights reserved. 28

Editor's Notes

  1. Greenplum HD HadoopSoftware