IBM Pure Data System for Analytics (Netezza)

IBM PureData System for Analytics
10/12/2014 1

 At the end of this session, participants will understand all the basic concepts about
IBM Puredata System for Analytics (Netezza).
 IBM Puredata System models and its components.
 IBM Puredata System Architecture. And
 How it works exactly.
10/12/2014 2

“If you'd like to take us on, make our day.”
- Larry Ellison, Oct 2009
“Our goal is to become number one in the high-end server business for both
Online Transaction Processing and Data Warehousing, both of those
segments.”
- Larry Ellison, Dec 2010
10/12/2014 3

One of “The five most important M&A Deals of 2010”
- Wall Street Journal
10/12/2014 4

 Dedicated device
 Optimized for purpose
 Complete solution
 Fast installation
 Very easy operation
 Standard interfaces
 Low cost
10/12/2014 6

PureData for Analytics – Where Big Data Meets Deep
Analytics
Analytics without constraint
10/12/2014 7

9
Seamless integration with Informatica, Business Objects, SAS and SQL
Server (SSIS packages)
Very little DDL & SQL conversion
• Used same table structures
• Converted Primary index to Distribution column
10 to 200X performance improvements in BO reporting
Fast to Deploy
Price to Performance very appealing
Ease of use.
• Administrative
• DBA Tasks
• Supports all DB structures (3NF, Star, De-Normalized table)
9
Why Netezza?
10/12/2014

 The IBM PureData System for Analytics N1001 models
include single-rack and multi-rack configurations.
 The N1001 model family is an update to the IBM Netezza
1000 model family, with the same architectural and
interface specifications.
 Each N1001 storage array contains either two or four disk
enclosures, depending upon the model.
 Each disk enclosure has 12 disks.
 For example, an N1001-005 system has one storage array
with 48 disks.
10/12/2014 10

 The IBM® PureData™ System for Analytics N2001 family
is the latest generation of data warehouse appliances.
 It increases the capacity and performance of the N1001
models.
 Within each rack are numerous components that work
together to provide the asymmetric massively parallel
processing of the Netezza® architecture.
 The key hardware components include:
 Snippet blades (S-Blades)
Hosts
 Storage arrays
10/12/2014 12

10/12/2014 13
The Following figure summarizes the IBM PureData System for Analytics N2001 half-rack,
full-rack, and two-rack models.

 The snippet processing functions are the responsibility of
the S-Blade.
 The S-Blade is a specialized processing board which
combines the CPU processing power of a blade server
with the query analysis intelligence of the Netezza
Database Accelerator card.
 The dualboard component resides in two slots of the S-Blade
chassis.
 Each chassis can contain up to 7 S-Blades.
10/12/2014 14

 The Netezza Database Accelerator card contains the FPGA query engines, memory,
and I/O for processing the data from the disks where user data is stored.
10/12/2014 15

 The host server is a Linux server that runs the Netezza software
and utilities.
 The host controls and coordinates the activity of the appliance.
 It performs query optimization; controls table and database
operations; consolidates and returns query results; and monitors
the Netezza system components to detect and report problems.
 The host is a highly redundant, highly available, server.
 The Netezza 1000 systems have two hosts in a highly available (HA)
configuration.
10/12/2014 16

 The storage arrays contain the disks that store the user data
and related processing files to support the query activity on
the system.
 In the N2001 model family, each disk enclosure has 24
disks.
 There are 12 disk enclosures in each full rack, or 6
enclosures in a half-rack model.
 In the N2001 family, each rack is one storage array.
10/12/2014 17

 Netezza's appliances use a proprietary Asymmetric Massively Parallel
Processing (AMPP) architecture that combines open, blade-based
servers and disk storage with a proprietary data filtering process using
field-programmable gate arrays (FPGAs).
 Netezza’s proprietary AMPP architecture is a two-tiered system
designed to quickly handle very large queries from multiple users.
 The first tier is a high-performance Linux SMP host that compiles data
query tasks received from business intelligence applications, and
generates query execution plans.
 It then divides a query into a sequence of sub-tasks, or snippets that
can be executed in parallel, and distributes the snippets to the second
tier for execution.
 The second tier consists of multiple no. of snippet processing blades,
or S-Blades, where all the primary processing work of the appliance is
executed.
10/12/2014 18

IBM PureData System for Analytics
The Simple Appliance for Serious Analytics
Built-in Expertise
 No indexes or tuning
 Data model agnostic
 Fully parallel, optimized In Database Analytics
Integration by Design
 Server, Storage, Database in one easy to use package
 Automatic parallelization and resource optimization to scale
economically
 Enterprise-class security and platform management
Simplified Experience
 Up and running in hours
 Minimal ongoing administration
 Standard interfaces to best of breed Analytics, BI, and data integration
tools
 Built-in analytics capabilities allow users to derive insight from data
quickly
 Easy connectivity to other Big Data Platform components
10/12/2014 20

Optimized exclusively for analytic data workloads
System for Analytics
Delivering data services
for analytics
Speed
 10-100x faster than traditional custom systems*
 Patented MPP hardware acceleration
(Massively Parallel Processing)
Simplicity
 Data load ready in hours
 No database indexes
 No tuning
 No storage administration
Scalability
 Peta-scale data capacity
Smart
 Designed to runs complex analytics in minutes,
not hours
 Richest set of in-database analytics
* Based on IBM customers' reported results. "Traditional custom systems" refers to systems that are not professionally pre-built,
pre-tested and optimized. Individual results may vary.
10/12/2014 21

Move analytics into the Data Warehouse
 Integrate the server,
storage and database
into one optimized
package
 Move complex analytics
into the database
 Integrated, high
performance analytics
within the data
warehouse
Analytics
Database
Storage
Server
10/12/2014 22

Simplicity and
Ease of
Administration
 No dbspace/tablespace sizing and configuration
 No redo/physical/Logical log sizing and
configuration
 No page/block sizing and configuration for tables
 No extent sizing and configuration for tables
 No Temp space allocation and monitoring
 No logical volume creations of files
 No integration of OS kernel recommendations
 No maintenance of OS recommended patch levels
Data Experts,
not Database
Experts
 Easy Administration Portal
 No software installation
 No indexes and tuning
 No storage administration
10/12/2014 25

0. CREATE DATABASE TEST LOGFILE 'E:OraDataTESTLOG1TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG2TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG3TEST.ORA' SIZE 2M,
'E:OraDataTESTLOG4TEST.ORA' SIZE 2M, 'E:OraDataTESTLOG5TEST.ORA' SIZE 2M EXTENT MANAGEMENT LOCAL MAXDATAFILES 100 DATAFILE
'E:OraDataTESTSYS1TEST.ORA' SIZE 50 M DEFAULT TEMPORARY TABLESPACE temp TEMPFILE 'E:OraDataTESTTEMP.ORA' SIZE 50 M
UNDO TABLESPACE undo DATAFILE 'E:OraDataTESTUNDO.ORA' SIZE 50 M NOARCHIVELOG CHARACTER SET WE8ISO8859P1;
1. Oracle* table and indexes
2. Oracle tablespace
3. Oracle datafile
4. Veritas file
5. Veritas file system
6. Veritas striped logical volume
7. Veritas mirror/plex
8. Veritas sub-disk
9. SunOS raw device
Netezza: Low (ZERO) Touch:
10. Brocade SAN switch
11. EMC Symmetrix volume
CREATE DATABASE my_db;
12. EMC Symmetrix striped meta-volume
13. EMC Symmetrix hyper-volume
14. EMC Symmetrix remote volume (replication)
15. Days/weeks of planning meetings
10/12/2014 26

ORACLE
CREATE TABLE "MRDWDDM"."RDWF_DDM_ROOMS_SOLD" ("ID_PROPERTY" NUMBER(5,
0) NOT NULL ENABLE, "ID_DATE_STAY" NUMBER(5, 0) NOT NULL ENABLE,
"CD_ROOM_POOL" CHAR(4) NOT NULL ENABLE, "CD_RATE_PGM" CHAR(4) NOT
NULL ENABLE, "CD_RATE_TYPE" CHAR(1) NOT NULL ENABLE,
ORACLE Indexes
"CD_MARKET_SEGMENT" CHAR(2) NOT NULL ENABLE, "ID_CONFO_NUM_ORIG"
CREATE INDEX "MRDWDDM"."RDWF_DDM_ROOMS_SOLD_IDX1" ON "RDWF_DDM_ROOMS_SOLD"
NUMBER(9, 0) NOT NULL ENABLE, "ID_CONFO_NUM_CUR" NUMBER(9, 0) NOT
("ID_PROPERTY" , "ID_DATE_STAY" , "CD_ROOM_POOL" , "CD_RATE_PGM" ,
NULL ENABLE, "ID_DATE_CREATE" NUMBER(5, 0) NOT NULL ENABLE,
"CD_RATE_TYPE" , "CD_MARKET_SEGMENT" ) PCTFREE 10 INITRANS 6 MAXTRANS 255
STORAGE( FREELISTS 10) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING
"ID_DATE_ARRIVAL" NUMBER(5, 0) NOT NULL ENABLE, "ID_DATE_DEPART"
PARALLEL ( DEGREE 4 INSTANCES 1) LOCAL(PARTITION "PART1" PCTFREE 10
ORACLE Bitmap index
NUMBER(5, 0) NOT NULL ENABLE, "QY_ROOMS" NUMBER(5, 0) NOT NULL
INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 MINEXTENTS 1
MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL
ENABLE, "CU_REV_PROJ_NET_LOCAL" NUMBER(21, 3) NOT NULL ENABLE,
CREATE BITMAP INDEX "CRDBO"."SNAPSHOT_MONTH_IDX13" ON
DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART2"
"SNAPSHOT_OPPTY_MONTH_HIST" ("SNAPSHOT_YEAR" ) PCTFREE 10 INITRANS 2
"CU_REV_PROJ_NET_USD" NUMBER(21, 3) NOT NULL ENABLE,
PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840
MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4194304 MINEXTENTS 2 MAXEXTENTS
MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS
"QY_DAYS_STAY_CUR" NUMBER(3, 0) NOT NULL ENABLE, "CD_BOOK_SOURCE"
2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL
1 BUFFER_POOL DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING,
DEFAULT) TABLESPACE "SFA_DATAMART_INDEX" NOLOGGING ;
CHAR(1) NOT NULL ENABLE) PCTFREE 5 PCTUSED 95 INITRANS 4 MAXTRANS 255
PARTITION "PART3" PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL
4194304 NEXT 4259840 MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0
STORAGE( FREELISTS 6) TABLESPACE "DDM_ROOMS_SOLD_DATA" NOLOGGING
FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) TABLESPACE
ORACLE Table Clusters
PARTITION BY RANGE ("ID_PROPERTY" ) (PARTITION "PART1" VALUES LESS
"DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART4" PCTFREE 10 INITRANS 6
MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 MINEXTENTS 1 MAXEXTENTS
THAN (600) PCTFREE 5 PCTUSED 95 INITRANS 4 MAXTRANS 255
CREATE CLUSTER "MRDW"."CT_INTRMDRY_CAL" ("ID_YEAR_CAL" NUMBER(4, 0),
100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
"ID_MONTH_CAL" NUMBER(2, 0), "ID_PROPERTY" NUMBER(5, 0)) SIZE 16384
STORAGE(INITIAL 16777216 FREELISTS 6 FREELIST GROUPS 1) TABLESPACE
TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART5" PCTFREE 10
PCTFREE 10 PCTUSED 90 INITRANS 3 MAXTRANS 255 STORAGE(INITIAL
INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840 MINEXTENTS 1
"DDM_ROOMS_SOLD_DATA" NOLOGGING NOCOMPRESS, PARTITION "PART2" VALUES
83886080 NEXT 41943040 MINEXTENTS 1 MAXEXTENTS 1017 PCTINCREASE 0
MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS 1 BUFFER_POOL
FREELISTS 4 FREELIST GROUPS 1 BUFFER_POOL RECYCLE) TABLESPACE
LESS THAN (1200) PCTFREE 5 PCTUSED 95 INITRANS 4 MAXTRANS 255
DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING, PARTITION "PART6"
"TSS_FACT" ;
PCTFREE 10 INITRANS 6 MAXTRANS 255 STORAGE(INITIAL 4194304 NEXT 4259840
MINEXTENTS 1 MAXEXTENTS 100000 PCTINCREASE 0 FREELISTS 10 FREELIST GROUPS
1 BUFFER_POOL DEFAULT) TABLESPACE "DDM_DATAMART_INDEX_L" NOLOGGING ) ;
LESS THAN (1800) PCTFREE 5 PCTUSED 95 INITRANS 4 MAXTRANS 255
Netezza
CREATE TABLE MRDWDDM.RDWF_DDM_ROOMS_SOLD (
ID_PROPERTY numeric(5, 0) NOT NULL ,
ID_DATE_STAY integer NOT NULL ,
CD_ROOM_POOL CHAR(4) NOT NULL ,
CD_RATE_PGM CHAR(4) NOT NULL ,
CD_RATE_TYPE CHAR(1) NOT NULL ,
CD_MARKET_SEGMENT CHAR(2) NOT NULL ,
ID_CONFO_NUM_ORIG integer NOT NULL ,
ID_CONFO_NUM_CUR integer NOT NULL ,
ID_DATE_CREATE integer NOT NULL ,
ID_DATE_ARRIVAL integer NOT NULL ,
ID_DATE_DEPART integer NOT NULL ,
QY_ROOMS integer NOT NULL ,
CU_REV_PROJ_NET_LOCAL numeric(21, 3) NOT NULL ,
CU_REV_PROJ_NET_USD numeric(21, 3) NOT NULL ,
QY_DAYS_STAY_CUR smallint NOT NULL ,
CD_BOOK_SOURCE CHAR(1) NOT NULL)
distribute on random;
•No indexes
•No Physical Tuning/Admin
•Stripe data randomly, or by Columns
10/12/2014 27

Data In
Data Integration
 Ab Initio
 Cloudera
 Composite Software
 IBM Big Insights
 IBM Information Server
 IBM InfoSphere Streams
 Informatica
 Oracle Data Integrator
 Oracle GoldenGate
 SAP Business Objects
SQL ODBC JDBC OLE-DB
10/12/2014 28

Reporting and Analysis
 IBM Cognos
 IBM SPSS
 IBM Unica
 Information Builders
 Kalido
 KXEN
 Microsoft Excel
 MicroStrategy
 Oracle OBIEE
 SAP Business Objects
 SAS
 Actuate
Data Out
SQL ODBC JDBC OLE-DB
10/12/2014 29

Advanced
Analytics
Loader
ETL
BI
Applications
FPGA
CPU
Memory
FPGA
CPU
Memory
FPGA
CPU
Memory
Host
Hosts
Disk
Enclosures
S-Blades™
Network
Fabric
Netezza Appliance
10/12/2014 31

FPGA Core CPU Core
Uncompress Project Restrict,
Visibility
Complex Σ
Joins, Aggs, etc.
select DISTRICT,
PRODUCTGRP,
sum(NRX)
from MTHLY_RX_TERR_DATA
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
Slice of table
MTHLY_RX_TERR_DATA
(compressed)
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
sum(NRX)
select DISTRICT,
PRODUCTGRP,
sum(NRX)
10/12/2014 32

 Essentially: A big, fast SQL database 10/12/2014 33

Commodity CPU, NIC, disk
FPGA
 Can do basic filtering in hardware,
i.e., stream processing before data hits main memory
10/12/2014 34

 The four key components that make up TwinFin are: SMP
hosts; snippet blades (called S-Blades); disk enclosures and a
network fabric.
 The disk enclosures contain high-density, high-performance
disks.
 Each disk contains a slice of the data in the database table,
along with a mirror of the data on another disk.
 The storage arrays are connected to the S-Blades via high-speed
interconnects that allow all the disks to
simultaneously stream data to the S-Blades at the fastest
rate possible.
10/12/2014 35

 The SMP hosts are high-performance Linux servers that are set up in an
active-passive configuration for high-availability.
 The active host presents a standardized interface to external tools and
applications, such as BI and ETL tools and load utilities.
 It compiles SQL queries into executable code segments called snippets,
creates optimized query plans and distributes the snippets to the S-Blades
for execution.
10/12/2014 36

 S-Blades are intelligent processing nodes that make up the turbocharged
MPP engine of the appliance.
 Each S-Blade is an independent server that contains powerful multi-core
CPUs, Netezza's unique multi-engine FPGAs and gigabytes of RAM--all
balanced and working concurrently to deliver peak performance.
 FPGAs are commodity chips that are designed to do process data streams
at extremely fast rates.
10/12/2014 37

SAS Expander
Module
Intel Quad-Core
SAS Expander
Module
DRAM Dual-Core FPGA
IBM BladeCenter Server Netezza DB Accelerator
10/12/2014 39

Netezza uses FPGA to do front line processing by
filtering data from disk and applying additional logic
before passing that to memory on SPU. Main
advantages from data processing:
 Parallelism and processing power now shifted away from
CPU,
 FPGA has similar dimensions as a CPU, consumes 5 times
less power and clock speed is about 5 times less.
 Filtering out unnecessary data.
 Low latency, high throughput.
 More caching capability.
10/12/2014 40

 Netezza is the first company to leverage the power
of FPGA to process streaming data in a data
warehouse appliance.
 In traditional systems, all the data for a query is
moved and then the “where” clause is processed.
With Netezza, instead of moving a huge set of data,
the FPGA processes the “where” clause as data
streams off of the disk, so only the data needed for
processing is moved to the next step.
10/12/2014 41

 As discussed earlier, each disk in the appliance is
partitioned into primary, mirror and temp or swap
partitions.
 The primary partition in each disk is used to store
user data like database tables, the mirror stores a
copy of the primary partition of another disk so that
it can be used in the event of disk failures and the
temp/swap partition is used to store the data
temporarily like when the appliance does data
redistribution while processing queries.
10/12/2014 42

 The logical representation of the data saved in the primary
partition of each disk is called the data slice.
 When users create database tables and load data into it,
they get distributed across the available data slices.
 Logical representation of data slices is called the data
partition.
 For TwinFin systems each S-Blade or SPU is connected to 8
data partitions and some only to 6 disk partitions (since
some disks are reserved for failovers).
 There are situations like SPU failures when a SPU can have
more than 8 partitions attached to it since it got assigned
some of the data partitions from the failed SPU.
10/12/2014 43

 The SPU 1001 is connected to 8 data partitions numbered 0 to 7.
 Each data partition is connected to one data slice stored on different
disks.
 For e.g., the data partition 0 points to the data slice 17 stored on the
disk with id 1063.
 The disk 1063 also stores the mirror of the data partition 18 stored on
disk 1064.
 The following diagram illustrates what happens when the disk 1070 fails.
10/12/2014 44

Immediately after the disk 1070 stops responding,
the disk 1069 will be used by the system to satify
queries for which data is required from data slice 23
and 24.
 Disk 1069 will serve the requests using the data in
both its primary and mirror partition.
 In the meantime, the contents in disk 1070 are
regenerated on one of the spare disks in the disk
array which in this case is disk 1100 using the data in
disk 1069.
 Once the regen is complete the SPU data partition 7
is updated to point to the data slice 24 on disk 1100.
10/12/2014 45

 In the situation where a SPU fails, the appliance
assigns all the data partitions to other SPUs in the
system.
Pair of disks which contains the mirror copy of each
others data slice will be assigned to other SPUs
which will result in additional two data partitioned
to be managed by the target SPU.
 If for e.g. if an SPU currently manages data
partitions 0 to 7 and if the appliance reassings two
data partitions from a failed SPU, the SPU will have
10 data partitions to manage and it will be
numbered from 0 to 9.
10/12/2014 46

47
Speed
• Hardware-based
data
streaming
Scalability
• True MPP
offers
enterprise
scale-out
Simple
• Black-box
appliance with
no tuning or
storage
administration
Smart
• Built-in
advanced
analytics
pushed deep
into database
NO NO NO NO
NO YES NO LIMITED
10/12/2014

Teradata Results In IBM Netezza Client Advantage
Costs
High initial cost
Lots of professional services
Lots of administration
High cost of
ownership
Low initial cost
Little administration
Low total cost of ownership
Smart
Limited analytics pushdown
Analytics causes resource contention
Poor analytic
performance
Minimal contention due to
analytics
More customers benefit from
faster analytics
Simplicity
Constant tuning for performance
Needs much administration
Difficult and slow to
provide business
value
True appliance
No tuning
Faster time to value
Speed
Old inefficient legacy code
Complex workload partitions
Data warehouse
performance doesn’t
scale consistently
Designed for balance
Highest / most consistent data
warehouse and advanced
analytics performance
Architecture
Proprietary interconnect
Virtualized MPP nodes (vAMPs)
Separating compute and storage
Unpredictable
performance
True MPP
FPGA acceleration
Best architecture for data
warehouse and advanced
analytics
48
10/12/2014 48

Oracle Exadata Results In IBM Netezza Client Advantage
Costs
High initial cost
Lots of administration
High total cost of
ownership
Low initial cost
Little administration
Low total cost of ownership
Smart
Limited analytics pushdown
Inefficiency of Oracle
Real Application Clusters (RAC)
Poor analytic
performance
Extensive analytics
Pushdown capabilities
Fast time to insight
More users benefit
from faster analytics
Simplicity
Complexity of Oracle RAC
Constant tuning for performance
Complex patch process
Complex
administration
True appliance
No tuning
Faster time to value
Scalability
No proof points on scaling
RAC scalability bottleneck
Business growth
risk
Proven scalability
Business growth with
confidence
Speed
Designed for OLTP
RAC is inefficient for
data warehouse workloads
Poor
data warehouse
performance
Designed for data
warehousing
Highest data warehouse
performance
Architecture
Clustered SMP database layer
+
Shared disk MPP storage layer
Compromised
performance
True MPP
FPGA acceleration
Best architecture for
data warehousing and advanced
analytics
49
10/12/2014 49

IBM Pure Data System for Analytics (Netezza)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to IBM Pure Data System for Analytics (Netezza)

Similar to IBM Pure Data System for Analytics (Netezza) (20)

More from Girish Srivastava

More from Girish Srivastava (8)

Recently uploaded

Recently uploaded (20)

IBM Pure Data System for Analytics (Netezza)