SlideShare a Scribd company logo
1 of 26
Daniel Abadi -- Yale University
Column-Stores vs. Row-Stores:
How Different Are They
Really?
Daniel Abadi (Yale),
Samuel Madden (MIT),
Nabil Hachem (AvantGarde Consulting)
June 12th
, 2008
Row vs. Column-Stores
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail Phone #
Street
Address
Row-Store Column-Store
− Might read in
unnecessary data
+ Only need to read
in relevant data
+ Easy to add a new record
− Tuple writes might require
multiple seeks
Column-Stores
• Really good for read-mostly data
warehouses
 Lot’s of column scans and aggregations
 Writes tend to be in batch
 [CK85], [SAB+05], [ZBN+05], [HLA+06],
[SBC+07] all verify this
 Top 3 in TPC-H rankings (Exasol, ParAccel,
and Kickfire) are column-stores
 Factor of 5 faster on performance
 Factor of 2 superior on price/performance
Data Warehouse DBMS
Software
• $4.5 billion industry (out of total $16 billion
DBMS software industry)
• Growing 10% annually
Momentum
• Right solution for growing market  $$$$
• Vertica, ParAccel, Kickfire, Calpont,
Infobright, and Exasol new entrants
• Sybase IQ’s profits rapidly increasing
• Yahoo’s world largest (multi-petabyte)
data warehouse is a column-store (from
Mahat Technologies acquisition)
Paper Looks At Key Question
• How much of the buzz around column-
stores just marketing hype?
 Do you really need to buy Sybase
IQ/Vertica/ParAccel?
 How far will your current row-store take you?
 Can you get column-store performance from a row-
store?
 Can you simulate a column-store in a row-store
Paper Methodology
• Comparing row-store vs. column-store is
dangerous/borderline meaningless
• Instead, compare row-store vs. row-store
and column-store vs. column-store
 Simulate a column-store inside of a row-store
 Remove column-oriented features from
column-store until it behaves like a row-store
Simulate Column-Store
Inside Row-Store
Last
Name
First
Name
E-
m
ai
l
Phone
#
Street
Address
Last
Name
First
Name E-mail
1
2
3
1
2
3
1
2
3
Option A:
Vertical
Partitioning
…
Option B:
Index Every
Column
Last Name
Index
First Name Index
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
SSBM Averages
0.0
50.0
100.0
150.0
200.0
250.0
Time(seconds)
Average 25.7 79.9 221.2
Normal Row-Store
Vertically Partitioned
Row-Store
Row-Store With All
Indexes
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Reconstruction
Star Schema Benchmark
• Fact table contains 17 columns and
60,000,000 rows
• 4 dimension tables, biggest one has
80,000 rows
• Queries touch 3-4 foreign keys in fact
table, 1-2 numeric columns
Tuple Size
1
2
3
Column
Data
TID
1
2
3
TID Column
Data
1
2
3
TID Column
Data
Tuple
Header
•Queries touch 3-4 foreign keys in fact table, 1-2 numeric
columns
•Complete fact table takes up ~4 GB (compressed)
•Vertically partitioned tables take up 0.7-1.1 GB
(compressed)
Horizontal Partitioning
• Fact table horizontally partitioned on year
 Year is an element of the ‘Date’ dimension
table
 Most queries in SSBM have a predicate on
year
 Since vertically partitioned tables do not
contain the ‘Date’ foreign key, row-store could
not similarly partition them
What’s Going On?
• Vertically Partitioned Case
 Tuple Sizes
 Horizontal Partitioning
• All Indexes Case
 Tuple Construction
Tuple Construction
• Pretty much all queries require a column
to be extracted (in the SELECT clause)
that has not yet been accessed, e.g.:
 SELECT store_name, SUM(revenue)
FROM Facts, Stores
WHERE fact.store_id = stores.store_id
AND stores.area = “NEW ENGLAND”
GROUP BY store_name
Tuple Construction
• Result of lower part of query plan is a set
of TIDs that passed all predicates
• Need to extract SELECT attributes at
these TIDs
 BUT: index maps value to TID
 You really want to map TID to value (i.e., a
vertical partition)
  Tuple construction is SLOW
So….
• All indexes approach is pretty obviously a
poor way to simulate a column-store
• Problems with vertical partitioning are
NOT fundamental
 Store tuple header in a separate partition
 Allow virtual TIDs
 Allow HP using a foreign key on a different VP
• So can row-stores simulate column-
stores?
Row-Store vs. Column-Store
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Time(seconds)
Average 25.7 11.7 4.4
Row-Store Row-Store (M V) C-Store
Column-Store Experiments
• Start with column-store (C-Store)
• Remove column-store-specific
performance optimizations
• End with column-store that behaves like a
row-store
Compression
• Higher data value locality
in column-stores
 Better ratio  reduced I/O
• Can use schemes like
run-length encoding
 Easy to operate on directly
for improved performance
([AMF06])
Q1
Q1
Q1
Q1
Q1
Q1
Q1
Q2
Q2
Q2
Q2
…
…
Quarter
(Q1, 1, 300)
Quarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
• Early Materialization:
create rows first. But:
 Poor memory bandwidth
utilization
 Lose opportunity for
vectorized operation
2
1
3
1
2
3
3
3
7
13
42
80
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeIDcustID price
QUERY:
SELECT custID,SUM(price)
FROM table
WHERE (prodID = 4) AND
(storeID = 1) AND
GROUP BY custID
Early vs. Late Materialization
4
4
4
4
Other Column-Store
Optimizations
• Invisible join
 Column-store specific join
 Optimizations for star schemas
 Similar to a semi-join
• Block Processing
Simplified Version of Results
0.0
10.0
20.0
30.0
40.0
50.0
Time(seconds)
Average 4.4 14.9 40.7
Original C-Store
C-Store,No
Compression
C-Store,Early
Materialization
Conclusion
• Might be possible to simulate a row-store
in a column-store, BUT:
 Need better support for vertical partitioning at
the storage layer
 Need support for column-specific
optimizations at the executer level
• Working with HP-Labs to find out
Come Join the Yale DB
Group!

More Related Content

What's hot

Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide威智 黃
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image Generation[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image GenerationDeep Learning JP
 
kaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detectionkaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact DetectionKazuyuki Miyazawa
 
Graph neural networks overview
Graph neural networks overviewGraph neural networks overview
Graph neural networks overviewRodion Kiryukhin
 
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...KwanyoungKim7
 
Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Saurab Dulal
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithmKIRAN R
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Appsilon Data Science
 
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...宏毅 李
 
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化RCCSRENKEI
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Pso introduction
Pso introductionPso introduction
Pso introductionrutika12345
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewLEE HOSEONG
 
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)Takanori Hayashi
 

What's hot (20)

Mobilenetv1 v2 slide
Mobilenetv1 v2 slideMobilenetv1 v2 slide
Mobilenetv1 v2 slide
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image Generation[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image Generation
 
MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
 
kaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detectionkaggle NFL 1st and Future - Impact Detection
kaggle NFL 1st and Future - Impact Detection
 
Graph neural networks overview
Graph neural networks overviewGraph neural networks overview
Graph neural networks overview
 
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...
Noise2Score: Tweedie’s Approach to Self-Supervised Image Denoising without Cl...
 
Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model Speaker Recognition using Gaussian Mixture Model
Speaker Recognition using Gaussian Mixture Model
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
ICASSP 2018 Tutorial: Generative Adversarial Network and its Applications to ...
 
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Pso introduction
Pso introductionPso introduction
Pso introduction
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
TalkingData AdTracking  Fraud Detection Challenge (1st place solution)TalkingData AdTracking  Fraud Detection Challenge (1st place solution)
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 

Viewers also liked

Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar DatabaseBiju Nair
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceBiju Nair
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developersBiju Nair
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented DatabaseSuvradeep Rudra
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesLuis Cipriani
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in databaseNikhil Patteri
 
Efficient transaction processing in sap hana
Efficient transaction processing in sap hanaEfficient transaction processing in sap hana
Efficient transaction processing in sap hanaMysa Vijay
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryJustin Swanhart
 
Beckman abadi-5min-pres
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-presDaniel Abadi
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi
 

Viewers also liked (20)

Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
 
Efficient transaction processing in sap hana
Efficient transaction processing in sap hanaEfficient transaction processing in sap hana
Efficient transaction processing in sap hana
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Beckman abadi-5min-pres
Beckman abadi-5min-presBeckman abadi-5min-pres
Beckman abadi-5min-pres
 
Daniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 PanelDaniel Abadi: VLDB 2009 Panel
Daniel Abadi: VLDB 2009 Panel
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really?

2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2Kelvin Chan
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseAnil Gupta
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataMarketingArrowECS_CZ
 
SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoresqlserver.co.il
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Julien SIMON
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSFOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSHITIKAJAIN4
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxReneeClintGortifacio
 

Similar to Column-Stores vs. Row-Stores: How Different are they Really? (20)

2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2017 biological databasespart2
2017 biological databasespart22017 biological databasespart2
2017 biological databasespart2
 
2016 02 23_biological_databases_part2
2016 02 23_biological_databases_part22016 02 23_biological_databases_part2
2016 02 23_biological_databases_part2
 
Sql rally 2013 columnstore indexes
Sql rally 2013   columnstore indexesSql rally 2013   columnstore indexes
Sql rally 2013 columnstore indexes
 
BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2BI Knowledge Sharing Session 2
BI Knowledge Sharing Session 2
 
MySql
MySqlMySql
MySql
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení ExadataSloupcové uložení dat a použití in-memory technologií u řešení Exadata
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
 
Excel Tips 101
Excel Tips 101Excel Tips 101
Excel Tips 101
 
Excel tips
Excel tipsExcel tips
Excel tips
 
SQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStoreSQL Explore 2012 - Michael Zilberstein: ColumnStore
SQL Explore 2012 - Michael Zilberstein: ColumnStore
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Tunning overview
Tunning overviewTunning overview
Tunning overview
 
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONSFOUNDATION OF DATA SCIENCE SQL QUESTIONS
FOUNDATION OF DATA SCIENCE SQL QUESTIONS
 
Excel Tips.pptx
Excel Tips.pptxExcel Tips.pptx
Excel Tips.pptx
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
IT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptxIT301-Datawarehousing (1) and its sub topics.pptx
IT301-Datawarehousing (1) and its sub topics.pptx
 

More from Daniel Abadi

Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi
 

More from Daniel Abadi (9)

Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Column-Stores vs. Row-Stores: How Different are they Really?

  • 1. Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12th , 2008
  • 2. Row vs. Column-Stores Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail Phone # Street Address Row-Store Column-Store − Might read in unnecessary data + Only need to read in relevant data + Easy to add a new record − Tuple writes might require multiple seeks
  • 3. Column-Stores • Really good for read-mostly data warehouses  Lot’s of column scans and aggregations  Writes tend to be in batch  [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this  Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores  Factor of 5 faster on performance  Factor of 2 superior on price/performance
  • 4. Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually
  • 5. Momentum • Right solution for growing market  $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition)
  • 6. Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype?  Do you really need to buy Sybase IQ/Vertica/ParAccel?  How far will your current row-store take you?  Can you get column-store performance from a row- store?  Can you simulate a column-store in a row-store
  • 7. Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store  Simulate a column-store inside of a row-store  Remove column-oriented features from column-store until it behaves like a row-store
  • 8. Simulate Column-Store Inside Row-Store Last Name First Name E- m ai l Phone # Street Address Last Name First Name E-mail 1 2 3 1 2 3 1 2 3 Option A: Vertical Partitioning … Option B: Index Every Column Last Name Index First Name Index
  • 9. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 10. SSBM Averages 0.0 50.0 100.0 150.0 200.0 250.0 Time(seconds) Average 25.7 79.9 221.2 Normal Row-Store Vertically Partitioned Row-Store Row-Store With All Indexes
  • 11. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Reconstruction
  • 12. Star Schema Benchmark • Fact table contains 17 columns and 60,000,000 rows • 4 dimension tables, biggest one has 80,000 rows • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
  • 13. Tuple Size 1 2 3 Column Data TID 1 2 3 TID Column Data 1 2 3 TID Column Data Tuple Header •Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed)
  • 14. Horizontal Partitioning • Fact table horizontally partitioned on year  Year is an element of the ‘Date’ dimension table  Most queries in SSBM have a predicate on year  Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them
  • 15. What’s Going On? • Vertically Partitioned Case  Tuple Sizes  Horizontal Partitioning • All Indexes Case  Tuple Construction
  • 16. Tuple Construction • Pretty much all queries require a column to be extracted (in the SELECT clause) that has not yet been accessed, e.g.:  SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.area = “NEW ENGLAND” GROUP BY store_name
  • 17. Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs  BUT: index maps value to TID  You really want to map TID to value (i.e., a vertical partition)   Tuple construction is SLOW
  • 18. So…. • All indexes approach is pretty obviously a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental  Store tuple header in a separate partition  Allow virtual TIDs  Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores?
  • 20. Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store that behaves like a row-store
  • 21. Compression • Higher data value locality in column-stores  Better ratio  reduced I/O • Can use schemes like run-length encoding  Easy to operate on directly for improved performance ([AMF06]) Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 … … Quarter (Q1, 1, 300) Quarter (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)
  • 22. • Early Materialization: create rows first. But:  Poor memory bandwidth utilization  Lose opportunity for vectorized operation 2 1 3 1 2 3 3 3 7 13 42 80 Construct 2 3 3 3 7 13 42 80 Select + Aggregate 2 1 3 1 4 4 4 4 prodID storeIDcustID price QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID Early vs. Late Materialization 4 4 4 4
  • 23. Other Column-Store Optimizations • Invisible join  Column-store specific join  Optimizations for star schemas  Similar to a semi-join • Block Processing
  • 24. Simplified Version of Results 0.0 10.0 20.0 30.0 40.0 50.0 Time(seconds) Average 4.4 14.9 40.7 Original C-Store C-Store,No Compression C-Store,Early Materialization
  • 25. Conclusion • Might be possible to simulate a row-store in a column-store, BUT:  Need better support for vertical partitioning at the storage layer  Need support for column-specific optimizations at the executer level • Working with HP-Labs to find out
  • 26. Come Join the Yale DB Group!