SlideShare a Scribd company logo
1 of 30
From HadoopDB to Hadapt: A Case
Study of Transitioning a VLDB
paper into Real World Deployments
Daniel Abadi
Yale University
August 28th, 2013
Twitter: @daniel_abadi
Overview of Talk
Motivation for HadoopDB
Overview of HadoopDB
Overview of the commercialization process
Technical features missing from HadoopDB that
Hadapt needed to implement
What does this mean for tenure?
Situation in 2008
Hadoop starting to take off as a “Big Data”
processing platform
Parallel database startups such as
Netezza, Vertica, and Greenplum gaining
traction for “Big Data” analysis
2 Schools of Thought
– School 1: They are on a collision course
– School 2: They are complementary
technologies
From 10,000 feet Hadoop and Parallel
Database Systems are Quite Similar
Both are suitable for large-scale data
processing
– I.e. analytical processing workloads
– Bulk loads
– Not optimized for transactional workloads
– Queries over large amounts of data
– Both can handle both relational and nonrelational
queries (DBMS via UDFs)
SIGMOD 2009 Paper
Benchmarked Hadoop vs. 2 parallel
database systems
– Mostly focused on performance differences
– Measured differences in load and query time
for some common data processing tasks
– Used Web analytics benchmark whose goal
was to be representative of tasks that:
Both should excel at
Hadoop should excel at
Databases should excel at
Hardware Setup
100 node cluster
Each node
– 2.4 GHz Code 2 Duo Processors
– 4 GB RAM
– 2 250 GB SATA HDs (74 MB/Sec sequential I/O)
Dual GigE switches, each with 50 nodes
– 128 Gbit/sec fabric
Connected by a 64 Gbit/sec ring
Join Task
0
200
400
600
800
1000
1200
1400
1600
10 nodes 25 nodes 50 nodes 100 nodes
Time(seconds)
Vertica
DBMS-X
Hadoop
UDF Task
0
200
400
600
800
1000
1200
10 nodes 25 nodes 50 nodes 100
nodes
Time(seconds)
DBMS
Hadoop
DBMS clearly doesn’t scaleCalculate
PageRank
over a set of
HTML
documents
Performed
via a UDF
Scalability
Except for UDFs all systems scale near
linearly
BUT: only ran on 100 nodes
As nodes approach 1000, other effects
come into play
– Faults go from being rare, to not so rare
– It is nearly impossible to maintain
homogeneity at scale
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
Database systems restart entire
query upon a single node
failure, and do not adapt if a
node is running slowly
Benchmark Conclusions
Hadoop had scalability advantages
– Checkpointing allows for better fault tolerance
– Runtime scheduling allows for better tolerance of
unexpectedly slow nodes
– Better parallelization of UDFs
Hadoop was consistently less efficient for
structured, relational data
– Reasons mostly non-fundamental
– Needed better support for compression and direct
operation on compressed data
– Needed better support for indexing
– Needed better support for co-partitioning of datasets
Best of Both Worlds Possible?
Connector
Problems With the Connector
Approach
Network delays and bandwidth limitations
Data silos
Multiple vendors
Fundamentally wasteful
– Very similar architectures
Both partition data across a cluster
Both parallelize processing across the cluster
Both optimize for local data processing (to
minimize network costs)
Unified System
Two options:
– Bring Hadoop technology to a parallel
database system
Problem: Hadoop is more than just technology
– Bring parallel database system technology to
Hadoop
Far more likely to have impact
Adding DBMS Technology to
Hadoop
Option 1: Keep Hadoop’s storage and build parallel
executor on top of it
Cloudera Impala (which is sort of a combination of Hadoop++
and NoDB research projects)
Need better Storage Formats (Trevni and Parquet are
promising)
Updates and Deletes are hard (Impala doesn’t support them)
Option 2: Use relational storage on each node
Accelerates “time to complete system”
We chose this option for HadoopDB
HadoopDB Architecture
SMS Planner
TPC-H Benchmark Results
UDF Task
0
100
200
300
400
500
600
700
800
10 nodes 25 nodes 50 nodes
Time(seconds)
DBMS
Hadoop
HadoopDB
Fault Tolerance and Cluster
Heterogeneity Results
0
20
40
60
80
100
120
140
160
180
200
Fault tolerance Slowdown tolerance
PercentageSlowdown
DBMS
Hadoop
HadoopDB
HadoopDB Commercialization
Wanted to build a real system
Released initial prototype open source
Blog post about HadoopDB got slashdotted, led
to VC interest
– Initially reluctant to take VC money
Posted a job for an engineer to help build out
open source codebase
– Low quality of applicants
– Not enough government funding for more than 1
engineer
HadoopDB Commercialization
VC money only route to building a
complete system
– Launched with $1.5 million in seed money in
2010
– Raised an additional $8 million in 2011
– Raised an additional $6.75 million in 2012
Commercializing HadoopDB:
Where does development time go?
Work we expected to transition from
research prototype to commercial product
– SQL coverage
– Failover for high availability
– Authorization / authentication
– Error codes / messages for every situation
– Installer
– Documentation
But what about unexpected work?
Infrastructure Tools
Distributed systems are unwieldy
– For a cluster of size n, many things need to be done n times
Automated tools are critical
Just to try some new code, the following needs to
happen:
– Build product
– Provision a cluster
– Deploy build to cluster
– Install dependencies (Hadoop distro, libraries, etc)
– Install Hadapt with correct configuration parameters for that
cluster
– Generate data or copy data files to cluster for load
Upgrader
Start-ups need to move fast
Hadapt delivers a new release every
couple of months
Upgrade process must be easy
Downgrade (!) process must be easy
Changes in storage layout or APIs add
complexity to the process
UDF Support
HadoopDB supported both MapReduce
and SQL as interfaces
MapReduce was not a sufficient
replacement for database UDFs
Hadapt provides an “HDK” that enables
analysts to create functions that are
invokable from SQL
– Integrates with 3rd party tools
Search
Hadoop is increasingly used as a data
landfill
– Granular data
– Messy data
– Unprocessed data
Database for Hadoop cannot assume all
data fits in rows and columns
Search support was the first thing we built
after our A round of financing
Is doing a start-up pre-tenure a
good idea?
Spinning off a company takes a ton of time
– At first, you are the ONLY person who can give a
complete description of the technical vision, so
You’re talking to all the VCs to fundraise
You’re talking to all the prospective customers
You’re talking to all the prospective employees
– Lots of travel
– Eventually, others can help with the above, but a
good CEO will not let you escape
Ups and downs can be mentally draining
If you do a start-up you will:
Publish less
Advise fewer students
Pursue fewer grants
Avoid university committees as much as
possible
Skip faculty meetings (usually because of
travel)
Attend fewer academic conferences
At the end of the day
Unless there are changes (see SIGMOD panel
from June):
– Publishing a lot is the best way to get tenure
– Spinning off a company necessarily detracts from
university measurable objectives
Doing a start-up is putting all your eggs in one
basket
– If successful, you have a lot of impact you can point to
– If not successful, you have nothing
– A lot of market forces that you have no control over
determine success

More Related Content

What's hot

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 

What's hot (20)

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 

Viewers also liked

Large Scale ETL with Hadoop
Large Scale ETL with HadoopLarge Scale ETL with Hadoop
Large Scale ETL with HadoopEric Sammer
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs Daniel Abadi
 
Finding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social NetworksFinding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social NetworksAntonio Maccioni
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresDaniel Abadi
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database SystemsDaniel Abadi
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and DeterminismDaniel Abadi
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Daniel Abadi
 
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialPersonal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialAmélie Marian
 

Viewers also liked (11)

Large Scale ETL with Hadoop
Large Scale ETL with HadoopLarge Scale ETL with Hadoop
Large Scale ETL with Hadoop
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
Finding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social NetworksFinding All Maximal Cliques in Very Large Social Networks
Finding All Maximal Cliques in Very Large Social Networks
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Accordion - VLDB 2014
Accordion - VLDB 2014Accordion - VLDB 2014
Accordion - VLDB 2014
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 TutorialPersonal Information Management Systems - EDBT/ICDT'15 Tutorial
Personal Information Management Systems - EDBT/ICDT'15 Tutorial
 

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)GeeksLab Odessa
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty
 

Similar to From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments (20)

Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments

  • 1. From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real World Deployments Daniel Abadi Yale University August 28th, 2013 Twitter: @daniel_abadi
  • 2. Overview of Talk Motivation for HadoopDB Overview of HadoopDB Overview of the commercialization process Technical features missing from HadoopDB that Hadapt needed to implement What does this mean for tenure?
  • 3. Situation in 2008 Hadoop starting to take off as a “Big Data” processing platform Parallel database startups such as Netezza, Vertica, and Greenplum gaining traction for “Big Data” analysis 2 Schools of Thought – School 1: They are on a collision course – School 2: They are complementary technologies
  • 4. From 10,000 feet Hadoop and Parallel Database Systems are Quite Similar Both are suitable for large-scale data processing – I.e. analytical processing workloads – Bulk loads – Not optimized for transactional workloads – Queries over large amounts of data – Both can handle both relational and nonrelational queries (DBMS via UDFs)
  • 5. SIGMOD 2009 Paper Benchmarked Hadoop vs. 2 parallel database systems – Mostly focused on performance differences – Measured differences in load and query time for some common data processing tasks – Used Web analytics benchmark whose goal was to be representative of tasks that: Both should excel at Hadoop should excel at Databases should excel at
  • 6. Hardware Setup 100 node cluster Each node – 2.4 GHz Code 2 Duo Processors – 4 GB RAM – 2 250 GB SATA HDs (74 MB/Sec sequential I/O) Dual GigE switches, each with 50 nodes – 128 Gbit/sec fabric Connected by a 64 Gbit/sec ring
  • 7. Join Task 0 200 400 600 800 1000 1200 1400 1600 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) Vertica DBMS-X Hadoop
  • 8. UDF Task 0 200 400 600 800 1000 1200 10 nodes 25 nodes 50 nodes 100 nodes Time(seconds) DBMS Hadoop DBMS clearly doesn’t scaleCalculate PageRank over a set of HTML documents Performed via a UDF
  • 9. Scalability Except for UDFs all systems scale near linearly BUT: only ran on 100 nodes As nodes approach 1000, other effects come into play – Faults go from being rare, to not so rare – It is nearly impossible to maintain homogeneity at scale
  • 10. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop Database systems restart entire query upon a single node failure, and do not adapt if a node is running slowly
  • 11. Benchmark Conclusions Hadoop had scalability advantages – Checkpointing allows for better fault tolerance – Runtime scheduling allows for better tolerance of unexpectedly slow nodes – Better parallelization of UDFs Hadoop was consistently less efficient for structured, relational data – Reasons mostly non-fundamental – Needed better support for compression and direct operation on compressed data – Needed better support for indexing – Needed better support for co-partitioning of datasets
  • 12. Best of Both Worlds Possible? Connector
  • 13. Problems With the Connector Approach Network delays and bandwidth limitations Data silos Multiple vendors Fundamentally wasteful – Very similar architectures Both partition data across a cluster Both parallelize processing across the cluster Both optimize for local data processing (to minimize network costs)
  • 14. Unified System Two options: – Bring Hadoop technology to a parallel database system Problem: Hadoop is more than just technology – Bring parallel database system technology to Hadoop Far more likely to have impact
  • 15. Adding DBMS Technology to Hadoop Option 1: Keep Hadoop’s storage and build parallel executor on top of it Cloudera Impala (which is sort of a combination of Hadoop++ and NoDB research projects) Need better Storage Formats (Trevni and Parquet are promising) Updates and Deletes are hard (Impala doesn’t support them) Option 2: Use relational storage on each node Accelerates “time to complete system” We chose this option for HadoopDB
  • 19. UDF Task 0 100 200 300 400 500 600 700 800 10 nodes 25 nodes 50 nodes Time(seconds) DBMS Hadoop HadoopDB
  • 20. Fault Tolerance and Cluster Heterogeneity Results 0 20 40 60 80 100 120 140 160 180 200 Fault tolerance Slowdown tolerance PercentageSlowdown DBMS Hadoop HadoopDB
  • 21. HadoopDB Commercialization Wanted to build a real system Released initial prototype open source Blog post about HadoopDB got slashdotted, led to VC interest – Initially reluctant to take VC money Posted a job for an engineer to help build out open source codebase – Low quality of applicants – Not enough government funding for more than 1 engineer
  • 22. HadoopDB Commercialization VC money only route to building a complete system – Launched with $1.5 million in seed money in 2010 – Raised an additional $8 million in 2011 – Raised an additional $6.75 million in 2012
  • 23. Commercializing HadoopDB: Where does development time go? Work we expected to transition from research prototype to commercial product – SQL coverage – Failover for high availability – Authorization / authentication – Error codes / messages for every situation – Installer – Documentation But what about unexpected work?
  • 24. Infrastructure Tools Distributed systems are unwieldy – For a cluster of size n, many things need to be done n times Automated tools are critical Just to try some new code, the following needs to happen: – Build product – Provision a cluster – Deploy build to cluster – Install dependencies (Hadoop distro, libraries, etc) – Install Hadapt with correct configuration parameters for that cluster – Generate data or copy data files to cluster for load
  • 25. Upgrader Start-ups need to move fast Hadapt delivers a new release every couple of months Upgrade process must be easy Downgrade (!) process must be easy Changes in storage layout or APIs add complexity to the process
  • 26. UDF Support HadoopDB supported both MapReduce and SQL as interfaces MapReduce was not a sufficient replacement for database UDFs Hadapt provides an “HDK” that enables analysts to create functions that are invokable from SQL – Integrates with 3rd party tools
  • 27. Search Hadoop is increasingly used as a data landfill – Granular data – Messy data – Unprocessed data Database for Hadoop cannot assume all data fits in rows and columns Search support was the first thing we built after our A round of financing
  • 28. Is doing a start-up pre-tenure a good idea? Spinning off a company takes a ton of time – At first, you are the ONLY person who can give a complete description of the technical vision, so You’re talking to all the VCs to fundraise You’re talking to all the prospective customers You’re talking to all the prospective employees – Lots of travel – Eventually, others can help with the above, but a good CEO will not let you escape Ups and downs can be mentally draining
  • 29. If you do a start-up you will: Publish less Advise fewer students Pursue fewer grants Avoid university committees as much as possible Skip faculty meetings (usually because of travel) Attend fewer academic conferences
  • 30. At the end of the day Unless there are changes (see SIGMOD panel from June): – Publishing a lot is the best way to get tenure – Spinning off a company necessarily detracts from university measurable objectives Doing a start-up is putting all your eggs in one basket – If successful, you have a lot of impact you can point to – If not successful, you have nothing – A lot of market forces that you have no control over determine success