SlideShare a Scribd company logo
1 of 41
Distributed Databases
Daniel Marcous
What?
Introduction
A distributed database is a database
in which storage devices are not all
attached to a common processing
unit such as the CPU, controlled by a
distributed database management
system.
Definitions
● RDBMS - Relational Database Management System
● DDB - Distributed Database
● Node - a unit in a distributed system (mainly a single server)
● DDBMS - Distributed Database Management System
○ In charge of managing the different DDB nodes as one integrated system
● Centralized System - data is stored in one place
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Understanding the vocabulary
Basic Concepts
Distributed Database Concepts
● Number of processing elements (database nodes)
● Connection between nodes over a computer network
● Logical interrelation between different database nodes
● Absence of node homogeneity
Types of Distributed Databases
Multiprocessing Systems
● Parallel Systems
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
● Truly Distributed Systems
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
● Distribution - Data and software distributed over multiple nodes
● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs
● Heterogeneity - use of different software / hardware on different nodes
Classification of Distributed Systems
Why?
The power of distribution
Reasons for choosing a distributed
database over a “plain” centralized
database.
Advantages
● More computing power
○ CPU
○ Memory
○ Storage
○ Network bandwidth
● Parallelism
○ Inter-query
○ Intra-query
Performance
Ease of use / development
● Transparency
● Geographically distributed sites
● Backups
● Elasticity
○ Growing
○ Shrinking
Challenges
● Transparency - One software (Ring) to rule them all
○ Management - one command
○ Data - one query
● Autonomy - Degree of Independence
○ Different settings / configurations / Cache size
○ “Master” node / Master Election
● Keeping track of data distribution
○ which server has the table / partition I need?
Management Challenges
● Reliability - Probability of failures
○ Does one server failure affects the whole system? (“Freeze”)
● Availability - Percent of time when a data source is available
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
○ What is a single point of time?
○ Nodes clocks Synchronisation (NTP)
● Transaction Management - Server X must assure that the data is “safe” and no
Complex Features Implementation
Scaling
● Synchronisation Overhead
CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
○ C - a read see all previously completed writes
○ A - reads and writes always succeed
○ P - read and write while network is down
● Choose 2! (2000)
● Sorry, actually only C or A… (2012)
How?
Internals
How does a distributed database
work?
● Advanced Concepts
● Architectures
Advanced Concepts
Replication
● Assumptions
○ Nodes will fail
○ Commodity Hardware - prone to failure
● Settings
○ Replication Factor
○ Data / Actions /Apply logs
○ Synchronous / ASynchronous
○ Delay
Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Horizontal - row wise
○ Vertical - column wise (Vertica/ Parquet)
○ Hybrid - both
● Advantages
○ Reports on part of the data - horizontal
○ Increased parallelism - multiple physical files
Distributed Processing
● Access by key Only!
○ Using Hash Tables
■ keys are hashed and spread (=sharded) across nodes
■ result of hash tells you which node to access
■ Hash maps exist on every node / client
● Batch Processing
○ MapReduce
■ Map - partition by key
Data Locality
● Local storage (VS centralised storage controller)
○ Bring the processing to the data
○ Free bandwidth
● Smart Load Balancing
○ Route users to the “closest” node with the data (replication duh..)
● Data sorted by Key /Hash Key
○ Same / Close enough key = Same node
○ “Process” all the rides in the TLV area
ACID
BASE● Atomicity
○ Transactions
● Consistency
○ Locked until done
● Isolation
○ No interference
● Durability
○ Completed = Persistent
● Basic Availability
○ Response to every request
● Soft State
○ States change, results are
not determinant
● Eventual Consistency
○ Consistent state may take
time but is promised
○ (CAS - Compare & Swap
Operations exist)
Architectures
Plain Old Centralized Database
● Oracle
● SQLServer (MS)
● DB2 (IBM)
● MySQL
● PostgreSQL
Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)
● DB2 Data Sharing
● PostgresXL
Federated Database System
● IBM IIDR
Data Warehouse
● Oracle Exadata
● Teradata
● SQL Data Warehouse (MS)
● Vertica (HP)
● Greenplum (EMC)
Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)
● Redshift (Amazon)
● Presto (Facebook)
● Impala (Cloudera)
NoSQL (BASE) Shared Nothing Database
● MongoDB
● CouchBase
● Cassandra
● HBase
When?/Where?
History and Present
Where did the ideas come from and
what do we have present for use
nowadays?
The Founding Fathers
Articles
● Old School
○ Fundamentals of Database Systems (1989)
○ Principles of Distributed Database Systems (1991)
● Distributed File System
○ The Google file system (2003)
● Distributed Processing
○ MapReduce: simplified data processing on large clusters (2004)
● Interactive Querying on large scale
Adopters
● Document DB (Mostly JSON)
○MongoDB
○CouchBase
● Key-Value DB
○Cassandra
○HBase
● Graph DB
○Neo4J
NoSQL – Database Types
Known Users
Big Guys
● Google - Inside tools
○ MapReduce
○ Dremel -> Big Query
○ Flume -> DataFlow
● Facebook - Inside tools open-sourced and modified
○ Cassandra -> HBase
○ Presto
● Yahoo - Hadoop / HBase
● IDF
● Waze
● Viber - Couchbase
● Liveperson - MongoDB, CouchBase
● SimilarWeb - HBase
Israel
Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.

More Related Content

What's hot

Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database SystemSulemang
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Kiruthikak14
 
Distributed database system
Distributed database systemDistributed database system
Distributed database systemM. Ahmad Mahmood
 
Distributed database
Distributed databaseDistributed database
Distributed databasesanjay joshi
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed databaseHoneySah
 
Database , 1 Introduction
 Database , 1 Introduction Database , 1 Introduction
Database , 1 IntroductionAli Usman
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issuesEsar Qasmi
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed DatabaseJovyLee4
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed databaseSonia Panesar
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Kiruthikak14
 
Difference between Homogeneous and Heterogeneous
Difference between Homogeneous  and    HeterogeneousDifference between Homogeneous  and    Heterogeneous
Difference between Homogeneous and HeterogeneousFaraz Qaisrani
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 

What's hot (20)

Distributed database
Distributed databaseDistributed database
Distributed database
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Distributed database system
Distributed database systemDistributed database system
Distributed database system
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed database
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Database , 1 Introduction
 Database , 1 Introduction Database , 1 Introduction
Database , 1 Introduction
 
Ddb 1.6-design issues
Ddb 1.6-design issuesDdb 1.6-design issues
Ddb 1.6-design issues
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed Database
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed database
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
 
Cassandra
CassandraCassandra
Cassandra
 
Difference between Homogeneous and Heterogeneous
Difference between Homogeneous  and    HeterogeneousDifference between Homogeneous  and    Heterogeneous
Difference between Homogeneous and Heterogeneous
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 

Similar to Distributed Databases - Concepts & Architectures

An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
 
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterMesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterAnkur Chauhan
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
Productionizing dl from the ground up
Productionizing dl from the ground upProductionizing dl from the ground up
Productionizing dl from the ground upAdam Gibson
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptmy no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptwondimagegndesta
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exacarldevsco63
 

Similar to Distributed Databases - Concepts & Architectures (20)

An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data CenterMesos - A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos - A Platform for Fine-Grained Resource Sharing in the Data Center
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Productionizing dl from the ground up
Productionizing dl from the ground upProductionizing dl from the ground up
Productionizing dl from the ground up
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptmy no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
 
NoSQL Evolution
NoSQL EvolutionNoSQL Evolution
NoSQL Evolution
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa9780538469685 ppt ch12 1er exa
9780538469685 ppt ch12 1er exa
 

More from Daniel Marcous

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILDaniel Marcous
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Daniel Marcous
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETADaniel Marcous
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Daniel Marcous
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleDaniel Marcous
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 

More from Daniel Marcous (10)

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle IL
 
S2
S2S2
S2
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @Google
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 

Recently uploaded

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Distributed Databases - Concepts & Architectures

  • 2. What? Introduction A distributed database is a database in which storage devices are not all attached to a common processing unit such as the CPU, controlled by a distributed database management system.
  • 4. ● RDBMS - Relational Database Management System ● DDB - Distributed Database ● Node - a unit in a distributed system (mainly a single server) ● DDBMS - Distributed Database Management System ○ In charge of managing the different DDB nodes as one integrated system ● Centralized System - data is stored in one place ● Homogenous system - built of parts (nodes) that all act the same way / consist of the same hardware (Opposite of Heterogeneous). Understanding the vocabulary
  • 6. Distributed Database Concepts ● Number of processing elements (database nodes) ● Connection between nodes over a computer network ● Logical interrelation between different database nodes ● Absence of node homogeneity
  • 8. Multiprocessing Systems ● Parallel Systems ○ Shared Memory (tightly coupled) - multiple processors share the same main memory ○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage ● Truly Distributed Systems ○ Shared Nothing - each processor with its own memory and disk, interrelations are only through network (no SPOF)
  • 9. ● Distribution - Data and software distributed over multiple nodes ● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs ● Heterogeneity - use of different software / hardware on different nodes Classification of Distributed Systems
  • 10. Why? The power of distribution Reasons for choosing a distributed database over a “plain” centralized database.
  • 12. ● More computing power ○ CPU ○ Memory ○ Storage ○ Network bandwidth ● Parallelism ○ Inter-query ○ Intra-query Performance
  • 13. Ease of use / development ● Transparency ● Geographically distributed sites ● Backups ● Elasticity ○ Growing ○ Shrinking
  • 15. ● Transparency - One software (Ring) to rule them all ○ Management - one command ○ Data - one query ● Autonomy - Degree of Independence ○ Different settings / configurations / Cache size ○ “Master” node / Master Election ● Keeping track of data distribution ○ which server has the table / partition I need? Management Challenges
  • 16. ● Reliability - Probability of failures ○ Does one server failure affects the whole system? (“Freeze”) ● Availability - Percent of time when a data source is available ○ If a node goes down, does it’s data get lost? unavailable until its up again? ● Recovery ○ What is a single point of time? ○ Nodes clocks Synchronisation (NTP) ● Transaction Management - Server X must assure that the data is “safe” and no Complex Features Implementation
  • 18. CAP Theorem ● Eric Brewer (Berkeley->Yahoo->Google) ○ C - a read see all previously completed writes ○ A - reads and writes always succeed ○ P - read and write while network is down ● Choose 2! (2000) ● Sorry, actually only C or A… (2012)
  • 19. How? Internals How does a distributed database work? ● Advanced Concepts ● Architectures
  • 21. Replication ● Assumptions ○ Nodes will fail ○ Commodity Hardware - prone to failure ● Settings ○ Replication Factor ○ Data / Actions /Apply logs ○ Synchronous / ASynchronous ○ Delay
  • 22. Fragmentation ● Dividing a single Data Object (Table/ File) into multiple parts ● Types ○ Horizontal - row wise ○ Vertical - column wise (Vertica/ Parquet) ○ Hybrid - both ● Advantages ○ Reports on part of the data - horizontal ○ Increased parallelism - multiple physical files
  • 23. Distributed Processing ● Access by key Only! ○ Using Hash Tables ■ keys are hashed and spread (=sharded) across nodes ■ result of hash tells you which node to access ■ Hash maps exist on every node / client ● Batch Processing ○ MapReduce ■ Map - partition by key
  • 24. Data Locality ● Local storage (VS centralised storage controller) ○ Bring the processing to the data ○ Free bandwidth ● Smart Load Balancing ○ Route users to the “closest” node with the data (replication duh..) ● Data sorted by Key /Hash Key ○ Same / Close enough key = Same node ○ “Process” all the rides in the TLV area
  • 25. ACID BASE● Atomicity ○ Transactions ● Consistency ○ Locked until done ● Isolation ○ No interference ● Durability ○ Completed = Persistent ● Basic Availability ○ Response to every request ● Soft State ○ States change, results are not determinant ● Eventual Consistency ○ Consistent state may take time but is promised ○ (CAS - Compare & Swap Operations exist)
  • 27. Plain Old Centralized Database ● Oracle ● SQLServer (MS) ● DB2 (IBM) ● MySQL ● PostgreSQL
  • 28. Relational (ACID) “Distributed” Database ● Oracle RAC (Real Applications Cluster) ● DB2 Data Sharing ● PostgresXL
  • 30. Data Warehouse ● Oracle Exadata ● Teradata ● SQL Data Warehouse (MS) ● Vertica (HP) ● Greenplum (EMC)
  • 31. Interactive Multiple Parallel Processing (MPP) ● Dremel (Big Query, Google) ● Redshift (Amazon) ● Presto (Facebook) ● Impala (Cloudera)
  • 32. NoSQL (BASE) Shared Nothing Database ● MongoDB ● CouchBase ● Cassandra ● HBase
  • 33. When?/Where? History and Present Where did the ideas come from and what do we have present for use nowadays?
  • 35. Articles ● Old School ○ Fundamentals of Database Systems (1989) ○ Principles of Distributed Database Systems (1991) ● Distributed File System ○ The Google file system (2003) ● Distributed Processing ○ MapReduce: simplified data processing on large clusters (2004) ● Interactive Querying on large scale
  • 37. ● Document DB (Mostly JSON) ○MongoDB ○CouchBase ● Key-Value DB ○Cassandra ○HBase ● Graph DB ○Neo4J NoSQL – Database Types
  • 39. Big Guys ● Google - Inside tools ○ MapReduce ○ Dremel -> Big Query ○ Flume -> DataFlow ● Facebook - Inside tools open-sourced and modified ○ Cassandra -> HBase ○ Presto ● Yahoo - Hadoop / HBase
  • 40. ● IDF ● Waze ● Viber - Couchbase ● Liveperson - MongoDB, CouchBase ● SimilarWeb - HBase Israel
  • 41. Distribution is awesome, but requires complex skills to do right. Don’t overkill it.