SlideShare a Scribd company logo
1 of 68
the cloud
DATA & STORAGE
Outline
 Distributed file systems
 Introduction to Big Data
 Storage paradigms (RDBMS, NoSQL, and NewSQL)

 Writing an application on top of distributed storage
(Cassandra)
file system
The purpose of a file system is to:
 Organize and store data
 Support sharing of data among users and applications
 Ensure persistence of data after a reboot
 Examples include FAT, NTFS, ext3, ext4, etc.
Distributed file system
 Self-explanatory: the file system is distributed across many
machines
 The DFS provides a common abstraction to the dispersed files
 Each DFS has an associated API that provides a service to
clients, which are normal file operations, such as
create, read, write, etc.
 Maintains a namespace which maps logical names to physical
names
 Simplifies replication and migration
 Examples include the Network file system (NFS), Andrew file system
(AFS), Google file system (GFS), Hadoop Distributed file system
(HDFS) etc.
Introduction to GFS
 Designed by Google to meet its massive storage needs
 Shares many goals with previous distributed file systems such as
performance, scalability, reliability, and availability
 At the same time, design driven by key observations of their
workload and infrastructure, both current and future
Design Goals
 Failure is the norm rather than the exception: The GFS must constantly
introspect and automatically recover from failure
 The system stores a fair number of large files: Optimize for large files, on
the order of GBs, but still support small files
 Most applications perform large, sequential writes that are mostly append
operations: Support small writes but do not optimize for them
 Most operations are producer-consume queues or many-way merging:
Support concurrent reads or writes by hundreds of clients simultaneously
 Applications process data in bulk at a high rate: Favor throughput over
latency
Files
 Files are sliced into fixed-size chunks
 64MB
 Each chunk is identifiable by an immutable and globally unique
64-bit handle
 Chunks are stored by chunkservers as local Linux files
 Reads and writes to a chunk are specified by a handle and a byte
range
 Each chunk is replicated on multiple chunkservers
 3 by default
Architecture
 Consists of a single master and
multiple chunkservers
 The system can be accessed by
multiple clients
 Both the master and
chunkservers run as user-space
server processes on commodity
Linux machines
Master
 In charge of all filesystem metadata
 Namespace, access control information, mapping between files and
chunks, and current locations of chunks
 Holds this information in memory and regularly syncs it with a log file
 Also in charge of chunk leasing, garbage collection, and chunk
migration
 Periodically sends each chunkserver a heartbeat signal to check
its state and send it instructions
 Clients interact with it to access metadata but all data-bearing
communication goes directly to the relevant chunkservers
 As a result, the master does not become a performance bottleneck
Master: Consistency Model
 All namespace mutations (such as file creation) are atomic as they
are exclusively handled by the master
 Namespace locking guarantees atomicity and correctness
 The operation log maintained by the master defines a global total
order of these operations
Mutation Operations
 Each chunk has many replicas
 The primary replica holds a lease from the master
 It decides the order of all mutations for all replicas
Write Operation
 Client obtains the location of replicas and
the identity of the primary replica from the
master
 It then pushes the data to all replica nodes
 The client issues an update request to
primary
 Primary forwards the write request to all
replicas
 It waits for a reply from all replicas before
returning to the client
Record Append Operation
 Append location chosen by the GFS and communicated to the
client
 Primary forwards the write request to all replicas
 It waits for a reply from all replicas before returning to the client
 If the records fits in the current chunk, it is written and communicated
to the client
 If it does not, the chunk is padded and the client is told to try the next
chunk
 Performed atomically
Chunk Placement
 Put on chunkservers with below average disk space usage
 Limit number of “recent” creations on a chunkserver, to ensure that
it does not experience any traffic spike due to its fresh data
 For reliability, replicas spread across racks
Stale Replica Detection
 Each chunk is assigned a version number
 Each time a new lease is granted, the version number is
incremented
 Stale replicas with outdated version numbers, are simply garbage
collected
Garbage Collection
 A lazy reclamation strategy is used by not reclaiming chunks at
delete time
 Each chunkserver communicates the subset of its current chunks
to the master in the heartbeat signal
 Master pinpoints chunks which have been orphaned
 Chunks become garbage when they are orphaned
 The chunkserver finally reclaims that space
Introduction HDFS
 Open-source clone of GFS
 Comes packaged with Hadoop
 Master is called the NameNode and chunkservers are called
 DataNodes
 Chunks are known as blocks
 Exposes a Java API and a command-line interface
Command-line API
 Accessible through: bin/hdfs dfs –[command args]
 Useful commands:
cat, copyFromLocal, copyToLocal, cp, ls, mkdir, moveFr
omLocal, moveToLocal, mv, rm, etc*.
* http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html
INTRO. TO BIG DATA & STORAGE PARADIGMS
Today, Government agencies at the Federal, State and Local level are
confronting the same challenge that commercial organizations have been
struggling with in recent years: how to best capture and utilize the increasing
amount of data that is coming from more sources than ever before.
Problem
 The current framework:
the Web
multidisciplinary
and complex
Cloud storage
Cloud storage
Big Data
Large datasets whose processing and storage requirements exceed all traditional
paradigms and infrastructure
3 Vs of Big Data
 The “BIG” in big
data isn’t just
about volume
Big data ecosystem
 Presentation layer
 Application layer: frameworks + storage
 Operating system layer
 Virtualization layer (optional)
 Network layer (intra- and inter-data center)
 Physical infrastructure
 Can roughly be called the “cloud”
More Examples of big data…
 Index 20 billion web pages a day, Handle in excess of 3 billion search queries daily
 Provide email storage to 425 million Gmail users
 Serve 3 billion YouTube videos a day
 400 million Tweets everyday
 In March 2012, the Obama Administration announced the Big Data Research and Development
Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address
important problems facing the government.
Why are they collecting all this data?
Target Marketing
• To send you catalogs for exactly
the merchandise you typically
purchase.
• To suggest medications that
precisely match your medical
history.
• To “push” television channels to
your set instead of your “pulling”
them in.
• To send advertisements on those
channels just for you!
Targeted Information
• To know what you need before
you even know you need it
based on past purchasing
habits!
• To notify you of your expiring
driver’s license or credit cards or
last refill on a Rx, etc.
• To give you turn-by-turn
directions to a shelter in case of
emergency.
Cloud storage
Cloud storage
Cloud storage
Cloud storage
Cloud storage
What problems can be raised with Big Data ?
What is the problem
 Traditionally, computation has been processor-bound
 For decades, the primary push was to increase the
computing power of a single machine – Faster
processor, more RAM
 Distributed systems evolved to allow developers to use
multiple machines for a single job – At compute
time, data is copied to the compute nodes
 Getting the data to the processors becomes the bottleneck
 Quick calculation – Typical disk data transfer rate:
 75MB/sec – Time taken to transfer 100GB of data to the processor:
 approx. 22 minutes!
What is the problem
 Failure of a component may cost a lot
 What we need when job fail? – May result in a graceful degradation of
application performance, but entire system does not completely fail –
Should not result in the loss of any data – Would not affect the outcome of
the job
What is the problem
RDBMS, NoSQL & NewSQL & Apps
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data which
might be useful for a certain application
We use this data to share information and make a more informed
 decision about different events
Datasets can easily be classified on the basis of theirstructure
 Structured
 Unstructured
 Semi-structured
Structured Data
 Formatted in a universally understandable and identifiable way
 In most cases, structured data is formally specified by aschema
 Your phone address phone is structured because it has a schema
 consisting of name, phone number, address, email address, etc.
 Most traditional databases contain structured data revolving around
 data laid out across columns and rows
 Each field also has an associated type
 Possible to search for items based on their data types
Unstructured Data
 Data without any conceptual definition or type
 Can vary from raw text to binary data
 Processing unstructured data requires parsing and tagging on the fly
 In most cases, consists of simple log files
Semi-structured Data
 Occupies the space between the structured and unstructured data
 spectrum
 For instance, while binary data has no structure, audio and video files.
 have meta-data which has structure, such as author, time of creation,
 etc.
 Can also be labelled as self-describing structure
Storage
Database Management Systems (DBMS)
 Used to store and manage data
 Support for large amounts of data
 Ensure concurrency, sharing, and locking
 Security is useful too; to enable fine-grained access control
 Ability to keep working in the face of failure
Relational Database Management Systems
(RDBMS)
 The most popular and predominant storage system in use
 Data in different files is connected by using a key field
 Data is laid out in different tables, with a key field that identifies eachrow
 The same key field is used to connect one table to another
 For instance, a relation might have customer ID as key and her details as
data; another table might have the same key but different data, say her
purchases; yet another table with the same key might have a breakdown
of her preferences
 Examples include Oracle Database, MS SQL Server, MySQL, IBM DB2, and
Teradata
RDBMS and Structured Data
 As structured data follows a predefined schema, it naturally maps on to a
relational database system
 The schema defines the type and structure of the data and its relations
 Schema design is an arduous process and needs to be done before
 the database can be populated
 Another consequence of a strict schema is that it is non-trivial to
 extend it
 For instance, adding a new attribute to an existing row necessitates
 adding a new column to the entire table
 Extremely suboptimal in tables with millions of rows
RDBMS and Semi- and Un-structured Data
 Unstructured data has no notion of schema while semi-structured data
only has a weak one
 Data within such datasets also has an associated type
 In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
 While it is possible, with human intervention, to glean structure from
unstructured data, it is an extremely expensive task
 Structureless data generated by real-time sources can change the
number of attributes and their types on the fly
 RDBMS would require the creation of a new table each time such a
change takes place
 Therefore, unstructured and semi-structured data does not fit the
relational model
Cloud storage
NoSQL
Its not about not about saying that SQL should never be used, or that SQL is
dead
NoSQL
is simply
Not Only SQL!
Its about recognizing that for some problems
other storage solutions are better suited
NoSQL
 Database management without relational
model, schema free
 Usually not ACID
 Eventually consistent data
 Distributed, fault-tolerant
 Large amounts of dataLow and predictable response time (latency)
 Scalability & elasticity (at low cost!)
 High availability
 Flexible schemas / semi-structured data
Some NoSQL use cases
1. Massive data volumes
 Massively distributed architecture required to store the data
 Google, Amazon, Yahoo, Facebook – 10-100K servers
2. Extreme query workload
 Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
 Schema flexibility (migration) is not trivial at large
 Schema changes can be gradually introduced with NoSQL
Three (emerging) NOSQL categories
 Key-value stores
 Based on DHTs / Amazon's Dynamo paper
 Data model: (global) collection of K-V pairs
 Example: Dynomite, Voldemort, Tokyo
 BigTable Clones
 Based on Google's BigTable paper
 Data model: big table, column families
 Example: Hbase, Hypertable
 Document databases
 Inspired by Lotus Notes
 Data model: collections of K-V collections
 Example: CouchDB, MongoDB
Three (emerging) NOSQL categories…
NoSQL pros/cons
Cloud storage
NewSQL
NewSQL is a class of modern relational database management systems that
seek to provide the same scalable performance of NoSQL systems for
OLTP workloads while still maintaining the ACID guarantees of a traditional
single-node database system
NewSQL
 SQL as the primary interface.
 ACID support for transactions
 Non-locking concurrency control.
 High per-node performance.
 Parallel, shared-nothing architecture.
 Radically better scalability and
performance
 A hybrid of traditional RDBMS and NoSQL
 Scalability and performance of NoSQL and ACID guarantees of RDBMS
 Use SQL as the primary language
 Ability to scale out and run over commodity hardware
 Classified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replace the
storage engine
3 Transparent Clustering: Add pluggable features to existing databases to
ensure scalability
NewSQL World
11:23:18 PM /
00
Column Store Database
Why column Store ?
 Can be significantly faster than row stores for some applications
 Fetch only required columns for a query
 Better cache effects
 Better compression (similar attribute values within a column)Why Column
 But can be slower for other applications
 OLTP with many row inserts, ..Store?
 Long war between the column store and row store camps
Cloud storage
Introduction
 Borrows concepts from both Dynamo and BigTable
 Originally developed by Facebook but now an Apache open source
 project
 Designed for Facebook Chat for efficiently storing, indexing, and
 searching messages
Design Goals
 Processing of a large amount of data
 Highly scalable
 Reliability at a massive scale
 High throughput writes without sacrificing read efficiency
Introduction…
 http://cassandra.apache.org/
• Developed by Facebook (inbox), now Apache
– Facebook now developing its own version again
• Based on Google BigTable (data model) and Amazon Dynamo (partitioning & consistency)
• P2P
– Every node is aware of all other nodes in the cluster
• Design goals
– High availability
– Eventual consistency (improves HA)
– Incremental scalability / elasticity
– Optimistic replication
Data model
– Same as BigTable
– Super Columns (nested Columns) and Super Column Families
– column order in a CF can be specified (name, time)
• Cluster membership
– Gossip – every nodes gossips to 1-3 other nodes about the state of the cluster (merge incoming
info with its own)
– Changes in the cluster (node in/out, failure) propagate quickly (LogN)
– Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2))
• Dynamic partitioning
– Consistent hashing
– Ring of nodes
– Nodes can be “moved” on the ring for load balancing
Cassandra @ Facebook
• Inbox search
• ca. 2009 - 50 TB data, 150 nodes, 2 datacenters
• Performance (production)

More Related Content

What's hot

2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Demystifying Cloud: What is Cloud?
Demystifying Cloud: What is Cloud?Demystifying Cloud: What is Cloud?
Demystifying Cloud: What is Cloud?sriramr
 
Distributed processing
Distributed processingDistributed processing
Distributed processingNeil Stein
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadooplilyco
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recoverySandeep Singh
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduceHadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduceDebatri Mitra
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Nandhitha B
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersSprintzeal
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 

What's hot (19)

2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Demystifying Cloud: What is Cloud?
Demystifying Cloud: What is Cloud?Demystifying Cloud: What is Cloud?
Demystifying Cloud: What is Cloud?
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop
HadoopHadoop
Hadoop
 
Distributed processing
Distributed processingDistributed processing
Distributed processing
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadoop
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recovery
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduceHadoop_Fundamentals_HDFS_map_reduce
Hadoop_Fundamentals_HDFS_map_reduce
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
Deduplication - Remove Duplicate
Deduplication - Remove DuplicateDeduplication - Remove Duplicate
Deduplication - Remove Duplicate
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and Answers
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 

Similar to Cloud storage

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Hierarchical And Directory Based Database Essay
Hierarchical And Directory Based Database EssayHierarchical And Directory Based Database Essay
Hierarchical And Directory Based Database EssayNibadita Palmer
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Robustness In HDFS
Robustness In HDFSRobustness In HDFS
Robustness In HDFSErin Moore
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
Lecture 4 -_internet_infrastructure_2_updated_2011
Lecture 4 -_internet_infrastructure_2_updated_2011Lecture 4 -_internet_infrastructure_2_updated_2011
Lecture 4 -_internet_infrastructure_2_updated_2011Serious_SamSoul
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 

Similar to Cloud storage (20)

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hierarchical And Directory Based Database Essay
Hierarchical And Directory Based Database EssayHierarchical And Directory Based Database Essay
Hierarchical And Directory Based Database Essay
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
 
Gfs
GfsGfs
Gfs
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Robustness In HDFS
Robustness In HDFSRobustness In HDFS
Robustness In HDFS
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Lecture 4 -_internet_infrastructure_2_updated_2011
Lecture 4 -_internet_infrastructure_2_updated_2011Lecture 4 -_internet_infrastructure_2_updated_2011
Lecture 4 -_internet_infrastructure_2_updated_2011
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 

Recently uploaded

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 

Recently uploaded (20)

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 

Cloud storage

  • 2. Outline  Distributed file systems  Introduction to Big Data  Storage paradigms (RDBMS, NoSQL, and NewSQL)
  Writing an application on top of distributed storage (Cassandra)
  • 3. file system The purpose of a file system is to:  Organize and store data  Support sharing of data among users and applications  Ensure persistence of data after a reboot  Examples include FAT, NTFS, ext3, ext4, etc.
  • 4. Distributed file system  Self-explanatory: the file system is distributed across many machines  The DFS provides a common abstraction to the dispersed files  Each DFS has an associated API that provides a service to clients, which are normal file operations, such as create, read, write, etc.  Maintains a namespace which maps logical names to physical names  Simplifies replication and migration  Examples include the Network file system (NFS), Andrew file system (AFS), Google file system (GFS), Hadoop Distributed file system (HDFS) etc.
  • 5. Introduction to GFS  Designed by Google to meet its massive storage needs  Shares many goals with previous distributed file systems such as performance, scalability, reliability, and availability  At the same time, design driven by key observations of their workload and infrastructure, both current and future
  • 6. Design Goals  Failure is the norm rather than the exception: The GFS must constantly introspect and automatically recover from failure  The system stores a fair number of large files: Optimize for large files, on the order of GBs, but still support small files  Most applications perform large, sequential writes that are mostly append operations: Support small writes but do not optimize for them  Most operations are producer-consume queues or many-way merging: Support concurrent reads or writes by hundreds of clients simultaneously  Applications process data in bulk at a high rate: Favor throughput over latency
  • 7. Files  Files are sliced into fixed-size chunks  64MB  Each chunk is identifiable by an immutable and globally unique 64-bit handle  Chunks are stored by chunkservers as local Linux files  Reads and writes to a chunk are specified by a handle and a byte range  Each chunk is replicated on multiple chunkservers  3 by default
  • 8. Architecture  Consists of a single master and multiple chunkservers  The system can be accessed by multiple clients  Both the master and chunkservers run as user-space server processes on commodity Linux machines
  • 9. Master  In charge of all filesystem metadata  Namespace, access control information, mapping between files and chunks, and current locations of chunks  Holds this information in memory and regularly syncs it with a log file  Also in charge of chunk leasing, garbage collection, and chunk migration  Periodically sends each chunkserver a heartbeat signal to check its state and send it instructions  Clients interact with it to access metadata but all data-bearing communication goes directly to the relevant chunkservers  As a result, the master does not become a performance bottleneck
  • 10. Master: Consistency Model  All namespace mutations (such as file creation) are atomic as they are exclusively handled by the master  Namespace locking guarantees atomicity and correctness  The operation log maintained by the master defines a global total order of these operations
  • 11. Mutation Operations  Each chunk has many replicas  The primary replica holds a lease from the master  It decides the order of all mutations for all replicas
  • 12. Write Operation  Client obtains the location of replicas and the identity of the primary replica from the master  It then pushes the data to all replica nodes  The client issues an update request to primary  Primary forwards the write request to all replicas  It waits for a reply from all replicas before returning to the client
  • 13. Record Append Operation  Append location chosen by the GFS and communicated to the client  Primary forwards the write request to all replicas  It waits for a reply from all replicas before returning to the client  If the records fits in the current chunk, it is written and communicated to the client  If it does not, the chunk is padded and the client is told to try the next chunk  Performed atomically
  • 14. Chunk Placement  Put on chunkservers with below average disk space usage  Limit number of “recent” creations on a chunkserver, to ensure that it does not experience any traffic spike due to its fresh data  For reliability, replicas spread across racks
  • 15. Stale Replica Detection  Each chunk is assigned a version number  Each time a new lease is granted, the version number is incremented  Stale replicas with outdated version numbers, are simply garbage collected
  • 16. Garbage Collection  A lazy reclamation strategy is used by not reclaiming chunks at delete time  Each chunkserver communicates the subset of its current chunks to the master in the heartbeat signal  Master pinpoints chunks which have been orphaned  Chunks become garbage when they are orphaned  The chunkserver finally reclaims that space
  • 17. Introduction HDFS  Open-source clone of GFS  Comes packaged with Hadoop  Master is called the NameNode and chunkservers are called  DataNodes  Chunks are known as blocks  Exposes a Java API and a command-line interface
  • 18. Command-line API  Accessible through: bin/hdfs dfs –[command args]  Useful commands: cat, copyFromLocal, copyToLocal, cp, ls, mkdir, moveFr omLocal, moveToLocal, mv, rm, etc*. * http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html
  • 19. INTRO. TO BIG DATA & STORAGE PARADIGMS
  • 20. Today, Government agencies at the Federal, State and Local level are confronting the same challenge that commercial organizations have been struggling with in recent years: how to best capture and utilize the increasing amount of data that is coming from more sources than ever before. Problem
  • 21.  The current framework: the Web multidisciplinary and complex
  • 24. Big Data Large datasets whose processing and storage requirements exceed all traditional paradigms and infrastructure
  • 25. 3 Vs of Big Data  The “BIG” in big data isn’t just about volume
  • 26. Big data ecosystem  Presentation layer  Application layer: frameworks + storage  Operating system layer  Virtualization layer (optional)  Network layer (intra- and inter-data center)  Physical infrastructure  Can roughly be called the “cloud”
  • 27. More Examples of big data…  Index 20 billion web pages a day, Handle in excess of 3 billion search queries daily  Provide email storage to 425 million Gmail users  Serve 3 billion YouTube videos a day  400 million Tweets everyday  In March 2012, the Obama Administration announced the Big Data Research and Development Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address important problems facing the government.
  • 28. Why are they collecting all this data? Target Marketing • To send you catalogs for exactly the merchandise you typically purchase. • To suggest medications that precisely match your medical history. • To “push” television channels to your set instead of your “pulling” them in. • To send advertisements on those channels just for you! Targeted Information • To know what you need before you even know you need it based on past purchasing habits! • To notify you of your expiring driver’s license or credit cards or last refill on a Rx, etc. • To give you turn-by-turn directions to a shelter in case of emergency.
  • 34. What problems can be raised with Big Data ?
  • 35. What is the problem  Traditionally, computation has been processor-bound  For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM  Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
  • 36.  Getting the data to the processors becomes the bottleneck  Quick calculation – Typical disk data transfer rate:  75MB/sec – Time taken to transfer 100GB of data to the processor:  approx. 22 minutes! What is the problem
  • 37.  Failure of a component may cost a lot  What we need when job fail? – May result in a graceful degradation of application performance, but entire system does not completely fail – Should not result in the loss of any data – Would not affect the outcome of the job What is the problem
  • 38. RDBMS, NoSQL & NewSQL & Apps
  • 39. Introduction Data is everywhere and is the driving force behind our lives The address book on your phone is data So is the newspaper that you read every morning Everything you see around you is a potential source of data which might be useful for a certain application We use this data to share information and make a more informed  decision about different events Datasets can easily be classified on the basis of theirstructure  Structured  Unstructured  Semi-structured
  • 40. Structured Data  Formatted in a universally understandable and identifiable way  In most cases, structured data is formally specified by aschema  Your phone address phone is structured because it has a schema  consisting of name, phone number, address, email address, etc.  Most traditional databases contain structured data revolving around  data laid out across columns and rows  Each field also has an associated type  Possible to search for items based on their data types
  • 41. Unstructured Data  Data without any conceptual definition or type  Can vary from raw text to binary data  Processing unstructured data requires parsing and tagging on the fly  In most cases, consists of simple log files
  • 42. Semi-structured Data  Occupies the space between the structured and unstructured data  spectrum  For instance, while binary data has no structure, audio and video files.  have meta-data which has structure, such as author, time of creation,  etc.  Can also be labelled as self-describing structure
  • 44. Database Management Systems (DBMS)  Used to store and manage data  Support for large amounts of data  Ensure concurrency, sharing, and locking  Security is useful too; to enable fine-grained access control  Ability to keep working in the face of failure
  • 45. Relational Database Management Systems (RDBMS)  The most popular and predominant storage system in use  Data in different files is connected by using a key field  Data is laid out in different tables, with a key field that identifies eachrow  The same key field is used to connect one table to another  For instance, a relation might have customer ID as key and her details as data; another table might have the same key but different data, say her purchases; yet another table with the same key might have a breakdown of her preferences  Examples include Oracle Database, MS SQL Server, MySQL, IBM DB2, and Teradata
  • 46. RDBMS and Structured Data  As structured data follows a predefined schema, it naturally maps on to a relational database system  The schema defines the type and structure of the data and its relations  Schema design is an arduous process and needs to be done before  the database can be populated  Another consequence of a strict schema is that it is non-trivial to  extend it  For instance, adding a new attribute to an existing row necessitates  adding a new column to the entire table  Extremely suboptimal in tables with millions of rows
  • 47. RDBMS and Semi- and Un-structured Data  Unstructured data has no notion of schema while semi-structured data only has a weak one  Data within such datasets also has an associated type  In fact, types are application-centric: It might be possible to interpret a field as a float in one application and as a string in another  While it is possible, with human intervention, to glean structure from unstructured data, it is an extremely expensive task  Structureless data generated by real-time sources can change the number of attributes and their types on the fly  RDBMS would require the creation of a new table each time such a change takes place  Therefore, unstructured and semi-structured data does not fit the relational model
  • 49. NoSQL Its not about not about saying that SQL should never be used, or that SQL is dead
  • 50. NoSQL is simply Not Only SQL! Its about recognizing that for some problems other storage solutions are better suited
  • 51. NoSQL  Database management without relational model, schema free  Usually not ACID  Eventually consistent data  Distributed, fault-tolerant  Large amounts of dataLow and predictable response time (latency)  Scalability & elasticity (at low cost!)  High availability  Flexible schemas / semi-structured data
  • 52. Some NoSQL use cases 1. Massive data volumes  Massively distributed architecture required to store the data  Google, Amazon, Yahoo, Facebook – 10-100K servers 2. Extreme query workload  Impossible to efficiently do joins at that scale with an RDBMS 3. Schema evolution  Schema flexibility (migration) is not trivial at large  Schema changes can be gradually introduced with NoSQL
  • 53. Three (emerging) NOSQL categories  Key-value stores  Based on DHTs / Amazon's Dynamo paper  Data model: (global) collection of K-V pairs  Example: Dynomite, Voldemort, Tokyo  BigTable Clones  Based on Google's BigTable paper  Data model: big table, column families  Example: Hbase, Hypertable
  • 54.  Document databases  Inspired by Lotus Notes  Data model: collections of K-V collections  Example: CouchDB, MongoDB Three (emerging) NOSQL categories…
  • 57. NewSQL NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional single-node database system
  • 58. NewSQL  SQL as the primary interface.  ACID support for transactions  Non-locking concurrency control.  High per-node performance.  Parallel, shared-nothing architecture.  Radically better scalability and performance
  • 59.  A hybrid of traditional RDBMS and NoSQL  Scalability and performance of NoSQL and ACID guarantees of RDBMS  Use SQL as the primary language  Ability to scale out and run over commodity hardware  Classified into: 1 New Databases: Designed from scratch 2 New MySQL Storage Engines: Keep MySQL as interface but replace the storage engine 3 Transparent Clustering: Add pluggable features to existing databases to ensure scalability
  • 62. Why column Store ?  Can be significantly faster than row stores for some applications  Fetch only required columns for a query  Better cache effects  Better compression (similar attribute values within a column)Why Column  But can be slower for other applications  OLTP with many row inserts, ..Store?  Long war between the column store and row store camps
  • 64. Introduction  Borrows concepts from both Dynamo and BigTable  Originally developed by Facebook but now an Apache open source  project  Designed for Facebook Chat for efficiently storing, indexing, and  searching messages
  • 65. Design Goals  Processing of a large amount of data  Highly scalable  Reliability at a massive scale  High throughput writes without sacrificing read efficiency
  • 66. Introduction…  http://cassandra.apache.org/ • Developed by Facebook (inbox), now Apache – Facebook now developing its own version again • Based on Google BigTable (data model) and Amazon Dynamo (partitioning & consistency) • P2P – Every node is aware of all other nodes in the cluster • Design goals – High availability – Eventual consistency (improves HA) – Incremental scalability / elasticity – Optimistic replication
  • 67. Data model – Same as BigTable – Super Columns (nested Columns) and Super Column Families – column order in a CF can be specified (name, time) • Cluster membership – Gossip – every nodes gossips to 1-3 other nodes about the state of the cluster (merge incoming info with its own) – Changes in the cluster (node in/out, failure) propagate quickly (LogN) – Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2)) • Dynamic partitioning – Consistent hashing – Ring of nodes – Nodes can be “moved” on the ring for load balancing
  • 68. Cassandra @ Facebook • Inbox search • ca. 2009 - 50 TB data, 150 nodes, 2 datacenters • Performance (production)