1. MEGASTORE: Providing
Scalable, Highly Available
Storage for Interactive
Services
Guided By- Prof. Kong Li
Presented By- (TEAM 1)
Anumeha Shah(009423973)
Ankita Kapratwar (009413469)
Swapna Kulkarni(009264905)
2. What is Megastore
● Megastore combines the scalability and availability of NoSQL datastore
with ACID semantics of RDBMS in an innovative way so that it can meet
the requirement of interactive online services. Megastore provides both the
high consistency as well as high availability which can not be provided by
NoSQL or RDBMS alone.
● Megastore uses Paxos replication and consensus algorithm for high
availability and with low latency.
● Partitions the data to a fine granularity and ACID semantics within the
partition across wide area network with low latency.
3. Why Megastore
Online interactive services requires high availability as well as high
consistency.
● Online services are growing exceedingly as potential users are growing
exceedingly.
● More and more desktop services are moving to the cloud
● Opposing requirements of storage demands are arising and making the
storage challenging
Reasons for opposing requirements are:
● Applications should be scalable Services should be responsive.
● User should have consistent view of the data
● Services should be highly available services to be up for 24/7 services to be
4. Approach to Provide High Availability
and Consistency
Two approaches has been taken.
1. synchronous fault tolerant log replicator to provide availability.
2. To provide scalability partition the data into many small databases and
provide each database with its own log replicator.
Replications for High Availability
Need for replications:
● Replication is needed for high availability
● replication with in data center overcome the host specific failures
● But to overcome datacenter specific failure and regional disaster the data
should be replicated over geographically distributed datacenters.
5. Common Replication Strategies and
Issues
Asynchronous master/slave
● write ahead log entries are replicated by master node to at least one slave.
● Log appends acknowledgement at master and transmissions to slave
happens parallely.
● However if master fails then we can experience downtime till a slave
becomes master and also loss of data can occur.
Synchronous master/slave:
● Changes on masters and slave are done synchronously that is master
acknowledge the change once the changes are mirrored to slaves.
● This approach prevent data loss in failover of master to slave.
● However failures need timely detection using an external system because it
may cause high latency and user visible outage
6. Common Replication Strategies and
Issues Cont..
Optimistic Replication:
● There is no master.
● Any member can accept the changes and the changes propagates through
the group asynchronously. This approach provide high availability and
excellent latency
● However transactions are not possible as global mutation orderings are not
known at commit time.
7. Use of Paxos for Replication
● Paxos is fault tolerant consensus algorithm
● There is no master but group of similar peers
● A write ahead log can be replicated over all the peers.
● Any of the peer can initiate read or write.
● Log add the changes only if majority of the peers acknowledges the
changes.
● The other peers which did not acknowledge the change eventually
acknowledge.
● No distinguished failed state
8. Use of Paxos for Replication Cont..
Issues with Paxos replication Strategy
● If we have only one replicated log over wide area then it might suffer high
latencies which will limit the throughput.
● What if none of the replica is updated.
● What if majority of the replica does not acknowledge the writes
Solution
● Partition the data
● Multiple replicated logs.
● Each partition of the data will have its own replicated log.
● Synchronous log replication among the data centers.
11. Partitioning For Scalability and
Consistency
● Partition the data into entity groups
● Each partition is replicated across different data centers synchronously
and independently
● The data is stored in NoSQL datastore in datacenter
● Within an entity group the changes are done using single phase ACID
semantics.
● But across the entity group changes or operations are done using two
phase single commit using asynchronous messaging.
● These entity groups are logically distant not physically distant. So
operations across the different entity groups are local
● The traffic between the data centers is only for synchronous replications
12. Physical Layout
How to select entity group boundaries:
● Should not be too fine grained as it may require excessive cross group
operations. Group should also not contain large no of entities as it may
cause unnecessary writes.
Physical Layout
● Google’s big table as a storage system which is fault tolerant and scalable
● Applications keeps the data near the user or to a region where it is being
accessed the most and maintains replications near each other to avoid
failures and high latency during failures. Keeps the group of data which are
accessed together either close to each other or with in the same row.
● Implement cache for low latency
13. Data Model Overview
● Lies between abstract tuples of RDBMS and concrete row-column storage.
● Schema=>Set of tables =>contains entities=>contains properties
● Entity group will consist of a root entity along with all entities in child table
that references it
15. Indexes
● This can be applied to any property
● Local Index- Used to find data within an entity group
● Global Index- Used to find entities without knowing in advance the entity
groups that contain them
● Storing Clause- Applications store additional properties from the primary
table for faster access at read time
● Repeated Indexes- For repeated properties
● Inline indexes: Extracting slices of information from child entities and
storing it in the data in parent for fast access. Implements many to many
links
16. Mapping to Bigtable
● Here the column name = Megastore table name + Property name
● Each Bigtable row stores transaction, metadata and log for the group
● Metadata is in the same row which allows to update atomically through a
single Bigtable transaction
● Index Entry- represented as a Bigtable row. Row key = Indexed property
values + primary key of indexed entity
17. Transactions and Concurrency control
● Entity group functions as a mini-database.
● Transaction writes mutations in write-ahead log, then mutations will apply to
data
● Multiple values can be stored in the same row/column pair with different
timestamps
● Multiversion Concurrency control- MVCC
● Readers and writers don’t block each other
18. Cont..
Reads-
a. Current- ensure that all committed writes are applied first, then read latest
committed transaction
b. Snapshot- reads the latest committed write operation
c. Inconsistent- Ignore the state of log and read latest value
Writes-
Begins with a current read to determine the next available log position.
Commit operation gathers mutations into a log entry, assigns it a timestamp
higher than any prev ones and appends to the log using Paxos
19. Transaction Lifecycle
READ- Obtain
Timestamp & Log
Position of last
committed transaction
Application Logic-
Read from Bigtable
and gather writes
into a log entry
Commit - Use
Paxos for
appending that
entry to log
Apply - Write
mutations to the
entities and
indexes
Clean Up - Delete
data that is no
longer required
20. Replication
● Initiation of reads and writes can be done from any replica
● Replication is done per entity group by synchronously replicating the
groups transaction log to a quorum of replicas
● Reads guarantees:
o Read will always observe the last-acknowledged write.
o After a write has been observed, all future reads observe that write
27. Coordinator Availability
Failure Detection
● Google's Chubby lock service is used
● Writers are insulated from coordinator failure by testing whether a
coordinator has lost its locks
Validation Races
● Races between validates for earlier writes and invalidates for later
writes are protected in the coordinator by always sending the log
position associated with the action.