SlideShare a Scribd company logo
1 of 55
Download to read offline
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Sep. 19. 2017 WebDB Forum Tokyo
1
Yasuharu Goto
Dragon: A Distributed Object Storage @Yahoo! JAPAN
(English Ver.)
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
About me
• Yasuharu Goto
• Yahoo! JAPAN (2008-)
• Software Engineer
• Storage, Distributed Database Systems (Cassandra)
• Twitter: @ono_matope
• Lang: Go
2
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Agenda
• About Dragon
• Architecture
• Issues and Future works
3
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Dragon
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Object Storage
• What is Object Storage?
• A storage architecure that manages files not as files but as objects.
• Instead of providing features like file hierarchy, it provides high availability and scalabiliity.
• (Typically) provides REST API, so it can be used easily by applications.
• Populer products
• AWS: Amazon S3
• GCP: Google Cloud Storage
• Azure: Azure Blob Storage
• An essential component for modern web development.
5
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Dragon
• A distributed Object Storage developed at Yahoo! JAPAN.
• Design Goals:
• High { performance, scalability, availability, cost efficiency }
• Written in Go
• Released in Jan/2016 (20 months in production)
• Scale
• deployed in 2 data centers in Japan
• Stores 20 billion / 11 PB of objects.
6
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Use Cases
• 250+ users in Y!J
• Various usage
• media content
• user data, log storage
• backend for Presto (experimental)
7
• Yahoo! Auction (image)
• Yahoo! News/Topics (image)
• Yahoo! Display Ad Network (image/video)
• Yahoo! Blog (image)
• Yahoo! Smartphone Themes (image)
• Yahoo! Travel (image)
• Yahoo! Real Estate (image)
• Yahoo! Q&A (image)
• Yahoo! Reservation (image)
• Yahoo! Politics (image)
• Yahoo! Game (contents)
• Yahoo! Bookstore (contents)
• Yahoo! Box (user data)
• Netallica (image)
• etc...
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
S3 Compatible API
• Dragon provides an S3 compatible API
• aws-sdk, aws-cli, CyberDuck...
• Implemented
• Basic S3 API (Service, Bucket, Object, ACL...)
• SSE (Server Side Encryption)
• TODO
• Multipart Upload API (to upload large objects up to 5TB)
• and more...
8
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Performance(with Riak CS/reference)
• Dragon: API*1, Storage*3, Cassandra*3
• Riak CS: haproxy*1, stanchion*1, Riak (KV+CS)*3
• Same Hardware except for Cassandra and Stanchion.
9
0
500
1000
1500
2000
2500
3000
3500
1 5 10 50 100 200 400
Requests/sec
# of Threads
GET Object 10KB Throughput
Riak CS
Dragon
0
100
200
300
400
500
600
700
800
900
1000
1 5 10 50 100 200 400
Requests/sec
# of Threads
PUT Object 10KB Throughput
Riak CS
Dragon
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Why?
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Why did we build a new Object Storage?
• Octagon (2011-2017)
• Our 1st Generation Object Storage
• Up to 7 PB / 7 Billion Objects / 3,000 Nodes at a time
• used for personal cloud storage service, E-Book, etc...
• Problems of Octagon
• Low performance
• Unstable
• Expensive TCO
• Hard to operate
• We started to consider alternative products.
11
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Requirements
• Our requirements
• High performance enough for our services
• High scalability to respond to rapid increase in data demands
• High availability with less operation cost
• High cost efficiency
• Mission
• To establish a company-wide storage infrastructure
12
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Alternatives
• Existing Open Source Products
• Riak CS
• Some of our products introduced it, but it did not meet our performance requiremnt.
• OpenStack Swift
• Concerns about peformance degration when object count increases.
• Public Cloud Providers
• cost inefficient
• We mainly provides our services with our own DC.
We needed a high scalable storage system which runs on-premise.
13
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Alternatives
14
OK, let’s make it by ourselves!
• Existing Open Source Products
• Riak CS
• Some of our products introduced it, but it did not meet our performance requiremnt.
• OpenStack Swift
• Concerns about peformance degration when object count increases.
• Public Cloud Providers
• cost inefficient
• We mainly provides our services with our own DC.
We needed a high scalable storage system which runs on-premise.
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Architecture
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Architecture Overview
• Dragon consists of 3 components: API Nodes, Storage Cluster and MetaDB.
• API Node
• Provides S3 compatible API and serves all user requets.
• Storage Node
• HTTP file servers that store BLOBs of uploaded objects.
• 3 nodes make up a VolumeGroup. BLOBs in each group are periodically synchronized.
• MetaDB
• Apache Cassandra cluster
• Stores metadata of uploaded objects including the location of its BLOB.
16
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Architecture
17
API Nodes
HTTP (S3 API)
BLOB
Metadata
Storage Cluster
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
Meta DB
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Architecture
18
API Nodes
HTTP (S3
API)
BLOB
Metadata
Storage Cluster
API and Storage nodes are witten in Go
18
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
Meta DB
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Architecture
19
API Nodes
BLOBStorage Cluster
VolumeGroup: 01
StorageNode
1
HDD4
HDD3
StorageNode
2
HDD4
HDD3
StorageNode
3
HDD4
HDD3
VolumeGroup: 02
StorageNode
4
HDD4
HDD3
StorageNode
5
HDD4
HDD3
StorageNode
6
HDD4
HDD3
API Nodes periodically fetch and cache VolumeGroup configuration from MetaDB.
Meta DB
id hosts Volumes
01 node1,node2,node3 HDD1, HDD2
02 node4,node5,node6 HDD1, HDD2
volumegroup configuration
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Upload
20
API Nodes
Meta DB
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
① HTTP PUT
key: bucket1/sample.jpg,
size: 1024bytes
blob: volumegroup01/hdd1/...,
PUT bucket1/sample.jpg
② Metadata
1. When a user uploads an object, the API Node first randomly picks a VolumeGroup and transfers
the object’s BLOB to the nodes in the VolumeGroup using HTTP PUT.
2. Stores the metadata including its BLOB location into the MetaDB.
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Download
21
API Nodes
Meta DB
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
② HTTP GET
key: bucket1/sample.jpg,
size: 1024bytes
blob: volumegroup01/hdd1/...,
PUT bucket1/sample.jpg
① Metadata
1. When a user downloads an Object, the API Node retrieves its metadata from the MetaDB.
2. Requests a HTTP GET to a Storage holding the BLOB based on the metadata and transfer the
response to the user.
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Failure Recovery
22
API Nodes
Meta DB
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
When a Hard Disk fails...
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Failure Recovery
23
API Nodes
Meta DB
VolumeGroup: 01
StorageNode
1
HDD2
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
The drive will be replaced and data that should be in the drive will be recovered by transferring from
the other StorageNodes in the VolumeGroup.
HDD1
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Scaling out
24
API Nodes
Meta DB
When you add capacity to the cluster...
24
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
id hosts Volumes
01 node1,node2,node3 HDD1, HDD2
02 node4,node5,node6 HDD1, HDD2
volumegroup Configuration
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Scaling out
API Nodes
Meta DB
• ... simply set up a new set of StorageNodes and update the VolumeGroup configuration.
25
VolumeGroup: 01
StorageNode
1
HDD2
HDD1
StorageNode
2
HDD2
HDD1
StorageNode
3
HDD2
HDD1
VolumeGroup: 02
StorageNode
4
HDD2
HDD1
StorageNode
5
HDD2
HDD1
StorageNode
6
HDD2
HDD1
VolumeGroup: 03
StorageNode
7
HDD2
HDD1
StorageNode
8
HDD2
HDD1
StorageNode
9
HDD2
HDD1
id hosts Volumes
01 node1,node2,node3 HDD1, HDD2
02 node4,node5,node6 HDD1, HDD2
03 node7,node8,node9 HDD1, HDD2
volumegroup Configuration
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Why not Consistent Hash?
• Dragon’s distributed architecture is based on mapping managed by the DB.
• Q. Why not Consistent Hash?
26
quoted from: http://docs.basho.com/riak/kv/2.2.3/learn/concepts/clusters/
• Consistent Hash
• Data is distributed uniformly by hash of key
• Used by many existing distributed systems
• e.g. Riak CS, OpenStack Swift
• No need for external DB to manage the map
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Why not Consistent Hash?
• A. Able to add storage capacities without Rebalancing
• It heavily consumes Disk I/O, bandwidth, and often takes a long time.
• eg. Adding 1 node into 10 node * 720TB cluster which is 100% utilized requires transfering
655TB. 655TB/2Gbps = 30 days
• Scaling hash-based DB to more than 1000 nodes with large nodes is very challenging.
27
655TB
(720TB*10Node)/11Node = 655TB
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Other Pros/Cons
• Pros
• We can scale out MetaDB and BLOB Storage independently.
• Backend Storage Engine is pluggable.
• We can easily add or change the storage technology/class in the future
• Cons
• We need external Database to manage the map
• BLOB load would be non-uniform
• We’ll rebalance periodically.
28
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Storage Node
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Storage Hardware
• High density Storage Servers for cost efficiency
• We need to make use of the full potential of the hardware.
30
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR90L.cfm
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Storage Configuration
• HDDs are configured as independent logical volumes instead of RAID
• Reason 1: To reduce time to recover when HDDs fail.
31
VolumeGroup
StorageNode
HDD4
HDD3
HDD2
HDD1
StorageNode
HDD4
HDD3
HDD2
HDD1
StorageNode
HDD4
HDD3
HDD2
HDD1
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Storage Configuration
• Reason 2: RAID is slow for random access.
32
Configure Requests per sec
Non RAID 178.9
RAID 0 73.4
RAID 5 68.6
Throughput for random access work load.
Served by Nginx. 4HDDs. Filesize: 500KB
2.4x Faster
than RAID 0
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
File Persistence Strategy
• Dragon’s Storage Nodes use one file per BLOB.
• Strategy to increase robustness by using stable filesystem (ext4).
• But, it is known that file systems can not handle large numbers of files well.
• It is reported that Swift has poor writing performance as the number of files increases.
• To get over this problem, Dragon uses a unique technique.
33
ref.1: “OpenStack Swiftによる画像ストレージの運用” http://labs.gree.jp/blog/2014/12/11746/
ref.2: “画像システムの車窓から|サイバーエージェント 公式エンジニアブログ” http://ameblo.jp/principia-ca/entry-12140148643.html
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
File Persistence Strategy
• Typical approach: Write files into directories evenly which are created in advance
• Swift writes files in this manner.
• As the number of files increases, the number of seeks increases and the write throughput decreases.
• Cost for updating dentries increases.
34
(256dirs)
... 256 dirs01 02 03 fe ff
Seek count and throughput when randomly writing 3 million files in 256 directories.
Implemented as a smple HTTP server. Used ab, blktrace, seekwatcher for measurement.
photo2.jpgphoto1.jpg photo4.jpgphoto3.jpg
Hash function
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Dynamic Partitioning
• Dynamic Partitioning Approach
1. Create a sequentially numbered directories (partitions). API Nodes upload files into the latest directory.
2. Once the number of files in the partition reaches a threshold (1000 here), the Storage Node creates the
next partition and informs the API nodes about it.
• Keep the number of files in the directory constant by adding directories at any time.
35
When # of files/dir exceeds approximately 1000, Dragon creates a next directory and uploads there.
0 1 0 New
Dir!
1
1000
Files!
2
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Dynamic Partitioning
36
• Comparison with hash strategy. Green is Dyamic Partitioning.
• Even if file count increases, seek count does not increase, throughput is stable
Writing Files in Hash Based Strategy (blue) and Dynamic Partitioning (green)
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Microbenchmark
Confirmed the maintenance of writing
throughput up to 10 Million files for
single HDD.
37
Writing throughput when creating up to 10 Million files.
We syncd and dropped cache after each creating 100,000 files.
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Eventual Consistency
• To achieve high availability, writing to Storage Nodes uses eventual consistency with Quorum.
• Uploads succeed if writing to the majority of 3 nodes is successful.
• Anti-Entropy Repair process synchronizes failed nodes periodically.
38
VolumeGroup: 01
StorageNode
1
HDD4
HDD3
HDD2
HDD1
StorageNode
2
HDD4
HDD3
HDD2
HDD1
StorageNode
3
HDD4
HDD3
HDD2
HDD1
API Nodes
OK
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Anti-Entropy Repair
• Anti-Entropy Repair
• Process to compare data between nodes, detect data that is not replicated and recover the
consistency.
39
Node B Node C
file1
file2
file3
file4
Node A
file1
file2
file3
file4
file1
file2
file4
file3
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Anti-Entropy Repair
• Detect and correct inconsistency of Storage Nodes in a partition unit.
1. Calculate the hash of the names of the files in a partition.
2. Compare the hashes between nodes in a VolumeGroup. There are inconsistencies if the hashes do not match.
3. If the hashes do not match, compare the files in the partition and transfer missing files.
• Comparing process is IO efficient as we can cache the hash and the update is concentrated in the latest
partition.
40
HDD2
01 60b725f...
02 e8191b3...
03 97880df...
HDD2
01 60b725f...
02 e8191b3...
03 97880df...
HDD2
01 60b725f...
02 e8191b3...
03 10c9c85c...
node1 node2 node3
file1001.data
-----
file1003.data
file1001.data
file1002.data
file1003.data
file1001.data
file1002.data
file1003.data
transfer file1002.data to node1
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
MetaDB
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Cassandra
• Apache Cassandra
• High Availability
• Linear Scalability
• Low operation cost
• Eventual Consistency
• Cassandra does not support ACID transactions
42
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Cassandra
• Tables
• VolumeGroup
• Account
• Bucket
• Object
• ObjectIndex
43
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Object Table
• Object Table
• Table to retain Object Metadata
• size, BLOB location, ACL, Content-Type...
• Distributed evenly within the cluster by the partition key which is composed of (bucket, key).
44
bucket key mtime status metadata...
b1 photo1.jpg uuid(t2) ACTIVE {size, location, acl...,}
b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}
b3 photo1.jpg uuid(t3) ACTIVE {size, location, acl....}
Partition Key
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
PUT Object
• Update matadata
• Within each partition, metadata is clustered in descending order by UUIDv1 based on creation time.
• When an object is overwritten, the metadata of the latest version is inserted into the top of the partition.
• Since we keep records of multiple versions, no inconsistency occurs even if the object is overwritten
concurrently.
45
Clustering Column
bucket key mtime status metadata...
b1 photo2.jpg
uuid(t5) ACTIVE {size, location, acl...,}
uuid(t4) ACTIVE {size, location, acl...,}
uuid(t1) ACTIVE {size, location, acl...,}
b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}
PUT b1/photo2.jpg (time: t4)
PUT b1/photo2.jpg (time: t5)
photo2.jpg reaches consistency. (t5 wins)
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
GET Object
• Retrieving Metadata
• Retrieve the first row of the partition with SELECT query
• Since the partition is sorted by the creation time, the first row always indicates the current
state of the object.
46
bucket key mtime status metadata...
b1 photo1.jpg
uuid(t5) ACTIVE {size, location, acl...}
uuid(t3) ACTIVE {size, location, acl....}
b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....}
Partition Key Clustering Column
SELECT * FROM bucket=‘b1’ AND key= ‘photo1.jpg’ LIMIT 1;
(time:t5)
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
DELETE Object
• Request Deletion of object
• Insert row with deletion status without deleting the row immediately.
47
bucket key mtime status metadata...
b1 photo1.jpg
uuid(t5) ACTIVE {size, location, acl...}
uuid(t3) ACTIVE {size, location, acl....}
b1 photo2.jpg
uuid(t7) DELETED N/A
uuid(t1) ACTIVE {size, location, acl....}
DELETE b1/photo1.jpg (time: t7)
Partition Key Clustering Column
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
GET Object (deleted)
• Retrieving Metadata (in case of deleted)
• If the retrieved latest row has DELETED status, the object is considered deleted logically and
returns error
48
bucket key mtime status metadata...
b1 photo1.jpg
uuid(t5) ACTIVE {size, location, acl...}
uuid(t3) ACTIVE {size, location, acl....}
b1 photo2.jpg
uuid(t7) DELETED N/A
uuid(t1) ACTIVE {size, location, acl....}
SELECT * FROM bucket=‘b1’ AND key= ‘photo2.jpg’ LIMIT 1;
(time:t7)
Partition Key Clustering Column
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Object Garbage Collection
• Garbage Collection (GC)
• Periodically deletes metadata and the linked BLOBs of overwritten or deleted Objects.
• Full scan of Object table
• The second and subsequent rows of each partition are garbage. GC Deletes them.
49
bucket key mtime status metadata...
b1 photo1.jpg
uuid(t5) ACTIVE {size, location, acl...}
uuid(t3) ACTIVE {size, location, acl....}
b1 photo2.jpg
uuid(t7) DELETED N/A
uuid(t3) ACTIVE {size, location, acl...,}
uuid(t1) ACTIVE {size, location, acl....}
Garbage
Garbage
Garbage
full scan
Upload 0 byte tomstone files to delete the BLOB
Partition Key Clustering Column
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Object Garbage Collection
• GC completed
50
bucket key mtime status metadata...
b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...}
b1 photo2.jpg uuid(t7) DELETED N/A
GC completed
We achieved Concurrency control on Eventual Consistency Database by using partitioning and UUID
clustering.
Partition Key Clustering Column
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Issues and Future Plans
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
ObjectIndex Table
• ObjectIndex Table
• Objects in bucket are sorted and stored in ObjectIndex table in asc order by key name for ListObjects API
• Since the partitions get extremely large, objects in a bucket are split into 16 partitions.
52
bucket hash key metadata
bucket1 0
key0001 ...
key0003 ...
key0012 ...
key0024 ...
... ...
bucket1 1
key0004 ...
key0009 ...
key0011 ...
... ...
bucket1 2
key0002 ...
key0005 ...
... ...
... ... ... ...
key metadata
key0001 ...
key0002 ...
key0003 ...
key0004 ...
key0005 ...
key0006 ...
key0007 ...
key0008 ...
... ...
Retrieve 16 partitions and merge them to respond
ObjectIndex Table
Partition Key Clustering Column
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Issues
• ObjectIndex related problems
• Some API requests cause a lot of queries to Cassandra, resulting in high load and high
latency.
• Because of Cassandra’s limitation, the # of Objects in Bucket is restricted to 32 Billion.
• We’d like to eliminate constraints on the number of Objects by introducing a mechanism
that dynamically divides the index partition.
53
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Future Plans
• Improvement of Storage Engine
• WAL (Write Ahead Log) based Engine?
• Erasure Coding?
• Serverless Architecture
• Push notification to messaging queues such as Kafka, Pulsar
• Integration with other distributed systems
• Hadoop, Spark, Presto, etc...
54
Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved.
Wrap up
• Yahoo! JAPAN is developing a large scale object storage named “Dragon”.
• “Dragon” is a highly scalable object storage platform.
• We’re going to improve it to meet our new requirements.
• Thank you!

More Related Content

What's hot

Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
Jazz Yao-Tsung Wang
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 

What's hot (20)

Online Upgrade Using Logical Replication
 Online Upgrade Using Logical Replication Online Upgrade Using Logical Replication
Online Upgrade Using Logical Replication
 
A Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkA Peek in the Elephant's Trunk
A Peek in the Elephant's Trunk
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Hedis - GET HBase via Redis
Hedis - GET HBase via RedisHedis - GET HBase via Redis
Hedis - GET HBase via Redis
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Managing PostgreSQL with Ansible
 Managing PostgreSQL with Ansible Managing PostgreSQL with Ansible
Managing PostgreSQL with Ansible
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
 
Does Anyone Really Need RAC?
 Does Anyone Really Need RAC? Does Anyone Really Need RAC?
Does Anyone Really Need RAC?
 
Hops - Distributed metadata for Hadoop
Hops - Distributed metadata for HadoopHops - Distributed metadata for Hadoop
Hops - Distributed metadata for Hadoop
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Smau Milano 2016 - Fabio Alessandro Locati
Smau Milano 2016 - Fabio Alessandro LocatiSmau Milano 2016 - Fabio Alessandro Locati
Smau Milano 2016 - Fabio Alessandro Locati
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
 
DAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL PerformanceDAT316_Report from the field on Aurora PostgreSQL Performance
DAT316_Report from the field on Aurora PostgreSQL Performance
 
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 

Viewers also liked

いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標
圭輔 大曽根
 

Viewers also liked (8)

Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
 
広告における機械学習の適用例とシステムについて
広告における機械学習の適用例とシステムについて広告における機械学習の適用例とシステムについて
広告における機械学習の適用例とシステムについて
 
Yahoo! JAPANのOSS Cassandra貢献の今までとこれから
Yahoo! JAPANのOSS Cassandra貢献の今までとこれからYahoo! JAPANのOSS Cassandra貢献の今までとこれから
Yahoo! JAPANのOSS Cassandra貢献の今までとこれから
 
Design pattern in presto source code
Design pattern in presto source codeDesign pattern in presto source code
Design pattern in presto source code
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて
 
ICML2017 参加報告会 山本康生
ICML2017 参加報告会 山本康生ICML2017 参加報告会 山本康生
ICML2017 参加報告会 山本康生
 
協調フィルタリングを利用した推薦システム構築
協調フィルタリングを利用した推薦システム構築協調フィルタリングを利用した推薦システム構築
協調フィルタリングを利用した推薦システム構築
 
いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標
 

Similar to Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 

Similar to Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.) (20)

What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Ceph at Spreadshirt (June 2016)
Ceph at Spreadshirt (June 2016)Ceph at Spreadshirt (June 2016)
Ceph at Spreadshirt (June 2016)
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
DevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on ExadataDevOps Supercharged with Docker on Exadata
DevOps Supercharged with Docker on Exadata
 
Compression talk
Compression talkCompression talk
Compression talk
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...
Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...
Don't Repeat Our Mistakes! Lessons Learned from Running Go Daddy's Private Cl...
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 

More from Yahoo!デベロッパーネットワーク

More from Yahoo!デベロッパーネットワーク (20)

ゼロから始める転移学習
ゼロから始める転移学習ゼロから始める転移学習
ゼロから始める転移学習
 
継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator
 
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるかヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
 
オンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッションオンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッション
 
LakeTahoe
LakeTahoeLakeTahoe
LakeTahoe
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
 
Persistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability FeaturePersistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability Feature
 
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
 
eコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtceコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtc
 
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtcヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
 
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtcYahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
 
ビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtcビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtc
 
サイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtcサイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtc
 
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtcヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
 
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtcYahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
 
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
 
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtcPC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
 
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtcモブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
 
「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc
 
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtcユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / English Ver.)

  • 1. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Sep. 19. 2017 WebDB Forum Tokyo 1 Yasuharu Goto Dragon: A Distributed Object Storage @Yahoo! JAPAN (English Ver.)
  • 2. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. About me • Yasuharu Goto • Yahoo! JAPAN (2008-) • Software Engineer • Storage, Distributed Database Systems (Cassandra) • Twitter: @ono_matope • Lang: Go 2
  • 3. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Agenda • About Dragon • Architecture • Issues and Future works 3
  • 4. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Dragon
  • 5. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Object Storage • What is Object Storage? • A storage architecure that manages files not as files but as objects. • Instead of providing features like file hierarchy, it provides high availability and scalabiliity. • (Typically) provides REST API, so it can be used easily by applications. • Populer products • AWS: Amazon S3 • GCP: Google Cloud Storage • Azure: Azure Blob Storage • An essential component for modern web development. 5
  • 6. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Dragon • A distributed Object Storage developed at Yahoo! JAPAN. • Design Goals: • High { performance, scalability, availability, cost efficiency } • Written in Go • Released in Jan/2016 (20 months in production) • Scale • deployed in 2 data centers in Japan • Stores 20 billion / 11 PB of objects. 6
  • 7. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Use Cases • 250+ users in Y!J • Various usage • media content • user data, log storage • backend for Presto (experimental) 7 • Yahoo! Auction (image) • Yahoo! News/Topics (image) • Yahoo! Display Ad Network (image/video) • Yahoo! Blog (image) • Yahoo! Smartphone Themes (image) • Yahoo! Travel (image) • Yahoo! Real Estate (image) • Yahoo! Q&A (image) • Yahoo! Reservation (image) • Yahoo! Politics (image) • Yahoo! Game (contents) • Yahoo! Bookstore (contents) • Yahoo! Box (user data) • Netallica (image) • etc...
  • 8. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. S3 Compatible API • Dragon provides an S3 compatible API • aws-sdk, aws-cli, CyberDuck... • Implemented • Basic S3 API (Service, Bucket, Object, ACL...) • SSE (Server Side Encryption) • TODO • Multipart Upload API (to upload large objects up to 5TB) • and more... 8
  • 9. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Performance(with Riak CS/reference) • Dragon: API*1, Storage*3, Cassandra*3 • Riak CS: haproxy*1, stanchion*1, Riak (KV+CS)*3 • Same Hardware except for Cassandra and Stanchion. 9 0 500 1000 1500 2000 2500 3000 3500 1 5 10 50 100 200 400 Requests/sec # of Threads GET Object 10KB Throughput Riak CS Dragon 0 100 200 300 400 500 600 700 800 900 1000 1 5 10 50 100 200 400 Requests/sec # of Threads PUT Object 10KB Throughput Riak CS Dragon
  • 10. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Why?
  • 11. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Why did we build a new Object Storage? • Octagon (2011-2017) • Our 1st Generation Object Storage • Up to 7 PB / 7 Billion Objects / 3,000 Nodes at a time • used for personal cloud storage service, E-Book, etc... • Problems of Octagon • Low performance • Unstable • Expensive TCO • Hard to operate • We started to consider alternative products. 11
  • 12. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Requirements • Our requirements • High performance enough for our services • High scalability to respond to rapid increase in data demands • High availability with less operation cost • High cost efficiency • Mission • To establish a company-wide storage infrastructure 12
  • 13. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Alternatives • Existing Open Source Products • Riak CS • Some of our products introduced it, but it did not meet our performance requiremnt. • OpenStack Swift • Concerns about peformance degration when object count increases. • Public Cloud Providers • cost inefficient • We mainly provides our services with our own DC. We needed a high scalable storage system which runs on-premise. 13
  • 14. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Alternatives 14 OK, let’s make it by ourselves! • Existing Open Source Products • Riak CS • Some of our products introduced it, but it did not meet our performance requiremnt. • OpenStack Swift • Concerns about peformance degration when object count increases. • Public Cloud Providers • cost inefficient • We mainly provides our services with our own DC. We needed a high scalable storage system which runs on-premise.
  • 15. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Architecture
  • 16. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Architecture Overview • Dragon consists of 3 components: API Nodes, Storage Cluster and MetaDB. • API Node • Provides S3 compatible API and serves all user requets. • Storage Node • HTTP file servers that store BLOBs of uploaded objects. • 3 nodes make up a VolumeGroup. BLOBs in each group are periodically synchronized. • MetaDB • Apache Cassandra cluster • Stores metadata of uploaded objects including the location of its BLOB. 16
  • 17. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Architecture 17 API Nodes HTTP (S3 API) BLOB Metadata Storage Cluster VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 Meta DB
  • 18. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Architecture 18 API Nodes HTTP (S3 API) BLOB Metadata Storage Cluster API and Storage nodes are witten in Go 18 VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 Meta DB
  • 19. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Architecture 19 API Nodes BLOBStorage Cluster VolumeGroup: 01 StorageNode 1 HDD4 HDD3 StorageNode 2 HDD4 HDD3 StorageNode 3 HDD4 HDD3 VolumeGroup: 02 StorageNode 4 HDD4 HDD3 StorageNode 5 HDD4 HDD3 StorageNode 6 HDD4 HDD3 API Nodes periodically fetch and cache VolumeGroup configuration from MetaDB. Meta DB id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 volumegroup configuration
  • 20. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Upload 20 API Nodes Meta DB VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 ① HTTP PUT key: bucket1/sample.jpg, size: 1024bytes blob: volumegroup01/hdd1/..., PUT bucket1/sample.jpg ② Metadata 1. When a user uploads an object, the API Node first randomly picks a VolumeGroup and transfers the object’s BLOB to the nodes in the VolumeGroup using HTTP PUT. 2. Stores the metadata including its BLOB location into the MetaDB.
  • 21. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Download 21 API Nodes Meta DB VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 ② HTTP GET key: bucket1/sample.jpg, size: 1024bytes blob: volumegroup01/hdd1/..., PUT bucket1/sample.jpg ① Metadata 1. When a user downloads an Object, the API Node retrieves its metadata from the MetaDB. 2. Requests a HTTP GET to a Storage holding the BLOB based on the metadata and transfer the response to the user.
  • 22. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Failure Recovery 22 API Nodes Meta DB VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 When a Hard Disk fails...
  • 23. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Failure Recovery 23 API Nodes Meta DB VolumeGroup: 01 StorageNode 1 HDD2 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 The drive will be replaced and data that should be in the drive will be recovered by transferring from the other StorageNodes in the VolumeGroup. HDD1
  • 24. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Scaling out 24 API Nodes Meta DB When you add capacity to the cluster... 24 VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 volumegroup Configuration
  • 25. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Scaling out API Nodes Meta DB • ... simply set up a new set of StorageNodes and update the VolumeGroup configuration. 25 VolumeGroup: 01 StorageNode 1 HDD2 HDD1 StorageNode 2 HDD2 HDD1 StorageNode 3 HDD2 HDD1 VolumeGroup: 02 StorageNode 4 HDD2 HDD1 StorageNode 5 HDD2 HDD1 StorageNode 6 HDD2 HDD1 VolumeGroup: 03 StorageNode 7 HDD2 HDD1 StorageNode 8 HDD2 HDD1 StorageNode 9 HDD2 HDD1 id hosts Volumes 01 node1,node2,node3 HDD1, HDD2 02 node4,node5,node6 HDD1, HDD2 03 node7,node8,node9 HDD1, HDD2 volumegroup Configuration
  • 26. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Why not Consistent Hash? • Dragon’s distributed architecture is based on mapping managed by the DB. • Q. Why not Consistent Hash? 26 quoted from: http://docs.basho.com/riak/kv/2.2.3/learn/concepts/clusters/ • Consistent Hash • Data is distributed uniformly by hash of key • Used by many existing distributed systems • e.g. Riak CS, OpenStack Swift • No need for external DB to manage the map
  • 27. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Why not Consistent Hash? • A. Able to add storage capacities without Rebalancing • It heavily consumes Disk I/O, bandwidth, and often takes a long time. • eg. Adding 1 node into 10 node * 720TB cluster which is 100% utilized requires transfering 655TB. 655TB/2Gbps = 30 days • Scaling hash-based DB to more than 1000 nodes with large nodes is very challenging. 27 655TB (720TB*10Node)/11Node = 655TB
  • 28. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Other Pros/Cons • Pros • We can scale out MetaDB and BLOB Storage independently. • Backend Storage Engine is pluggable. • We can easily add or change the storage technology/class in the future • Cons • We need external Database to manage the map • BLOB load would be non-uniform • We’ll rebalance periodically. 28
  • 29. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Storage Node
  • 30. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Storage Hardware • High density Storage Servers for cost efficiency • We need to make use of the full potential of the hardware. 30 https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR90L.cfm
  • 31. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Storage Configuration • HDDs are configured as independent logical volumes instead of RAID • Reason 1: To reduce time to recover when HDDs fail. 31 VolumeGroup StorageNode HDD4 HDD3 HDD2 HDD1 StorageNode HDD4 HDD3 HDD2 HDD1 StorageNode HDD4 HDD3 HDD2 HDD1
  • 32. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Storage Configuration • Reason 2: RAID is slow for random access. 32 Configure Requests per sec Non RAID 178.9 RAID 0 73.4 RAID 5 68.6 Throughput for random access work load. Served by Nginx. 4HDDs. Filesize: 500KB 2.4x Faster than RAID 0
  • 33. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. File Persistence Strategy • Dragon’s Storage Nodes use one file per BLOB. • Strategy to increase robustness by using stable filesystem (ext4). • But, it is known that file systems can not handle large numbers of files well. • It is reported that Swift has poor writing performance as the number of files increases. • To get over this problem, Dragon uses a unique technique. 33 ref.1: “OpenStack Swiftによる画像ストレージの運用” http://labs.gree.jp/blog/2014/12/11746/ ref.2: “画像システムの車窓から|サイバーエージェント 公式エンジニアブログ” http://ameblo.jp/principia-ca/entry-12140148643.html
  • 34. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. File Persistence Strategy • Typical approach: Write files into directories evenly which are created in advance • Swift writes files in this manner. • As the number of files increases, the number of seeks increases and the write throughput decreases. • Cost for updating dentries increases. 34 (256dirs) ... 256 dirs01 02 03 fe ff Seek count and throughput when randomly writing 3 million files in 256 directories. Implemented as a smple HTTP server. Used ab, blktrace, seekwatcher for measurement. photo2.jpgphoto1.jpg photo4.jpgphoto3.jpg Hash function
  • 35. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Dynamic Partitioning • Dynamic Partitioning Approach 1. Create a sequentially numbered directories (partitions). API Nodes upload files into the latest directory. 2. Once the number of files in the partition reaches a threshold (1000 here), the Storage Node creates the next partition and informs the API nodes about it. • Keep the number of files in the directory constant by adding directories at any time. 35 When # of files/dir exceeds approximately 1000, Dragon creates a next directory and uploads there. 0 1 0 New Dir! 1 1000 Files! 2
  • 36. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Dynamic Partitioning 36 • Comparison with hash strategy. Green is Dyamic Partitioning. • Even if file count increases, seek count does not increase, throughput is stable Writing Files in Hash Based Strategy (blue) and Dynamic Partitioning (green)
  • 37. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Microbenchmark Confirmed the maintenance of writing throughput up to 10 Million files for single HDD. 37 Writing throughput when creating up to 10 Million files. We syncd and dropped cache after each creating 100,000 files.
  • 38. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Eventual Consistency • To achieve high availability, writing to Storage Nodes uses eventual consistency with Quorum. • Uploads succeed if writing to the majority of 3 nodes is successful. • Anti-Entropy Repair process synchronizes failed nodes periodically. 38 VolumeGroup: 01 StorageNode 1 HDD4 HDD3 HDD2 HDD1 StorageNode 2 HDD4 HDD3 HDD2 HDD1 StorageNode 3 HDD4 HDD3 HDD2 HDD1 API Nodes OK
  • 39. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Anti-Entropy Repair • Anti-Entropy Repair • Process to compare data between nodes, detect data that is not replicated and recover the consistency. 39 Node B Node C file1 file2 file3 file4 Node A file1 file2 file3 file4 file1 file2 file4 file3
  • 40. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Anti-Entropy Repair • Detect and correct inconsistency of Storage Nodes in a partition unit. 1. Calculate the hash of the names of the files in a partition. 2. Compare the hashes between nodes in a VolumeGroup. There are inconsistencies if the hashes do not match. 3. If the hashes do not match, compare the files in the partition and transfer missing files. • Comparing process is IO efficient as we can cache the hash and the update is concentrated in the latest partition. 40 HDD2 01 60b725f... 02 e8191b3... 03 97880df... HDD2 01 60b725f... 02 e8191b3... 03 97880df... HDD2 01 60b725f... 02 e8191b3... 03 10c9c85c... node1 node2 node3 file1001.data ----- file1003.data file1001.data file1002.data file1003.data file1001.data file1002.data file1003.data transfer file1002.data to node1
  • 41. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. MetaDB
  • 42. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Cassandra • Apache Cassandra • High Availability • Linear Scalability • Low operation cost • Eventual Consistency • Cassandra does not support ACID transactions 42
  • 43. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Cassandra • Tables • VolumeGroup • Account • Bucket • Object • ObjectIndex 43
  • 44. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Object Table • Object Table • Table to retain Object Metadata • size, BLOB location, ACL, Content-Type... • Distributed evenly within the cluster by the partition key which is composed of (bucket, key). 44 bucket key mtime status metadata... b1 photo1.jpg uuid(t2) ACTIVE {size, location, acl...,} b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....} b3 photo1.jpg uuid(t3) ACTIVE {size, location, acl....} Partition Key
  • 45. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. PUT Object • Update matadata • Within each partition, metadata is clustered in descending order by UUIDv1 based on creation time. • When an object is overwritten, the metadata of the latest version is inserted into the top of the partition. • Since we keep records of multiple versions, no inconsistency occurs even if the object is overwritten concurrently. 45 Clustering Column bucket key mtime status metadata... b1 photo2.jpg uuid(t5) ACTIVE {size, location, acl...,} uuid(t4) ACTIVE {size, location, acl...,} uuid(t1) ACTIVE {size, location, acl...,} b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....} PUT b1/photo2.jpg (time: t4) PUT b1/photo2.jpg (time: t5) photo2.jpg reaches consistency. (t5 wins)
  • 46. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. GET Object • Retrieving Metadata • Retrieve the first row of the partition with SELECT query • Since the partition is sorted by the creation time, the first row always indicates the current state of the object. 46 bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} b1 photo2.jpg uuid(t1) ACTIVE {size, location, acl....} Partition Key Clustering Column SELECT * FROM bucket=‘b1’ AND key= ‘photo1.jpg’ LIMIT 1; (time:t5)
  • 47. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. DELETE Object • Request Deletion of object • Insert row with deletion status without deleting the row immediately. 47 bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} b1 photo2.jpg uuid(t7) DELETED N/A uuid(t1) ACTIVE {size, location, acl....} DELETE b1/photo1.jpg (time: t7) Partition Key Clustering Column
  • 48. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. GET Object (deleted) • Retrieving Metadata (in case of deleted) • If the retrieved latest row has DELETED status, the object is considered deleted logically and returns error 48 bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} b1 photo2.jpg uuid(t7) DELETED N/A uuid(t1) ACTIVE {size, location, acl....} SELECT * FROM bucket=‘b1’ AND key= ‘photo2.jpg’ LIMIT 1; (time:t7) Partition Key Clustering Column
  • 49. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Object Garbage Collection • Garbage Collection (GC) • Periodically deletes metadata and the linked BLOBs of overwritten or deleted Objects. • Full scan of Object table • The second and subsequent rows of each partition are garbage. GC Deletes them. 49 bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} uuid(t3) ACTIVE {size, location, acl....} b1 photo2.jpg uuid(t7) DELETED N/A uuid(t3) ACTIVE {size, location, acl...,} uuid(t1) ACTIVE {size, location, acl....} Garbage Garbage Garbage full scan Upload 0 byte tomstone files to delete the BLOB Partition Key Clustering Column
  • 50. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Object Garbage Collection • GC completed 50 bucket key mtime status metadata... b1 photo1.jpg uuid(t5) ACTIVE {size, location, acl...} b1 photo2.jpg uuid(t7) DELETED N/A GC completed We achieved Concurrency control on Eventual Consistency Database by using partitioning and UUID clustering. Partition Key Clustering Column
  • 51. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Issues and Future Plans
  • 52. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. ObjectIndex Table • ObjectIndex Table • Objects in bucket are sorted and stored in ObjectIndex table in asc order by key name for ListObjects API • Since the partitions get extremely large, objects in a bucket are split into 16 partitions. 52 bucket hash key metadata bucket1 0 key0001 ... key0003 ... key0012 ... key0024 ... ... ... bucket1 1 key0004 ... key0009 ... key0011 ... ... ... bucket1 2 key0002 ... key0005 ... ... ... ... ... ... ... key metadata key0001 ... key0002 ... key0003 ... key0004 ... key0005 ... key0006 ... key0007 ... key0008 ... ... ... Retrieve 16 partitions and merge them to respond ObjectIndex Table Partition Key Clustering Column
  • 53. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Issues • ObjectIndex related problems • Some API requests cause a lot of queries to Cassandra, resulting in high load and high latency. • Because of Cassandra’s limitation, the # of Objects in Bucket is restricted to 32 Billion. • We’d like to eliminate constraints on the number of Objects by introducing a mechanism that dynamically divides the index partition. 53
  • 54. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Future Plans • Improvement of Storage Engine • WAL (Write Ahead Log) based Engine? • Erasure Coding? • Serverless Architecture • Push notification to messaging queues such as Kafka, Pulsar • Integration with other distributed systems • Hadoop, Spark, Presto, etc... 54
  • 55. Copyrig ht © 2017 Yahoo Japan Corporation. All Rig hts Reserved. Wrap up • Yahoo! JAPAN is developing a large scale object storage named “Dragon”. • “Dragon” is a highly scalable object storage platform. • We’re going to improve it to meet our new requirements. • Thank you!