4. 44
• Distributed Object Storage System In User Space
– Manage Disks and Nodes
• Aggregate the capacity and the power (IOPS + throughput)
• Hide the failure of hardware
• Dynamically grow or shrink the scale
– Secure Data
• Provide redundancy mechanisms (replication and erasure code) for high-
availability
• Secure the data with auto-healing and auto-rebalanced mechanisms
– Provide Interfaces (in a single cluster)
• Virtual volume for QEMU VM, iSCSI TGT (Best supported)
• RESTful container (Openstack Swift and Amazon S3 Compatible, in progress)
• Storage for Openstack Cinder, Glance, Nova (in progress)
• POSIX file via NFS (in progress)
• Linux Block Device
What is Sheepdog
5. 55
Gateway
Store
1TB 1TB
1TB
Gateway
Store
1TB 1TB
2TB
Gateway
Store
1TB 2TB
X
Private Hash Ring: Local Rebalance
Global Consistent Hash Ring and P2P Global Rebalance
No meta servers!Zookeeper: membership management and message queue
4TB Hot-plugged Auto unplugged on EIO
Disks and Nodes Management
10. 1010
Peoples
Kazutaka Morita 2009.9
People from Taobao 2011.9
Christph Hellwig from Nebula 2012.4
More production uses from the world
People from Intel 2014
People from China Mobile 2015
Stayed for around half the year
Valerio, Andy, startups at China and Japan
Add isa-l for Erasure code
Open sourced the Sheepdog
Add features, bug fixing, redesign
Make sheepdog better
11. 1111
Patches
2009 2010 2011 2012 2013 2014 2015
0
200
400
600
800
1000
1200
Patches Per Year
●
Culminate at 2012 and 2013,
suffer a decline recently.
●
It is always easier to open
source the code, but build a
community is really difficult.
●
China Mobile is committed to
release all its patches to the
community.
12. 1212
Comparison with Ceph and GlusterFS
Pros:
The simplicity is the biggest advantage for Sheepdog
Sheepdog: 20k+ lines in user space
Ceph: 400k+ lines in user space and 20k+ in kernel
GlusterFS: 330K+ lines in user space
Cons:
●
No company behind
●
inactive community
●
few users and few developers
But Sheepdog is not technically inferior! Simplicity doesn't mean bad!
13. 1313
Sheepdog-ng
Why?
We forked it at May because of endless crashes, panics by our stressing test. I
discussed with NTT guys with the redesign idea to remove shared states between
sheep nodes. They asked me to fork Sheepdog instead simply because they don't use
zookeeper as they always replied to a user with some features they don't use (e.g.,
object cache)
http://lists.wpkg.org/pipermail/sheepdog/2015-May/067736.html
The technical reason:
Share nothing or share more and more state with overwhelming complexity.
The non-technical reason:
Community is not as friendly and open as before. We want to build a real community-
based project.
Subscribe the list: send email to sheepdog-ng+subscribe@googlegroups.com
15. 1515
iSCSI Target Scalability
LUN1 LUN2
STGT
sheep
Main thread
Max req == nr of workers
Sync
LUN1 LUN2
New Target
sheep
Unlimted!
Async
Thread per LUN
Problems:
●
OS tends to issue more and
more request (blk-mp, scsi-mp)
●
A single LUN can saturate stgt,
not scale at all
●
STGT take too much resource
●
Multipath is not so good
Solution – Rewrite
●
from sync to async, less threads
and Fds
●
Tailored for sheepdog
●
Add io rebalance and cache
support New target
16. 1616
Performance Degradation
X
IO hang
IO Resume
Problem with default Dynamic Hash Ring
●
If object is in recovery, we need to wait!
●
What make it worse , recovery IO will
complete with user IO for bandwidth, CPU
●
Neither slow nor fast recovery is satisfied
Solution – Static Hash Ring
Failure of node won't change the hash ring.Trade
data reliability for performance! We don't recover
object if some of redundancy data are missing.
Useful for small cluster with mostly deal with single
node event.
X
Drop this IOSHR
DHR
17. 1717
Live Patching
A ----> B ----> C
A B C
B`
After Patching
B` is loaded by Linux's
dynamic loader on the fly
Sheep tracer
Similar to Linux's Ftrace, virtually add
constructor and destructor to every function.
This mechanism relies on the 5 bytes space
(A.K.A mcount) injected by GCC beforehand.
Based on the tracer, we can replace any function
in the sheep daemon on the fly.
Useful for one-liner bug fixing but is limited on
function level.
18. 1818
NFS Server
Current status:
Just a toy with file size < 4M, NFSv3 is not fully supported and virtually no file
system code (need implement inode, dentry and free space management)
Todos
- finish stubs
- add extent to file allocation
- add btree or hash based kv store to manage dentries
- implement a multi-threaded SUNRPC to take place of poor performance
glibc RPC
- implement NFS v4
19. 1919
Cinder - Block Storage
– Support since day 1
Glance - Image Storage
– Support merged at Havana version
Nova - Ephemeral Storage
– Not yet started
Swift - Object Storage
– Swift API compatible In progress
Final Goal - Unified Storage
– Copy-On-Write anywhere ?
– Data dedup ?
Sheep Sheep Sheep Sheep
Cinder Glance
Unified Storage
NovaSwift
Openstack
Plan to rewrite the driver with libsheepdog.so