As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements . The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way? In this talk, we'll explore the coming convergence of data and compute in the cloud, focusing in particular on Joyent's Manta, a new internet-facing object storage facility that features compute. We will describe the design principles for Manta, the engineering challenges in building it, and more generally, the opportunities presented by the convergence of compute and data.
A Journey Into the Emotions of Software Developers
ZFS & Zones: A Powerful Combination for Object Storage and In-Situ Compute
1. ZFS & Zones:
Your Compute fell into
My Data!
Bryan Cantrill
SVP, Engineering
bryan@joyent.com
@bcantrill
2. The filesystem: Some prehistory
•
When they were originally developed in the 1970s,
filesystems were designed as an abstraction over a disk
•
Over time, it became increasingly expensive to make
bigger disks — and reliability suffered
•
In the 1980s, both problems were solved by using many
hard-drives instead of just larger and large drives: a
redundant array of inexpensive disks (RAID)
•
Even though filesystems were still relatively young at the
time, it was deemed too complicated to rewrite them to
accommodate the (new) notion of many disks
•
This software problem was solved by introducing a new
layer of software: the volume manager
3. The volume management divide
•
Volume management abstracts many physical devices
into single logical volumes, allowing filesystems retained
a one-to-one mapping with a device (a logical one)
•
This gave rise to a problematic divide:
•
•
•
The volume manager understands multiple disks, but
nothing of the higher level semantics of the filesystem
The filesystem understands the higher semantics of the
data, but has no physical device understanding
This divide became entrenched over the 1990s, and had
devastating ramifications for reliability, performance and
manageability
4. Volume management deficiencies
•
Because the volume management layer had no notion
of the transactional semantics of the filesystem, system
failure induced excruciating file system checks
•
Worse, the system was left with no protection against
many variants of device-level data corruption:
•
•
•
The only failure the volume manager can reasonably detect
is media failure that results in incorrect data on disk
This doesn’t account for phantom reads (i.e., the wrong disk
block is read from), phantom writes (i.e., the wrong disk
block is written to) or driver pathologies (e.g. memory errors)
And because they did not understand more than one
device, device failure often meant filesystem failure
5. Volume management deficiencies
•
Lacking visibility into the hardware layer, the filesystem
could not effectively use the parallelism inherent in
multiple disks — and could not effectively schedule I/O
•
Spindles were underutilized (leaving bandwidth and/or
IOPS on the table) or overutilized (thrashing the device
and yielding pathological performance
•
Management was a nightmare: filesystems could not be
expanded or shrunk — requiring every filesystem to
know in advance its intended capacity
6. The ZFS revolution
•
Starting in 2001, Sun began a revolutionary new
software effort: to unify storage and eliminate the divide
•
In this model, filesystems would lose their one-to-one
association with devices: many filesystems would be
multiplexed on many devices
•
By starting with a clean sheet of paper, ZFS opened up
vistas of innovation — and by its architecture was able
to solve many otherwise intractable problems
•
Sun shipped ZFS in 2005, and used it as the foundation
of its enterprise storage products starting in 2008
•
ZFS was open sourced in 2005; it remains the only open
source enterprise-grade filesystem
7. ZFS advantages
•
Copy-on-write design allows on-disk consistency to be
always assured (eliminating file system check)
•
Copy-on-write design allows constant-time snapshots in
unlimited quantity — and writable clones!
•
Filesystem architecture allows filesystems to be created
instantly and expanded — or shrunk! — on-the-fly
•
Integrated volume management allows for intelligent
device behavior with respect to disk failure and recovery
•
Adaptive replacement cache (ARC) allows for optimal
use of DRAM — especially on high DRAM systems
•
Support for dedicated log and cache devices allows for
optimal use of flash-based SSDs
8. ZFS at Joyent
•
Joyent was the earliest ZFS adopter: becoming (in
2005) the first production user of ZFS outside of Sun
•
ZFS is one of the four foundational technologies of
Joyent’s SmartOS, our illumos derivative
•
•
•
The other three foundational technologies in SmartOS are
DTrace, Zones and KVM
Search “fork yeah illumos” for the (uncensored) history of
OpenSolaris, illumos, SmartOS and derivatives
Joyent has extended ZFS to provide better support
multi-tenant operation with I/O throttling
9. ZFS as the basis for object storage?
•
•
We view ZFS as our most foundational differentiator...
•
Could we extend ZFS in some important way that would
offer something interesting and compelling?
•
Short answer: meh
As we began to think about building our own internet
facing object store in the fall of 2011, we naturally
gravitated to ZFS...
10. Aside: Virtualization in the cloud
•
Operating a public cloud has significant technological
and business challenges:
•
From a technological perspective, must deliver highly elastic
infrastructure with acceptable quality of service across a
broad class of users and applications
•
From a business perspective, must drive utilization as high
as possible while still satisfying customer expectations for
quality of service
•
These aspirations are in tension: multi-tenancy can
significantly degrade quality of service
•
The key enabling technology for multi-tenancy is
virtualization — but where in the stack to virtualize?
11. Hardware-level virtualization?
•
The historical answer — since the 1960s — has been to
virtualize at the level of the hardware:
•
A virtual machine is presented upon which each
tenant runs an operating system of their choosing
•
There are as many operating systems as tenants
•
The historical motivation for hardware virtualization
remains its advantage today: it can run entire legacy
stacks unmodified
•
However, hardware virtualization exacts a heavy tolls:
operating systems are not designed to share resources
like DRAM, CPU, I/O devices or the network
•
Hardware virtualization limits tenancy and inhibits
performance!
12. Platform-level virtualization?
•
Virtualizing at the application platform layer addresses
the tenancy challenges of hardware virtualization…
•
•
...but at the cost of dictating abstraction to the developer
•
Virtualizing at the application platform layer poses many
other challenges:
This creates the “Google App Engine problem”:
developers are in a straightjacket where toy programs
are easy — but sophisticated apps are impossible
•
Security, resource containment, language specificity,
environment-specific engineering costs
13. Joyent’s solution: OS-level virtualization
•
Virtualizing at the OS level hits the sweet spot:
•
Single OS (single kernel) allows for efficient use of hardware
resources, and therefore allows load factors to be high
•
Disjoint instances are securely compartmentalized by the
operating system
•
Gives customers what appears to be a virtual machine
(albeit a very fast one) on which to run higher-level software
•
Gives customers PaaS when the abstractions work for them,
IaaS when they need more generality
•
OS-level virtualization allows for high levels of tenancy
without dictating abstraction or sacrificing efficiency
•
Zones is a bullet-proof implementation of OS-level
virtualization — and is the core abstraction in Joyent’s
SmartOS
15. Manta: ZFS + Zones!
•
Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an internet-facing
object storage system offering in situ compute
•
That is, the description of compute can be brought to
where objects reside instead of having to backhaul
objects to transient compute
•
The abstractions made available for computation are
anything that can run on the OS...
•
...and as a reminder, the OS — Unix — was built around
the notion of ad hoc unstructured data processing, and
allows for remarkably terse expressions of computation
16. Aside: Unix
•
When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about systems
•
Instead of a sealed monolith, the operating system was
a collection of small, easily understood programs
•
First Edition Unix (1971) contained many programs that
we still use today (ls, rm, cat, mv)
•
Its very name conveyed this minimalist aesthetic: Unix is
a homophone of “eunuchs” — a castrated Multics
We were a bit oppressed by the big system mentality. Ken
wanted to do something simple. — Dennis Ritchie
17. Unix: Let there be light
•
In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that Thompson and Ritchie were sketching
out a file system, I was sketching out how to do data
processing on the blackboard by connecting together
cascades of processes
•
This was the primordial pipe, but it took three years to
persuade Thompson to adopt it:
And one day I came up with a syntax for the shell that went
along with the piping, and Ken said, “I’m going to do it!”
18. Unix: ...and there was light
And the next morning we had this
orgy of one-liners. — Doug McIlroy
19. The Unix philosophy
•
The pipe — coupled with the small-system aesthetic —
gave rise to the Unix philosophy, as articulated by Doug
McIlroy:
•
•
Write programs to work together
•
•
Write programs that do one thing and do it well
Write programs that handle text streams, because
that is a universal interface
Four decades later, this philosophy remains the single
most important revolution in software systems thinking!
20. Doug McIlroy v. Don Knuth: FIGHT!
•
In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer science history:
Read a file of text, determine the n most frequently used
words, and print out a sorted list of those words along with
their frequencies.
•
Don Knuth’s solution: an elaborate program in WEB, a
Pascal-like literate programming system of his own
invention, using a purpose-built algorithm
•
Doug McIlroy’s solution shows the power of the Unix
philosophy:
tr -cs A-Za-z 'n' | tr A-Z a-z |
sort | uniq -c | sort -rn | sed ${1}q
21. Big Data: History repeats itself?
•
The original Google MapReduce paper (Dean et al.,
OSDI ’04) poses a problem disturbingly similar to
Bentley’s challenge nearly two decades prior:
Count of URL Access Frequency: The function processes
logs of web page requests and outputs ⟨URL, 1⟩. The
reduce function adds together all values for the same URL
and emits a ⟨URL, total count⟩ pair
•
•
But the solutions do not adhere to the Unix philosophy...
•
e.g., Appendix A of the OSDI ’04 paper has a 71 line
word count in C++ — with nary a wc in sight
...and nor do they make use of the substantial Unix
foundation for data processing
22. Manta: Unix for Big Data
•
Manta allows for an arbitrarily scalable variant of
McIlroy’s solution to Bentley’s challenge:
mfind -t o /bcantrill/public/v7/usr/man |
mjob create -o -m "tr -cs A-Za-z 'n' |
tr A-Z a-z | sort | uniq -c" -r
"awk '{ x[$2] += $1 }
END { for (w in x) { print x[w] " " w } }' |
sort -rn | sed ${1}q"
•
This description not only terse, it is high performing: data
is left at rest — with the “map” phase doing heavy
reduction of the data stream
•
As such, Manta — like Unix — is not merely syntactic
sugar; it converges compute and data in a new way
23. Manta: CAP tradeoffs
•
Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
availability for writes (but still availability for reads)
•
Many more details:
http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/
•
Celebrity endorsement:
24. Manta: Other design principles
•
Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper directories, delimited with a
forward slash
•
Manta implements a snapshot/link hybrid dubbed a
snaplink; can be used to effect versioning
•
•
Manta has full support for CORS headers
•
•
Manta SDKs exist for node.js, Java, Ruby, Python
Manta uses SSH-based HTTP auth for client-side
tooling (IETF draft-cavage-http-signatures-00)
“npm install manta” for command line interface
25. Manta and the future of big data
•
We believe compute/data convergence to be the future
of big data: stores of record must support computation
as a first-class, in situ operation
•
We believe that Unix is a natural way of expressing this
computation — and that the OS is the right level at
which to virtualize to support this securely
•
We believe that ZFS is the only sane storage
underpinning for such a system
•
Manta will surely not be the only system to represent the
confluence of these — but it is the first
•
We are actively retooling our software stack in terms of
Manta — Manta is changing the way we develop
software!
26. Manta: More information
•
Product page:
http://joyent.com/products/manta
•
node.js module:
https://github.com/joyent/node-manta
•
Manta documentation:
http://apidocs.joyent.com/manta/
•
IRC, e-mail, Twitter, etc.:
#manta on freenode, manta@joyent.com, @mcavage,
@dapsays, @yunongx, @joyent