Talk delivered at High Performance Transaction Processing 2013
Myria is a new Big Data service being developed at the University of Washington. We feature high level language interfaces, a hybrid graph-relational data model, database-style algebraic optimization, a comprehensive REST API, an iterative programming model suitable for machine learning and graph analytics applications, and a tight connection to new theories of parallel computation.
In this talk, we describe the motivation for another big data platform emphasizing requirements emerging from the physical, life, and social sciences.
2. “It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
2
3.
4. How can we deliver 1000 little SDSSs
to anyone who wants one?
10/13/2013
Bill Howe, UW
4
8. Big Data in the Long Tail Workshop, 2012 (Social Sciences)
10/13/2013
Bill Howe, UW
8
9. Maier’s 2nd Maxim
Working with scientists is like
working with 7 year olds:
They think they know everything
and they don’t have any money
10/13/2013
Bill Howe, UW
9
10. My Goal: Expose all the world’s science data
through declarative query interfaces
10/13/2013
Bill Howe, UW
10
11. Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
10/13/2013
Bill Howe, UW
11
13. Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
10/13/2013
Bill Howe, UW
13
14. A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
analytics
query
curation
sharing
storage
10/13/2013
Bill Howe, UW
14
15. A “Needs Hierarchy” of Science Data Management
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
analytics
query
semantic integration
sharing
storage
10/13/2013
Bill Howe, UW
15
16. Why should you care?
Science == Data Science
10/13/2013
Bill Howe, UW
16
18. 3) Share the results
Make them public, tag
them, share with specific
colleagues – anyone with
access can query
2) Write SQL
Right in your browser,
writing queries on top of
queries on top of queries ...
1) Upload data “as is”
Cloud-hosted; no need to
install or design a database;
no pre-defined schema
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
19. Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
10/13/2013
Bill Howe, UW
19
20. Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
We see thousands
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
queries written by
THEN w.end_bp - x.start_bp + 1
non-programmers
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
of
22. Join
Steven
Roberts
Link methylation
with gene description
Excel
Trim
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Compute
misstep: join
w/ wrong fill
Reorder
columns
Reorder
columns
Join
Join
Count
Calculate
methylation ratio
Calculate
methylation ratio
and link with gene
description
Count
Calculate #
methylated CGs
Calculate #
all CGs
Join
Join
Calculate #
methylated
CGs
Calculate #
all CGs
Reorder
columns
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
GFF of
methylated
CG locations
Popular service for
Bioinformatics Workflows
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
24. Andrew White,
UW Chemistry
“An undergraduate student and I are working with gigabytes of tabular data
derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
-- Andrew D White
Decoding nonspecific interactions from nature. A. White, A. Nowinski, W.
Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
10/13/2013
Bill Howe, UW
24
25. SSDBM 2011
Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions with views
• Standardize on units, apply naming conventions?
– rename columns, apply functions with views
• Attach metadata?
– add new tables with descriptive names, add new columns with views
• Data cleaning, quality control?
– hide bad values with views
• Maintain provenance?
– inspect view dependencies
• Propagate updates?
– view maintenance
• Protect sensitive data?
– expose subsets with views (assuming views carry permissions)
10/13/2013
Bill Howe, UW
25
26. Two Problems with SQLShare
• No help for really big datasets
• No iteration
10/13/2013
Bill Howe, UW
26
27. Myria is…
• A compiler framework for multiple
iterative RA-based languages
• A parallel, shared-nothing, iterative
execution engine
• A RESTful Query-as-a-Service platform
• prefix meaning “ten thousand” in Greek
10/13/2013
Bill Howe, UW
27
28. Myria Team
Dan Suciu
Magda Balazinska
Bill Howe
Dan Halperin (postdoc, technical lead)
Victor Almeida (postdoc)
Andrew Whitaker (research scientist)
Students
Paris Koutris
Emad Soroush
Jingjing Wang
ShengLiang Xu
Jennifer Ortiz
Jeremy Hyrkas
Shumo Chu
28
29. Myria
Architecture
Web UI
Language Parser
Google
App
Engine
Logical Optimizer for RA+While
Myria Compiler
MyriaL
C Compiler
Grappa
json query plan
MyriaDB
REST Server
Coordinator
Catalog
netty
protocols
Worker
Catalog
Worker
Catalog
…
Worker
Catalog
jdbc
jdbc
jdbc
RDBMS
RDBMS
RDBMS
HDFS
HDFS
HDFS
33. Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations
Vast majority
of reachable tuples
discovered by
iteration 25
33
34. Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations
Vast majority
of reachable tuples
discovered by
iteration 25
The datalog program
continues for almost
200 iterations, each
almost as expensive
as the early steps
34
38. VLDB 2010, VLDBJ 2011
Bu, Howe, Balazinska, Ernst
VLDB10, VLDBJ12, Datalog12
Difference
Join
ΔAi-1
map
reduce
map
R(0)
map
reduce
map
R(1)
map
(a)
Ai(0)
(b)
map
Ai(1)
reduce
map
reduce
(a) R is loop invariant, but gets loaded and shuffled on each iteration
(b) Ai grows slowly and monotonically, but is loaded and shuffled on each
iteration. HaLoop’s Reducer Input Cache addressed (a), but did not
support the append semantics needed for (b).
10/13/2013
Bill Howe, UW
38
39. VLDB 2010, VLDBJ 2011
Inter-loop caching
Iteration i = 0: Load a distributed cache
Iteration i > 0:
ΔAi-1
Difference
Join
map
R(0)
map
R(0)
R(1)
map
R(1)
reduce
reduce
map
map
Ai(0)
map
A(0)
Ai(1)
reduce
map
A(1)
reduce
Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12
39
41. Difference
Join
ΔAi-1
MapReduce semantics
require that all keys from
the cache be extracted and
passed to reducers.
reduce
map
R(0)
Specialize Cache for Query
Semantics
map
map
reduce
map
R(1)
map
reduce
Ai(0)
map
Ai(1)
map
reduce
join keys arriving
from mappers
Reducer for
Join
But we only care about
keys that join.
all tuples
from cache
10/13/2013
Bill Howe, UW
41
44. Difference
Join
ΔAi-1
reduce
map
R(0)
Third Optimization: Extend Cache
to Support Duplicate Elimination
map
map
reduce
map
R(1)
map
reduce
Ai(0)
map
Ai(1)
map
reduce
The accumulated result is not loop-invariant, but it changes relatively slowly, and
is needed on every iteration to check for duplicates.
Extend the cache to support append, and we can use it for Dupe-Elim as well.
Reducer for
Dupe-elim
tuples arriving from
mappers
unique keys
indexed cache
lookup, with new tuples
inserted
10/13/2013
Bill Howe, UW
44
45. Effect of Diff Cache
no diff ache
c
with diff ache
loop body (s)
total time for me (s)
100
Failures may be more likely due
to extra network traffic
80
60
~20% overall
improvement
40
20
0
0
10/13/2013
10
20
30
itera on #
Bill Howe, UW
40
50
45
51. Summary
• Goal: Expose all the world’s science data through
declarative query interfaces!
• Motivated by real science
• Data and query model is iterative relational algebra
• Industrial-strength Query-as-a-Service
http://db.cs.washington.edu/myria/
http://myria-web.appspot.com/
10/13/2013
Bill Howe, UW
51
53. Datalog Parser
Logical Optimizer
Myria Compiler
C Compiler
Grappa
Google
App
Engine
• Hypothesis: The performance difference
between hand-coded graph algorithms and
relational query plans amounts to
implementation details
• Can we generate “hand-coded” plans?
10/13/2013
Bill Howe, UW
53
55. Assume a collection edges
answers = set()
for all (x, y1) in edges:
for all (y2, z) in edges:
if y1 == y2:
answers.insert((x,z))
count = answers.size()
In an RDBMS: “Nested Loops Join”
10/13/2013
Bill Howe, UW
55
56. Assume a collection edges, but also an index
neighbors: vertex -> [vetex]
answers = set()
for all (x, y) in edges:
for all z in neighbors[y]:
answers.insert((x,z))
count = answers.size()
In an RDBMS: “Hash Join”
10/13/2013
Bill Howe, UW
56
57. Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
answers = set()
for all x in neighbors:
for all y in neighbors[x]:
for all z in neighbors[y]:
answers.insert((x,z))
count = answers.size()
In an RDBMS: Still a Hash Join
10/13/2013
Bill Howe, UW
57
58. Just drop the edges collection entirely, leaving only the index
neighbors: vertex -> [vetex]
count = 0
answers = set()
for all x in neighbors:
for all y in neighbors[x]:
for all z in neighbors[y]:
answers.insert(z)
count += answers.size()
answers.clear()
only one value
stays small
RDBMS don’t express this, but there’s
no reason they couldn’t
10/13/2013
Bill Howe, UW
58
59. Or if you prefer…assume a collection of vertices, where each
vertex points directly to its neighbors
answers = set()
for all x in neighbors:
for all y in x.neighbors():
for all z in y.neighbors():
answers.insert(z)
count += answers.size()
answers.clear()
only one value, so
stays small
Boils down to dereferencing a pointer vs. probing a hash table
10/13/2013
Bill Howe, UW
59
So in part motivated by this. there’s a group of great database researchers who work deeply with scientistsDave Maier, my advisor. Jignesh, who left. Natassa, who left. Yannis, Alex, others.And we recently attracted some new blood to the science data arena.But this community of science databases has something in common with the HPTS communityJim was luminary of HPTS, and no less so a luminary of science databases.The Sloan Digital Sk
To understand the problem, it’s useful to consider past successes. The Sloan Digital Sky Survey used a relational database with a carefully engineered schema, and then served the database online using a carefully engineered infrastructure.This approach requires a lot of people, expertise, money, and time – things that small and medium-sized projects don’t typically have.So the question we explore is: How can we support 1000 little "SDSSs” for small- and medium- sized projects?---We started thinking about a new tool. schema designed in part by a turing-award winning computer database expert We can't afford to build a database + applications from scratch for every project and nobody wants to maintain such a system anyway. Most importantly, the data comes from all over the place instead of a single source like SDSS --- we can't pretend the data will arrive clean and coherent.
…whereI had to disguise myself as an oceanographer in order to do data science work. This me on a research cruise in 2007
But since joining the eScience Institute, I’m can mingle freely with the scientists in their natural habitat, and I sometimes get invited to their events
In every discipline, you can play where’s Waldo in these group photos and find me.
The problem is not only scale, and not even usually scale – it’s what Stratos called DB exploration. Grubbing around in messy data with unknown quality, properties, etc.And, working
But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that strong semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database.You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries.Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out. ----Key ideas to get data in: a) Use the cloud to avoid having to install and run a databaseb) Give up on the schema -- just throw your data in "as is" and do "lazy integration.”c) Use some magic to automate parsing, integration, recommendations, and more.Key ideas to get data out:a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find.b) Saving and reusing queries is a first class requirement. Given examples, it's easy to modify it into an "adjacent" query.c) Expose the whole system through a REST API to make it easy to bring new client applications online.
Multiple input laguages, multiple output languages, all RA basedDatabase on every node for local processingEverything in memory, but can push down into databasePush-based processing with back pressure to keep queues filled (a bit of streaming influence)Column-oriented tuple-batches between workers.Row-oriented on disk, typically, but depends on the databaseSupport
Four points to make:0) This is the time for *join only*, not the overall iteration time1) First iteration is slower, as the cache is filled2) each iteration is about 23X faster by joining against cached results.3) The gaps are failures, which are a reality at this scale. Recovery proceeded as usual.HaLoop showed similar, but did not evaluate complete datalog queries.
Two points to make:1) 20% speedup on the overall iteration time from this specialization. This optimization violates MapReduce semantics, but is safe given our target lanaguage of datalog. 2) The outliers represent failures, which is a reality of dealing with large-scale data, and is a key reason why HaLoop is popular.
The accumulated result is not loop-invariant, but it changes relatively slowly, and is needed on every iteration to check for duplicates. Extend the cache to support append, and we can use it for Dupe-Elim as well.
The diff cache worksMaybe ignore the failure comment, but just in case a question arises about why failures appear to be more common without the cache.The answer: we’re not sure, but we know more data is being transferred over the network without the cache.
But if we can’t express important analysis tasks, they’ll export their data and use some parallel cloudy R monstrosity