Myria: Analytics-as-a-Service for (Data) Scientists

Myria: Analytics-as-a-Service
for (Data) Scientists
Bill Howe
University of Washington

10/13/2013

Bill Howe, UW

1

“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera

2

How can we deliver 1000 little SDSSs
to anyone who wants one?

10/13/2013

Bill Howe, UW

4

Armbrust Lab Retreat, 2009 (Biology, Oceanography)

10/13/2013

Bill Howe, UW

6

Astronomy Visualization
Workshop, 2011

10/13/2013

Bill Howe, UW

7

Big Data in the Long Tail Workshop, 2012 (Social Sciences)

10/13/2013

Bill Howe, UW

8

Maier’s 2nd Maxim

Working with scientists is like
working with 7 year olds:
They think they know everything
and they don’t have any money

10/13/2013

Bill Howe, UW

9

My Goal: Expose all the world’s science data
through declarative query interfaces

10/13/2013

Bill Howe, UW

10

Problem
How much time do you spend “handling
data” as opposed to “doing science”?

Mode answer: “90%”

10/13/2013

Bill Howe, UW

11

ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query
chr_4[480001-580000].287
chr_4[560001-660000].1
chr_9[400001-500000].503
chr_9[320001-420000].548
chr_27[320001-404298].20
chr_26[320001-420000].378
chr_26[400001-441226].196
chr_24[160001-260000].65
chr_5[720001-820000].339
chr_9[160001-260000].243
chr_12[720001-820000].86
chr_12[800001-900000].109
chr_11[1-100000].70
chr_11[80001-180000].100

length
4500
3556
4211
2833
3991
3963
2949
3542
3141
3002
2895
1463
2886
1523

COG hit #1

e-value #1

identity #1

score #1

COG4547
COG5406
COG4547
COG5099
COG5099

2.00E-04
2.00E-04
5.00E-05
5.00E-05
2.00E-04

19
38
18
17
17

44.6
43.9
46.2
46.2
43.9

620
1001
620
777
777

Cobalamin biosynthesis protein C
Nucleosome binding factor SPN,
Cobalamin biosynthesis protein C
RNA-binding protein of the Puf fa

COG5099
COG5077
COG5032
COG5032

4.00E-09
1.00E-25
2.00E-09
1.00E-09

20
26
30
30

59.3
114
60.5
60.1

777
1089
2105
2105

Ubiquitin carboxyl-terminal hydr
Phosphatidylinositol kinase and p
Phosphatidylinositol kinase and p

Simple Example

hit length #1 description #1

COGAnnotation_coastal_sample.txt
id

query
1 FHJ7DRN01A0TND.1
2 FHJ7DRN01A1AD2.2
3 FHJ7DRN01A2HWZ.4
…
2853 FHJ7DRN02HXTBY.5
2854 FHJ7DRN02HZO4J.2
…
3566 FHJ7DRN02FUJW3.1
…

hit
COG0414
COG0092
COG3889

e_value
identity_ score query_start query_end hit_start hit_end hit_length
1.00E-08
28
51
1
74
180
257
285
3.00E-20
47 89.9
6
85
41
120
233
0.0006
26 35.8
9
94
758
845
872

COG5077
COG0444

7.00E-09
2.00E-31

37
67

52.3
127

3
1

77
73

313
135

388
207

1089
316

COG5032

1.00E-09

32

54.7

1

75

1965

2038

2105

SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
10/13/2013

Bill Howe, UW

12

Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

10/13/2013

Bill Howe, UW

13

A “Needs Hierarchy” of Science Data Management
-- Maslow 43

analytics
query
curation
sharing
storage
10/13/2013

Bill Howe, UW

14

A “Needs Hierarchy” of Science Data Management
-- Maslow 43

analytics
query
semantic integration
sharing
storage
10/13/2013

Bill Howe, UW

15

Why should you care?

Science == Data Science

10/13/2013

Bill Howe, UW

16

Version 1

QUERY-AS-A-SERVICE
2010 - present

10/13/2013

Bill Howe, UW

17

3) Share the results
Make them public, tag
them, share with specific
colleagues – anyone with
access can query

2) Write SQL
Right in your browser,
writing queries on top of
queries on top of queries ...

1) Upload data “as is”
Cloud-hosted; no need to
install or design a database;
no pre-defined schema

SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC

Find all TIGRFam ids (proteins) that are missing from at least
one of three samples (relations)
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
UNION
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
EXCEPT
SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [est_hma_fasta_TGIRfam_refs]
INTERSECT
SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs]
10/13/2013

Bill Howe, UW

19

Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
We see thousands
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
queries written by
THEN w.end_bp - x.start_bp + 1
non-programmers
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC

of

Join

Steven
Roberts

Link methylation
with gene description
Excel

Trim

SQL as a lab notebook:
http://bit.ly/16Xj2JP

Compute

misstep: join
w/ wrong ﬁll
Reorder
columns

Reorder
columns

Join

Join

Count

Calculate
methylation ratio

Calculate
methylation ratio
and link with gene
description

Count

Calculate #
methylated CGs

Calculate #
all CGs
Join

Join

Calculate #
methylated
CGs

Calculate #
all CGs

Reorder
columns

GFF of
methylated
CG locations

GFF of all
genes

GFF of all
CG locations

Gene
descriptions

GFF of
methylated
CG locations

Popular service for
Bioinformatics Workflows

GFF of all
genes

GFF of all
CG locations

Gene
descriptions

Halperin, Howe, et al. SSDBM 2013

Andrew White,
UW Chemistry

“An undergraduate student and I are working with gigabytes of tabular data
derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
-- Andrew D White
Decoding nonspecific interactions from nature. A. White, A. Nowinski, W.
Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
10/13/2013

Bill Howe, UW

24

SSDBM 2011

Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions with views

• Standardize on units, apply naming conventions?
– rename columns, apply functions with views

• Attach metadata?
– add new tables with descriptive names, add new columns with views

• Data cleaning, quality control?
– hide bad values with views

• Maintain provenance?
– inspect view dependencies

• Propagate updates?
– view maintenance

• Protect sensitive data?
– expose subsets with views (assuming views carry permissions)
10/13/2013

Bill Howe, UW

25

Two Problems with SQLShare
• No help for really big datasets
• No iteration

10/13/2013

Bill Howe, UW

26

Myria is…
• A compiler framework for multiple
iterative RA-based languages
• A parallel, shared-nothing, iterative
execution engine
• A RESTful Query-as-a-Service platform

• prefix meaning “ten thousand” in Greek
10/13/2013

Bill Howe, UW

27

Myria Team
Dan Suciu
Magda Balazinska
Bill Howe

Dan Halperin (postdoc, technical lead)
Victor Almeida (postdoc)
Andrew Whitaker (research scientist)

Students
Paris Koutris
Emad Soroush
Jingjing Wang
ShengLiang Xu
Jennifer Ortiz
Jeremy Hyrkas
Shumo Chu
28

Myria
Architecture

Web UI
Language Parser

Google
App
Engine

Logical Optimizer for RA+While
Myria Compiler

MyriaL

C Compiler

Grappa

json query plan

MyriaDB

REST Server
Coordinator

Catalog

netty
protocols

Worker

Catalog

Worker

Catalog

…

Worker

Catalog

jdbc

jdbc

jdbc

RDBMS

RDBMS

RDBMS

HDFS

HDFS

HDFS

A(y) :- R(‘a’, y)
A(y) :- A(x), R(x,y)

10/13/2013

Bill Howe, UW

30

A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];
Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DO
I = CROSS(Kmeans, Centroids);
J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id,
$distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)];
L = JOIN(J, id, K, id)
M = [FROM L WHERE J.distance <= K.distance EMIT
(id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans)
Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}

10/13/2013

Bill Howe, UW

31

Why Iteration Matters
Datalog: reachability in a graph with 1.4B unique edges: almost 200 iterations

32


Vast majority
of reachable tuples
discovered by
iteration 25
33


Vast majority
of reachable tuples
discovered by
iteration 25

The datalog program
continues for almost
200 iterations, each
almost as expensive
as the early steps

34

Fewer Iterations: Endgame Problem [Afrati 10]
100,000,000

frontier tuples
previously discovered tuples removed

10,000,000

# of tuples discovered

1,000,000
100,000
10,000
1,000
100
10
1
0

10/13/2013

20

40

60

80
100
iteration #
Bill Howe, UW

120

140

160

180

35

Reachability from ‘a’ in datalog

Basic Semi-Naïve Evaluation

Join
10/13/2013

Bill Howe, UW

A(y) :- R(‘a’, y)
A(y) :- A(x), R(x,y)

Dupe-elim
36

MAYBE JUST USE HADOOP?

10/13/2013

Bill Howe, UW

37

VLDB 2010, VLDBJ 2011
Bu, Howe, Balazinska, Ernst
VLDB10, VLDBJ12, Datalog12

Difference

Join

ΔAi-1

map

reduce

map

R(0)

map

reduce

map

R(1)

map
(a)

Ai(0)
(b)

map

Ai(1)

reduce

map

reduce

(a) R is loop invariant, but gets loaded and shuffled on each iteration
(b) Ai grows slowly and monotonically, but is loaded and shuffled on each
iteration. HaLoop’s Reducer Input Cache addressed (a), but did not
support the append semantics needed for (b).
10/13/2013

Bill Howe, UW

38

VLDB 2010, VLDBJ 2011

Inter-loop caching
Iteration i = 0: Load a distributed cache
Iteration i > 0:

ΔAi-1

Difference

Join
map

R(0)

map
R(0)

R(1)

map
R(1)

reduce
reduce

map
map

Ai(0)

map
A(0)

Ai(1)

reduce

map
A(1)

reduce

Bu, Howe, Balazinska, Ernst VLDB10, VLDBJ12, Datalog12
39

Difference

Join

Caching Loop-Invariant Data

ΔAi-1

map

reduce

map

R(0)

map

reduce

map

R(1)

map

reduce

1200

map

Ai(1)

no cache

Ai(0)

map

reduce

cache

failure

me (s)

1000
800
600

First iteration is
slow, as the invariant
graph is shuffled and
cached

400

23X

200
0
0
10/13/2013

10

20
itera on #
Bill Howe, UW

30
40

Difference

Join

ΔAi-1

MapReduce semantics
require that all keys from
the cache be extracted and
passed to reducers.

reduce

map

R(0)

Specialize Cache for Query
Semantics

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

map

reduce

join keys arriving
from mappers

Reducer for
Join

But we only care about
keys that join.
all tuples
from cache

10/13/2013

Bill Howe, UW

41

Difference

Join

ΔAi-1

reduce

map

R(0)

map

reduce

map

R(1)

Second optimization:
Specialization for Equijoin

map

map

reduce

Ai(0)

map

Ai(1)

map

reduce

Index the cache, and only extract keys that join

Reducer for
Join
join keys arriving
from mappers

keys that join

indexed cache
lookup

10/13/2013

Bill Howe, UW

42

Difference

Join

ΔAi-1

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

Equijoin seman cs

map

reduce

MapReduce seman cs

160

me (s)
total time for loop body (s)

reduce

R(0)

Effect of equijoin
specialization

map

Failure occurred

120
80

~20%

40
0
0

10/13/2013

20

40
60
itera on #
Bill Howe, UW

80
43

Difference

Join

ΔAi-1

reduce

map

R(0)

Third Optimization: Extend Cache
to Support Duplicate Elimination

map

map

reduce

map

R(1)

map

reduce

Ai(0)

map

Ai(1)

map

reduce

The accumulated result is not loop-invariant, but it changes relatively slowly, and
is needed on every iteration to check for duplicates.
Extend the cache to support append, and we can use it for Dupe-Elim as well.

Reducer for
Dupe-elim
tuples arriving from
mappers

unique keys

indexed cache
lookup, with new tuples
inserted
10/13/2013

Bill Howe, UW

44

Effect of Diff Cache
no diff ache
c

with diff ache

loop body (s)
total time for me (s)

100

Failures may be more likely due
to extra network traffic

80
60

~20% overall
improvement

40
20
0
0

10/13/2013

10

20
30
itera on #
Bill Howe, UW

40

50
45

Overall
35000

(a) no optimizations

30000
(b) HaLoop

time (s)

25000
20000
15000

(c) all
optimizations

10000

(d) raw Hadoop
overhead

5000
0
0

50

100

iteration #

150

200

250

Fewer Iteraations: Loop unrolling

Run two joins for every dupe-elim

10/13/2013

Bill Howe, UW

47

half the iterations, but
each is more expensive

change
strategies
10/13/2013

Bill Howe, UW

48

reachable(Y) :- edge(5,Y)
reachable(Y) :- edge(X,Y), reachable(X)

# of Newly Discovered Facts

10000000
1000000
100000
10000
1000

Greenplum
Myria

100
10

not much
useful work

1

1

3

5

7

9

11 13 15 17 19 21 23
Iteration

700
Total Time (second)

600
500
400

Greenplum

Low per-iteration cost

300

Myria

200

Greenplum, incremental

100

Greenplum, incremental+index

0

1 3 5 7 9 11 13 15 17 19 21 23
Iteration

10/13/2013

Bill Howe, UW

50

Summary
• Goal: Expose all the world’s science data through
declarative query interfaces!
• Motivated by real science
• Data and query model is iterative relational algebra
• Industrial-strength Query-as-a-Service
http://db.cs.washington.edu/myria/
http://myria-web.appspot.com/

10/13/2013

Bill Howe, UW

51

Datalog Parser
Logical Optimizer

Myria Compiler

C Compiler

Grappa

Google
App
Engine

• Hypothesis: The performance difference
between hand-coded graph algorithms and
relational query plans amounts to
implementation details
• Can we generate “hand-coded” plans?

10/13/2013

Bill Howe, UW

53

Path-Counting Queries

Ex: Count the number of unique 2-hops

Assume a collection edges
answers = set()
for all (x, y1) in edges:
for all (y2, z) in edges:
if y1 == y2:
answers.insert((x,z))
count = answers.size()

In an RDBMS: “Nested Loops Join”

10/13/2013

Bill Howe, UW

55

Assume a collection edges, but also an index
neighbors: vertex -> [vetex]
answers = set()
for all (x, y) in edges:
for all z in neighbors[y]:

In an RDBMS: “Hash Join”

10/13/2013

Bill Howe, UW

56

Just drop the edges collection entirely, leaving only the index
answers = set()
for all x in neighbors:
for all y in neighbors[x]:
In an RDBMS: Still a Hash Join

10/13/2013

Bill Howe, UW

57

Just drop the edges collection entirely, leaving only the index
count = 0
answers = set()
for all y in neighbors[x]:
answers.insert(z)
count += answers.size()
answers.clear()

only one value
stays small

RDBMS don’t express this, but there’s
no reason they couldn’t
10/13/2013

Bill Howe, UW

58

Or if you prefer…assume a collection of vertices, where each
vertex points directly to its neighbors
answers = set()
for all y in x.neighbors():
for all z in y.neighbors():
answers.insert(z)
count += answers.size()
answers.clear()

only one value, so
stays small

Boils down to dereferencing a pointer vs. probing a hash table

10/13/2013

Bill Howe, UW

59

Experiments
• Data sets:
Dataset

# Vertices

# Edges

#Distinct 2-hop
Paths

# Triangles

BSN*

685,230

7,600,595

78,350,597

6,935,709

Twitter 4MEⱡ 166,317

4,532,185

1,056,317,985

14,912,950

comlivejournal*

3,997,962

34,681,189

735,398,579

soclivejournal*

4,874,571

68,993,773

ⱡ Kwak et al
H.
2010.

112,319,229

*http://snap.stanford.edu/

Experiments

no dupe
elim

single-threaded

dupe elim
BSN data set

Twitter 4ME data set

Experiments
• Parallel system performance

Myria: Analytics-as-a-Service for (Data) Scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Myria: Analytics-as-a-Service for (Data) Scientists

Similar to Myria: Analytics-as-a-Service for (Data) Scientists (20)

More from University of Washington

More from University of Washington (20)

Recently uploaded

Recently uploaded (20)

Myria: Analytics-as-a-Service for (Data) Scientists

Editor's Notes