VO Course 10: Big data challenges in astronomy

Astronomy’s Big Data
Challenges
Juan de Dios Santander Vela (IAA-CSIC)

Overview

What is, exactly, big data?

Which are the dimensions of big data?

Which are the big data drivers in astronomy?

How can we deal with big data?

VO tools for dealing with big data

What is exactly Big Data?

Data sets whose size is beyond the ability
of commonly used software tools to
capture, manage, and process the data
within a tolerable elapsed time.

WIKIPEDIA: “BIG DATA”

What is exactly Big Data?
Big Data is data with at least one Big dimension

Bandwidth

Number of individual assets

Size of individual assets

Response speed

…

Data
mining
Processing
techniques Offline
Storage

Size Flow

Access
techniques

Big Real time

Data Event
Processi
ng

Processing Paralell
Files
level Access
Raw Data Schemata Capabilities

Unstructured
Durability
Processed Formats
Data Statistics Value
Stuctured

Tagging

Information Tech Debt
Extracted

Next big data projects in
astronomy

Large Synoptic Survey
Telescope

The Large Synoptic Survey
Telescope Camera

Steven M. Kahn
Stanford/SLAC
(for the LSST Consortium)

LSST Data Rates

* 2.3 billion pixels read out in less than 2 sec, every 12 sec
* 1 pixel = 2 Bytes (raw)
* Over 3 GBytes/sec peak raw data from camera
* Real-time processing and transient detection: < 10 sec
* Dynamic range: 4 Bytes / pixel
* > 0.6 GB/sec average in pipeline
* 5000 floating point operations per pixel
* 2 TFlop/s average, 9 TFlop/s peak
* ~ 18 Tbytes/night

Signal Transport & Processing
DESIGNS COUNTS!

Massive Data Flow, Storage
& Processing
Antenna &
Front End
Systems

STORAGE?
Correlation
CAN’T STORE IT!
1 DAY STREAM = 150 DAYS
GLOBAL INTERNET TRAFFIC
Data Product
Generation Temporary 800 PB
Storage

On-Demand
Long Term High Availability Processing
Storage Storage / DB

18 PB/YEAR

& Processing
Antenna &
Front End
Systems

PROCESSING NEEDS
Correlation > 1 EXAFLOP/S 109 TOP RANGE PCS

Data Product
Generation Temporary 30 PETAFLOPS/S
Storage

On-Demand

& Processing
Antenna &
Front End
Systems

7 PB/S
Correlation
BANDWIDTH
TYPICAL SURVEY,
> 300 GB/S
5 DAYS READ TIME @
Data Product 10GB/SEC
Generation Temporary
Storage

On-Demand

MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna &
Front End
Systems

Correlation

Bandwidth)in)TB/s)

LOFAR"

ALMA"

0" 5" 10" 15" 20" 25" 30" 35" 40"


Antenna &
Front End
Systems

Bandwidth)in)TB/s)
Correlation

Bandwidth)in)TB/s)
ASKAP"

LOFAR"

0" 10" 20" 30" 40" 50" 60" 70"
ALMA"

0" 5" 10" 15" 20" 25" 30" 35" 40"


Correlation

Processing*TFlops/s*

ALMA"

VLA"

0" 0,0005" 0,001" 0,0015" 0,002"


Correlation


LOFAR" Processing*TFlops/s*

ALMA"
ALMA"

0" 20" 40" 60" 80" 100" 120"
VLA"

0" 0,0005" 0,001" 0,0015" 0,002"


Correlation

ASKAP"

LOFAR" Processing*TFlops/s*

0" 50" 100" 150" 200" 250" 300" 350"
ALMA"
ALMA"

0" 20" 40" 60" 80" 100" 120"
VLA"

0" 0,0005" 0,001" 0,0015" 0,002"

CERN/IT/DB

40 M
online system Hz
leve (40
multi-level trigger l1 TB/
filter out background 75 K - spe sec)
cial
l 2 - Hz (7
reduce data volume from leve hard
40TB/s to 100MB/s 5G war
5 K embedde B/sec) e
Hz ( d pr
o
leve 5G cess
l3 - B/se ors
c)
100 PCs
(100
MB z
H
data /sec
)
offli reco
ne a rding
naly &
sis

CERN/IT/DB
Event Filter & Reconstruction
(figures are for one experiment)
data from detector - event builder

switch input: 5-100 GB/sec

capacity: 50K SI95
computer farm (~4K 1999 PCs)

recording rate: 100 MB/sec
(Alice – 1 GB/sec)
high speed network

tape
and disk servers

raw sum
dat ma
a ry d
ata

+ 1-1.25 PetaByte/year
+ 1-500 TB/year
20,000 Redwood
cartridges every year (+ copy)

Dealing with Big Data

We cannot allow for arbitrary queries

We can have arbitrary processing instead

We cannot allow full data dumps

We can generate data on the the ﬂy (see above)

Queries as functions

QUERY = FUNCTION
{ }
DATA

QUERIES NEED TO BE PRECOMPUTED
ARBITRARY QUERIES ONLY POSSIBLE ON
THE PRECOMPUTED, SMALLER DATA SETS

Queries as functions

QUERY = FUNCTION
{ }
ALL
DATA

QUERIES NEED TO BE PRECOMPUTED
ARBITRARY QUERIES ONLY POSSIBLE ON
THE PRECOMPUTED, SMALLER DATA SETS

Lambda Architecture

FAST, INCREMENTAL ALGOS.
Speed Layer QUERIES NOT ON BATCH L.
COMPENSATES FOR LATENCY
RANDOM ACCESS TO VIEWS
Serving Layer
UPDATED BY BATCH LAYER

STORE MASTER DATASET
Batch Layer
COMPUTE ARBITRARY VIEWS

Batch Layer

INMUTABLE,
CONSTANTLY
Stores master copy of the dataset GROWING

Precomputes batch views on that master dataset
INMUTABLE,
CONSTANTLY
GROWING

Batch Layer UPDATED
VIEWS

TYPICALLY, View 1
MAP/REDUCE

All Data Batch Layer View 2
NEW
DATA …

View n

Serving Layer

Allows for:

batch writes of view updates

random reads on the views

Does not allow random writes

Speed Layer

Allows for:

incremental writes of view updates

short-term temporal queries on the views

Can be discarded!

27

Figure 2.1 The master dataset in the Lambda Architecture serves as the source of
truth of your Big Data system. Errors at the serving and speed layers can be

Computing over
Big Data
Batch layer as a computational engine on data

Need to formally specify

Inputs
IKE
KS L !
Processes LOO LOW
T HAT RKF
A WO
Outputs SQL
OR ING
QU ERY

Map/Reduce
from%random%import%normalvariate
from,math,import,sqrt

def,res2(x):,return,pow(mean_v,6,x,,2.)
#"Random"vector,"mean"1,"stdev"0.001
v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]
mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)

res2_v,=,map(res2,,v)

stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))
print,(mean_v,,stdev)
PARALELLISABLE!

Map/Reduce
from%random%import%normalvariate
from,math,import,sqrt
from,multiprocessing,import,Pool
def,res2(x):,return,pow(mean_v,6,x,,2.)
#"Random"vector,"mean"1,"stdev"0.001
v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]
mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)
pool,=,Pool(processes=4)
ONLY FOR MAP, BUT REDUCE
res2_v,=,pool.map(res2,,v)
ALSO PARALLELISABLE
pool.close()
stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))
print,(mean_v,,stdev)

Dependence of execution time with the number of pool processors
0,8
20 millions
10 millions
5 millions
seconds per million elements

1 million
0,7

0,6

0,5

0,4
1 2 3 4 5 6 7 8
Number of pool processors

Conclusions

Big data needs diﬀerent approaches

Parallelism & data-side processing

Map/Reduce as parallelism engine

Need of ways to formally specify computations

References & Links

“The Fourth Paradigm: Data-Intensive Scientific
Discovery”, Jim Gray, Microsoft Research

“MapReduce: Simplified Data Processing on Large
Clusters”, Jeffrey Dean and Sanjay Ghemawat,
Google

MyExperiment

VO Course 10: Big data challenges in astronomy

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to VO Course 10: Big data challenges in astronomy

Similar to VO Course 10: Big data challenges in astronomy (20)

More from Joint ALMA Observatory

More from Joint ALMA Observatory (20)

VO Course 10: Big data challenges in astronomy