How future astronomy projects will generate enormous amounts of data, and what does that mean for astronomical data processing. Part of the virtual observatory course by Juan de Dios Santander Vela, as imparted for the MTAF (Métodos y Técnicas Avanzadas en Física, Advanced Methods and Techniques in Physics) Master at the University of Granada (UGR).
2. Overview
What is, exactly, big data?
Which are the dimensions of big data?
Which are the big data drivers in astronomy?
How can we deal with big data?
VO tools for dealing with big data
3. What is exactly Big Data?
Data sets whose size is beyond the ability
of commonly used software tools to
capture, manage, and process the data
within a tolerable elapsed time.
WIKIPEDIA: “BIG DATA”
4. What is exactly Big Data?
Big Data is data with at least one Big dimension
Bandwidth
Number of individual assets
Size of individual assets
Response speed
…
5. Data
mining
Processing
techniques Offline
Storage
Size Flow
Access
techniques
Big Real time
Data Event
Processi
ng
Processing Paralell
Files
level Access
Raw Data Schemata Capabilities
Unstructured
Durability
Processed Formats
Data Statistics Value
Stuctured
Tagging
Information Tech Debt
Extracted
8. The Large Synoptic Survey
Telescope Camera
Steven M. Kahn
Stanford/SLAC
(for the LSST Consortium)
9. LSST Data Rates
* 2.3 billion pixels read out in less than 2 sec, every 12 sec
* 1 pixel = 2 Bytes (raw)
* Over 3 GBytes/sec peak raw data from camera
* Real-time processing and transient detection: < 10 sec
* Dynamic range: 4 Bytes / pixel
* > 0.6 GB/sec average in pipeline
* 5000 floating point operations per pixel
* 2 TFlop/s average, 9 TFlop/s peak
* ~ 18 Tbytes/night
14. Massive Data Flow, Storage
& Processing
Antenna &
Front End
Systems
STORAGE?
Correlation
CAN’T STORE IT!
1 DAY STREAM = 150 DAYS
GLOBAL INTERNET TRAFFIC
Data Product
Generation Temporary 800 PB
Storage
On-Demand
Long Term High Availability Processing
Storage Storage / DB
18 PB/YEAR
15. Massive Data Flow, Storage
& Processing
Antenna &
Front End
Systems
PROCESSING NEEDS
Correlation > 1 EXAFLOP/S 109 TOP RANGE PCS
Data Product
Generation Temporary 30 PETAFLOPS/S
Storage
On-Demand
Long Term High Availability Processing
Storage Storage / DB
16. Massive Data Flow, Storage
& Processing
Antenna &
Front End
Systems
7 PB/S
Correlation
BANDWIDTH
TYPICAL SURVEY,
> 300 GB/S
5 DAYS READ TIME @
Data Product 10GB/SEC
Generation Temporary
Storage
On-Demand
Long Term High Availability Processing
Storage Storage / DB
17. MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna &
Front End
Systems
Correlation
Bandwidth)in)TB/s)
LOFAR"
ALMA"
0" 5" 10" 15" 20" 25" 30" 35" 40"
18. MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna &
Front End
Systems
Bandwidth)in)TB/s)
Correlation
Bandwidth)in)TB/s)
ASKAP"
LOFAR"
0" 10" 20" 30" 40" 50" 60" 70"
ALMA"
0" 5" 10" 15" 20" 25" 30" 35" 40"
19. MASSIVE DATA FLOW, STORAGE & PROCESSING
Antenna &
Front End
Systems
Bandwidth)in)TB/s)
Correlation
Bandwidth)in)TB/s)
ASKAP"
LOFAR"
0" 10" 20" 30" 40" 50" 60" 70"
ALMA"
0" 5" 10" 15" 20" 25" 30" 35" 40"
25. CERN/IT/DB
40 M
online system Hz
leve (40
multi-level trigger l1 TB/
filter out background 75 K - spe sec)
cial
l 2 - Hz (7
reduce data volume from leve hard
40TB/s to 100MB/s 5G war
5 K embedde B/sec) e
Hz ( d pr
o
leve 5G cess
l3 - B/se ors
c)
100 PCs
(100
MB z
H
data /sec
)
offli reco
ne a rding
naly &
sis
26. CERN/IT/DB
Event Filter & Reconstruction
(figures are for one experiment)
data from detector - event builder
switch input: 5-100 GB/sec
capacity: 50K SI95
computer farm (~4K 1999 PCs)
recording rate: 100 MB/sec
(Alice – 1 GB/sec)
high speed network
tape
and disk servers
raw sum
dat ma
a ry d
ata
+ 1-1.25 PetaByte/year
+ 1-500 TB/year
20,000 Redwood
cartridges every year (+ copy)
28. Dealing with Big Data
We cannot allow for arbitrary queries
We can have arbitrary processing instead
We cannot allow full data dumps
We can generate data on the the fly (see above)
29. Queries as functions
QUERY = FUNCTION
{ }
DATA
QUERIES NEED TO BE PRECOMPUTED
ARBITRARY QUERIES ONLY POSSIBLE ON
THE PRECOMPUTED, SMALLER DATA SETS
30. Queries as functions
QUERY = FUNCTION
{ }
ALL
DATA
QUERIES NEED TO BE PRECOMPUTED
ARBITRARY QUERIES ONLY POSSIBLE ON
THE PRECOMPUTED, SMALLER DATA SETS
31. Lambda Architecture
FAST, INCREMENTAL ALGOS.
Speed Layer QUERIES NOT ON BATCH L.
COMPENSATES FOR LATENCY
RANDOM ACCESS TO VIEWS
Serving Layer
UPDATED BY BATCH LAYER
STORE MASTER DATASET
Batch Layer
COMPUTE ARBITRARY VIEWS
32. Batch Layer
INMUTABLE,
CONSTANTLY
Stores master copy of the dataset GROWING
Precomputes batch views on that master dataset
INMUTABLE,
CONSTANTLY
GROWING
33. Batch Layer UPDATED
VIEWS
TYPICALLY, View 1
MAP/REDUCE
All Data Batch Layer View 2
NEW
DATA …
View n
34. Serving Layer
Allows for:
batch writes of view updates
random reads on the views
Does not allow random writes
35. Speed Layer
Allows for:
incremental writes of view updates
short-term temporal queries on the views
Can be discarded!
36. 27
Figure 2.1 The master dataset in the Lambda Architecture serves as the source of
truth of your Big Data system. Errors at the serving and speed layers can be
37. Computing over
Big Data
Batch layer as a computational engine on data
Need to formally specify
Inputs
IKE
KS L !
Processes LOO LOW
T HAT RKF
A WO
Outputs SQL
OR ING
QU ERY
41. Dependence of execution time with the number of pool processors
0,8
20 millions
10 millions
5 millions
seconds per million elements
1 million
0,7
0,6
0,5
0,4
1 2 3 4 5 6 7 8
Number of pool processors
42. Conclusions
Big data needs different approaches
Parallelism & data-side processing
Map/Reduce as parallelism engine
Need of ways to formally specify computations
43. References & Links
“The Fourth Paradigm: Data-Intensive Scientific
Discovery”, Jim Gray, Microsoft Research
“MapReduce: Simplified Data Processing on Large
Clusters”, Jeffrey Dean and Sanjay Ghemawat,
Google
MyExperiment