SlideShare a Scribd company logo
1 of 43
Download to read offline
Astronomy’s Big Data
Challenges
Juan de Dios Santander Vela (IAA-CSIC)
Overview

What is, exactly, big data?

Which are the dimensions of big data?

Which are the big data drivers in astronomy?

How can we deal with big data?

VO tools for dealing with big data
What is exactly Big Data?

 Data sets whose size is beyond the ability
 of commonly used software tools to
 capture, manage, and process the data
 within a tolerable elapsed time.

                          WIKIPEDIA: “BIG DATA”
What is exactly Big Data?
Big Data is data with at least one Big dimension

  Bandwidth

  Number of individual assets

  Size of individual assets

  Response speed

  …
Data
                                                                                                                               mining
                                                                                       Processing
                                                                                       techniques                   Offline
                                                                     Storage




                                                                                Size                   Flow

                                                                     Access
                                                                   techniques

                                                                                       Big                         Real time

                                                                                       Data                                     Event
                                                                                                                               Processi
                                                                                                                                  ng




                       Processing                                                                                                                                 Paralell
                                                                                                                                                Files
                          level                                                                                                                                   Access
Raw Data                                                     Schemata                                                                                           Capabilities




                                          Unstructured
                                                                                                                                                        Durability
           Processed                                                                                                                      Formats
              Data           Statistics                                                                 Value
                                                                           Stuctured




                                                         Tagging


                                                                                                Information     Tech Debt
                                                                                                 Extracted
Next big data projects in
astronomy
Large Synoptic Survey
Telescope
The Large Synoptic Survey
    Telescope Camera


       Steven M. Kahn
       Stanford/SLAC
    (for the LSST Consortium)
LSST Data Rates


* 2.3 billion pixels read out in less than 2 sec, every 12 sec
* 1 pixel = 2 Bytes (raw)
* Over 3 GBytes/sec peak raw data from camera
* Real-time processing and transient detection: < 10 sec
* Dynamic range: 4 Bytes / pixel
* > 0.6 GB/sec average in pipeline
* 5000 floating point operations per pixel
* 2 TFlop/s average, 9 TFlop/s peak
* ~ 18 Tbytes/night
Relative Survey Power
Square Kilometre Array
Signal Transport & Processing
Signal Transport & Processing
  DESIGNS COUNTS!
Massive Data Flow, Storage
             & Processing
 Antenna &
 Front End
  Systems



                                   STORAGE?
 Correlation
                                    CAN’T STORE IT!
                                    1 DAY STREAM = 150 DAYS
                                    GLOBAL INTERNET TRAFFIC
Data Product
 Generation      Temporary         800 PB
                  Storage




                                            On-Demand
 Long Term     High Availability            Processing
  Storage        Storage / DB


18 PB/YEAR
Massive Data Flow, Storage
             & Processing
 Antenna &
 Front End
  Systems



                                                      PROCESSING NEEDS
Correlation    > 1 EXAFLOP/S                            109 TOP RANGE PCS


Data Product
 Generation               Temporary         30 PETAFLOPS/S
                           Storage




                                                    On-Demand
 Long Term              High Availability           Processing
  Storage                 Storage / DB
Massive Data Flow, Storage
             & Processing
 Antenna &
 Front End
  Systems

                 7 PB/S
Correlation
                                                BANDWIDTH
                                                  TYPICAL SURVEY,
               > 300 GB/S
                                                  5 DAYS READ TIME @
Data Product                                      10GB/SEC
 Generation                   Temporary
                               Storage




                                                       On-Demand
 Long Term                  High Availability          Processing
  Storage                     Storage / DB
MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna &
Front End
 Systems




Correlation



                                          Bandwidth)in)TB/s)



                       LOFAR"




                       ALMA"



                                0"   5"   10"   15"   20"   25"   30"   35"   40"
MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna &
Front End
 Systems



                                           Bandwidth)in)TB/s)
Correlation



                                           Bandwidth)in)TB/s)
                       ASKAP"




                       LOFAR"



                                0"   10"     20"         30"         40"         50"     60"    70"
                       ALMA"



                                0"   5"    10"     15"         20"         25"     30"    35"   40"
MASSIVE DATA FLOW, STORAGE & PROCESSING

Antenna &
Front End
 Systems



                                           Bandwidth)in)TB/s)
Correlation



                                           Bandwidth)in)TB/s)
                       ASKAP"




                       LOFAR"



                                0"   10"     20"         30"         40"         50"     60"    70"
                       ALMA"



                                0"   5"    10"     15"         20"         25"     30"    35"   40"
MASSIVE DATA FLOW, STORAGE & PROCESSING




Correlation




                                        Processing*TFlops/s*



                          ALMA"




                           VLA"



                                  0"   0,0005"   0,001"   0,0015"   0,002"
MASSIVE DATA FLOW, STORAGE & PROCESSING




Correlation

                                              Processing*TFlops/s*



                          LOFAR"          Processing*TFlops/s*



                          ALMA"
                          ALMA"



                                   0"   20"        40"        60"    80"      100"        120"
                            VLA"



                                   0"    0,0005"         0,001"     0,0015"      0,002"
MASSIVE DATA FLOW, STORAGE & PROCESSING




                                               Processing*TFlops/s*
Correlation

                                               Processing*TFlops/s*
                          ASKAP"




                          LOFAR"              Processing*TFlops/s*

                                   0"   50"     100"      150"         200"       250"    300"       350"
                          ALMA"
                          ALMA"



                                   0"    20"        40"          60"           80"       100"        120"
                            VLA"



                                   0"     0,0005"         0,001"              0,0015"       0,002"
MASSIVE DATA FLOW, STORAGE & PROCESSING




                                               Processing*TFlops/s*
Correlation

                                               Processing*TFlops/s*
                          ASKAP"




                          LOFAR"              Processing*TFlops/s*

                                   0"   50"     100"      150"         200"       250"    300"       350"
                          ALMA"
                          ALMA"



                                   0"    20"        40"          60"           80"       100"        120"
                            VLA"



                                   0"     0,0005"         0,001"              0,0015"       0,002"
Comparison: LHC
CERN/IT/DB




                                         40 M
online system                                    Hz
                                        leve           (40
multi-level trigger                         l1           TB/
filter out background                  75 K  - spe          sec)
                                                  cial
                                      l 2 - Hz (7
reduce data volume from           leve                 hard
40TB/s to 100MB/s                                 5G       war
                                   5 K embedde B/sec) e
                                        Hz (      d pr
                                                      o
                                      leve      5G        cess
                                          l3 -     B/se        ors
                                                       c)
                                     100 PCs
                                  (100
                                        MB z
                                            H
                          data              /sec
                                                )
                          offli reco
                               ne a rding
                                    naly &
                                        sis
CERN/IT/DB
                            Event Filter & Reconstruction
                                  (figures are for one experiment)
                         data from detector - event builder



                                         switch                            input: 5-100 GB/sec


                                                                           capacity: 50K SI95
       computer farm                                                            (~4K 1999 PCs)


                                                                           recording rate: 100 MB/sec
                                                                                     (Alice – 1 GB/sec)
                                   high speed network




  tape
  and disk servers

       raw                                        sum
             dat                                        ma
                 a                                           ry d
                                                                 ata

+ 1-1.25 PetaByte/year
                                                         + 1-500 TB/year
20,000 Redwood
 cartridges every year (+ copy)
Dealing with Big Data
Dealing with Big Data

We cannot allow for arbitrary queries

  We can have arbitrary processing instead

We cannot allow full data dumps

  We can generate data on the the fly (see above)
Queries as functions


     QUERY = FUNCTION
                        { }
                        DATA



        QUERIES NEED TO BE PRECOMPUTED
        ARBITRARY QUERIES ONLY POSSIBLE ON
        THE PRECOMPUTED, SMALLER DATA SETS
Queries as functions


     QUERY = FUNCTION
                        { }
                         ALL
                        DATA



        QUERIES NEED TO BE PRECOMPUTED
        ARBITRARY QUERIES ONLY POSSIBLE ON
        THE PRECOMPUTED, SMALLER DATA SETS
Lambda Architecture

                                         FAST, INCREMENTAL ALGOS.
                         Speed Layer     QUERIES NOT ON BATCH L.
                                         COMPENSATES FOR LATENCY
RANDOM ACCESS TO VIEWS
                         Serving Layer
UPDATED BY BATCH LAYER

                                         STORE MASTER DATASET
                         Batch Layer
                                         COMPUTE ARBITRARY VIEWS
Batch Layer

                                 INMUTABLE,
                                 CONSTANTLY
Stores master copy of the dataset GROWING

Precomputes batch views on that master dataset
                                   INMUTABLE,
                                   CONSTANTLY
                                    GROWING
Batch Layer                        UPDATED
                                    VIEWS

                     TYPICALLY,     View 1
                    MAP/REDUCE


  All Data           Batch Layer    View 2
             NEW
             DATA                    …

                                    View n
Serving Layer

Allows for:

  batch writes of view updates

  random reads on the views

Does not allow random writes
Speed Layer

Allows for:

  incremental writes of view updates

  short-term temporal queries on the views

Can be discarded!
27




Figure 2.1 The master dataset in the Lambda Architecture serves as the source of
truth of your Big Data system. Errors at the serving and speed layers can be
Computing over
Big Data
Batch layer as a computational engine on data

Need to formally specify

  Inputs
                                          IKE
                                     KS L !
  Processes                       LOO LOW
                            T HAT RKF
                               A WO
  Outputs                               SQL
                                    OR ING
                                    QU ERY
Map/Reduce
Map/Reduce
from%random%import%normalvariate
from,math,import,sqrt

def,res2(x):,return,pow(mean_v,6,x,,2.)
#"Random"vector,"mean"1,"stdev"0.001
v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]
mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)

res2_v,=,map(res2,,v)

stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))
print,(mean_v,,stdev)
                                        PARALELLISABLE!
Map/Reduce
from%random%import%normalvariate
from,math,import,sqrt
from,multiprocessing,import,Pool
def,res2(x):,return,pow(mean_v,6,x,,2.)
#"Random"vector,"mean"1,"stdev"0.001
v,=,[normalvariate(1,0.001),for,x,in,range(0,1000000)]
mean_v,=,reduce(lambda,x,y:,x+y,,v)/len(v)
pool,=,Pool(processes=4)
                             ONLY FOR MAP, BUT REDUCE
res2_v,=,pool.map(res2,,v)
                               ALSO PARALLELISABLE
pool.close()
stdev,,=,sqrt(reduce(lambda,x,y:,x+y,,res2_v)/len(v))
print,(mean_v,,stdev)
Dependence of execution time with the number of pool processors
                               0,8
                                                                                          20 millions
                                                                                          10 millions
                                                                                          5 millions
seconds per million elements




                                                                                          1 million
                               0,7




                               0,6




                               0,5




                               0,4
                                     1       2        3          4        5           6   7             8
                                                          Number of pool processors
Conclusions

Big data needs different approaches

  Parallelism & data-side processing

  Map/Reduce as parallelism engine

Need of ways to formally specify computations
References & Links

“The Fourth Paradigm: Data-Intensive Scientific
Discovery”, Jim Gray, Microsoft Research

“MapReduce: Simplified Data Processing on Large 
Clusters”, Jeffrey Dean and Sanjay Ghemawat,
Google

MyExperiment

More Related Content

Viewers also liked

modern security risks for big data and mobile applications
modern security risks for big data and mobile applicationsmodern security risks for big data and mobile applications
modern security risks for big data and mobile applicationsTrivadis
 
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Information Security Awareness Group
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Big data security challenges and recommendations!
Big data security challenges and recommendations!Big data security challenges and recommendations!
Big data security challenges and recommendations!cisoplatform
 
Big data and cyber security legal risks and challenges
Big data and cyber security legal risks and challengesBig data and cyber security legal risks and challenges
Big data and cyber security legal risks and challengesKapil Mehrotra
 
Growth Hacking - 10 Key Checklist
Growth Hacking - 10 Key Checklist Growth Hacking - 10 Key Checklist
Growth Hacking - 10 Key Checklist Bryan Ferguson
 
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...WSO2
 
Cyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & RecommendationsCyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & RecommendationsUlf Mattsson
 
Cyber security threats for 2017
Cyber security threats for 2017Cyber security threats for 2017
Cyber security threats for 2017Ramiro Cid
 
IoT Security Risks and Challenges
IoT Security Risks and ChallengesIoT Security Risks and Challenges
IoT Security Risks and ChallengesOWASP Delhi
 
Top ten big data security and privacy challenges
Top ten big data security and privacy challengesTop ten big data security and privacy challenges
Top ten big data security and privacy challengesBee_Ware
 
Cyber crime and security ppt
Cyber crime and security pptCyber crime and security ppt
Cyber crime and security pptLipsita Behera
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
SXSW 2016 takeaways
SXSW 2016 takeawaysSXSW 2016 takeaways
SXSW 2016 takeawaysHavas
 

Viewers also liked (18)

modern security risks for big data and mobile applications
modern security risks for big data and mobile applicationsmodern security risks for big data and mobile applications
modern security risks for big data and mobile applications
 
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Big data security challenges and recommendations!
Big data security challenges and recommendations!Big data security challenges and recommendations!
Big data security challenges and recommendations!
 
Big data and cyber security legal risks and challenges
Big data and cyber security legal risks and challengesBig data and cyber security legal risks and challenges
Big data and cyber security legal risks and challenges
 
Growth Hacking - 10 Key Checklist
Growth Hacking - 10 Key Checklist Growth Hacking - 10 Key Checklist
Growth Hacking - 10 Key Checklist
 
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...
WSO2Con USA 2017: Geospatial Big Data – Location Intelligence in Digital Tran...
 
IoT - Big Data & Security
IoT - Big Data & SecurityIoT - Big Data & Security
IoT - Big Data & Security
 
Cyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & RecommendationsCyber Risk Management in 2017: Challenges & Recommendations
Cyber Risk Management in 2017: Challenges & Recommendations
 
Cyber security threats for 2017
Cyber security threats for 2017Cyber security threats for 2017
Cyber security threats for 2017
 
Big Idea For Big Data
Big Idea For Big DataBig Idea For Big Data
Big Idea For Big Data
 
IoT Security Risks and Challenges
IoT Security Risks and ChallengesIoT Security Risks and Challenges
IoT Security Risks and Challenges
 
Top ten big data security and privacy challenges
Top ten big data security and privacy challengesTop ten big data security and privacy challenges
Top ten big data security and privacy challenges
 
Cyber crime and security ppt
Cyber crime and security pptCyber crime and security ppt
Cyber crime and security ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
SXSW 2016 takeaways
SXSW 2016 takeawaysSXSW 2016 takeaways
SXSW 2016 takeaways
 

Similar to VO Course 10: Big data challenges in astronomy

2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10keirdo1
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Odinot Stanislas
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Mindtree Ltd.
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demandzsvoboda
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireCarter Shanklin
 
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL DatabaseScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL DatabaseScaleBase
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
January 2006 Document Scanning Considerations Presentation
January 2006 Document Scanning Considerations PresentationJanuary 2006 Document Scanning Considerations Presentation
January 2006 Document Scanning Considerations PresentationJohn Wang
 
Sn wf12 amd fabric server (satheesh nanniyur) oct 12
Sn wf12 amd fabric server (satheesh nanniyur) oct 12Sn wf12 amd fabric server (satheesh nanniyur) oct 12
Sn wf12 amd fabric server (satheesh nanniyur) oct 12Satheesh Nanniyur
 
Mike Stolz Dramatic Scalability
Mike Stolz Dramatic ScalabilityMike Stolz Dramatic Scalability
Mike Stolz Dramatic Scalabilitydeimos
 

Similar to VO Course 10: Big data challenges in astronomy (20)

2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
 
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
Protect Your Big Data with Intel<sup>®</sup> Xeon<sup>®</sup> Processors a..
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).Scalable Computing Labs (SCL).
Scalable Computing Labs (SCL).
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demand
 
iRODS
iRODSiRODS
iRODS
 
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFire
 
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL DatabaseScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
ScaleBase Webinar: Methods and Challenges to Scale Out a MySQL Database
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Flow
FlowFlow
Flow
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
January 2006 Document Scanning Considerations Presentation
January 2006 Document Scanning Considerations PresentationJanuary 2006 Document Scanning Considerations Presentation
January 2006 Document Scanning Considerations Presentation
 
Sn wf12 amd fabric server (satheesh nanniyur) oct 12
Sn wf12 amd fabric server (satheesh nanniyur) oct 12Sn wf12 amd fabric server (satheesh nanniyur) oct 12
Sn wf12 amd fabric server (satheesh nanniyur) oct 12
 
Mike Stolz Dramatic Scalability
Mike Stolz Dramatic ScalabilityMike Stolz Dramatic Scalability
Mike Stolz Dramatic Scalability
 

More from Joint ALMA Observatory

Hablemos de ALMA — Wideband Sensitivity Upgrade
Hablemos de ALMA — Wideband Sensitivity UpgradeHablemos de ALMA — Wideband Sensitivity Upgrade
Hablemos de ALMA — Wideband Sensitivity UpgradeJoint ALMA Observatory
 
From SKA to SKAO: Early progress in SKA project construction.
From SKA to SKAO: Early progress in SKA project construction.From SKA to SKAO: Early progress in SKA project construction.
From SKA to SKAO: Early progress in SKA project construction.Joint ALMA Observatory
 
The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)Joint ALMA Observatory
 
Software Development Practices in ESFRIS—SKA Software Development
Software Development Practices in ESFRIS—SKA Software DevelopmentSoftware Development Practices in ESFRIS—SKA Software Development
Software Development Practices in ESFRIS—SKA Software DevelopmentJoint ALMA Observatory
 
Agile Systems Engineering & Agile at SKA Scale
Agile Systems Engineering & Agile at SKA ScaleAgile Systems Engineering & Agile at SKA Scale
Agile Systems Engineering & Agile at SKA ScaleJoint ALMA Observatory
 
How much control do you need to dance TANGO?
How much control do you need to dance TANGO?How much control do you need to dance TANGO?
How much control do you need to dance TANGO?Joint ALMA Observatory
 
Citizen Science in the era of the Square Kilometre Array
Citizen Science in the era of the Square Kilometre ArrayCitizen Science in the era of the Square Kilometre Array
Citizen Science in the era of the Square Kilometre ArrayJoint ALMA Observatory
 
The Square Kilometre Array: Overview and Engineering Update
The Square Kilometre Array: Overview and Engineering UpdateThe Square Kilometre Array: Overview and Engineering Update
The Square Kilometre Array: Overview and Engineering UpdateJoint ALMA Observatory
 
SKA Systems Engineering: from PDR to Construction
SKA Systems Engineering: from PDR to ConstructionSKA Systems Engineering: from PDR to Construction
SKA Systems Engineering: from PDR to ConstructionJoint ALMA Observatory
 
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...Joint ALMA Observatory
 
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...Joint ALMA Observatory
 
e-Science for the Science Kilometre Array
e-Science for the Science Kilometre Arraye-Science for the Science Kilometre Array
e-Science for the Science Kilometre ArrayJoint ALMA Observatory
 
Curso VO 07: Sistemas gestores de bases de datos
Curso VO 07: Sistemas gestores de bases de datosCurso VO 07: Sistemas gestores de bases de datos
Curso VO 07: Sistemas gestores de bases de datosJoint ALMA Observatory
 
VO Course 05: VOTable, VO Protocols, and UCDs
VO Course 05: VOTable, VO Protocols, and UCDsVO Course 05: VOTable, VO Protocols, and UCDs
VO Course 05: VOTable, VO Protocols, and UCDsJoint ALMA Observatory
 
VO Course 03: IVOA, the International Virtual Observatory Alliance
VO Course 03: IVOA, the International Virtual Observatory AllianceVO Course 03: IVOA, the International Virtual Observatory Alliance
VO Course 03: IVOA, the International Virtual Observatory AllianceJoint ALMA Observatory
 
VO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectVO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectJoint ALMA Observatory
 

More from Joint ALMA Observatory (20)

Hablemos de ALMA — Wideband Sensitivity Upgrade
Hablemos de ALMA — Wideband Sensitivity UpgradeHablemos de ALMA — Wideband Sensitivity Upgrade
Hablemos de ALMA — Wideband Sensitivity Upgrade
 
From SKA to SKAO: Early progress in SKA project construction.
From SKA to SKAO: Early progress in SKA project construction.From SKA to SKAO: Early progress in SKA project construction.
From SKA to SKAO: Early progress in SKA project construction.
 
The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)
 
Software Development Practices in ESFRIS—SKA Software Development
Software Development Practices in ESFRIS—SKA Software DevelopmentSoftware Development Practices in ESFRIS—SKA Software Development
Software Development Practices in ESFRIS—SKA Software Development
 
Agile Systems Engineering & Agile at SKA Scale
Agile Systems Engineering & Agile at SKA ScaleAgile Systems Engineering & Agile at SKA Scale
Agile Systems Engineering & Agile at SKA Scale
 
How much control do you need to dance TANGO?
How much control do you need to dance TANGO?How much control do you need to dance TANGO?
How much control do you need to dance TANGO?
 
Citizen Science in the era of the Square Kilometre Array
Citizen Science in the era of the Square Kilometre ArrayCitizen Science in the era of the Square Kilometre Array
Citizen Science in the era of the Square Kilometre Array
 
The Square Kilometre Array: Overview and Engineering Update
The Square Kilometre Array: Overview and Engineering UpdateThe Square Kilometre Array: Overview and Engineering Update
The Square Kilometre Array: Overview and Engineering Update
 
SKA Systems Engineering: from PDR to Construction
SKA Systems Engineering: from PDR to ConstructionSKA Systems Engineering: from PDR to Construction
SKA Systems Engineering: from PDR to Construction
 
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...
Building a National Virtual Observatory: The Case of the Spanish Virtual Obse...
 
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...
Wf4Ever: Scientific Workflows and Research Objects as tools for scientific in...
 
e-Science for the Science Kilometre Array
e-Science for the Science Kilometre Arraye-Science for the Science Kilometre Array
e-Science for the Science Kilometre Array
 
VO Course 11: Spatial indexing
VO Course 11: Spatial indexingVO Course 11: Spatial indexing
VO Course 11: Spatial indexing
 
Curso VO 07: Sistemas gestores de bases de datos
Curso VO 07: Sistemas gestores de bases de datosCurso VO 07: Sistemas gestores de bases de datos
Curso VO 07: Sistemas gestores de bases de datos
 
VO Course 06: VO Data-models
VO Course 06: VO Data-modelsVO Course 06: VO Data-models
VO Course 06: VO Data-models
 
VO Course 05: VOTable, VO Protocols, and UCDs
VO Course 05: VOTable, VO Protocols, and UCDsVO Course 05: VOTable, VO Protocols, and UCDs
VO Course 05: VOTable, VO Protocols, and UCDs
 
VO Course 04: VO architecture
VO Course 04: VO architectureVO Course 04: VO architecture
VO Course 04: VO architecture
 
VO Course 03: IVOA, the International Virtual Observatory Alliance
VO Course 03: IVOA, the International Virtual Observatory AllianceVO Course 03: IVOA, the International Virtual Observatory Alliance
VO Course 03: IVOA, the International Virtual Observatory Alliance
 
VO Course 02: Astronomy & Standards
VO Course 02: Astronomy & StandardsVO Course 02: Astronomy & Standards
VO Course 02: Astronomy & Standards
 
VO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever projectVO Course 12: Workflows & the Wf4Ever project
VO Course 12: Workflows & the Wf4Ever project
 

VO Course 10: Big data challenges in astronomy

  • 1. Astronomy’s Big Data Challenges Juan de Dios Santander Vela (IAA-CSIC)
  • 2. Overview What is, exactly, big data? Which are the dimensions of big data? Which are the big data drivers in astronomy? How can we deal with big data? VO tools for dealing with big data
  • 3. What is exactly Big Data? Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. WIKIPEDIA: “BIG DATA”
  • 4. What is exactly Big Data? Big Data is data with at least one Big dimension Bandwidth Number of individual assets Size of individual assets Response speed …
  • 5. Data mining Processing techniques Offline Storage Size Flow Access techniques Big Real time Data Event Processi ng Processing Paralell Files level Access Raw Data Schemata Capabilities Unstructured Durability Processed Formats Data Statistics Value Stuctured Tagging Information Tech Debt Extracted
  • 6. Next big data projects in astronomy
  • 8. The Large Synoptic Survey Telescope Camera Steven M. Kahn Stanford/SLAC (for the LSST Consortium)
  • 9. LSST Data Rates * 2.3 billion pixels read out in less than 2 sec, every 12 sec * 1 pixel = 2 Bytes (raw) * Over 3 GBytes/sec peak raw data from camera * Real-time processing and transient detection: < 10 sec * Dynamic range: 4 Bytes / pixel * > 0.6 GB/sec average in pipeline * 5000 floating point operations per pixel * 2 TFlop/s average, 9 TFlop/s peak * ~ 18 Tbytes/night
  • 12. Signal Transport & Processing
  • 13. Signal Transport & Processing DESIGNS COUNTS!
  • 14. Massive Data Flow, Storage & Processing Antenna & Front End Systems STORAGE? Correlation CAN’T STORE IT! 1 DAY STREAM = 150 DAYS GLOBAL INTERNET TRAFFIC Data Product Generation Temporary 800 PB Storage On-Demand Long Term High Availability Processing Storage Storage / DB 18 PB/YEAR
  • 15. Massive Data Flow, Storage & Processing Antenna & Front End Systems PROCESSING NEEDS Correlation > 1 EXAFLOP/S 109 TOP RANGE PCS Data Product Generation Temporary 30 PETAFLOPS/S Storage On-Demand Long Term High Availability Processing Storage Storage / DB
  • 16. Massive Data Flow, Storage & Processing Antenna & Front End Systems 7 PB/S Correlation BANDWIDTH TYPICAL SURVEY, > 300 GB/S 5 DAYS READ TIME @ Data Product 10GB/SEC Generation Temporary Storage On-Demand Long Term High Availability Processing Storage Storage / DB
  • 17. MASSIVE DATA FLOW, STORAGE & PROCESSING Antenna & Front End Systems Correlation Bandwidth)in)TB/s) LOFAR" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  • 18. MASSIVE DATA FLOW, STORAGE & PROCESSING Antenna & Front End Systems Bandwidth)in)TB/s) Correlation Bandwidth)in)TB/s) ASKAP" LOFAR" 0" 10" 20" 30" 40" 50" 60" 70" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  • 19. MASSIVE DATA FLOW, STORAGE & PROCESSING Antenna & Front End Systems Bandwidth)in)TB/s) Correlation Bandwidth)in)TB/s) ASKAP" LOFAR" 0" 10" 20" 30" 40" 50" 60" 70" ALMA" 0" 5" 10" 15" 20" 25" 30" 35" 40"
  • 20. MASSIVE DATA FLOW, STORAGE & PROCESSING Correlation Processing*TFlops/s* ALMA" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  • 21. MASSIVE DATA FLOW, STORAGE & PROCESSING Correlation Processing*TFlops/s* LOFAR" Processing*TFlops/s* ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  • 22. MASSIVE DATA FLOW, STORAGE & PROCESSING Processing*TFlops/s* Correlation Processing*TFlops/s* ASKAP" LOFAR" Processing*TFlops/s* 0" 50" 100" 150" 200" 250" 300" 350" ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  • 23. MASSIVE DATA FLOW, STORAGE & PROCESSING Processing*TFlops/s* Correlation Processing*TFlops/s* ASKAP" LOFAR" Processing*TFlops/s* 0" 50" 100" 150" 200" 250" 300" 350" ALMA" ALMA" 0" 20" 40" 60" 80" 100" 120" VLA" 0" 0,0005" 0,001" 0,0015" 0,002"
  • 25. CERN/IT/DB 40 M online system Hz leve (40 multi-level trigger l1 TB/ filter out background 75 K - spe sec) cial l 2 - Hz (7 reduce data volume from leve hard 40TB/s to 100MB/s 5G war 5 K embedde B/sec) e Hz ( d pr o leve 5G cess l3 - B/se ors c) 100 PCs (100 MB z H data /sec ) offli reco ne a rding naly & sis
  • 26. CERN/IT/DB Event Filter & Reconstruction (figures are for one experiment) data from detector - event builder switch input: 5-100 GB/sec capacity: 50K SI95 computer farm (~4K 1999 PCs) recording rate: 100 MB/sec (Alice – 1 GB/sec) high speed network tape and disk servers raw sum dat ma a ry d ata + 1-1.25 PetaByte/year + 1-500 TB/year 20,000 Redwood cartridges every year (+ copy)
  • 28. Dealing with Big Data We cannot allow for arbitrary queries We can have arbitrary processing instead We cannot allow full data dumps We can generate data on the the fly (see above)
  • 29. Queries as functions QUERY = FUNCTION { } DATA QUERIES NEED TO BE PRECOMPUTED ARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
  • 30. Queries as functions QUERY = FUNCTION { } ALL DATA QUERIES NEED TO BE PRECOMPUTED ARBITRARY QUERIES ONLY POSSIBLE ON THE PRECOMPUTED, SMALLER DATA SETS
  • 31. Lambda Architecture FAST, INCREMENTAL ALGOS. Speed Layer QUERIES NOT ON BATCH L. COMPENSATES FOR LATENCY RANDOM ACCESS TO VIEWS Serving Layer UPDATED BY BATCH LAYER STORE MASTER DATASET Batch Layer COMPUTE ARBITRARY VIEWS
  • 32. Batch Layer INMUTABLE, CONSTANTLY Stores master copy of the dataset GROWING Precomputes batch views on that master dataset INMUTABLE, CONSTANTLY GROWING
  • 33. Batch Layer UPDATED VIEWS TYPICALLY, View 1 MAP/REDUCE All Data Batch Layer View 2 NEW DATA … View n
  • 34. Serving Layer Allows for: batch writes of view updates random reads on the views Does not allow random writes
  • 35. Speed Layer Allows for: incremental writes of view updates short-term temporal queries on the views Can be discarded!
  • 36. 27 Figure 2.1 The master dataset in the Lambda Architecture serves as the source of truth of your Big Data system. Errors at the serving and speed layers can be
  • 37. Computing over Big Data Batch layer as a computational engine on data Need to formally specify Inputs IKE KS L ! Processes LOO LOW T HAT RKF A WO Outputs SQL OR ING QU ERY
  • 41. Dependence of execution time with the number of pool processors 0,8 20 millions 10 millions 5 millions seconds per million elements 1 million 0,7 0,6 0,5 0,4 1 2 3 4 5 6 7 8 Number of pool processors
  • 42. Conclusions Big data needs different approaches Parallelism & data-side processing Map/Reduce as parallelism engine Need of ways to formally specify computations
  • 43. References & Links “The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray, Microsoft Research “MapReduce: Simplified Data Processing on Large  Clusters”, Jeffrey Dean and Sanjay Ghemawat, Google MyExperiment