SlideShare a Scribd company logo
1 of 19
Empowering Transformational Science
Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann
Ryan Abernathey (Columbia / LDEO) twitter: @rabernat
Aimee Barciauskas (Development Seed) twitter: @_aimeeb
(there are lots of links in this presentation! click away!)
SWOT
NISAR
NASA Physical Oceanography Program
Communities build open science.
Open science is more efficient.
Efficient science leads to
transformational results.
Data: time to find, access, clean, & format data for analysis
Software: what tools are easily available?
Compute: access to compute == speed of results
What impacts the velocity of science?
Data, Software, & Compute
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Traditional methods of data access
cannot leverage large volumes of data
6
https://earthdata.nasa.gov/eosdis/cloud-evolution
SWOT
NISAR
Data, Software, Compute
Analytics Optimized Data Store (AODS)
a few examples of
AODS formats
Current method -
NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or
day. Filename has information about date, sensor, version. Reading usually involved
calculating the filename, opening, reading, processing, closing.
Analytics Optimized Data Store (one example of many different formats)
Zarr - makes large datasets easily accessible to distributed computing. Original data is
stored in directories each having chunked data corresponding to dataset dimensions.
Metadata is read by zarr libraries to read only the chunks necessary to complete a
subsetting request.
Technology advances -
Lazy loading - also known as asynchronous loading - defer initialization of an object until
the point at which it is needed. Developed for webpages. Delays reading data until needed
for compute.
Advanced OSS libraries:
Xarray - library for analyzing multi-dimensional arrays, lazy loading.
Dask - able to break a large computational problems into a network of smaller problems for
distribution across multiple processors
Intake - lightweight set of tools for loading and sharing data in data science projects
NetCDF Zarr
What does a data store look like?
Organized so that each file can fit into RAM,
usually by day, orbit, or granules
organization and format invisible to user,
data accessed by metadata
Time to access data?
https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb
Modern software tools use lazy loading
to access large datasets
Accessing netCDF data: 11 minutes (depends on computer)
1 - user creates list of filenames
2 - access dataset by reading the metadata distributed through files
Accessing Zarr data: 0.1 seconds (metadata consolidated)
1 - access dataset by reading the consolidated metadata
Calculate mean over region
NetCDF - 12 minutes
Zarr - 4 seconds
My version of
lazy loading
before I knew
python - on
bedrest,
pregnant with
twins
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute
Data, Software, Compute
Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
SciPy
Data, Software, Compute
Analytics Optimized Data
Store (AODS)
Data Provider’s $ Data Consumer’s $
Scalable Parallel
Computing Frameworks
Agency driven solutions
Grass-Roots Solutions
13
14
Pangeo Architecture
Jupyter for interactive data
analysis on remote
systemsCloud / HPC
Xarray provides data structures
and intuitive interface for
interacting with datasetsParallel computing system allows users
deploy clusters of compute nodes for
data processing.
Dask tells the nodes what to do.
Distributedstorage
“Analytics Optimized
Data Stores”
stored on globally-
available distributed
storage.
@pangeo_data
How can data providers reduce barriers?
Reimagine how cloud data access and tools can enable
transformational science
Publish cloud-
optimized data Interactive
tutorials
Contribute to OSS tools
Increase user interactions/feedback
How does minimizing barriers to data
change science?
Levels the playing
field for all who
want to contribute
Traditional Project Timeline
Impacts: Reduce Time to Science
80%
Data Preparation
(download, clean, & organize files)
10%
Batch
Processing
10%
Think about
science
Cloud-based Project Timeline
5%
Load
AODS
5%
Parallel
Processing
90%
Think about science
Traditional Project Code
Impacts: Reproducibility
Cloud-based Project Code
# step 1: open data (stored on local hard drive)
>>> data = open_data(“/path/to/private/files”)
Error: files not found
# step 1: open data (globally accessible)
>>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”)
# step 2: process data
>>> process(data)
Reproducibility in data-driven science requires more than just code!
Thank you!
Open source science
What impacts the velocity of progress?
Data, Software, & Compute
STOP ------------- THIS IS DIFFERENT ------------------
1 line of code to access a 28-year, global, 25km dataset
1 line of code to select a region, calculate mean, & plot time series
in LESS than 1 minute

More Related Content

What's hot

What's hot (20)

Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Significance Of Hadoop For Data Science
Significance Of Hadoop For Data ScienceSignificance Of Hadoop For Data Science
Significance Of Hadoop For Data Science
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data
Big dataBig data
Big data
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...An incremental and distributed inference methodfor large scale ontologies bas...
An incremental and distributed inference methodfor large scale ontologies bas...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 

Similar to Empowering Transformational Science

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 

Similar to Empowering Transformational Science (20)

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop
HadoopHadoop
Hadoop
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 

More from Chelle Gentemann (9)

Butterfly Satellite Mission Overview
Butterfly Satellite Mission OverviewButterfly Satellite Mission Overview
Butterfly Satellite Mission Overview
 
Satellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisisSatellite passive microwave measurements of the climate crisis
Satellite passive microwave measurements of the climate crisis
 
FOSS4G 2021: Open source science
FOSS4G 2021: Open source scienceFOSS4G 2021: Open source science
FOSS4G 2021: Open source science
 
Open ecosystems help science storm the cloud
Open ecosystems help science storm the cloudOpen ecosystems help science storm the cloud
Open ecosystems help science storm the cloud
 
Building a Community of Practice
Building a Community of PracticeBuilding a Community of Practice
Building a Community of Practice
 
Open Science
Open ScienceOpen Science
Open Science
 
Multi-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures ProjectMulti-sensor Improved Sea Surface Temperatures Project
Multi-sensor Improved Sea Surface Temperatures Project
 
Saildrone Baja 2018 Cruise
Saildrone Baja 2018 CruiseSaildrone Baja 2018 Cruise
Saildrone Baja 2018 Cruise
 
The changing landscape of science
The changing landscape of scienceThe changing landscape of science
The changing landscape of science
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 

Recently uploaded (20)

GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Empowering Transformational Science

  • 1. Empowering Transformational Science Chelle Gentemann (Farallon Institute) twitter: @ChelleGentemann Ryan Abernathey (Columbia / LDEO) twitter: @rabernat Aimee Barciauskas (Development Seed) twitter: @_aimeeb (there are lots of links in this presentation! click away!) SWOT NISAR NASA Physical Oceanography Program
  • 2.
  • 3. Communities build open science. Open science is more efficient. Efficient science leads to transformational results.
  • 4. Data: time to find, access, clean, & format data for analysis Software: what tools are easily available? Compute: access to compute == speed of results What impacts the velocity of science? Data, Software, & Compute 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science
  • 5. Traditional methods of data access cannot leverage large volumes of data
  • 7. Analytics Optimized Data Store (AODS) a few examples of AODS formats Current method - NetCDF files - organized into ‘reasonable’ data sizes per file, usually by orbit, granule, or day. Filename has information about date, sensor, version. Reading usually involved calculating the filename, opening, reading, processing, closing. Analytics Optimized Data Store (one example of many different formats) Zarr - makes large datasets easily accessible to distributed computing. Original data is stored in directories each having chunked data corresponding to dataset dimensions. Metadata is read by zarr libraries to read only the chunks necessary to complete a subsetting request. Technology advances - Lazy loading - also known as asynchronous loading - defer initialization of an object until the point at which it is needed. Developed for webpages. Delays reading data until needed for compute. Advanced OSS libraries: Xarray - library for analyzing multi-dimensional arrays, lazy loading. Dask - able to break a large computational problems into a network of smaller problems for distribution across multiple processors Intake - lightweight set of tools for loading and sharing data in data science projects
  • 8. NetCDF Zarr What does a data store look like? Organized so that each file can fit into RAM, usually by day, orbit, or granules organization and format invisible to user, data accessed by metadata
  • 9. Time to access data? https://nbviewer.jupyter.org/github/cgentemann/Biophysical/blob/master/Test_AVISO_zarr_simple_version.ipynb Modern software tools use lazy loading to access large datasets Accessing netCDF data: 11 minutes (depends on computer) 1 - user creates list of filenames 2 - access dataset by reading the metadata distributed through files Accessing Zarr data: 0.1 seconds (metadata consolidated) 1 - access dataset by reading the consolidated metadata Calculate mean over region NetCDF - 12 minutes Zarr - 4 seconds My version of lazy loading before I knew python - on bedrest, pregnant with twins STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute
  • 10. Data, Software, Compute Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015) SciPy
  • 11. Data, Software, Compute Analytics Optimized Data Store (AODS) Data Provider’s $ Data Consumer’s $ Scalable Parallel Computing Frameworks
  • 14. 14 Pangeo Architecture Jupyter for interactive data analysis on remote systemsCloud / HPC Xarray provides data structures and intuitive interface for interacting with datasetsParallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributedstorage “Analytics Optimized Data Stores” stored on globally- available distributed storage. @pangeo_data
  • 15. How can data providers reduce barriers? Reimagine how cloud data access and tools can enable transformational science Publish cloud- optimized data Interactive tutorials Contribute to OSS tools Increase user interactions/feedback
  • 16. How does minimizing barriers to data change science? Levels the playing field for all who want to contribute
  • 17. Traditional Project Timeline Impacts: Reduce Time to Science 80% Data Preparation (download, clean, & organize files) 10% Batch Processing 10% Think about science Cloud-based Project Timeline 5% Load AODS 5% Parallel Processing 90% Think about science
  • 18. Traditional Project Code Impacts: Reproducibility Cloud-based Project Code # step 1: open data (stored on local hard drive) >>> data = open_data(“/path/to/private/files”) Error: files not found # step 1: open data (globally accessible) >>> data = open_data(“http://catalog.pangeo.io/path/to/dataset”) # step 2: process data >>> process(data) Reproducibility in data-driven science requires more than just code!
  • 19. Thank you! Open source science What impacts the velocity of progress? Data, Software, & Compute STOP ------------- THIS IS DIFFERENT ------------------ 1 line of code to access a 28-year, global, 25km dataset 1 line of code to select a region, calculate mean, & plot time series in LESS than 1 minute