SlideShare a Scribd company logo
1 of 23
Download to read offline
Dan Crichton
February 2018
Jet Propulsion Laboratory
California Institute of Technology
An introduction to systems and software architecture
considerations for scaling data analysis
Introduction
• Statisticians often
work in very high
level languages
• Complexities of
hardware and
software are
hidden through
“abstractions”
• Simplify
programming and
increase
efficiency
Hardware (CPU, IO, etc)
Machine Code
Assembly Language
C, C++, etc
Python, R, etc
Remote	Archive
System Architecture: what is it?
• The fundamental organization of a system embodied in
its components, their relationships to each other, and
to the environment, and the principles guiding its
design and evolution. (ANSI/IEEE Std. 1471-2000
Observational Data
My	Application
Network
Software System Levels of Abstraction
• Applications – Scientific software applications, visualizations, etc
• Middleware – Common software for data management, data
processing, analysis
• Compute Services – Storage, Databases, Computation, etc
(e.g., cloud computing)
• Networks – Communication (see OSI 7 layer model)
Credit:IBM
• Applications
• Middleware
• Compute Services
• Networks
The details of network communications
are highly abstracted
My	
Application
Remote	
Archive
Observational Data
Complexities can be highly abstracted
out (mission view)
14
DJC-14
External
Science
Community
Data
Acquisition
and Command
Mission
Operations
Instrument /Sensor
Operations
Science
Data
Archive
Science
Data
Processing
Data
Analysis and
Modeling
Science Information Package
Science Team
Relay Satellite
Spacecraft / lander
Spacecraft and
Scientific Instruments
Primitive Information
Object
Primitive Information
Object
Simple Information
Object
Telemetry Information
Package
Science Information
Package
Instrument
Planning
Information
Object
Science Information
Package
Science Products -
Information Objects
Planning
Information
Object
Science Information
Package
• Common Meta Models for Describing
Space Information Objects
• Common Data Dictionary end-to-end
• Flight Software
• Ground Data Systems
• Science Production/Processing
• Science Analysis
Flight Software (Highly Specialized)
Ground Data Systems
Science Processing
Science Analysis
At NASA, software abstractions exist across the
lifecycle
• Applications
• Middleware
• Compute Services
• Networks
The component deployment is dependent on fitting the principles (scalability, usability, accessibility,
etc) and environment (highly distributed on ground, onboard, etc.)
Analysis subject to distributed environment
Comm
Network
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Other Data
Systems (e.g.
NOAA)
Other Data
Systems (e.g.
NOAA)
Other
Data Systems
(In-Situ, Other
Agency, etc.)
Instrument
Data
Systems
Airborne
Data
NASA
Data
Archives
Data Capture Data Analysis
(Water, Ocean,
CO2, Extreme
Events, Mars,
etc.)
Reducing Data Wrangling: “There is a major need for the development of software components…
that link high-level data analysis-specifications with low-level distributed systems architectures.”
Frontiers in the Analysis of Massive Data, National Research Council, 2013.
Distributed Computing: Software as
“components” and “connections”
Think of each
application as a
component that
is plugged into a
network for
communication
Institute for Software Research, UC Irvine
Distributed Computing: Software as
“components” and “connections”
Think of each
application as a
component that
is plugged into a
network for
communication
Institute for Software Research, UC Irvine
Network
R or Python
Application
(Server 1)
Archive
(Server 2)
My	
Application
Remote	
Archive
Observational Data
User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Simple data
system:
• Servers hold data.
• User node connected by manual
webpage interface.
• Middleware at each node provides
search for, extract, and transmit data
from servers to user.
• Search and extract by lat/lon bounding
box, time, variable, etc.
• Some simple “analytics” available
though web interface (e.g,, simple
averages, maps, time series).
Server compute capacity, (dedicated) transmission links and
their capacities, and deployment of data to servers, all fixed at
inception subject to infrastructure construction costs. Designed
exclusively for access rather than analysis.
User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Bootstrap estimate of ()** ', ! :
For , = 1,2, ⋯,B:
1. User node samples 0 indices from
1,2, ⋯ , 0 	with replacement. Call these
(3"4, 3$4, ⋯ , 354).
2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and
B.
3. Server A extracts 74 = ('89:
,'8;:
, ⋯ , '8<:
) and
transmits to User.
4. Server B extracts	=4= (!89:
,!8;:
, ⋯ , !8<:
) and
transmits to User.
5. User computes >4 = ()**(74, 	=4).
6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of
()** ', ! .
7. FGH
$
= IB*(>",>$,⋯ , >D) is an estimate of the variance
of ()** ', ! .
Simple analysis system
for estimating
correlation:
Two-way command-line communication between servers
and user, specialized remote computation (e.g., random
sampling) dictated by type of analysis, direct
communication between servers (if needed). Constructed
subject to infrastructure cost constraints.
User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Bootstrap estimate of ()** ', ! :
For , = 1,2, ⋯,B:
1. User node samples 0 indices from
1,2, ⋯ , 0 	with replacement. Call these
(3"4, 3$4, ⋯ , 354).
2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and
B.
3. Server A extracts 74 = ('89:
,'8;:
, ⋯ , '8<:
) and
transmits to User.
4. Server B extracts	=4= (!89:
,!8;:
, ⋯ , !8<:
) and
transmits to User.
5. User computes >4 = ()**(74, 	=4).
6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of
()** ', ! .
7. FGH
$
= IB*(>",>$,⋯ , >D) is an estimate of the variance
of ()** ', ! .
Simple analysis system
for estimating
correlation:
How to choreograph this analysis to minimize uncertainty?
Maximize B, set 0 = J, but what about the costs of
computation and transmission?
Example: Carbon Cycle Model-Observation Comparison
Heterogeneous, incomplete
observations: ˆx(u1, u2, t),
ˆx(u1, u2, u3, t)
Transport
modelt = 1
t = 2
t = T ...
...
Fluxes:
Flux
model
Retrieval
algorithm
Ground-
based
retrieval
True fluxes:
t = 1
t = 2
t = T
...
...
t = 1
t = 2
t = T
...
...
Inferred fluxes:
In situ
sampling
Inference
(Inverse problem)
Compare
...
...
t = T
t = 1
t = 2
Radiances: y(u1, u2, t)
˜s(x1, x2, t)
CO2
concentrations:
t = 1
t = 2
t = T
...
...
˜x(u1, u2, u3, t)
t = 1
t = 2
t = T
...
...
True CO2
concentrations:
t = 1
t = 2
t = T
...
...
s(u1, u2, t) ˆs(u1, u2, t)
Goal:
Understand the processes that control flux of CO2 between the
ocean/land and the atmosphere.
Strategy:
Experiment with flux and transport models to make
modeled concentrations agree with observations of
concentrations, or produce fluxes that agree with
inferred fluxes.
(unknown)
(unknown)
x(u1, u2, u3, t)
Compare
Issues:
1) Flux and transport models are run in different physical locations; output must be moved.
2) Observational data are heterogenous (different footprints, measurement errors, etc) and
must be reconciled to one another and to flux and transport model resolutions.
3) Observational data are stored in different physical locations; must be moved.
4) "Compare" = hypothesis testing; requires formal uncertainty (probabilistic) estimates
on observations.
5) "Inference" = data assimilation or direct Bayesian inference; also requires
uncertainties on model output in addition to observational uncertainties.
Data science challenge:
Address 1) - 5) to make inferences (hypothesis tests) with minimum uncertainty
and minimum movement of data. A dynamic optimization problem.
Gunson, Braverman, Bowman, Cressie
Science Goal:
- Understand processes that control CO2 flux
Strategy:
- Experiment with models to increase
agreement observations / inferences
Analysis Challenges:
- Models and data reside at different locations
- Data are heterogeneous and must be reconciled as to
format, scope, fidelity, resolution, etc.
- Meaningful comparison requires uncertainty estimation
on both observational data and model output
Architecture Evaluation:
- Address the analysis challenges to minimize
both uncertainty and data movement.
A dynamic optimization problem.
Research Challenges
Research the relationship between architectural topology and scientific
data analysis efficiency to explore new architectural techniques for
scaling science-driven data analytics across distributed environments.
1) for a fixed system architecture, how can one optimize the movement
of data and algorithms and estimate the costs?
2) which system architectures yield the greatest efficiencies for the
types of scientific analyses we wish to support?
3) how can existing and new computational methods be designed to
better exploit the distributed architecture and increase scientific return?
How does the intersection between system topology and analysis
methodology affect the uncertainty?
Topology Decisions
Distribution (data, computation)
Data Accessibility
Network Capacity
Computational Capacity
Analysis Choreography/Workflow
The Need for Architectural Tradeoffs
Data Science
Architectural
Tradeoffs
Methodology
Decisions
Software
and
Hardware
Decisions
Use Cases,
Scenarios
Data Science Analytics Framework
Data Management Capabilities
Storage (e.g., Cloud)
Visualization
Algorithms/Data Processing
Server Resources (Cloud, HPC, etc)
Data Movement Technologies
Output
Uncertainty, performance,
cost, and computing stack
based on a set of capacities
Data Collections
Data Products/Objects/Files
Metadata Products
Data Formats, Data Size
Methods
Data Reduction
Feature Extraction
Classification
Detection
Fusion
Data Big Data
Analytics
Capability
Topology
Decisions
Nodes and Edges
• Consider a node to run a set
of components (applications,
middleware, services)
• Consider nodes to be
connected via edges
(networks)
– This follows a component-
connector architectural style
(C2)
2/11/18 18
DAWN: Modeling Software Architectures for
Scalability
DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for
simulation, analysis and optimization of data science architectures
(performance, cost, and uncertainty (planned)
Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton; Collaboration
with David Garlan/CMU
Applications
• Data Analytic Architectures
• Science Data Processing
• Cloud Architectures
• etc
Applications of DAWN
• Can be run multiple times by changing the system parameters (number of
nodes, cores, network speed, …) to identify the resources needed to
achieve a given processing goal
• Can find bottlenecks in workflow execution to identify computations that
need to be optimized or parallelized
• Can analyze how efficiently CPUs are utilized, to minimize monetary cost
• Can compare different possible architectures (centralized, distributed,
parallel, …) to maximize efficiency
Joint Research Opportunities between
statistics and software architecture
• Future vision would be an environment where the underlying distributed
architecture provides sufficient abstraction to allow for dynamic changes
without breaking statistical analysis.
• In consideration for this and the relationships to scalability and
uncertainty in the architectural and statistical choices for analysis of
distributed data,
– Statisticians can set up the use cases, analysis, and methodology to
address uncertainty
– Software architects can set up the simulation, including the distributed
environment
– Joint research can be conducted to optimal architectural architectural
approaches in the analysis within these highly distributed and massive data
environments.
Summary
• Understand the “decomposition” of a node into components
– Separate to understand appropriate levels of abstraction
– Generate an architectural representation
– Determine granularity for assessing ”cost” (performance, uncertainty,
etc)
• Understand the “edges” of nodes that “connect” components
• Choreograph is highly dependent on the data and topology
• Ultimately use a simulation tool like DAWN to break chicken and
egg cycle questions
Acknowledgements
• Virtual Information Fabric Infrastructure (VIFI) distributed
analytics framework. NSF DIBBS. (Bill Tolone, UNC; G.
Djorgovski, Caltech; L. Cinquini, K. Lee, JPL
• JPL Data Science Initiative
• Joint Initiative for Data Science and Technology at
Caltech and JPL
• NASA AIST and Mike Little

More Related Content

What's hot

Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...AtakanAral
 
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...ijcsit
 
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...AtakanAral
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...eSAT Publishing House
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsEran Chinthaka Withana
 
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS ijdpsjournal
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

What's hot (20)

Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
Modeling and Optimization of Resource Allocation in Cloud [PhD Thesis Progres...
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...
COMPLETE END-TO-END LOW COST SOLUTION TO A 3D SCANNING SYSTEM WITH INTEGRATED...
 
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
Resource Mapping Optimization for Distributed Cloud Services - PhD Thesis Def...
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
A0360109
A0360109A0360109
A0360109
 
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...
 
C0312023
C0312023C0312023
C0312023
 
Aa31163168
Aa31163168Aa31163168
Aa31163168
 
B0330811
B0330811B0330811
B0330811
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
 
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS
STUDY OF TASK SCHEDULING STRATEGY BASED ON TRUSTWORTHINESS
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
50120140501016
5012014050101650120140501016
50120140501016
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Az36311316
Az36311316Az36311316
Az36311316
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Similar to CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018

RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbaiUnmesh Baile
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters DefensederDoc
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
Scientific
Scientific Scientific
Scientific marpierc
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsJigisha Aryya
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Weather and Climate Visualization software
Weather and Climate Visualization softwareWeather and Climate Visualization software
Weather and Climate Visualization softwareRahul Gupta
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
 
GRIFFOR_OxfordU CPS 20Mar2017.pptx
GRIFFOR_OxfordU CPS 20Mar2017.pptxGRIFFOR_OxfordU CPS 20Mar2017.pptx
GRIFFOR_OxfordU CPS 20Mar2017.pptxDAYARNABBAIDYA3
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemIRJET Journal
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 

Similar to CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018 (20)

RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Vitus Masters Defense
Vitus Masters DefenseVitus Masters Defense
Vitus Masters Defense
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Big Data and IOT
Big Data and IOTBig Data and IOT
Big Data and IOT
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Scientific
Scientific Scientific
Scientific
 
A04230105
A04230105A04230105
A04230105
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systems
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Weather and Climate Visualization software
Weather and Climate Visualization softwareWeather and Climate Visualization software
Weather and Climate Visualization software
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
 
GRIFFOR_OxfordU CPS 20Mar2017.pptx
GRIFFOR_OxfordU CPS 20Mar2017.pptxGRIFFOR_OxfordU CPS 20Mar2017.pptx
GRIFFOR_OxfordU CPS 20Mar2017.pptx
 
An Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing SystemAn Algorithm for Optimized Cost in a Distributed Computing System
An Algorithm for Optimized Cost in a Distributed Computing System
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 

Recently uploaded (20)

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 

CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018

  • 1. Dan Crichton February 2018 Jet Propulsion Laboratory California Institute of Technology An introduction to systems and software architecture considerations for scaling data analysis
  • 2. Introduction • Statisticians often work in very high level languages • Complexities of hardware and software are hidden through “abstractions” • Simplify programming and increase efficiency Hardware (CPU, IO, etc) Machine Code Assembly Language C, C++, etc Python, R, etc
  • 3. Remote Archive System Architecture: what is it? • The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000 Observational Data My Application Network
  • 4. Software System Levels of Abstraction • Applications – Scientific software applications, visualizations, etc • Middleware – Common software for data management, data processing, analysis • Compute Services – Storage, Databases, Computation, etc (e.g., cloud computing) • Networks – Communication (see OSI 7 layer model)
  • 5. Credit:IBM • Applications • Middleware • Compute Services • Networks
  • 6. The details of network communications are highly abstracted My Application Remote Archive Observational Data
  • 7. Complexities can be highly abstracted out (mission view) 14 DJC-14 External Science Community Data Acquisition and Command Mission Operations Instrument /Sensor Operations Science Data Archive Science Data Processing Data Analysis and Modeling Science Information Package Science Team Relay Satellite Spacecraft / lander Spacecraft and Scientific Instruments Primitive Information Object Primitive Information Object Simple Information Object Telemetry Information Package Science Information Package Instrument Planning Information Object Science Information Package Science Products - Information Objects Planning Information Object Science Information Package • Common Meta Models for Describing Space Information Objects • Common Data Dictionary end-to-end • Flight Software • Ground Data Systems • Science Production/Processing • Science Analysis
  • 8. Flight Software (Highly Specialized) Ground Data Systems Science Processing Science Analysis At NASA, software abstractions exist across the lifecycle • Applications • Middleware • Compute Services • Networks The component deployment is dependent on fitting the principles (scalability, usability, accessibility, etc) and environment (highly distributed on ground, onboard, etc.)
  • 9. Analysis subject to distributed environment Comm Network Big Data Infrastructure (Data, Algorithms, Machines) Other Data Systems (e.g. NOAA) Other Data Systems (e.g. NOAA) Other Data Systems (In-Situ, Other Agency, etc.) Instrument Data Systems Airborne Data NASA Data Archives Data Capture Data Analysis (Water, Ocean, CO2, Extreme Events, Mars, etc.) Reducing Data Wrangling: “There is a major need for the development of software components… that link high-level data analysis-specifications with low-level distributed systems architectures.” Frontiers in the Analysis of Massive Data, National Research Council, 2013.
  • 10. Distributed Computing: Software as “components” and “connections” Think of each application as a component that is plugged into a network for communication Institute for Software Research, UC Irvine
  • 11. Distributed Computing: Software as “components” and “connections” Think of each application as a component that is plugged into a network for communication Institute for Software Research, UC Irvine Network R or Python Application (Server 1) Archive (Server 2) My Application Remote Archive Observational Data
  • 12. User Server B !", !$, ⋯ , !& Server A '", '$, ⋯ , '& Simple data system: • Servers hold data. • User node connected by manual webpage interface. • Middleware at each node provides search for, extract, and transmit data from servers to user. • Search and extract by lat/lon bounding box, time, variable, etc. • Some simple “analytics” available though web interface (e.g,, simple averages, maps, time series). Server compute capacity, (dedicated) transmission links and their capacities, and deployment of data to servers, all fixed at inception subject to infrastructure construction costs. Designed exclusively for access rather than analysis.
  • 13. User Server B !", !$, ⋯ , !& Server A '", '$, ⋯ , '& Bootstrap estimate of ()** ', ! : For , = 1,2, ⋯,B: 1. User node samples 0 indices from 1,2, ⋯ , 0 with replacement. Call these (3"4, 3$4, ⋯ , 354). 2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and B. 3. Server A extracts 74 = ('89: ,'8;: , ⋯ , '8<: ) and transmits to User. 4. Server B extracts =4= (!89: ,!8;: , ⋯ , !8<: ) and transmits to User. 5. User computes >4 = ()**(74, =4). 6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of ()** ', ! . 7. FGH $ = IB*(>",>$,⋯ , >D) is an estimate of the variance of ()** ', ! . Simple analysis system for estimating correlation: Two-way command-line communication between servers and user, specialized remote computation (e.g., random sampling) dictated by type of analysis, direct communication between servers (if needed). Constructed subject to infrastructure cost constraints.
  • 14. User Server B !", !$, ⋯ , !& Server A '", '$, ⋯ , '& Bootstrap estimate of ()** ', ! : For , = 1,2, ⋯,B: 1. User node samples 0 indices from 1,2, ⋯ , 0 with replacement. Call these (3"4, 3$4, ⋯ , 354). 2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and B. 3. Server A extracts 74 = ('89: ,'8;: , ⋯ , '8<: ) and transmits to User. 4. Server B extracts =4= (!89: ,!8;: , ⋯ , !8<: ) and transmits to User. 5. User computes >4 = ()**(74, =4). 6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of ()** ', ! . 7. FGH $ = IB*(>",>$,⋯ , >D) is an estimate of the variance of ()** ', ! . Simple analysis system for estimating correlation: How to choreograph this analysis to minimize uncertainty? Maximize B, set 0 = J, but what about the costs of computation and transmission?
  • 15. Example: Carbon Cycle Model-Observation Comparison Heterogeneous, incomplete observations: ˆx(u1, u2, t), ˆx(u1, u2, u3, t) Transport modelt = 1 t = 2 t = T ... ... Fluxes: Flux model Retrieval algorithm Ground- based retrieval True fluxes: t = 1 t = 2 t = T ... ... t = 1 t = 2 t = T ... ... Inferred fluxes: In situ sampling Inference (Inverse problem) Compare ... ... t = T t = 1 t = 2 Radiances: y(u1, u2, t) ˜s(x1, x2, t) CO2 concentrations: t = 1 t = 2 t = T ... ... ˜x(u1, u2, u3, t) t = 1 t = 2 t = T ... ... True CO2 concentrations: t = 1 t = 2 t = T ... ... s(u1, u2, t) ˆs(u1, u2, t) Goal: Understand the processes that control flux of CO2 between the ocean/land and the atmosphere. Strategy: Experiment with flux and transport models to make modeled concentrations agree with observations of concentrations, or produce fluxes that agree with inferred fluxes. (unknown) (unknown) x(u1, u2, u3, t) Compare Issues: 1) Flux and transport models are run in different physical locations; output must be moved. 2) Observational data are heterogenous (different footprints, measurement errors, etc) and must be reconciled to one another and to flux and transport model resolutions. 3) Observational data are stored in different physical locations; must be moved. 4) "Compare" = hypothesis testing; requires formal uncertainty (probabilistic) estimates on observations. 5) "Inference" = data assimilation or direct Bayesian inference; also requires uncertainties on model output in addition to observational uncertainties. Data science challenge: Address 1) - 5) to make inferences (hypothesis tests) with minimum uncertainty and minimum movement of data. A dynamic optimization problem. Gunson, Braverman, Bowman, Cressie Science Goal: - Understand processes that control CO2 flux Strategy: - Experiment with models to increase agreement observations / inferences Analysis Challenges: - Models and data reside at different locations - Data are heterogeneous and must be reconciled as to format, scope, fidelity, resolution, etc. - Meaningful comparison requires uncertainty estimation on both observational data and model output Architecture Evaluation: - Address the analysis challenges to minimize both uncertainty and data movement. A dynamic optimization problem.
  • 16. Research Challenges Research the relationship between architectural topology and scientific data analysis efficiency to explore new architectural techniques for scaling science-driven data analytics across distributed environments. 1) for a fixed system architecture, how can one optimize the movement of data and algorithms and estimate the costs? 2) which system architectures yield the greatest efficiencies for the types of scientific analyses we wish to support? 3) how can existing and new computational methods be designed to better exploit the distributed architecture and increase scientific return? How does the intersection between system topology and analysis methodology affect the uncertainty?
  • 17. Topology Decisions Distribution (data, computation) Data Accessibility Network Capacity Computational Capacity Analysis Choreography/Workflow The Need for Architectural Tradeoffs Data Science Architectural Tradeoffs Methodology Decisions Software and Hardware Decisions Use Cases, Scenarios Data Science Analytics Framework Data Management Capabilities Storage (e.g., Cloud) Visualization Algorithms/Data Processing Server Resources (Cloud, HPC, etc) Data Movement Technologies Output Uncertainty, performance, cost, and computing stack based on a set of capacities Data Collections Data Products/Objects/Files Metadata Products Data Formats, Data Size Methods Data Reduction Feature Extraction Classification Detection Fusion Data Big Data Analytics Capability Topology Decisions
  • 18. Nodes and Edges • Consider a node to run a set of components (applications, middleware, services) • Consider nodes to be connected via edges (networks) – This follows a component- connector architectural style (C2) 2/11/18 18
  • 19. DAWN: Modeling Software Architectures for Scalability DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for simulation, analysis and optimization of data science architectures (performance, cost, and uncertainty (planned) Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton; Collaboration with David Garlan/CMU Applications • Data Analytic Architectures • Science Data Processing • Cloud Architectures • etc
  • 20. Applications of DAWN • Can be run multiple times by changing the system parameters (number of nodes, cores, network speed, …) to identify the resources needed to achieve a given processing goal • Can find bottlenecks in workflow execution to identify computations that need to be optimized or parallelized • Can analyze how efficiently CPUs are utilized, to minimize monetary cost • Can compare different possible architectures (centralized, distributed, parallel, …) to maximize efficiency
  • 21. Joint Research Opportunities between statistics and software architecture • Future vision would be an environment where the underlying distributed architecture provides sufficient abstraction to allow for dynamic changes without breaking statistical analysis. • In consideration for this and the relationships to scalability and uncertainty in the architectural and statistical choices for analysis of distributed data, – Statisticians can set up the use cases, analysis, and methodology to address uncertainty – Software architects can set up the simulation, including the distributed environment – Joint research can be conducted to optimal architectural architectural approaches in the analysis within these highly distributed and massive data environments.
  • 22. Summary • Understand the “decomposition” of a node into components – Separate to understand appropriate levels of abstraction – Generate an architectural representation – Determine granularity for assessing ”cost” (performance, uncertainty, etc) • Understand the “edges” of nodes that “connect” components • Choreograph is highly dependent on the data and topology • Ultimately use a simulation tool like DAWN to break chicken and egg cycle questions
  • 23. Acknowledgements • Virtual Information Fabric Infrastructure (VIFI) distributed analytics framework. NSF DIBBS. (Bill Tolone, UNC; G. Djorgovski, Caltech; L. Cinquini, K. Lee, JPL • JPL Data Science Initiative • Joint Initiative for Data Science and Technology at Caltech and JPL • NASA AIST and Mike Little