CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018

Dan Crichton
February 2018
Jet Propulsion Laboratory
California Institute of Technology
An introduction to systems and software architecture
considerations for scaling data analysis

Introduction
• Statisticians often
work in very high
level languages
• Complexities of
hardware and
software are
hidden through
“abstractions”
• Simplify
programming and
increase
efficiency
Hardware (CPU, IO, etc)
Machine Code
Assembly Language
C, C++, etc
Python, R, etc

Remote Archive
System Architecture: what is it?
• The fundamental organization of a system embodied in
its components, their relationships to each other, and
to the environment, and the principles guiding its
design and evolution. (ANSI/IEEE Std. 1471-2000
Observational Data
My Application
Network

Software System Levels of Abstraction
• Applications – Scientific software applications, visualizations, etc
• Middleware – Common software for data management, data
processing, analysis
• Compute Services – Storage, Databases, Computation, etc
(e.g., cloud computing)
• Networks – Communication (see OSI 7 layer model)

Credit:IBM
• Applications
• Middleware
• Compute Services
• Networks

The details of network communications
are highly abstracted
My
Application
Remote
Archive
Observational Data

Complexities can be highly abstracted
out (mission view)
14
DJC-14
External
Science
Community
Data
Acquisition
and Command
Mission
Operations
Instrument /Sensor
Operations
Science
Data
Archive
Science
Data
Processing
Data
Analysis and
Modeling
Science Information Package
Science Team
Relay Satellite
Spacecraft / lander
Spacecraft and
Scientific Instruments
Primitive Information
Object
Primitive Information
Object
Simple Information
Object
Telemetry Information
Package
Science Information
Package
Instrument
Planning
Information
Object
Science Information
Package
Science Products -
Information Objects
Planning
Information
Object
Science Information
Package
• Common Meta Models for Describing
Space Information Objects
• Common Data Dictionary end-to-end
• Flight Software
• Ground Data Systems
• Science Production/Processing
• Science Analysis

Flight Software (Highly Specialized)
Ground Data Systems
Science Processing
Science Analysis
At NASA, software abstractions exist across the
lifecycle
• Applications
• Middleware
• Compute Services
• Networks
The component deployment is dependent on fitting the principles (scalability, usability, accessibility,
etc) and environment (highly distributed on ground, onboard, etc.)

Analysis subject to distributed environment
Comm
Network
Big Data
Infrastructure
(Data, Algorithms,
Machines)
Other Data
Systems (e.g.
NOAA)
Other Data
Systems (e.g.
NOAA)
Other
Data Systems
(In-Situ, Other
Agency, etc.)
Instrument
Data
Systems
Airborne
Data
NASA
Data
Archives
Data Capture Data Analysis
(Water, Ocean,
CO2, Extreme
Events, Mars,
etc.)
Reducing Data Wrangling: “There is a major need for the development of software components…
that link high-level data analysis-specifications with low-level distributed systems architectures.”
Frontiers in the Analysis of Massive Data, National Research Council, 2013.

Distributed Computing: Software as
“components” and “connections”
Think of each
application as a
component that
is plugged into a
network for
communication
Institute for Software Research, UC Irvine

Distributed Computing: Software as
“components” and “connections”
Think of each
application as a
component that
is plugged into a
network for
communication
Institute for Software Research, UC Irvine
Network
R or Python
Application
(Server 1)
Archive
(Server 2)
My
Application
Remote
Archive
Observational Data

User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Simple data
system:
• Servers hold data.
• User node connected by manual
webpage interface.
• Middleware at each node provides
search for, extract, and transmit data
from servers to user.
• Search and extract by lat/lon bounding
box, time, variable, etc.
• Some simple “analytics” available
though web interface (e.g,, simple
averages, maps, time series).
Server compute capacity, (dedicated) transmission links and
their capacities, and deployment of data to servers, all fixed at
inception subject to infrastructure construction costs. Designed
exclusively for access rather than analysis.

User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Bootstrap estimate of ()** ', ! :
For , = 1,2, ⋯,B:
1. User node samples 0 indices from
1,2, ⋯ , 0 with replacement. Call these
(3"4, 3$4, ⋯ , 354).
2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and
B.
3. Server A extracts 74 = ('89:
,'8;:
, ⋯ , '8<:
) and
transmits to User.
4. Server B extracts =4= (!89:
,!8;:
, ⋯ , !8<:
) and
transmits to User.
5. User computes >4 = ()**(74, =4).
6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of
()** ', ! .
7. FGH
$
= IB*(>",>$,⋯ , >D) is an estimate of the variance
of ()** ', ! .
Simple analysis system
for estimating
correlation:
Two-way command-line communication between servers
and user, specialized remote computation (e.g., random
sampling) dictated by type of analysis, direct
communication between servers (if needed). Constructed
subject to infrastructure cost constraints.

User
Server B
!", !$, ⋯ , !&
Server A
'", '$, ⋯ , '&
Bootstrap estimate of ()** ', ! :
For , = 1,2, ⋯,B:
1. User node samples 0 indices from
1,2, ⋯ , 0 with replacement. Call these
(3"4, 3$4, ⋯ , 354).
2. Transmit indices (3"4, 3$4, ⋯ , 354) to Servers A and
B.
3. Server A extracts 74 = ('89:
,'8;:
, ⋯ , '8<:
) and
transmits to User.
4. Server B extracts =4= (!89:
,!8;:
, ⋯ , !8<:
) and
transmits to User.
5. User computes >4 = ()**(74, =4).
6. >̅ = @ABC(>",>$,⋯ , >D) is a point estimate of
()** ', ! .
7. FGH
$
= IB*(>",>$,⋯ , >D) is an estimate of the variance
of ()** ', ! .
Simple analysis system
for estimating
correlation:
How to choreograph this analysis to minimize uncertainty?
Maximize B, set 0 = J, but what about the costs of
computation and transmission?

Example: Carbon Cycle Model-Observation Comparison
Heterogeneous, incomplete
observations: ˆx(u1, u2, t),
ˆx(u1, u2, u3, t)
Transport
modelt = 1
t = 2
t = T ...
...
Fluxes:
Flux
model
Retrieval
algorithm
Ground-
based
retrieval
True fluxes:
t = 1
t = 2
t = T
...
...
t = 1
t = 2
t = T
...
...
Inferred fluxes:
In situ
sampling
Inference
(Inverse problem)
Compare
...
...
t = T
t = 1
t = 2
Radiances: y(u1, u2, t)
˜s(x1, x2, t)
CO2
concentrations:
t = 1
t = 2
t = T
...
...
˜x(u1, u2, u3, t)
t = 1
t = 2
t = T
...
...
True CO2
concentrations:
t = 1
t = 2
t = T
...
...
s(u1, u2, t) ˆs(u1, u2, t)
Goal:
Understand the processes that control flux of CO2 between the
ocean/land and the atmosphere.
Strategy:
Experiment with flux and transport models to make
modeled concentrations agree with observations of
concentrations, or produce fluxes that agree with
inferred fluxes.
(unknown)
(unknown)
x(u1, u2, u3, t)
Compare
Issues:
1) Flux and transport models are run in different physical locations; output must be moved.
2) Observational data are heterogenous (different footprints, measurement errors, etc) and
must be reconciled to one another and to flux and transport model resolutions.
3) Observational data are stored in different physical locations; must be moved.
4) "Compare" = hypothesis testing; requires formal uncertainty (probabilistic) estimates
on observations.
5) "Inference" = data assimilation or direct Bayesian inference; also requires
uncertainties on model output in addition to observational uncertainties.
Data science challenge:
Address 1) - 5) to make inferences (hypothesis tests) with minimum uncertainty
and minimum movement of data. A dynamic optimization problem.
Gunson, Braverman, Bowman, Cressie
Science Goal:
- Understand processes that control CO2 flux
Strategy:
- Experiment with models to increase
agreement observations / inferences
Analysis Challenges:
- Models and data reside at different locations
- Data are heterogeneous and must be reconciled as to
format, scope, fidelity, resolution, etc.
- Meaningful comparison requires uncertainty estimation
on both observational data and model output
Architecture Evaluation:
- Address the analysis challenges to minimize
both uncertainty and data movement.
A dynamic optimization problem.

Research Challenges
Research the relationship between architectural topology and scientific
data analysis efficiency to explore new architectural techniques for
scaling science-driven data analytics across distributed environments.
1) for a fixed system architecture, how can one optimize the movement
of data and algorithms and estimate the costs?
2) which system architectures yield the greatest efﬁciencies for the
types of scientific analyses we wish to support?
3) how can existing and new computational methods be designed to
better exploit the distributed architecture and increase scientific return?
How does the intersection between system topology and analysis
methodology affect the uncertainty?

Topology Decisions
Distribution (data, computation)
Data Accessibility
Network Capacity
Computational Capacity
Analysis Choreography/Workflow
The Need for Architectural Tradeoffs
Data Science
Architectural
Tradeoffs
Methodology
Decisions
Software
and
Hardware
Decisions
Use Cases,
Scenarios
Data Science Analytics Framework
Data Management Capabilities
Storage (e.g., Cloud)
Visualization
Algorithms/Data Processing
Server Resources (Cloud, HPC, etc)
Data Movement Technologies
Output
Uncertainty, performance,
cost, and computing stack
based on a set of capacities
Data Collections
Data Products/Objects/Files
Metadata Products
Data Formats, Data Size
Methods
Data Reduction
Feature Extraction
Classification
Detection
Fusion
Data Big Data
Analytics
Capability
Topology
Decisions

Nodes and Edges
• Consider a node to run a set
of components (applications,
middleware, services)
• Consider nodes to be
connected via edges
(networks)
– This follows a component-
connector architectural style
(C2)
2/11/18 18

DAWN: Modeling Software Architectures for
Scalability
DAWN (“Distributed Analytics, Workflows and Numerics”) is a model for
simulation, analysis and optimization of data science architectures
(performance, cost, and uncertainty (planned)
Luca Cinquini, Kyo Lee, Amy Braverman, Mike Turmon, Dan Crichton; Collaboration
with David Garlan/CMU
Applications
• Data Analytic Architectures
• Science Data Processing
• Cloud Architectures
• etc

Applications of DAWN
• Can be run multiple times by changing the system parameters (number of
nodes, cores, network speed, …) to identify the resources needed to
achieve a given processing goal
• Can find bottlenecks in workflow execution to identify computations that
need to be optimized or parallelized
• Can analyze how efficiently CPUs are utilized, to minimize monetary cost
• Can compare different possible architectures (centralized, distributed,
parallel, …) to maximize efficiency

Joint Research Opportunities between
statistics and software architecture
• Future vision would be an environment where the underlying distributed
architecture provides sufficient abstraction to allow for dynamic changes
without breaking statistical analysis.
• In consideration for this and the relationships to scalability and
uncertainty in the architectural and statistical choices for analysis of
distributed data,
– Statisticians can set up the use cases, analysis, and methodology to
address uncertainty
– Software architects can set up the simulation, including the distributed
environment
– Joint research can be conducted to optimal architectural architectural
approaches in the analysis within these highly distributed and massive data
environments.

Summary
• Understand the “decomposition” of a node into components
– Separate to understand appropriate levels of abstraction
– Generate an architectural representation
– Determine granularity for assessing ”cost” (performance, uncertainty,
etc)
• Understand the “edges” of nodes that “connect” components
• Choreograph is highly dependent on the data and topology
• Ultimately use a simulation tool like DAWN to break chicken and
egg cycle questions

Acknowledgements
• Virtual Information Fabric Infrastructure (VIFI) distributed
analytics framework. NSF DIBBS. (Bill Tolone, UNC; G.
Djorgovski, Caltech; L. Cinquini, K. Lee, JPL
• JPL Data Science Initiative
• Joint Initiative for Data Science and Technology at
Caltech and JPL
• NASA AIST and Mike Little

CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018

Similar to CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Software Architecture Considerations for Scaling Data Analysis - Dan Crichton, Feb 12, 2018