Ibmr 2014

IBM Research Brazil
Fabio Porto (fporto@lncc.br),
LNCC – MCTI
DEXL Lab (dexl.lncc.br)
Challenges in Scientific
Big Data Management

Outline
 Introduction
 Big Data in Science
 Hypothesis Driven-Research
 Hypothesis as Data – Upsilon-DB
 SimDB
 Final remarks
IBM Research Brazil

Laboratório Nacional de
Computação Científica (LNCC)
Petropolis, Rio de Janeiro
IBM Research Brazil

LNCC - MCTI
 Graduate Course in Computational Modelling
– CAPES 6
 BioInfo Laboratory
– High throughput sequencing
 Coordinator of INCT –MACC
– Medicine Supported by Computational Science
 Coordinator of SINAPAD
– HPC National System
 Thematic laboratories
– ACIMA – Augmented Reality
– MARTIN – Network and Software Engineering
– DEXL – Big Data
– COMCIDIS – Distributed Systems
– HEMOLAB – Cardio Vascular System Modelling
IBM Research Brazil

DEXL – On going Projects
DEXL
Laboratory
What-if
data analysis
Astronomy
Data
Management
Simulation Data
management
Hypothesis
Upsilon-DB
Gene regulation
Network
Scientific
workflow
optimization
Seismic
Data Mngmt
(EMC)
Bioknowlogy
IBM Research Brazil
Olympic
Laboratory

Objective
 To provide scientists with an in-silico cockpit
from which scientific data and metadata can
be efficiently managed
IBM Research Brazil

IBM Research Brazil
“Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going to
get worse”
– Office Science. Data-Management Challenge
Report– DoE - 2004

Big Data in science
 An expression that reflects the data deluge
produced in science
– astronomy, astrophysics
– Biology, Neuroscience
– Sports
– Geology, Geophysics, etc.
IBM Research Brazil

Big Data - Dimensions
Volume
Velocity
Variety
MB GB TB PB
file
database
Uncertainty
heterogeneity, evolution
batch
online
sensors, alerts
real time
IBM Research Brazil

A challenge on volume
 Dark Energy Survey (DES) project expects to
produce 100 PB in 10 years; (source:personal comm.)
– 5000o sky cover, “all” objects, “perfect” accuracy
 Yahoo claims to manage 2 PB of click data in a
modified PostgreSQL
 EMBL - nucleotide database 260 Gbases
 High-throughput sequencing 454 Roche technology
– Sequence 400-600 million bases in 10 hours
– Eg. A project at Max Plant Institute aims at sequencing the
whole genome of the Neanderthal at 3 billion base pairs is
expected to take 2 years to finish.
IBM Research Brazil

1D-3D coupled simulations
IBM Research Brazil

From Observation to Data
Analysis
IBM Research Brazil

BIG DATA in Science
 Scientific process is being remodelled to be developed
within an in-silico environment
 Powerful instruments:
– Digital telescopes
– DNA sequencers
– Mass spectrometers
 Huge simulations
– Weak lensing
– Human Cardio-vascular system
 Massive amounts of information streams in and out…
 Hypothesis-driven research supported by in-silico
infrastructure, methods, models…
IBM Research Brazil

Hypothesis
Formulation
Modeling
Experiment
Life-cycle
IBM Research Brazil
PublicationPhenomenon
e-Science life cycle

Big Data urgent call in e-science
 Scientific life cycle metadata management;
 Scientific Hypothesis formulation and
validation;
 Scientific Data management;
 Scientific data processing architecture;
IBM Research Brazil

MODELLING -
HYPOTHESIS-DRIVEN BIG
DATA RESEARCH

“To see what is in front of one’s nose
needs a constant struggle”
George Orwell
IBM Research Brazil

To make sense of Big Data we
need models
[Peter Haas – Data is Dead without what-if models,
PVLDB 2011]
 Scientific Models are formal interpretation of
phenomena
 Hypotheses formalized as models
 Scientific life cycle driven by hypotheses
validation
IBM Research Brazil

Hypothesis driven Big Data
analyses
 Scientific Hypothesis – a model for scientists’
interpretation of a phenomenon;
 Different hypotheses co-habit a scientific
domain;
 Science method – prove hypotheses
 Big Data analyses – hypotheses exploration
 In new Big Data prediction analysis – identify
first principles that guide predictions – deep vs
shallow prediction
IBM Research Brazil

Big Data Hypotheses driven
life cycle
Hypotheses,
experiment
Goals
Experiment,
Workflow
Design
Workflow
Preparation
Workflow
ExecutionPost-
Execution
analysis
Workflow
repository
Data
Sources
Provenance
Store
Monitoring
Hypotheses
database
Adaptado de [Mattoso
et al. 2010]
Analysis
Results
IBM Research Brazil

Equivalence of interpretation
IBM Research Brazil
ν (hypotheses) ≅ Models
ϕ
(phenomenon)
δ (data)

Phenomenon
0..1
1..1
explains
1..1
1..1
Υ-DB Conceptual Model
Continuous
Ph_Process
Discrete
Ph_Process
Mathematical
Model
1..1
formulatedby
isTheBlendOf1..n
1..n
Is basedOn
represented_as
Compared_with
Mathematical
Formulae XML
Represented
with
Physical
Quantities
Phenomenon
physical
quantities
1..1
0..n
0..n
1..1
Formal
Representation
Scientist
1..m
0..n
0..n
0..n
elements
constant
fucntion
equation
1..n
0..n
1..n
1..1
Observation
Element
Simulated
Element
Data View
(query over
Data view)
Modeled_as
1..1
0..n
Refers-to
0..1
Space-Time
Dimension 1..1
0..n
0..n1..1
0..n
Event
Computational
Model View
modeled_as
transforms
Mesh
1..1
1..n
Mesh
Data view
Domain ontology URL
1..1
0..n
Formal
Language
Discrete
Phenomenon
Simulation
0..1
0..1
0..1 0..n
represents
1..1 1..1
0..n
Topologically
modeled by0..n
0..1
1-n
State
Ph_Process
represented_as
SC
Hypothesis1..n
isAuthor
variable 1..n
[Porto et al. ER 2008, ER 2012]
IBM Research Brazil

Hypotheses as Data – Upsilon DB
 From the triangular equivalence, we derive that
– Hypothesis = Model = Data
 How can we infer data from Model?
[Bernardo Gonçalves, Fabio Porto, PVLDB 2014]
IBM Research Brazil

Hypothesis as Models
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
Hypothesis
Scientific
Model
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Computational
Model
IBM Research Brazil

Hypothesis - From Models to Data
IBM Research Brazil
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
SOLVER
Input Output

Hypothesis as Data
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
t v s
0 0 5000
1 -32 4984
2 -64 4936
3 -96 4856
4 -128 4744
IBM Research Brazil
Free_Fall

Hypothesis as Data – Computing a
DB schema
 Mathematical Models
– formalize hypotheses
– equations establish a functional dependency
between dimensions and parameters and
predicting variables
 eg: g,t,vo -> v
– Derive a DB schema from DFs extracted from
equations
IBM Research Brazil

Hypothesis as Data
 In the Free Fall example:
– Σ1 = {Φ -> g,vo, so
g, ν -> a
g, vo, t, ν -> v
g, vo, so, t, ν -> s}
 Observe that Φ and ν are epistemological
variables referring to the phenomenon and
the hypothesis, respectively;
IBM Research Brazil
v(t) = -gt + vo

Hypothesis as Data - schema
 Φ -> g,vo, so
– defines the model parameters
– It is expected to be violated reproducing the
uncertainty in the model input;
– Such uncertainty contributes to the quality of the
hypothesis
 From Σ1, the schema for predicting a under
hypothesis h1 would be:
– h1 (Φ, ν , a)
 From Σ1, the input parameters are defined
as: *key violation
– h1_input(Φ, g, v0, s0)
IBM Research Brazil

Hypothesis as Uncertain Data
IBM Research Brazil
Φ g v0 s0
1 32 0 5000
1 32 10 5000
1 32 20 5000
1 32.2 0 5000
1 32.2 10 5000
1 32.2 20 5000
Uncertainty: 50%
Uncertainty: 33%
INPUT_H1

Uncertainty Introduction
 Υ_DB is a probabilistic database
[D. Suciu et al, Probabilistic Databases, 2011]
– a Y-relation includes certain and conditional columns;
– a conditional column is a pair (Vi , Di), where Vi is a
random variable and Di is one of its possible values;
– ex:
 Create table Y_g as select U_phi, U_g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g weight by Fr) as U
IBM Research Brazil

Hypothesis as Uncertain Data
 Create table Y_g as select U.phi, U.g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g) as U
IBM Research Brazil
Φ g
1 32
1 32
1 32
1 32.2
1 32.2
1 32.2
INPUT_H1
Φ V-> D g
1 x1 1 32
1 x1 2 32.2
Y_g

Synthesizing Prediction as a query
 as g,ν a in Σ1 , we can predict a as a query
on uncertain relations Y_g and Y_R
– create table Y1_a as select H.phi, H.upsilon, H.a
from H1_OUTPUT_a as H, Y_R as R, Y1_g as G,
(select min(tid) as tid, phi, g from H1_INPUT group
by phi, g) as U
where H.tid=U.tid and G.phi=U.phi and G.g=U.g
and H.phi=R.phi and H.upsilon=R.upsilon
IBM Research Brazil

Predicted Y-DB relation Y[a]
IBM Research Brazil
Φ g a u
1 32 32 0.5
1 32.2 32.2 0.5
Υ[a]
Υ-DB enables data
oriented uncertainty
quantification of
predicting variables

Sum Up
 Υ-DB is a probabilistic database designed to
manage hypothesis as data
 In Υ-DB, both the intrinsic uncertainty of the
model Υ[R] and those of predicting variables
(eg. Υ[a]) are automatically computed
 Υ-DB and Research Lattices are the basis for
managing Hypotheses over Big Data in
science (and we believe in any domain)
IBM Research Brazil

Ordering Hypotheses
 Different competing hypotheses must be
placed into context according to their
phenomenon explanation capacity
– predicting capability (predicting variables);
– assumptions and constraints
IBM Research Brazil

Hypotheses in the Dark Energy
Survey Project
 Phenomenon
– The universe is increasing its expansion acceleration
 Discovered in 1998 during supernovae investigation
 Supported by redshift observation of far away supernovae
 Hypotheses
– A new behaviour Dark Energy pushes the acceleration
– The Universe density is not uniform
 Evidences
– gravitational lenses
– Galaxy clusters
IBM Research Brazil

Research Lattices – structure
hypotheses of a phenomenon
Τ
Dark Energy
Non uniform universe
Weak lensing Galaxy
clustering
Earth special location
Τ
[B. Gonçalves, F. Porto, Research Lattices, AMW 2013]
IBM Research Brazil

Research Lattices
 Each Node is a hypotheses
 Given two hypotheses h1 and h2, in a R.L., if
h1 ≥ h2 then h1 is more general than h2;
 Top corresponds to all knowledge of a
domain;
 Bottom is the empty representation of lack of
knowledge;
IBM Research Brazil

Research Lattice: Acceleration
Τ
Lei da queda livre
d2s/dt2=9,8
h1
Primeira Lei
Newton
h2
Segunda Lei
Newton
F=mag
h3
Aceleração
Centrípeta
ac=4Πr/T2
h4
3a Lei de Kepler
r3/T2= c
h5
Lei da Gravitação Universal
Fg= G Mn/ r2
h6
Lei do inverso
quadrado da distância
ac ∝ 1 /r2
h7
ΤIBM Research Brazil

Research Lattice for the Human
Cardio Vascular System
IBM Research Brazil

Research lattice Operations
 Add/delete hypotheses
– consistently keep the partial ordering;
– automatic placement of hypotheses in the RL
 Querying
– finding hypotheses based on “Free Fall”
hypothesis
– find competing hypotheses wrt “Dark Energy”
Hypothesis
 How to access the predictive capacity of a
hypothesis?
IBM Research Brazil

Sum up
 Research Lattice enables a formal yet bound
representation of a research domain
 Hypotheses are scientists encoding of their
studied phenomenon interpretation
IBM Research Brazil

MULTIDIMENSIONAL
REPRESENTATION

Dealing with Space-Time
dimension on Hypotheses
 Most phenomena occur on space-time;
 In computational model, simulations use
meshes to model the physical domain;
 Predicting variables are computed in a point
of a multidimensional space
– 3D, 1D, 4D etc..
 Data Representation is a multidimensional
matrix [ArrayDB, SciDB,…]
IBM Research Brazil

1D and 3D Meshes representations
of human artery
IBM Research Brazil

Multidimensional Array
Representation
Is it efficient for
processing
queries over
meshes such as
the ones of
HCVS??
IBM Research Brazil

SciDB in 20 sec
 Unit of representation are multidimensional
arrays
 each dimension has a name and a size
 a reference to all dimensions in an array leads to
a cell
 a cell has many attributes – columnar store
 An array may be partitioned in its dimensions
 Two query languages AQL and AFL
IBM Research Brazil

Loading Simulation data
Simulation
output
Wrapper Unidimensional array
Multidimensional array
Geometry3d_raw
Geometry3d
redimension_store
IBM Research Brazil

A multidimensional array in SciDB
IBM Research Brazil
CREATE ARRAY Geometry3d
< velocity_x:double, velocity_y:double, velocity_z:double,
pressure:double, displacement_x:double,
displacement_y:double,displacement_z:double >
[simulation_number=0:9,1,0, time_step=0:30720,1920,0,
x_axis=0:39,40,0, y_axis=0:39,40,0, z_axis=0:39,40,0]

Challenge to map an irregular mesh
into a regular array structure
IBM Research
Brazil

SimDB
 We developed SimDB: A layer on top of
SciDB to map irregular meshes into a regular
array data representation
IBM Research Brazil

Experiment Results
IBM Research
Brazil

Experiments set-up
• 4 Servers and 16 VMs
• 4 Queries
Servers
VMs 1 2 4
1 10GB
2 5GB 5GB / 10GB
4 2.5GB 2.5GB / 5GB 2.5GB / 5GB / 10GB
8 2.5GB 2.5GB / 5GB
16 2.5 GB
IBM Research Brazil

Experiments
• 8 scidb instances per VM
• 1, 2, 4, 8, 16 VMs
• 7 queries
• 2 arrays (S, T) e (T, S)
• 30 executions
IBM Research Brazil

Results
1(1) 2(1) 2(2) 2(2) 4(1) 4(2) 4(2) 4(4) 4(4) 4(4) 8(2) 8(4) 8(4) 16(4)
00:00.00
00:43.20
01:26.40
02:09.60
02:52.80
03:36.00
04:19.20
VM (Server)
Query 1
Query 2
Query 3
Query 4
Executiontime
Queries
1
Select avg(pressure)from
Geometry3d where time
step < 1920 group by
Simulation number
2
Geometry3d group by
Simulation number
3
Geometry3d group by
time step
4
Geometry3d where
Simulation number < 2
group by time step
IBM Research Brazil

Results
Queries
1
Select avg(pressure)from Geometry3d
where time step < 1920 group by
Simulation number
2
group by Simulation number
3
group by time step
4
where Simulation number < 2 group by
time step
5
where Simulation number = 0 and time
step < 1920 group by time step
6
where Simulation number < 0 and time
step < 0 group by time step
7
where (time step % 512) = 0 group by
time stepIBM Research Brazil

Final Remarks
 We argue that “Big Data Analytics require a scientific
approach based on hypotheses formulation and
follow-up”
 Υ-DB is an innovative approach for Big Data
management;
– Reflects Hypothesis as data principle
– Is formal and guards equivalence between data and models
– Models uncertainty in the model and in the data
– must be extended
 to cope with observation validation (Bayesian Model)
 to support multidimensional representation
– read our paper at VLDB 2014 
 SimDB is an extension to SciDB to efficiently store
irregular meshes on multidimensional array systems
IBM Research Brazil

Final Remarks
• space-time models require a
multidimensional data model for hypothesis
as data management;
• SciDB is a parallel multidimensional array
DBMS
• We want to extend Υ-DB to multidimensional
array
• still very immature
IBM Research Brazil

This is a DEXL Team work
PhD Candidate Bernardo Gonçalves
IBM PhD Fellowship 2013-2014
(bgonc@lncc.br)
Dr Ramon Gomes Costa
(ramongc@lncc.br)
Msc student Hermano Lustosa
(hllustosa@gmail.com)
IBM Research
Brazil

Obrigado !
http://dexl.lncc.br
IBM Research
Brazil

EMC Summer School 2013
Olympic Laboratory
 Objective
– To study high performance sports as a science discipline
– To build the first sports laboratory in South America
 US$ 10M Project sponsored by FINEP(Funding
Agency)
 Departments:
– Biochemistry, physiology, genetics, nutrition, computational
modeling, computer science, physiology
67

Our task
 To support athlete’s follow-up data
– Athlete’s training
– Variation on biochemical elements
– Variation on biometric variables
 More recently
– For some modalities, Integrate meteorological
conditions
68

Analyses Board
69

Athletes follow-up database
 Athletes follow-up data modeled as trajectories
– Register measurements from athletes in different training
states
 Trajectory model
– Ordered set of measurements
– Division of time in training states
– Materialized view limited in time-range
– Imprecise measurements
 Not detected =0
 < x -> ]0,x[
 y , y  x
70

More on Athlete’s Trajectories
 Stops – modelled as measurements
– Qualified according the athlete’s training state
– Training states (recovery, training, rest,…)
 Moves – extrapolation between two stops
 Trajectory – the set of measurements,
ordered in time, and limited in time according
to some criteria (eg. A training program).
– Measurements of the same observable element
– Measurements of the same athlete
71

Metaphoric Trajectory
EMC Summer
School 2013
72

Challenges
 Integrating athlete’s trajectory with weather
information
 How to efficiently store metaphoric
trajectories ?
– Trajstore [Cudre-Mauroux et al ICDE 2010]
– SciDB
 How to express and efficiently process
similar trajectories
75

Ibmr 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (9)

Similar to Ibmr 2014

Similar to Ibmr 2014 (20)

Recently uploaded

Recently uploaded (20)

Ibmr 2014