Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Ibmr 2014
1. IBM Research Brazil
Fabio Porto (fporto@lncc.br),
LNCC – MCTI
DEXL Lab (dexl.lncc.br)
Challenges in Scientific
Big Data Management
2. Outline
Introduction
Big Data in Science
Hypothesis Driven-Research
Hypothesis as Data – Upsilon-DB
SimDB
Final remarks
IBM Research Brazil
4. LNCC - MCTI
Graduate Course in Computational Modelling
– CAPES 6
BioInfo Laboratory
– High throughput sequencing
Coordinator of INCT –MACC
– Medicine Supported by Computational Science
Coordinator of SINAPAD
– HPC National System
Thematic laboratories
– ACIMA – Augmented Reality
– MARTIN – Network and Software Engineering
– DEXL – Big Data
– COMCIDIS – Distributed Systems
– HEMOLAB – Cardio Vascular System Modelling
IBM Research Brazil
5. DEXL – On going Projects
DEXL
Laboratory
What-if
data analysis
Astronomy
Data
Management
Simulation Data
management
Hypothesis
Upsilon-DB
Gene regulation
Network
Scientific
workflow
optimization
Seismic
Data Mngmt
(EMC)
Bioknowlogy
IBM Research Brazil
Olympic
Laboratory
6. Objective
To provide scientists with an in-silico cockpit
from which scientific data and metadata can
be efficiently managed
IBM Research Brazil
7. IBM Research Brazil
“Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going to
get worse”
– Office Science. Data-Management Challenge
Report– DoE - 2004
8. Big Data in science
An expression that reflects the data deluge
produced in science
– astronomy, astrophysics
– Biology, Neuroscience
– Sports
– Geology, Geophysics, etc.
IBM Research Brazil
9. Big Data - Dimensions
Volume
Velocity
Variety
MB GB TB PB
file
database
Uncertainty
heterogeneity, evolution
batch
online
sensors, alerts
real time
IBM Research Brazil
10. A challenge on volume
Dark Energy Survey (DES) project expects to
produce 100 PB in 10 years; (source:personal comm.)
– 5000o sky cover, “all” objects, “perfect” accuracy
Yahoo claims to manage 2 PB of click data in a
modified PostgreSQL
EMBL - nucleotide database 260 Gbases
High-throughput sequencing 454 Roche technology
– Sequence 400-600 million bases in 10 hours
– Eg. A project at Max Plant Institute aims at sequencing the
whole genome of the Neanderthal at 3 billion base pairs is
expected to take 2 years to finish.
IBM Research Brazil
13. BIG DATA in Science
Scientific process is being remodelled to be developed
within an in-silico environment
Powerful instruments:
– Digital telescopes
– DNA sequencers
– Mass spectrometers
Huge simulations
– Weak lensing
– Human Cardio-vascular system
Massive amounts of information streams in and out…
Hypothesis-driven research supported by in-silico
infrastructure, methods, models…
IBM Research Brazil
15. Big Data urgent call in e-science
Scientific life cycle metadata management;
Scientific Hypothesis formulation and
validation;
Scientific Data management;
Scientific data processing architecture;
IBM Research Brazil
17. “To see what is in front of one’s nose
needs a constant struggle”
George Orwell
IBM Research Brazil
18. To make sense of Big Data we
need models
[Peter Haas – Data is Dead without what-if models,
PVLDB 2011]
Scientific Models are formal interpretation of
phenomena
Hypotheses formalized as models
Scientific life cycle driven by hypotheses
validation
IBM Research Brazil
19. Hypothesis driven Big Data
analyses
Scientific Hypothesis – a model for scientists’
interpretation of a phenomenon;
Different hypotheses co-habit a scientific
domain;
Science method – prove hypotheses
Big Data analyses – hypotheses exploration
In new Big Data prediction analysis – identify
first principles that guide predictions – deep vs
shallow prediction
IBM Research Brazil
20. Big Data Hypotheses driven
life cycle
Hypotheses,
experiment
Goals
Experiment,
Workflow
Design
Workflow
Preparation
Workflow
ExecutionPost-
Execution
analysis
Workflow
repository
Data
Sources
Provenance
Store
Monitoring
Hypotheses
database
Adaptado de [Mattoso
et al. 2010]
Analysis
Results
IBM Research Brazil
23. Phenomenon
0..1
1..1
explains
1..1
1..1
Υ-DB Conceptual Model
Continuous
Ph_Process
Discrete
Ph_Process
Mathematical
Model
1..1
formulatedby
isTheBlendOf1..n
1..n
Is basedOn
represented_as
Compared_with
Mathematical
Formulae XML
Represented
with
Physical
Quantities
Phenomenon
physical
quantities
1..1
0..n
0..n
1..1
Formal
Representation
Scientist
1..m
0..n
0..n
0..n
elements
constant
fucntion
equation
1..n
0..n
1..n
1..1
Observation
Element
Simulated
Element
Data View
(query over
Data view)
Modeled_as
1..1
0..n
Refers-to
0..1
Space-Time
Dimension 1..1
0..n
0..n1..1
0..n
Event
Computational
Model View
modeled_as
transforms
Mesh
1..1
1..n
Mesh
Data view
Domain ontology URL
1..1
0..n
Formal
Language
Discrete
Phenomenon
Simulation
0..1
0..1
0..1 0..n
represents
1..1 1..1
0..n
Topologically
modeled by0..n
0..1
1-n
State
Ph_Process
represented_as
SC
Hypothesis1..n
isAuthor
variable 1..n
[Porto et al. ER 2008, ER 2012]
IBM Research Brazil
24. Hypotheses as Data – Upsilon DB
From the triangular equivalence, we derive that
– Hypothesis = Model = Data
How can we infer data from Model?
[Bernardo Gonçalves, Fabio Porto, PVLDB 2014]
IBM Research Brazil
25. Hypothesis as Models
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
Hypothesis
Scientific
Model
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Computational
Model
IBM Research Brazil
26. Hypothesis - From Models to Data
IBM Research Brazil
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
SOLVER
Input Output
27. Hypothesis as Data
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
t v s
0 0 5000
1 -32 4984
2 -64 4936
3 -96 4856
4 -128 4744
IBM Research Brazil
Free_Fall
28. Hypothesis as Data – Computing a
DB schema
Mathematical Models
– formalize hypotheses
– equations establish a functional dependency
between dimensions and parameters and
predicting variables
eg: g,t,vo -> v
– Derive a DB schema from DFs extracted from
equations
IBM Research Brazil
29. Hypothesis as Data
In the Free Fall example:
– Σ1 = {Φ -> g,vo, so
g, ν -> a
g, vo, t, ν -> v
g, vo, so, t, ν -> s}
Observe that Φ and ν are epistemological
variables referring to the phenomenon and
the hypothesis, respectively;
IBM Research Brazil
v(t) = -gt + vo
30. Hypothesis as Data - schema
Φ -> g,vo, so
– defines the model parameters
– It is expected to be violated reproducing the
uncertainty in the model input;
– Such uncertainty contributes to the quality of the
hypothesis
From Σ1, the schema for predicting a under
hypothesis h1 would be:
– h1 (Φ, ν , a)
From Σ1, the input parameters are defined
as: *key violation
– h1_input(Φ, g, v0, s0)
IBM Research Brazil
31. Hypothesis as Uncertain Data
IBM Research Brazil
Φ g v0 s0
1 32 0 5000
1 32 10 5000
1 32 20 5000
1 32.2 0 5000
1 32.2 10 5000
1 32.2 20 5000
Uncertainty: 50%
Uncertainty: 33%
INPUT_H1
32. Uncertainty Introduction
Υ_DB is a probabilistic database
[D. Suciu et al, Probabilistic Databases, 2011]
– a Y-relation includes certain and conditional columns;
– a conditional column is a pair (Vi , Di), where Vi is a
random variable and Di is one of its possible values;
– ex:
Create table Y_g as select U_phi, U_g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g weight by Fr) as U
IBM Research Brazil
33. Hypothesis as Uncertain Data
Create table Y_g as select U.phi, U.g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g) as U
IBM Research Brazil
Φ g
1 32
1 32
1 32
1 32.2
1 32.2
1 32.2
INPUT_H1
Φ V-> D g
1 x1 1 32
1 x1 2 32.2
Y_g
34. Synthesizing Prediction as a query
as g,ν a in Σ1 , we can predict a as a query
on uncertain relations Y_g and Y_R
– create table Y1_a as select H.phi, H.upsilon, H.a
from H1_OUTPUT_a as H, Y_R as R, Y1_g as G,
(select min(tid) as tid, phi, g from H1_INPUT group
by phi, g) as U
where H.tid=U.tid and G.phi=U.phi and G.g=U.g
and H.phi=R.phi and H.upsilon=R.upsilon
IBM Research Brazil
35. Predicted Y-DB relation Y[a]
IBM Research Brazil
Φ g a u
1 32 32 0.5
1 32.2 32.2 0.5
Υ[a]
Υ-DB enables data
oriented uncertainty
quantification of
predicting variables
36. Sum Up
Υ-DB is a probabilistic database designed to
manage hypothesis as data
In Υ-DB, both the intrinsic uncertainty of the
model Υ[R] and those of predicting variables
(eg. Υ[a]) are automatically computed
Υ-DB and Research Lattices are the basis for
managing Hypotheses over Big Data in
science (and we believe in any domain)
IBM Research Brazil
38. Ordering Hypotheses
Different competing hypotheses must be
placed into context according to their
phenomenon explanation capacity
– predicting capability (predicting variables);
– assumptions and constraints
IBM Research Brazil
39. Hypotheses in the Dark Energy
Survey Project
Phenomenon
– The universe is increasing its expansion acceleration
Discovered in 1998 during supernovae investigation
Supported by redshift observation of far away supernovae
Hypotheses
– A new behaviour Dark Energy pushes the acceleration
– The Universe density is not uniform
Evidences
– gravitational lenses
– Galaxy clusters
IBM Research Brazil
40. Research Lattices – structure
hypotheses of a phenomenon
Τ
Dark Energy
Non uniform universe
Weak lensing Galaxy
clustering
Earth special location
Τ
[B. Gonçalves, F. Porto, Research Lattices, AMW 2013]
IBM Research Brazil
41. Research Lattices
Each Node is a hypotheses
Given two hypotheses h1 and h2, in a R.L., if
h1 ≥ h2 then h1 is more general than h2;
Top corresponds to all knowledge of a
domain;
Bottom is the empty representation of lack of
knowledge;
IBM Research Brazil
42. Research Lattice: Acceleration
Τ
Lei da queda livre
d2s/dt2=9,8
h1
Primeira Lei
Newton
h2
Segunda Lei
Newton
F=mag
h3
Aceleração
Centrípeta
ac=4Πr/T2
h4
3a Lei de Kepler
r3/T2= c
h5
Lei da Gravitação Universal
Fg= G Mn/ r2
h6
Lei do inverso
quadrado da distância
ac ∝ 1 /r2
h7
ΤIBM Research Brazil
44. Research lattice Operations
Add/delete hypotheses
– consistently keep the partial ordering;
– automatic placement of hypotheses in the RL
Querying
– finding hypotheses based on “Free Fall”
hypothesis
– find competing hypotheses wrt “Dark Energy”
Hypothesis
How to access the predictive capacity of a
hypothesis?
IBM Research Brazil
45. Sum up
Research Lattice enables a formal yet bound
representation of a research domain
Hypotheses are scientists encoding of their
studied phenomenon interpretation
IBM Research Brazil
47. Dealing with Space-Time
dimension on Hypotheses
Most phenomena occur on space-time;
In computational model, simulations use
meshes to model the physical domain;
Predicting variables are computed in a point
of a multidimensional space
– 3D, 1D, 4D etc..
Data Representation is a multidimensional
matrix [ArrayDB, SciDB,…]
IBM Research Brazil
48. 1D and 3D Meshes representations
of human artery
IBM Research Brazil
50. SciDB in 20 sec
Unit of representation are multidimensional
arrays
each dimension has a name and a size
a reference to all dimensions in an array leads to
a cell
a cell has many attributes – columnar store
An array may be partitioned in its dimensions
Two query languages AQL and AFL
IBM Research Brazil
53. A multidimensional array in SciDB
IBM Research Brazil
CREATE ARRAY Geometry3d
< velocity_x:double, velocity_y:double, velocity_z:double,
pressure:double, displacement_x:double,
displacement_y:double,displacement_z:double >
[simulation_number=0:9,1,0, time_step=0:30720,1920,0,
x_axis=0:39,40,0, y_axis=0:39,40,0, z_axis=0:39,40,0]
54. Challenge to map an irregular mesh
into a regular array structure
IBM Research
Brazil
55. SimDB
We developed SimDB: A layer on top of
SciDB to map irregular meshes into a regular
array data representation
IBM Research Brazil
59. Experiments
• 8 scidb instances per VM
• 1, 2, 4, 8, 16 VMs
• 7 queries
• 2 arrays (S, T) e (T, S)
• 30 executions
IBM Research Brazil
60. Results
1(1) 2(1) 2(2) 2(2) 4(1) 4(2) 4(2) 4(4) 4(4) 4(4) 8(2) 8(4) 8(4) 16(4)
00:00.00
00:43.20
01:26.40
02:09.60
02:52.80
03:36.00
04:19.20
VM (Server)
Query 1
Query 2
Query 3
Query 4
Executiontime
Queries
1
Select avg(pressure)from
Geometry3d where time
step < 1920 group by
Simulation number
2
Select avg(pressure)from
Geometry3d group by
Simulation number
3
Select avg(pressure)from
Geometry3d group by
time step
4
Select avg(pressure)from
Geometry3d where
Simulation number < 2
group by time step
IBM Research Brazil
61. Results
Queries
1
Select avg(pressure)from Geometry3d
where time step < 1920 group by
Simulation number
2
Select avg(pressure)from Geometry3d
group by Simulation number
3
Select avg(pressure)from Geometry3d
group by time step
4
Select avg(pressure)from Geometry3d
where Simulation number < 2 group by
time step
5
Select avg(pressure)from Geometry3d
where Simulation number = 0 and time
step < 1920 group by time step
6
Select avg(pressure)from Geometry3d
where Simulation number < 0 and time
step < 0 group by time step
7
Select avg(pressure)from Geometry3d
where (time step % 512) = 0 group by
time stepIBM Research Brazil
62. Final Remarks
We argue that “Big Data Analytics require a scientific
approach based on hypotheses formulation and
follow-up”
Υ-DB is an innovative approach for Big Data
management;
– Reflects Hypothesis as data principle
– Is formal and guards equivalence between data and models
– Models uncertainty in the model and in the data
– must be extended
to cope with observation validation (Bayesian Model)
to support multidimensional representation
– read our paper at VLDB 2014
SimDB is an extension to SciDB to efficiently store
irregular meshes on multidimensional array systems
IBM Research Brazil
63. Final Remarks
• space-time models require a
multidimensional data model for hypothesis
as data management;
• SciDB is a parallel multidimensional array
DBMS
• We want to extend Υ-DB to multidimensional
array
• still very immature
IBM Research Brazil
64. This is a DEXL Team work
PhD Candidate Bernardo Gonçalves
IBM PhD Fellowship 2013-2014
(bgonc@lncc.br)
Dr Ramon Gomes Costa
(ramongc@lncc.br)
Msc student Hermano Lustosa
(hllustosa@gmail.com)
IBM Research
Brazil
67. EMC Summer School 2013
Olympic Laboratory
Objective
– To study high performance sports as a science discipline
– To build the first sports laboratory in South America
US$ 10M Project sponsored by FINEP(Funding
Agency)
Departments:
– Biochemistry, physiology, genetics, nutrition, computational
modeling, computer science, physiology
67
68. Our task
To support athlete’s follow-up data
– Athlete’s training
– Variation on biochemical elements
– Variation on biometric variables
More recently
– For some modalities, Integrate meteorological
conditions
EMC Summer School 2013
68
70. EMC Summer School 2013
Athletes follow-up database
Athletes follow-up data modeled as trajectories
– Register measurements from athletes in different training
states
Trajectory model
– Ordered set of measurements
– Division of time in training states
– Materialized view limited in time-range
– Imprecise measurements
Not detected =0
< x -> ]0,x[
y , y x
70
71. More on Athlete’s Trajectories
Stops – modelled as measurements
– Qualified according the athlete’s training state
– Training states (recovery, training, rest,…)
Moves – extrapolation between two stops
Trajectory – the set of measurements,
ordered in time, and limited in time according
to some criteria (eg. A training program).
– Measurements of the same observable element
– Measurements of the same athlete
EMC Summer School 2013
71
75. Challenges
Integrating athlete’s trajectory with weather
information
How to efficiently store metaphoric
trajectories ?
– Trajstore [Cudre-Mauroux et al ICDE 2010]
– SciDB
How to express and efficiently process
similar trajectories
EMC Summer School 2013
75