SlideShare a Scribd company logo
1 of 75
IBM Research Brazil
Fabio Porto (fporto@lncc.br),
LNCC – MCTI
DEXL Lab (dexl.lncc.br)
Challenges in Scientific
Big Data Management
Outline
 Introduction
 Big Data in Science
 Hypothesis Driven-Research
 Hypothesis as Data – Upsilon-DB
 SimDB
 Final remarks
IBM Research Brazil
Laboratório Nacional de
Computação Científica (LNCC)
Petropolis, Rio de Janeiro
IBM Research Brazil
LNCC - MCTI
 Graduate Course in Computational Modelling
– CAPES 6
 BioInfo Laboratory
– High throughput sequencing
 Coordinator of INCT –MACC
– Medicine Supported by Computational Science
 Coordinator of SINAPAD
– HPC National System
 Thematic laboratories
– ACIMA – Augmented Reality
– MARTIN – Network and Software Engineering
– DEXL – Big Data
– COMCIDIS – Distributed Systems
– HEMOLAB – Cardio Vascular System Modelling
IBM Research Brazil
DEXL – On going Projects
DEXL
Laboratory
What-if
data analysis
Astronomy
Data
Management
Simulation Data
management
Hypothesis
Upsilon-DB
Gene regulation
Network
Scientific
workflow
optimization
Seismic
Data Mngmt
(EMC)
Bioknowlogy
IBM Research Brazil
Olympic
Laboratory
Objective
 To provide scientists with an in-silico cockpit
from which scientific data and metadata can
be efficiently managed
IBM Research Brazil
IBM Research Brazil
“Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going to
get worse”
– Office Science. Data-Management Challenge
Report– DoE - 2004
Big Data in science
 An expression that reflects the data deluge
produced in science
– astronomy, astrophysics
– Biology, Neuroscience
– Sports
– Geology, Geophysics, etc.
IBM Research Brazil
Big Data - Dimensions
Volume
Velocity
Variety
MB GB TB PB
file
database
Uncertainty
heterogeneity, evolution
batch
online
sensors, alerts
real time
IBM Research Brazil
A challenge on volume
 Dark Energy Survey (DES) project expects to
produce 100 PB in 10 years; (source:personal comm.)
– 5000o sky cover, “all” objects, “perfect” accuracy
 Yahoo claims to manage 2 PB of click data in a
modified PostgreSQL
 EMBL - nucleotide database 260 Gbases
 High-throughput sequencing 454 Roche technology
– Sequence 400-600 million bases in 10 hours
– Eg. A project at Max Plant Institute aims at sequencing the
whole genome of the Neanderthal at 3 billion base pairs is
expected to take 2 years to finish.
IBM Research Brazil
1D-3D coupled simulations
IBM Research Brazil
From Observation to Data
Analysis
IBM Research Brazil
BIG DATA in Science
 Scientific process is being remodelled to be developed
within an in-silico environment
 Powerful instruments:
– Digital telescopes
– DNA sequencers
– Mass spectrometers
 Huge simulations
– Weak lensing
– Human Cardio-vascular system
 Massive amounts of information streams in and out…
 Hypothesis-driven research supported by in-silico
infrastructure, methods, models…
IBM Research Brazil
Hypothesis
Formulation
Modeling
Experiment
Life-cycle
IBM Research Brazil
PublicationPhenomenon
e-Science life cycle
Big Data urgent call in e-science
 Scientific life cycle metadata management;
 Scientific Hypothesis formulation and
validation;
 Scientific Data management;
 Scientific data processing architecture;
IBM Research Brazil
MODELLING -
HYPOTHESIS-DRIVEN BIG
DATA RESEARCH
“To see what is in front of one’s nose
needs a constant struggle”
George Orwell
IBM Research Brazil
To make sense of Big Data we
need models
[Peter Haas – Data is Dead without what-if models,
PVLDB 2011]
 Scientific Models are formal interpretation of
phenomena
 Hypotheses formalized as models
 Scientific life cycle driven by hypotheses
validation
IBM Research Brazil
Hypothesis driven Big Data
analyses
 Scientific Hypothesis – a model for scientists’
interpretation of a phenomenon;
 Different hypotheses co-habit a scientific
domain;
 Science method – prove hypotheses
 Big Data analyses – hypotheses exploration
 In new Big Data prediction analysis – identify
first principles that guide predictions – deep vs
shallow prediction
IBM Research Brazil
Big Data Hypotheses driven
life cycle
Hypotheses,
experiment
Goals
Experiment,
Workflow
Design
Workflow
Preparation
Workflow
ExecutionPost-
Execution
analysis
Workflow
repository
Data
Sources
Provenance
Store
Monitoring
Hypotheses
database
Adaptado de [Mattoso
et al. 2010]
Analysis
Results
IBM Research Brazil
Equivalence of interpretation
IBM Research Brazil
ν (hypotheses) ≅ Models
ϕ
(phenomenon)
δ (data)
HYPOTHESIS AS DATA
Phenomenon
0..1
1..1
explains
1..1
1..1
Υ-DB Conceptual Model
Continuous
Ph_Process
Discrete
Ph_Process
Mathematical
Model
1..1
formulatedby
isTheBlendOf1..n
1..n
Is basedOn
represented_as
Compared_with
Mathematical
Formulae XML
Represented
with
Physical
Quantities
Phenomenon
physical
quantities
1..1
0..n
0..n
1..1
Formal
Representation
Scientist
1..m
0..n
0..n
0..n
elements
constant
fucntion
equation
1..n
0..n
1..n
1..1
Observation
Element
Simulated
Element
Data View
(query over
Data view)
Modeled_as
1..1
0..n
Refers-to
0..1
Space-Time
Dimension 1..1
0..n
0..n1..1
0..n
Event
Computational
Model View
modeled_as
transforms
Mesh
1..1
1..n
Mesh
Data view
Domain ontology URL
1..1
0..n
Formal
Language
Discrete
Phenomenon
Simulation
0..1
0..1
0..1 0..n
represents
1..1 1..1
0..n
Topologically
modeled by0..n
0..1
1-n
State
Ph_Process
represented_as
SC
Hypothesis1..n
isAuthor
variable 1..n
[Porto et al. ER 2008, ER 2012]
IBM Research Brazil
Hypotheses as Data – Upsilon DB
 From the triangular equivalence, we derive that
– Hypothesis = Model = Data
 How can we infer data from Model?
[Bernardo Gonçalves, Fabio Porto, PVLDB 2014]
IBM Research Brazil
Hypothesis as Models
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
Hypothesis
Scientific
Model
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Computational
Model
IBM Research Brazil
Hypothesis - From Models to Data
IBM Research Brazil
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
SOLVER
Input Output
Hypothesis as Data
for k = 0:n;
t = k * dt;
v = -g*t + v0;
s = -(g/2)*t2 + v0*t + s0;
t_plot(k) = t;
v_plot(k) = v;
s_plot(k) = s;
end
Law of free fall
If a body falls from rest, its velocity at any point is
proportional to the time it has been falling.
a(t) = -g
v(t) = -gt + vo
s(t)= - g/2 t2 + vot+ so
t v s
0 0 5000
1 -32 4984
2 -64 4936
3 -96 4856
4 -128 4744
IBM Research Brazil
Free_Fall
Hypothesis as Data – Computing a
DB schema
 Mathematical Models
– formalize hypotheses
– equations establish a functional dependency
between dimensions and parameters and
predicting variables
 eg: g,t,vo -> v
– Derive a DB schema from DFs extracted from
equations
IBM Research Brazil
Hypothesis as Data
 In the Free Fall example:
– Σ1 = {Φ -> g,vo, so
g, ν -> a
g, vo, t, ν -> v
g, vo, so, t, ν -> s}
 Observe that Φ and ν are epistemological
variables referring to the phenomenon and
the hypothesis, respectively;
IBM Research Brazil
v(t) = -gt + vo
Hypothesis as Data - schema
 Φ -> g,vo, so
– defines the model parameters
– It is expected to be violated reproducing the
uncertainty in the model input;
– Such uncertainty contributes to the quality of the
hypothesis
 From Σ1, the schema for predicting a under
hypothesis h1 would be:
– h1 (Φ, ν , a)
 From Σ1, the input parameters are defined
as: *key violation
– h1_input(Φ, g, v0, s0)
IBM Research Brazil
Hypothesis as Uncertain Data
IBM Research Brazil
Φ g v0 s0
1 32 0 5000
1 32 10 5000
1 32 20 5000
1 32.2 0 5000
1 32.2 10 5000
1 32.2 20 5000
Uncertainty: 50%
Uncertainty: 33%
INPUT_H1
Uncertainty Introduction
 Υ_DB is a probabilistic database
[D. Suciu et al, Probabilistic Databases, 2011]
– a Y-relation includes certain and conditional columns;
– a conditional column is a pair (Vi , Di), where Vi is a
random variable and Di is one of its possible values;
– ex:
 Create table Y_g as select U_phi, U_g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g weight by Fr) as U
IBM Research Brazil
Hypothesis as Uncertain Data
 Create table Y_g as select U.phi, U.g
from (repair key phi in (select phi, g , count(*) as Fr
from INPUT_H1 group by phi, g) as U
IBM Research Brazil
Φ g
1 32
1 32
1 32
1 32.2
1 32.2
1 32.2
INPUT_H1
Φ V-> D g
1 x1 1 32
1 x1 2 32.2
Y_g
Synthesizing Prediction as a query
 as g,ν a in Σ1 , we can predict a as a query
on uncertain relations Y_g and Y_R
– create table Y1_a as select H.phi, H.upsilon, H.a
from H1_OUTPUT_a as H, Y_R as R, Y1_g as G,
(select min(tid) as tid, phi, g from H1_INPUT group
by phi, g) as U
where H.tid=U.tid and G.phi=U.phi and G.g=U.g
and H.phi=R.phi and H.upsilon=R.upsilon
IBM Research Brazil
Predicted Y-DB relation Y[a]
IBM Research Brazil
Φ g a u
1 32 32 0.5
1 32.2 32.2 0.5
Υ[a]
Υ-DB enables data
oriented uncertainty
quantification of
predicting variables
Sum Up
 Υ-DB is a probabilistic database designed to
manage hypothesis as data
 In Υ-DB, both the intrinsic uncertainty of the
model Υ[R] and those of predicting variables
(eg. Υ[a]) are automatically computed
 Υ-DB and Research Lattices are the basis for
managing Hypotheses over Big Data in
science (and we believe in any domain)
IBM Research Brazil
ORDERING HYPOTHESES
Ordering Hypotheses
 Different competing hypotheses must be
placed into context according to their
phenomenon explanation capacity
– predicting capability (predicting variables);
– assumptions and constraints
IBM Research Brazil
Hypotheses in the Dark Energy
Survey Project
 Phenomenon
– The universe is increasing its expansion acceleration
 Discovered in 1998 during supernovae investigation
 Supported by redshift observation of far away supernovae
 Hypotheses
– A new behaviour Dark Energy pushes the acceleration
– The Universe density is not uniform
 Evidences
– gravitational lenses
– Galaxy clusters
IBM Research Brazil
Research Lattices – structure
hypotheses of a phenomenon
Τ
Dark Energy
Non uniform universe
Weak lensing Galaxy
clustering
Earth special location
Τ
[B. Gonçalves, F. Porto, Research Lattices, AMW 2013]
IBM Research Brazil
Research Lattices
 Each Node is a hypotheses
 Given two hypotheses h1 and h2, in a R.L., if
h1 ≥ h2 then h1 is more general than h2;
 Top corresponds to all knowledge of a
domain;
 Bottom is the empty representation of lack of
knowledge;
IBM Research Brazil
Research Lattice: Acceleration
Τ
Lei da queda livre
d2s/dt2=9,8
h1
Primeira Lei
Newton
h2
Segunda Lei
Newton
F=mag
h3
Aceleração
Centrípeta
ac=4Πr/T2
h4
3a Lei de Kepler
r3/T2= c
h5
Lei da Gravitação Universal
Fg= G Mn/ r2
h6
Lei do inverso
quadrado da distância
ac ∝ 1 /r2
h7
ΤIBM Research Brazil
Research Lattice for the Human
Cardio Vascular System
IBM Research Brazil
Research lattice Operations
 Add/delete hypotheses
– consistently keep the partial ordering;
– automatic placement of hypotheses in the RL
 Querying
– finding hypotheses based on “Free Fall”
hypothesis
– find competing hypotheses wrt “Dark Energy”
Hypothesis
 How to access the predictive capacity of a
hypothesis?
IBM Research Brazil
Sum up
 Research Lattice enables a formal yet bound
representation of a research domain
 Hypotheses are scientists encoding of their
studied phenomenon interpretation
IBM Research Brazil
MULTIDIMENSIONAL
REPRESENTATION
Dealing with Space-Time
dimension on Hypotheses
 Most phenomena occur on space-time;
 In computational model, simulations use
meshes to model the physical domain;
 Predicting variables are computed in a point
of a multidimensional space
– 3D, 1D, 4D etc..
 Data Representation is a multidimensional
matrix [ArrayDB, SciDB,…]
IBM Research Brazil
1D and 3D Meshes representations
of human artery
IBM Research Brazil
Multidimensional Array
Representation
Is it efficient for
processing
queries over
meshes such as
the ones of
HCVS??
IBM Research Brazil
SciDB in 20 sec
 Unit of representation are multidimensional
arrays
 each dimension has a name and a size
 a reference to all dimensions in an array leads to
a cell
 a cell has many attributes – columnar store
 An array may be partitioned in its dimensions
 Two query languages AQL and AFL
IBM Research Brazil
IBM Research
Brazil
Loading Simulation data
Simulation
output
Wrapper Unidimensional array
Multidimensional array
Geometry3d_raw
Geometry3d
redimension_store
IBM Research Brazil
A multidimensional array in SciDB
IBM Research Brazil
CREATE ARRAY Geometry3d
< velocity_x:double, velocity_y:double, velocity_z:double,
pressure:double, displacement_x:double,
displacement_y:double,displacement_z:double >
[simulation_number=0:9,1,0, time_step=0:30720,1920,0,
x_axis=0:39,40,0, y_axis=0:39,40,0, z_axis=0:39,40,0]
Challenge to map an irregular mesh
into a regular array structure
IBM Research
Brazil
SimDB
 We developed SimDB: A layer on top of
SciDB to map irregular meshes into a regular
array data representation
IBM Research Brazil
IBM Research
Brazil
Experiment Results
IBM Research
Brazil
Experiments set-up
• 4 Servers and 16 VMs
• 4 Queries
Servers
VMs 1 2 4
1 10GB
2 5GB 5GB / 10GB
4 2.5GB 2.5GB / 5GB 2.5GB / 5GB / 10GB
8 2.5GB 2.5GB / 5GB
16 2.5 GB
IBM Research Brazil
Experiments
• 8 scidb instances per VM
• 1, 2, 4, 8, 16 VMs
• 7 queries
• 2 arrays (S, T) e (T, S)
• 30 executions
IBM Research Brazil
Results
1(1) 2(1) 2(2) 2(2) 4(1) 4(2) 4(2) 4(4) 4(4) 4(4) 8(2) 8(4) 8(4) 16(4)
00:00.00
00:43.20
01:26.40
02:09.60
02:52.80
03:36.00
04:19.20
VM (Server)
Query 1
Query 2
Query 3
Query 4
Executiontime
Queries
1
Select avg(pressure)from
Geometry3d where time
step < 1920 group by
Simulation number
2
Select avg(pressure)from
Geometry3d group by
Simulation number
3
Select avg(pressure)from
Geometry3d group by
time step
4
Select avg(pressure)from
Geometry3d where
Simulation number < 2
group by time step
IBM Research Brazil
Results
Queries
1
Select avg(pressure)from Geometry3d
where time step < 1920 group by
Simulation number
2
Select avg(pressure)from Geometry3d
group by Simulation number
3
Select avg(pressure)from Geometry3d
group by time step
4
Select avg(pressure)from Geometry3d
where Simulation number < 2 group by
time step
5
Select avg(pressure)from Geometry3d
where Simulation number = 0 and time
step < 1920 group by time step
6
Select avg(pressure)from Geometry3d
where Simulation number < 0 and time
step < 0 group by time step
7
Select avg(pressure)from Geometry3d
where (time step % 512) = 0 group by
time stepIBM Research Brazil
Final Remarks
 We argue that “Big Data Analytics require a scientific
approach based on hypotheses formulation and
follow-up”
 Υ-DB is an innovative approach for Big Data
management;
– Reflects Hypothesis as data principle
– Is formal and guards equivalence between data and models
– Models uncertainty in the model and in the data
– must be extended
 to cope with observation validation (Bayesian Model)
 to support multidimensional representation
– read our paper at VLDB 2014 
 SimDB is an extension to SciDB to efficiently store
irregular meshes on multidimensional array systems
IBM Research Brazil
Final Remarks
• space-time models require a
multidimensional data model for hypothesis
as data management;
• SciDB is a parallel multidimensional array
DBMS
• We want to extend Υ-DB to multidimensional
array
• still very immature
IBM Research Brazil
This is a DEXL Team work
PhD Candidate Bernardo Gonçalves
IBM PhD Fellowship 2013-2014
(bgonc@lncc.br)
Dr Ramon Gomes Costa
(ramongc@lncc.br)
Msc student Hermano Lustosa
(hllustosa@gmail.com)
IBM Research
Brazil
Obrigado !
http://dexl.lncc.br
IBM Research
Brazil
LNCC Meeting
2012
EMC Summer School 2013
Olympic Laboratory
 Objective
– To study high performance sports as a science discipline
– To build the first sports laboratory in South America
 US$ 10M Project sponsored by FINEP(Funding
Agency)
 Departments:
– Biochemistry, physiology, genetics, nutrition, computational
modeling, computer science, physiology
67
Our task
 To support athlete’s follow-up data
– Athlete’s training
– Variation on biochemical elements
– Variation on biometric variables
 More recently
– For some modalities, Integrate meteorological
conditions
EMC Summer School 2013
68
Analyses Board
EMC Summer School 2013
69
EMC Summer School 2013
Athletes follow-up database
 Athletes follow-up data modeled as trajectories
– Register measurements from athletes in different training
states
 Trajectory model
– Ordered set of measurements
– Division of time in training states
– Materialized view limited in time-range
– Imprecise measurements
 Not detected =0
 < x -> ]0,x[
 y , y  x
70
More on Athlete’s Trajectories
 Stops – modelled as measurements
– Qualified according the athlete’s training state
– Training states (recovery, training, rest,…)
 Moves – extrapolation between two stops
 Trajectory – the set of measurements,
ordered in time, and limited in time according
to some criteria (eg. A training program).
– Measurements of the same observable element
– Measurements of the same athlete
EMC Summer School 2013
71
Metaphoric Trajectory
EMC Summer
School 2013
72
EMC Summer
School 2013
73
EMC Summer
School 2013
74
Challenges
 Integrating athlete’s trajectory with weather
information
 How to efficiently store metaphoric
trajectories ?
– Trajstore [Cudre-Mauroux et al ICDE 2010]
– SciDB
 How to express and efficiently process
similar trajectories
EMC Summer School 2013
75

More Related Content

What's hot

Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...iosrjce
 
data-microscopes
data-microscopesdata-microscopes
data-microscopesStephen Tu
 
Good Old Fashioned Artificial Intelligence
Good Old Fashioned Artificial IntelligenceGood Old Fashioned Artificial Intelligence
Good Old Fashioned Artificial IntelligenceRobert Short
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learningNaveen Davis
 
Duplicate Detection of Records in Queries using Clustering
Duplicate Detection of Records in Queries using ClusteringDuplicate Detection of Records in Queries using Clustering
Duplicate Detection of Records in Queries using ClusteringIJORCS
 

What's hot (8)

Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
Analytical Study and Newer Approach towards Frequent Pattern Mining using Boo...
 
data-microscopes
data-microscopesdata-microscopes
data-microscopes
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
Good Old Fashioned Artificial Intelligence
Good Old Fashioned Artificial IntelligenceGood Old Fashioned Artificial Intelligence
Good Old Fashioned Artificial Intelligence
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learning
 
Duplicate Detection of Records in Queries using Clustering
Duplicate Detection of Records in Queries using ClusteringDuplicate Detection of Records in Queries using Clustering
Duplicate Detection of Records in Queries using Clustering
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 

Viewers also liked

SAAS Search Concepts
SAAS Search ConceptsSAAS Search Concepts
SAAS Search ConceptsSatyajit Das
 
Bigdata_Marketresearch_Informationgoods__Coutinhofgv
Bigdata_Marketresearch_Informationgoods__CoutinhofgvBigdata_Marketresearch_Informationgoods__Coutinhofgv
Bigdata_Marketresearch_Informationgoods__CoutinhofgvMarcelo Coutinho Lima
 
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGV
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGVSmart Data Brazil Retail_ SAPForum2015_CoutinhoFGV
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGVMarcelo Coutinho Lima
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsIgor José F. Freitas
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data editionMark Kerzner
 
DELL project
DELL projectDELL project
DELL projectKIMEP
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 

Viewers also liked (9)

SAAS Search Concepts
SAAS Search ConceptsSAAS Search Concepts
SAAS Search Concepts
 
Bigdata_Marketresearch_Informationgoods__Coutinhofgv
Bigdata_Marketresearch_Informationgoods__CoutinhofgvBigdata_Marketresearch_Informationgoods__Coutinhofgv
Bigdata_Marketresearch_Informationgoods__Coutinhofgv
 
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGV
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGVSmart Data Brazil Retail_ SAPForum2015_CoutinhoFGV
Smart Data Brazil Retail_ SAPForum2015_CoutinhoFGV
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
DELL project
DELL projectDELL project
DELL project
 
Brazil PESTEL Analysis
Brazil PESTEL AnalysisBrazil PESTEL Analysis
Brazil PESTEL Analysis
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Brazil Startup Report
Brazil Startup ReportBrazil Startup Report
Brazil Startup Report
 

Similar to Ibmr 2014

Computational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisComputational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisAboul Ella Hassanien
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Using Consolidated Tabular and Text Data in Business Predictive Analytics
Using Consolidated Tabular and Text Data  in Business Predictive AnalyticsUsing Consolidated Tabular and Text Data  in Business Predictive Analytics
Using Consolidated Tabular and Text Data in Business Predictive AnalyticsBohdan Pavlyshenko
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and rKelli-Jean Chun
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsEmiliano De Cristofaro
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
Master Thesis Defense
Master Thesis DefenseMaster Thesis Defense
Master Thesis DefenseFilipo Mór
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 

Similar to Ibmr 2014 (20)

Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
20181212 ibm aot
20181212 ibm aot20181212 ibm aot
20181212 ibm aot
 
Computational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysisComputational model for artificial learning using formal concept analysis
Computational model for artificial learning using formal concept analysis
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Using Consolidated Tabular and Text Data in Business Predictive Analytics
Using Consolidated Tabular and Text Data  in Business Predictive AnalyticsUsing Consolidated Tabular and Text Data  in Business Predictive Analytics
Using Consolidated Tabular and Text Data in Business Predictive Analytics
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and r
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and Applications
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
Master Thesis Defense
Master Thesis DefenseMaster Thesis Defense
Master Thesis Defense
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 

Recently uploaded

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Recently uploaded (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Ibmr 2014

  • 1. IBM Research Brazil Fabio Porto (fporto@lncc.br), LNCC – MCTI DEXL Lab (dexl.lncc.br) Challenges in Scientific Big Data Management
  • 2. Outline  Introduction  Big Data in Science  Hypothesis Driven-Research  Hypothesis as Data – Upsilon-DB  SimDB  Final remarks IBM Research Brazil
  • 3. Laboratório Nacional de Computação Científica (LNCC) Petropolis, Rio de Janeiro IBM Research Brazil
  • 4. LNCC - MCTI  Graduate Course in Computational Modelling – CAPES 6  BioInfo Laboratory – High throughput sequencing  Coordinator of INCT –MACC – Medicine Supported by Computational Science  Coordinator of SINAPAD – HPC National System  Thematic laboratories – ACIMA – Augmented Reality – MARTIN – Network and Software Engineering – DEXL – Big Data – COMCIDIS – Distributed Systems – HEMOLAB – Cardio Vascular System Modelling IBM Research Brazil
  • 5. DEXL – On going Projects DEXL Laboratory What-if data analysis Astronomy Data Management Simulation Data management Hypothesis Upsilon-DB Gene regulation Network Scientific workflow optimization Seismic Data Mngmt (EMC) Bioknowlogy IBM Research Brazil Olympic Laboratory
  • 6. Objective  To provide scientists with an in-silico cockpit from which scientific data and metadata can be efficiently managed IBM Research Brazil
  • 7. IBM Research Brazil “Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it’s going to get worse” – Office Science. Data-Management Challenge Report– DoE - 2004
  • 8. Big Data in science  An expression that reflects the data deluge produced in science – astronomy, astrophysics – Biology, Neuroscience – Sports – Geology, Geophysics, etc. IBM Research Brazil
  • 9. Big Data - Dimensions Volume Velocity Variety MB GB TB PB file database Uncertainty heterogeneity, evolution batch online sensors, alerts real time IBM Research Brazil
  • 10. A challenge on volume  Dark Energy Survey (DES) project expects to produce 100 PB in 10 years; (source:personal comm.) – 5000o sky cover, “all” objects, “perfect” accuracy  Yahoo claims to manage 2 PB of click data in a modified PostgreSQL  EMBL - nucleotide database 260 Gbases  High-throughput sequencing 454 Roche technology – Sequence 400-600 million bases in 10 hours – Eg. A project at Max Plant Institute aims at sequencing the whole genome of the Neanderthal at 3 billion base pairs is expected to take 2 years to finish. IBM Research Brazil
  • 11. 1D-3D coupled simulations IBM Research Brazil
  • 12. From Observation to Data Analysis IBM Research Brazil
  • 13. BIG DATA in Science  Scientific process is being remodelled to be developed within an in-silico environment  Powerful instruments: – Digital telescopes – DNA sequencers – Mass spectrometers  Huge simulations – Weak lensing – Human Cardio-vascular system  Massive amounts of information streams in and out…  Hypothesis-driven research supported by in-silico infrastructure, methods, models… IBM Research Brazil
  • 15. Big Data urgent call in e-science  Scientific life cycle metadata management;  Scientific Hypothesis formulation and validation;  Scientific Data management;  Scientific data processing architecture; IBM Research Brazil
  • 17. “To see what is in front of one’s nose needs a constant struggle” George Orwell IBM Research Brazil
  • 18. To make sense of Big Data we need models [Peter Haas – Data is Dead without what-if models, PVLDB 2011]  Scientific Models are formal interpretation of phenomena  Hypotheses formalized as models  Scientific life cycle driven by hypotheses validation IBM Research Brazil
  • 19. Hypothesis driven Big Data analyses  Scientific Hypothesis – a model for scientists’ interpretation of a phenomenon;  Different hypotheses co-habit a scientific domain;  Science method – prove hypotheses  Big Data analyses – hypotheses exploration  In new Big Data prediction analysis – identify first principles that guide predictions – deep vs shallow prediction IBM Research Brazil
  • 20. Big Data Hypotheses driven life cycle Hypotheses, experiment Goals Experiment, Workflow Design Workflow Preparation Workflow ExecutionPost- Execution analysis Workflow repository Data Sources Provenance Store Monitoring Hypotheses database Adaptado de [Mattoso et al. 2010] Analysis Results IBM Research Brazil
  • 21. Equivalence of interpretation IBM Research Brazil ν (hypotheses) ≅ Models ϕ (phenomenon) δ (data)
  • 23. Phenomenon 0..1 1..1 explains 1..1 1..1 Υ-DB Conceptual Model Continuous Ph_Process Discrete Ph_Process Mathematical Model 1..1 formulatedby isTheBlendOf1..n 1..n Is basedOn represented_as Compared_with Mathematical Formulae XML Represented with Physical Quantities Phenomenon physical quantities 1..1 0..n 0..n 1..1 Formal Representation Scientist 1..m 0..n 0..n 0..n elements constant fucntion equation 1..n 0..n 1..n 1..1 Observation Element Simulated Element Data View (query over Data view) Modeled_as 1..1 0..n Refers-to 0..1 Space-Time Dimension 1..1 0..n 0..n1..1 0..n Event Computational Model View modeled_as transforms Mesh 1..1 1..n Mesh Data view Domain ontology URL 1..1 0..n Formal Language Discrete Phenomenon Simulation 0..1 0..1 0..1 0..n represents 1..1 1..1 0..n Topologically modeled by0..n 0..1 1-n State Ph_Process represented_as SC Hypothesis1..n isAuthor variable 1..n [Porto et al. ER 2008, ER 2012] IBM Research Brazil
  • 24. Hypotheses as Data – Upsilon DB  From the triangular equivalence, we derive that – Hypothesis = Model = Data  How can we infer data from Model? [Bernardo Gonçalves, Fabio Porto, PVLDB 2014] IBM Research Brazil
  • 25. Hypothesis as Models Law of free fall If a body falls from rest, its velocity at any point is proportional to the time it has been falling. a(t) = -g v(t) = -gt + vo s(t)= - g/2 t2 + vot+ so Hypothesis Scientific Model for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end Computational Model IBM Research Brazil
  • 26. Hypothesis - From Models to Data IBM Research Brazil for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end SOLVER Input Output
  • 27. Hypothesis as Data for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end Law of free fall If a body falls from rest, its velocity at any point is proportional to the time it has been falling. a(t) = -g v(t) = -gt + vo s(t)= - g/2 t2 + vot+ so t v s 0 0 5000 1 -32 4984 2 -64 4936 3 -96 4856 4 -128 4744 IBM Research Brazil Free_Fall
  • 28. Hypothesis as Data – Computing a DB schema  Mathematical Models – formalize hypotheses – equations establish a functional dependency between dimensions and parameters and predicting variables  eg: g,t,vo -> v – Derive a DB schema from DFs extracted from equations IBM Research Brazil
  • 29. Hypothesis as Data  In the Free Fall example: – Σ1 = {Φ -> g,vo, so g, ν -> a g, vo, t, ν -> v g, vo, so, t, ν -> s}  Observe that Φ and ν are epistemological variables referring to the phenomenon and the hypothesis, respectively; IBM Research Brazil v(t) = -gt + vo
  • 30. Hypothesis as Data - schema  Φ -> g,vo, so – defines the model parameters – It is expected to be violated reproducing the uncertainty in the model input; – Such uncertainty contributes to the quality of the hypothesis  From Σ1, the schema for predicting a under hypothesis h1 would be: – h1 (Φ, ν , a)  From Σ1, the input parameters are defined as: *key violation – h1_input(Φ, g, v0, s0) IBM Research Brazil
  • 31. Hypothesis as Uncertain Data IBM Research Brazil Φ g v0 s0 1 32 0 5000 1 32 10 5000 1 32 20 5000 1 32.2 0 5000 1 32.2 10 5000 1 32.2 20 5000 Uncertainty: 50% Uncertainty: 33% INPUT_H1
  • 32. Uncertainty Introduction  Υ_DB is a probabilistic database [D. Suciu et al, Probabilistic Databases, 2011] – a Y-relation includes certain and conditional columns; – a conditional column is a pair (Vi , Di), where Vi is a random variable and Di is one of its possible values; – ex:  Create table Y_g as select U_phi, U_g from (repair key phi in (select phi, g , count(*) as Fr from INPUT_H1 group by phi, g weight by Fr) as U IBM Research Brazil
  • 33. Hypothesis as Uncertain Data  Create table Y_g as select U.phi, U.g from (repair key phi in (select phi, g , count(*) as Fr from INPUT_H1 group by phi, g) as U IBM Research Brazil Φ g 1 32 1 32 1 32 1 32.2 1 32.2 1 32.2 INPUT_H1 Φ V-> D g 1 x1 1 32 1 x1 2 32.2 Y_g
  • 34. Synthesizing Prediction as a query  as g,ν a in Σ1 , we can predict a as a query on uncertain relations Y_g and Y_R – create table Y1_a as select H.phi, H.upsilon, H.a from H1_OUTPUT_a as H, Y_R as R, Y1_g as G, (select min(tid) as tid, phi, g from H1_INPUT group by phi, g) as U where H.tid=U.tid and G.phi=U.phi and G.g=U.g and H.phi=R.phi and H.upsilon=R.upsilon IBM Research Brazil
  • 35. Predicted Y-DB relation Y[a] IBM Research Brazil Φ g a u 1 32 32 0.5 1 32.2 32.2 0.5 Υ[a] Υ-DB enables data oriented uncertainty quantification of predicting variables
  • 36. Sum Up  Υ-DB is a probabilistic database designed to manage hypothesis as data  In Υ-DB, both the intrinsic uncertainty of the model Υ[R] and those of predicting variables (eg. Υ[a]) are automatically computed  Υ-DB and Research Lattices are the basis for managing Hypotheses over Big Data in science (and we believe in any domain) IBM Research Brazil
  • 38. Ordering Hypotheses  Different competing hypotheses must be placed into context according to their phenomenon explanation capacity – predicting capability (predicting variables); – assumptions and constraints IBM Research Brazil
  • 39. Hypotheses in the Dark Energy Survey Project  Phenomenon – The universe is increasing its expansion acceleration  Discovered in 1998 during supernovae investigation  Supported by redshift observation of far away supernovae  Hypotheses – A new behaviour Dark Energy pushes the acceleration – The Universe density is not uniform  Evidences – gravitational lenses – Galaxy clusters IBM Research Brazil
  • 40. Research Lattices – structure hypotheses of a phenomenon Τ Dark Energy Non uniform universe Weak lensing Galaxy clustering Earth special location Τ [B. Gonçalves, F. Porto, Research Lattices, AMW 2013] IBM Research Brazil
  • 41. Research Lattices  Each Node is a hypotheses  Given two hypotheses h1 and h2, in a R.L., if h1 ≥ h2 then h1 is more general than h2;  Top corresponds to all knowledge of a domain;  Bottom is the empty representation of lack of knowledge; IBM Research Brazil
  • 42. Research Lattice: Acceleration Τ Lei da queda livre d2s/dt2=9,8 h1 Primeira Lei Newton h2 Segunda Lei Newton F=mag h3 Aceleração Centrípeta ac=4Πr/T2 h4 3a Lei de Kepler r3/T2= c h5 Lei da Gravitação Universal Fg= G Mn/ r2 h6 Lei do inverso quadrado da distância ac ∝ 1 /r2 h7 ΤIBM Research Brazil
  • 43. Research Lattice for the Human Cardio Vascular System IBM Research Brazil
  • 44. Research lattice Operations  Add/delete hypotheses – consistently keep the partial ordering; – automatic placement of hypotheses in the RL  Querying – finding hypotheses based on “Free Fall” hypothesis – find competing hypotheses wrt “Dark Energy” Hypothesis  How to access the predictive capacity of a hypothesis? IBM Research Brazil
  • 45. Sum up  Research Lattice enables a formal yet bound representation of a research domain  Hypotheses are scientists encoding of their studied phenomenon interpretation IBM Research Brazil
  • 47. Dealing with Space-Time dimension on Hypotheses  Most phenomena occur on space-time;  In computational model, simulations use meshes to model the physical domain;  Predicting variables are computed in a point of a multidimensional space – 3D, 1D, 4D etc..  Data Representation is a multidimensional matrix [ArrayDB, SciDB,…] IBM Research Brazil
  • 48. 1D and 3D Meshes representations of human artery IBM Research Brazil
  • 49. Multidimensional Array Representation Is it efficient for processing queries over meshes such as the ones of HCVS?? IBM Research Brazil
  • 50. SciDB in 20 sec  Unit of representation are multidimensional arrays  each dimension has a name and a size  a reference to all dimensions in an array leads to a cell  a cell has many attributes – columnar store  An array may be partitioned in its dimensions  Two query languages AQL and AFL IBM Research Brazil
  • 52. Loading Simulation data Simulation output Wrapper Unidimensional array Multidimensional array Geometry3d_raw Geometry3d redimension_store IBM Research Brazil
  • 53. A multidimensional array in SciDB IBM Research Brazil CREATE ARRAY Geometry3d < velocity_x:double, velocity_y:double, velocity_z:double, pressure:double, displacement_x:double, displacement_y:double,displacement_z:double > [simulation_number=0:9,1,0, time_step=0:30720,1920,0, x_axis=0:39,40,0, y_axis=0:39,40,0, z_axis=0:39,40,0]
  • 54. Challenge to map an irregular mesh into a regular array structure IBM Research Brazil
  • 55. SimDB  We developed SimDB: A layer on top of SciDB to map irregular meshes into a regular array data representation IBM Research Brazil
  • 58. Experiments set-up • 4 Servers and 16 VMs • 4 Queries Servers VMs 1 2 4 1 10GB 2 5GB 5GB / 10GB 4 2.5GB 2.5GB / 5GB 2.5GB / 5GB / 10GB 8 2.5GB 2.5GB / 5GB 16 2.5 GB IBM Research Brazil
  • 59. Experiments • 8 scidb instances per VM • 1, 2, 4, 8, 16 VMs • 7 queries • 2 arrays (S, T) e (T, S) • 30 executions IBM Research Brazil
  • 60. Results 1(1) 2(1) 2(2) 2(2) 4(1) 4(2) 4(2) 4(4) 4(4) 4(4) 8(2) 8(4) 8(4) 16(4) 00:00.00 00:43.20 01:26.40 02:09.60 02:52.80 03:36.00 04:19.20 VM (Server) Query 1 Query 2 Query 3 Query 4 Executiontime Queries 1 Select avg(pressure)from Geometry3d where time step < 1920 group by Simulation number 2 Select avg(pressure)from Geometry3d group by Simulation number 3 Select avg(pressure)from Geometry3d group by time step 4 Select avg(pressure)from Geometry3d where Simulation number < 2 group by time step IBM Research Brazil
  • 61. Results Queries 1 Select avg(pressure)from Geometry3d where time step < 1920 group by Simulation number 2 Select avg(pressure)from Geometry3d group by Simulation number 3 Select avg(pressure)from Geometry3d group by time step 4 Select avg(pressure)from Geometry3d where Simulation number < 2 group by time step 5 Select avg(pressure)from Geometry3d where Simulation number = 0 and time step < 1920 group by time step 6 Select avg(pressure)from Geometry3d where Simulation number < 0 and time step < 0 group by time step 7 Select avg(pressure)from Geometry3d where (time step % 512) = 0 group by time stepIBM Research Brazil
  • 62. Final Remarks  We argue that “Big Data Analytics require a scientific approach based on hypotheses formulation and follow-up”  Υ-DB is an innovative approach for Big Data management; – Reflects Hypothesis as data principle – Is formal and guards equivalence between data and models – Models uncertainty in the model and in the data – must be extended  to cope with observation validation (Bayesian Model)  to support multidimensional representation – read our paper at VLDB 2014   SimDB is an extension to SciDB to efficiently store irregular meshes on multidimensional array systems IBM Research Brazil
  • 63. Final Remarks • space-time models require a multidimensional data model for hypothesis as data management; • SciDB is a parallel multidimensional array DBMS • We want to extend Υ-DB to multidimensional array • still very immature IBM Research Brazil
  • 64. This is a DEXL Team work PhD Candidate Bernardo Gonçalves IBM PhD Fellowship 2013-2014 (bgonc@lncc.br) Dr Ramon Gomes Costa (ramongc@lncc.br) Msc student Hermano Lustosa (hllustosa@gmail.com) IBM Research Brazil
  • 67. EMC Summer School 2013 Olympic Laboratory  Objective – To study high performance sports as a science discipline – To build the first sports laboratory in South America  US$ 10M Project sponsored by FINEP(Funding Agency)  Departments: – Biochemistry, physiology, genetics, nutrition, computational modeling, computer science, physiology 67
  • 68. Our task  To support athlete’s follow-up data – Athlete’s training – Variation on biochemical elements – Variation on biometric variables  More recently – For some modalities, Integrate meteorological conditions EMC Summer School 2013 68
  • 69. Analyses Board EMC Summer School 2013 69
  • 70. EMC Summer School 2013 Athletes follow-up database  Athletes follow-up data modeled as trajectories – Register measurements from athletes in different training states  Trajectory model – Ordered set of measurements – Division of time in training states – Materialized view limited in time-range – Imprecise measurements  Not detected =0  < x -> ]0,x[  y , y  x 70
  • 71. More on Athlete’s Trajectories  Stops – modelled as measurements – Qualified according the athlete’s training state – Training states (recovery, training, rest,…)  Moves – extrapolation between two stops  Trajectory – the set of measurements, ordered in time, and limited in time according to some criteria (eg. A training program). – Measurements of the same observable element – Measurements of the same athlete EMC Summer School 2013 71
  • 75. Challenges  Integrating athlete’s trajectory with weather information  How to efficiently store metaphoric trajectories ? – Trajstore [Cudre-Mauroux et al ICDE 2010] – SciDB  How to express and efficiently process similar trajectories EMC Summer School 2013 75