6. Data deluge
• At end of 2011 – info created and replicated > 1.8 zettabytes
• 90% data created in the last 2 years
• 5 hour flight – 240 Tbytes
• Facebook – 200 million users, >70 languages
• Each person in England is filmed 300 times/day
• Teenagers in the US send average 110 phone text messages a day
=> We need to build arks during the deluge - PRESERVATION
7. Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
8. Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
9. WHY PRESERVE
• Costly to produce
• Contribute to progress of science
• Intrinsic value
culture/science/sustainability
10. WHY PRESERVE
• Costly to produce
– Infrastructure, power, software, models, visualization,
people
– Hardware, Software, Peopleware
• Contribute to progress of science
– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability
– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
11. WHY PRESERVE
• Costly to produce
– Infrastructure, power, software, models, visualization,
people
– Hardware, Software, Peopleware
• Contribute to progress of science
– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability
– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
12. WHY PRESERVE
• Costly to produce
– Infrastructure, power, software, models, visualization,
people
– Hardware, Software, Peopleware
• Contribute to progress of science
– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability
– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
13. The Domesday Project 1086-1986
• Digital decay
• Equipment obsolescence
• Software obsolescence
20. What to preserve?
• Data
• BUT what is “data”?
– Files and records
– Models, documentation, annotations, sketches,
experiments, recordings
• Only data?
21. What to preserve?
• Data
• BUT what is “data”?
– Files and records
– Models, documentation, annotations, sketches,
experiments, recordings
• Only data?
– How produced it – workflows, devices,
methodologies, materials and methods,
reasonings, logs --- provenance
22. What to preserve?
• Data
• Environment in which was produced
• Data needed to preserve occupies more space
than the data itself
• Preservation means storing more than object
itself
23. What about our research data?
(slide adapted from Jim Gray)
Experiments
Instruments
Files Questions
Papers Answers
Simulations
Models
DATA
Data-driven science “Collaboratory”
23/10000
24. Data sources?
Table of Product Characteristics
id Property name Value
MilkProd productsrep MilkA
MilkProd quantity 10000
MilkProd validity date 10/06/2006
CheeseProd productsr Minas
CheeseProd epquantity 2000
CheeseProd validity date 12/02/2006
CheeseProd shape Circular
24/10000
30. How to preserve?
How to construct the ark during the
deluge?
Presaervare, Manutenere and Share
31. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
• To ensure quality
– Curation procedures
• To afford maintenance costs
– Cloud? CAP theorem?
32. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
• To ensure quality
– Curation procedures
• To afford maintenance costs
– Cloud? CAP theorem?
33. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
• To ensure quality
– Curation procedures
• To afford maintenance costs
– Cloud? CAP theorem?
34. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
• To ensure quality
– Curation procedures, metadata,standards
• To afford maintenance costs
– Cloud? CAP theorem?
35. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
• To ensure quality
– Curation procedures,metadata, standards
• To afford maintenance costs
– Cloud? CAP theorem? ======= WHERE
36. How to preserve?
• To ensure retrievability and sharing
– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity
– Media decay, software decay, hardware decay
– PEOPLE DECAY
• To ensure quality
– Curation procedures,metadata, standards
• To afford maintenance costs
– Cloud? CAP theorem? ======= WHERE
37. Sharing and open access
NSF Data Management Policy
Paper and data publication
38.
39. Sharing of Data Leads to Progress on Alzheimer’s
By GINA KOLATA
Published: August 12, 2010
= NEW YORK TIMES
In 2003, a group of scientists and executives from the National Institutes of Health, the Food and
Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups
joined in a project that experts say had no precedent: a collaborative effort to find the biological
markers that show the progression of Alzheimer’s disease in the human brain.
share all the data, making every single
finding public immediately, available to
anyone with a computer anywhere in the
world
=> AVAILABILITY and REUSE
40. • Data must be properly curated throughout its
life-cycle and released with the appropriate
high-quality metadata.
• Medical Research Council UK
40/10000
41. • Research data should be made available for
use by other researchers. Researchers must
retain research data, including electronic data,
in a durable, indexed and retrievable form.
• Australian Govnmt National Health and
Medical Research Council
41/10000
44. • Citing data is as important as citing papers
• For researchers, publishers, data centers
• Over 1M DOI, several major national research
libraries
– Germany, France, Korea, Netherlands, Australia,
USA...
• Present manager – German National Library of
Science and Technology
44/10000
45. Publish on the Cloud
Add metadata
Pre-print sharing
45/10000
46. FNJV
proj.lis.ic.unicamp.br/fnjv
• Sharing by publishing on the Web
• Retrievability by extending metadata
46/10000
54. Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
PRE-SAVE and MANU-TENERE
55. Outline
• Why preserve?
– Costly to produce (hardware, software, peopleware)
– Contribute to progress of science
– Value – culture, science, sustainability
• What to preserve?
– Data [WHAT IS DATA?]
– Context of production and use
• How to preserve?
– Accessibility and sharing – standards, metadata,
ontologies
– Integrity and quality – context to use (hw, sw),
standards
57. References
NSF – CISE Data management policy
The Domesday Project
http://www.atsf.co.uk/dottext/domesday.html
The CLARIN Project (languages)
Eigenfactor.org
Altmetrics movement