Presentation by Bill Michener asking whether Institutional and Subject-Specific Data Repositories can Co-Exist given as a 'provocation' in the final panel session at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK
Michener-institutional and subject-specific data repositories-nfdp13
1. Can Institutional and Subject-Specific
Data Repositories Co-Exist?
William Michener
University Libraries
University of New Mexico
22 May 2013
2. 2
The Long Tail of Orphan Data
Volume
Rank frequency of datatype
Well-curated/-preserved
Orphan data
(B. Heidorn)
2
Characteristics
Big Science
Large Volume
Automated sensors
Well described
Well curated
Easily Discovered
• Small Science
• Small Volume
• Poorly described
• Rarely Indexed
• Invisible to scientists
• Rarely Used
• Dark Data
• High spatial resolution
• Process based
• Theory Development
• Model Development
• Benchmarking
Characteristics
3. 3
The Long Tail of Orphan DataVolume
Rank frequency of datatype
Subject repositories
Institutional repositories
(B. Heidorn)
3
No repositories
5. 5
DataONE: Federating Data
Providing universal access to data about life on earth
and the environment that sustains it
1. Building community
2. Developing sustainable
data discovery and
interoperability solutions
3. Enabling science through
tools and services
6. 6
Metadata Interoperability
KNB
LTER
ORNL DAAC Internal
Metadata
Index
CDL
Coordinating Nodes
MetadataExtraction
• Virtual Portals
• Numerous search
capabilities
• Metadata has link to
data, which reside at
Member Nodes
USGS CSAS
D-Space,
I-Rods …
EML, ISO
FGDC
FGDC, ISO
EML
FGDC
Dublin Core
Darwin Core
…
FGDC, ISO
Member Nodes
*Others
Editor's Notes
There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data. Big science: large volume data sets from sensors (NEON, Remote sensing, Small science: orphan data, dark data higher resolution, poorly described
There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data. Big science: large volume data sets from sensors (NEON, Remote sensing, Small science: orphan data, dark data higher resolution, poorly described