“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 3: “Researcher Perspectives of Data Curation”
Presented by: David Minor, Research Data Curation Program, UC San Diego Library, Dick Norris, Professor, Scripps Institution of Oceanography & Rick Wagner, Data Scientist, San Diego Supercomputer Center.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
1. Hot Topics Web Seminar Series: Research
Data in Repositories
The UC San Diego Experience
Third Webinar: The Researcher Perspective
2. Reminder: General Series Info
•
First webinar: Intro and Framing: UC San Diego
decisions and planning
•
Second Webinar: Deep dive into technology and
metadata
•
Third Webinar: The perspective from researchers,
next steps
3. Reminder: General Series Info
Slides and presentations from previous
webinars are available for download!
http://www.duraspace.org/hot-topics
4. Your esteemed presenters …
First webinar:
David Minor – Program Director, Research Data Curation
Declan Fleming - Chief Technology Strategist
Second webinar:
Declan Fleming - Chief Technology Strategist
Arwen Hutt - Metadata Librarian
Matt Critchlow - Manager of Development and Web Services
Third webinar:
David Minor – Program Director, Research Data Curation
Dick Norris – Professor, Scripps Institution of Oceanography
Rick Wagner – Data Scientist at San Diego Supercomputer Center
5. Today we will …
Discuss how researchers have approached
curation and data management
6. Reminder: UCSD Research Data Curation Pilots
• The Brain Observatory
• NSF OpenTopography Facility
• Levantine Archaeology Laboratory
• Scripps Institute of Oceanography
Geological Collections
• The Laboratory for Computational
Astrophysics
7. Reminder: UCSD Research Data Curation Pilots
• The Brain Observatory
• NSF OpenTopography Facility
• Levantine Archaeology Laboratory
• Scripps Institute of Oceanography
Geological Collections
• The Laboratory for Computational
Astrophysics
9. Rick Wagner
High Performance Computing
Manager at the San Diego
Supercomputer Center
Ph.D. Candidate within
The Laboratory for
Computational Astrophysics
10. SIO Geological Collections
General Series Intro
First webinar: Intro and Framing: UC San Diego
decisions and planning Part of the
Curator: Dick Norris
International
Marine and Lacustrine
CollectionsWebinar: Deep dive into technology and
• Second Manager:
Geological Collections
Alexandra Hangsterfer
metadata
•
•
With collections at
Third Webinar: The perspective fromOregon
Columbia, researchers,
next steps
State, Woods Hole,
USGS and more
11. Our Collection: Sediment cores and rocks
recovered from the oceans & long-lived lakes
Reef sediment-Panama
Salton Sea-CA
12. How we get them….
Mostly by Sea
(Ship, Cruise, Leg)
But also by Land
Country, Locality, Lat/Long
14. A collection event is an Object
and includes:
•
•
•
•
•
•
•
Specimen(s) Latitude/Longitude)
Ship name and cruise number
Text descriptions
Thin-sections
Images, field notes, publications
Location in the repository
International Geological Sample
Number
15. The Sediment Core
Collection
Archive and Working
halves of ~7000 cores
from the world’s oceans
Typically 3-5
sections/core
+ core photos, chemical
data and sampling
history
The IODP Core collection, Bremen Germany
16. The Marine Rock collection…
• ~4000 dredge sites worldwide
• In an 8000 sq ft building
• Volcanic rocks, manganese nodules, reef rock
17. Our data resides with NGDC…
• NOAA’s National Geophysical Data Center
• And IGSN’s with Lamont’s SESAR
18. NGDC searches on ships, repositories,
sampling systems, and locations
But no keyword search, automated data input, ways to link associated
data, returns on nearest search terms, sampling history, etc….
19. What the Community Wants
• A unified National geo-referenced system
• Exploratory search by nearest word and mapbased system
• Links to associated data types (images, text,
data, references…)
• All data types linked by IGSNs
• Data entry through web forms with
publication by curators
20. What we did with RCI
• Identified one type of object
– Based in sampling events
– Ship-Cruise-Sampling device-Sample number
– Geo-referenced
– Includes associated materials: text description,
images, chemical data, references, records of
sampling event, sampling records, storage location
• NGDC records imported into UC Library
system
• Records searchable by any word in a record
21. What’s next?
• NSF-sponsored SEASAR (System for Earth
Sample Registration)
– Created the International GeoSample Number
– http://www.geosamples.org/
• NSF-sponsored workshop:
– Digital Environment for Sample Curation (June
2013)
– http://www.geosamples.org/news/descwebinarmaterials
• NSF “EarthCube” initiative
22. CyberInfrastructure needs (from DESC)
• Offline data entry at sea or in the field
• DESC should respect data moratoriums (typically 2
years, if collected with NSF grants)
• Automated release to public at close of moratorium
• Secure login-based data serving for project scientists
• Flexible search and access for users to view public
archive (view by location name, type, bounding
region) and associated data
• Flexible sample request submission
23. More cyberInfrastructure needs
• Display stored datasets and images hosted on other
servers (as in other repositories)
• Connections with Standard Visualization Tools Such
as Corelyzer, Correlator, PSICAT, CoreRef, GMT,
GeoMapApp
• Sampling database should be easily accessible by
researchers to submit requests
• Automatically updated by repository (personnel) to
reflect samples sent to the researchers
• Way of entering historical sampling information
24. These are general issues for
Natural History Collections
• Most museums have similar issues to us
– Geo-Referenced collections
– Mix of physical specimens, images, text
descriptions, sampling data, and affiliated data
files
– Many have home-grown data bases that are not
interoperable with other museums
Fish from the SIO Marine
Vertebrates Collection
25. Natural History Collections
• Need controlled vocabularies but flexibility to
search on variants
– Since nobody agrees on common vocabularies
• Value in cross-referencing to related
collections
– Such as samples (geology, biology, water)
collected on a cruise with ship track, sea floor
maps…
– Presently working on “Rolling deck to Repository”
NSF project
27. Research group focusing on numerical modeling of complex astrophysical
processes: cosmology, galaxy formation, turbulence, radiation hydrodynamics,
magneto-hydrodynamics, …
Image credit: NASA, IoA, A. Fabian et al.
28. Our simulations are large, based on the current definition of “large” (we grow
with the technology). Typical results are 1-100 TB.
29. This work is costly in terms of both the computer time and human effort, and we
see a benefit to the science community in sharing. (Citations are nice, too.)
http://bit.ly/sB30f1
http://bit.ly/IzTVV2
http://bit.ly/IE4iFd
http://bit.ly/HFYLQJ
30. Prior Sharing Efforts
Participation in the Virtual
Observatory
• Standards for simulation
metadata, search, and retrieval
• An odd fit beside the “pure”
astronomy projects and data
centers
• But, it meant we weren’t starting
from scratch in terms of describing
our data
Started the curation effort very
curious about how much of this
previous work would translate to
library space
Also wanted stable platform for data
hosting (e.g., not a closet server)
31. Curation Process
By E.gordienko (Own work) [CC-BY-SA-3.0
(http://creativecommons.org/licenses/by-sa/3.0) or GFDL
(http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
Several steps:
• Choosing the pilot dataset
• Cleaning up simulation cruft
• Identifying related publications
• Adding historical documents
(proposals, reports, etc.)
• Organize various data groups
• Simulations are a collection of
datasets from various points in
time, needed a description for
each type of digital object in each
dataset
• Bundle, checksum, and handoff
Decided near the end to replicate
the metadata record to a second site
as test of its portability
33. Final result:
• Datasets from a high-resolution
cosmology simulation held at
UCSD
• Viewable both at UCSD, and via
the Online Archive of California
• Raw simulation data and various
analysis results accessible over
HTTP
34. Some thoughts:
• When it comes to metadata formats libraries are like any other science
domain and speak their own language
• If you have a highly-specialized domain-specific metadata dialect or
language, you may need an additional discovery service
• If not, it’s a good starting point
• We’re working on repeating this process on our own for another simulation
35. Next steps at UC San Diego
Move from pilot services to a scalable series of processes.
Work with additional researchers in same domains.
Work with new domains.
Broaden lifecycle management
mindset on campus.
36. Questions?
Rick Wagner - rpwagner@sdsc.edu
Richard Norris - rnorris@ucsd.edu
David Minor - dminor@ucsd.edu
http://www.duraspace.org/hot-topics