Presented by Declan Fleming, Arwen Hutt, and Matt Critchlow. The second in a three part Webinar series on Research Data Curation at UC San Diego, as part of the larger Research Cyberinfrastructure initiative.
Duraspace Hot Topics Series 6: Metadata and Repository Services
1. Hot Topics Web Seminar Series: Research
Data in Repositories
The UC San Diego Experience
Second Webinar: Metadata and Repository Services
for Research Data Curation
2. General Series Intro
•
First webinar: Intro and Framing: UC San Diego
decisions and planning
•
Second Webinar: Deep dive into technology and
metadata
•
Third Webinar: The perspective from researchers,
next steps
3. Your esteemed presenters …
First webinar:
David Minor – Program Director, Research Data Curation
Declan Fleming - Chief Technology Strategist
Second webinar:
Declan Fleming - Chief Technology Strategist
Arwen Hutt - Metadata Librarian
Matt Critchlow - Manager of Development and Web Services
Third webinar:
Dick Norris – Professor, Scripps Institution of Oceanography
Rick Wagner – Data Scientist at San Diego Supercomputer Center
4. Today we will …
• Discuss real-world researcher interaction
• Document how metadata and files combine to make
digital objects
• Describe the DAMS data model and how it supports
complex research objects
• Detail the technology driving the DAMS
• Point to the future
5. Working with Researchers: Pilots
• The Brain Observatory
• NSF OpenTopography Facility
• Levantine Archaeology Laboratory
• Scripps Institute of Oceanography
Geological Collections
• The Laboratory for Computational
Astrophysics
6. Working with Researchers: Process
•
•
•
•
Introductory meeting
Metadata point person
Ongoing discussions
One on one work
Iterative, collaborative, customized, experimental…pilot!
8. Working with Researchers: What is an object?
• What are the boundaries on a discreet set or
subset of data? What is required to make the
data intelligible, usable and reusable?
• What needs to be preserved?
• What do they want to display and/or share?
• What do they want to be able to refer to or
cite?
10. Working with Researchers: Take Aways
They are the subject experts
There are a lot of broad level similarities
But no such thing as one size fits all
11. We want a new data model…
• One that is flexible and accommodates disparate
metadata from a variety of sources
• While promoting consistency within the data store
• One that supports relationships within and between
objects
• One that is more community engaged, both sharing
vocabularies and technology, and utilizing others
shared vocabularies and technologies
• One that supports improved management of objects
and metadata
12. DAMS Data Model Development Process
• Five people, in a room, 16 hours a week for 4
months
• Worked through existing data, use case scenarios,
known data requirements, investigated known
ontologies, etc.
• Lots and lots and lots of discussion
• Utilizes MADS (Metadata Authority Description
Schema)
• Results = a data dictionary and an OWL ontology
• Living document
13. DAMS Data Model: Flexibility
• The data model provides enough flexibility
that we can accommodate a wide variety of
data within the schema
– Vocabularies
– Use of “types” or “display labels” to distinguish
specific subtypes of a data field
– Flexible structures and relationships
– Extensible
14. DAMS Data Model: Consistency
• But enough consistency that searching and
display rules do not need to be customized for
each individual collection of material
– Rules can be applied at the level of the broader
concept
• As well as establishing the organizational
structure necessary for maintaining
consistency over time
– Evaluation and approval of modifications
15. DAMS Data Model: Relationships
• It allows us to create a number
of different relationships
– Collections and sub-collections
– Collections and objects
– Objects and components
(complex hierarchical objects)
– Other related resources internal
or external to the DAMS
complex object
example
16. DAMS Data Model: Vocabularies
• Allow management of local & community
vocabularies
– Vocabulary terms as entities
– Ability to encode authority data (vocabulary
source, value uri, etc.) as well as sameAs
relationships between the same term expressed in
multiple sources
– Ability to update authority records as community
vocabularies become more formalized.
17. DAMS Data Model: Management
• One that supports improved management of
objects and metadata
– Authority management of vocabulary terms
– Event metadata!
19. Preservation: Chronopolis
Current DAMS Process
1. Create Bagit bags for all objects
2. Host via HTTP(S)
3. Bags are retrieved and ingested into Chronopolis
DAMS4 Process
1. Create Bagit bags for Δ objects using Event metadata
2. Host via HTTP(S) or enqueue on messaging queue for
ingestion
21. Storage: EMC Isilon 72NL
Storage For Library Collections
1 cluster of 5 Nodes
1 Node = 36 x 2TB Drives
Total Current Usable Storage of 320TB
OneFS 7.0.2.1
22. Storage: OpenStack
Storage For Research Data Collections
Testing:
• Performance versus Local Storage
• Large Files (up to 1TB)
– Segmenting files > 5GB
– Lexical order bug fix: 1,10,2 -> 0001,0002,…0010
• Rackspace CloudFiles API VS OpenStack REST API
Testing Notes:
https://libraries.ucsd.edu/blogs/dams/openstack-testing-notes/
41. Next Steps
Beta Release: Late October
Production Release: January
Future:
• Sufia/Curate Integration for administrative functionality
• Additional Linked Data Integration and Crosswalks
– Schema.org, OpenURL, Dublin Core, ResourceSync
• Fedora4
42. More Information
DAMS Overview
https://github.com/ucsdlib/dams/wiki/DAMS-Manual
DAMS Hydra Head
https://github.com/ucsdlib/damspas
DAMS Ontology
https://github.com/ucsdlib/dams/tree/master/ontology
DAMS REST API
https://github.com/ucsdlib/dams/wiki/REST-API
Hot Topics Series 3: Get a Head on the Repository with Hydra
http://duraspace.org/hot-topics
Hydra Technical Overview
https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts
OneFS Technical Overview
http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf
Isilon Overview
http://www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf
43. Coming Up Next
Final Webinar (October 31)
The researcher perspective from two of our pilot
participants
Dick Norris – Professor, Scripps Institution of
Oceanography
Rick Wagner – Data Scientist at San Diego
Supercomputer Center