While the generation or collection of large, complex research datasets is becoming easier and less expensive all the time, researchers often lack the knowledge and skills that are necessary to properly manage them. Having these skills is paramount in ensuring data quality, integrity, discoverability, integration, reproducibility, and reuse over time. Librarians have been preserving, managing and disseminating information for thousands of years. As scholarly research is increasingly carried out digitally, and products of research have expanded from primarily text-based manuscripts to include datasets, metadata, maps, software code etc., it is a natural expansion of scope for libraries to be involved in the stewardship of these materials as well. This kind of evolution requires that libraries bring in faculty with new skills and collaborate more intimately with researchers during the research data lifecycle, and this is exactly what is happening in academic libraries across the country. In this webinar, two researchers-turned-data-specialists, both based in academic libraries, will share their experiences and perspectives on the development of research data services at their respective institutions. Each will share their perspective on the important role that libraries can play in helping researchers manage, preserve, and share their data.
A Critique of the Proposed National Education Policy Reform
Developing data services: a tale from two Oregon universities
1. Developing data services
A tale from two Oregon
universities
NN/LM, Pacific Northwest Region
PNR Rendezvous | 18 June 2014
Melissa Haendel
OHSU Library
Amanda Whitmire
OSU Libraries
2. B.S. in Aquatic Biology, 2000
Worked in a bioluminescence laboratory
Ph.D. in Oceanography, emphasis in biological
oceanography, 2008
Dissertation study area: bio-optics; using optical tools
to study ocean ecology (N. California Current)
Post-doc in Oceanography, emphasis in biological
oceanography, 2008-2012
Study area: bio-optics; using optical tools to study
ocean ecology in low oxygen zones (N. Chile)
Assistant Professor, Data Management
Specialist, Sept. 2012 - present
About Amanda…
Not a
librarian.
3. B.A. in Chemistry, 1990
Modeled drug-receptor ligand binding
Ph.D. in Neuroscience, 1999,
Dissertation study area: Identification of novel genes
involved in neural development in the mouse
Post-doc, 2002-2004
Study area: Toxic effects of biocides in zebrafish and
salmon
Assistant Professor, Library, 2010 – present
Lead semantic research team
About Melissa…
Not a
librarian.
Post-doc, 2000-2002,
Study area: Role of thyroid hormone during neural
cell death in zebrafish
Post-doc, 2002-2004
Study area: Ontologies, data models, gene
nomenclature, biocuration
?
4. Do you have any data-related tasks or
responsibilities in your job description
or duties?
[Yes/No]
What role do you believe metadata
plays in the modern research cycle?
[big, small, none, other]
Questions
5. Why data management?
The researcher perspective
Why libraries?
Why bring in non-librarians?
Amanda & Melissa share their experiences
Wrap-up
image credit: http://www.flickr.com/photos/54803625@N08/8296296949/
6. “…the recorded factual material
commonly accepted in the
scientific community as
necessary to validate research
findings.”
Research data is:
U.S. Office of Management and Budget, Circular A-110
6
7. “Unlike other types of information, research
data are collected, observed, or created, for
the purposes of analysis to produce and
validate original research results.”
What is research data?
University of Edinburgh
MANTRA Research Data Management Training,
‘Research Data Explained’
7
8. Actions that contribute to effective
storage, use, preservation, and reuse
of data and documentation throughout
the research lifecycle.
Data management:
11. Photocourtesyofwww.carboafrica.net
Data is collected from sensors, sensor
networks, remote sensing, observations,
and more - this calls for increased attention
to data management and stewardship
Data deluge
Photocourtesyof
http://modis.gsfc.nasa.gov/
Photocourtesyof
http://www.futurlec.com
CCimagebytajaionFlickr
CCimagebyCIMMYTonFlickr
ImagecollectedbyVivHutchinson
Slide credit: http://www.dataone.org/education-modules
12. Federal movement toward open data
1985:
National
Research
Council
1999:
OMB
Circular
A-110
revisions
2003:
NIH Data
Sharing
Policy
2008:
NIH
Public
Access
Policy
2011: NSF
DMP
requirement
2012: NEH,
Office of
Digital
Humanities
DMP
requirement
2013:
NSF bio-
sketch
change
2013:
OSTP
memo on
public
access to
results of
federally
funded
data
14. The memorandum states that, “digitally formatted scientific data resulting from
unclassified research supported wholly or in part by Federal funding should be stored
and publicly accessible to search, retrieve, and analyze.” To this end, federal agencies
must create a public access plan that includes the following mandates:
• Maximize public access to data while protecting personal privacy and
confidentiality, intellectual property, and balancing costs with long-term benefits;
• Ensure that investigators create data management plans that describe strategies for
long-term preservation of and access to data;
• Costs of data management are included in proposal budgets;
• Ensure that the merits of data management plans are properly evaluated;
• Implement mechanisms to ensure that investigators comply with their data
management plans and policies;
• Promote deposition of data into publicly accessible repositories;
• Encourage private and public cooperation to improve data access and
interoperability;
• Develop and standardize approaches to data citation/attribution;
• Support training in data management best practices;
• Assess needs and strategies for the long-term preservation of data.
18. Assertion:
“β amyloid, known for its role in
injuring brain in Alzheimer’s
disease, is also produced by and
injures skeletal muscle fibres in the
muscle disease sporadic inclusion
body myositis.”
Greenberg 2009
20. How do we believe what we think we
know?
Is it true or do we just believe it because
everyone else does?
How do we transcend “follow the leader”? What
tools can we build to help us?
21. How reproducible is science?
Let’s start simple.
Do we know what the ingredients were?
22. Journal guidelines for methods are often poor and
space is limited
“All companies from which materials were obtained should
be listed.” - A well-known journal
Reproducibility is dependent at a minimum, on
using the same resources. But…
23. How identifiable are resources in the
published literature?
An experiment in reproducibility
Gather journal
articles
5 domains:
Immunology
Cell biology
Neuroscience
Developmental biology
General biology
3 impact factors:
High
Medium
Low
84 Journals
248 papers
707 antibodies
104 cell lines
258 constructs
210 knockdown
reagents
437 model
organisms
24. Only ~50% of resources were identifiable
Vasilevsky et al, 2013, PeerJ
25. There is no correlation between impact factor and
resource identification
Journal Impact Factor
0 10 20 30 40
Fractionofresourcesidentified
0.0
0.2
0.4
0.6
0.8
1.0 Antibodies
Cell Lines
Constructs
Knockdown reagents
Organisms
29. Of 9 antibodies published in 5 articles, only
44% were identifiable
Percentidentifiable
0%
25%
50%
75%
100%
Commerical Ab
identifiable
Catalog number
reported
Source organism
reported
Target uniquely
identifiable
30. Resource information is not adequately
getting into the literature, EVEN
THOUGH IT IS READILY AVAILABLE
The problem is a lack of standards,
review, and tools
LIBRARIES CAN HELP!!!!!!
32. Sample citation:
Polyclonal rabbit anti-
MAPK3
antibody, Abgent, Cat#
AP7251E,
RRID:AB_2140114
1.
Research
er
submits a
manuscri
pt for
publicatio
n
2. Editor or
Publisher OR
LIBRARIA
N! asks for
inclusion of
RRID
3. Author goes to
Research
Identification
Portal to locate
RRID
4. RRID is
included
in
Methods
section
and
as
Keyword
Publishing Workflow
34. $1.3 million grant from the Laura and John
Arnold Foundation to validate 50 landmark
cancer biology studies
Partnership between
Science Exchange,
PLoS, FigShare,
Mendelay, and some of
us scientists
35. Librarians can help researchers
understand:
How to be critical of data and where it came from
Data provenance and meeting data standards
That there is a need to reinterpret data when new
information comes to light
That reproducibility depends on many things, including
very basic things
Why both retrospective and prospective efforts are
needed to ensure data quality, consistency, and utility
36. Amanda’s dissertation
The spectral backscattering properties of marine particles
Observations
ship-based sampling &
moored instruments
Simulation
results
scattering &
absorption of light
Experimental
optical properties of
phytoplankton cultures
Derived
variables
endless things
Compiled
observations
global oceanic bio-
optical observations
[self + from peers]
Reference
global oceanic bio-
optical observations
[NASA]
41. http://www.ala.org/acrl/sites/ala.org.acrl/files/content/publications/whitepapers/Tenopir_Birch_Allard.pdf
“Only a small minority of academic
libraries in the United States and Canada
currently offer research data services
(RDS), but a quarter to a third of all
academic libraries are planning to offer
some services within the next two years.”
“Few academic libraries are responsible for
developing research data policies. Being
able to serve as a clearinghouse of ideas
and to provide expertise to build these
policies is an opportunity for libraries to be
members of the knowledge creation
process.”
“Reassigning existing library staff is the
most common tactic for offering RDS.”
43. Timeline of data services at OSU
UL & library admin.
recognize need for role
of RDS on campus that
requires a dedicated FTE
late
2011
Sept.
2012
Data Management
Specialist starts
Oct.
2013
Data survey
launches
Strategic Agenda in
place*
Jan.
2013
GRAD 521
launches
Jan.
2014
*Sutton, Shan; Barber, David; Whitmire, Amanda L. (2013): Oregon
State University Libraries and Press Strategic Agenda for Research
Data Services. Oregon State University Libraries.
http://hdl.handle.net/1957/38794.
ESI
45. Responses to the question, “Please indicate whether or not you generate each of
the following data format(s) as a part of your research process. Select Yes or No for
each.” Color scale indicates what percentage of respondents in each college or unit
selected ‘Yes’ for each data type. The number in each tile shows the number of
faculty responses for that data type and college/unit.
47. Research
Analysis of data management plans as a means to inform and empower
academic librarians in providing research data support. National Leadership
Grant LG-07-13-0328, Oct 2014 – Sept 2015
Data management plans
As a Research Tool The DART Project
49. Teaching: GRAD 521
Logistical Details
• http://bit.ly/GRAD521
• All course materials on figshare
• 2 credits
• Discipline-agnostic
• Offered annually, winter quarter
Topics covered
• Overview of RDM
• Types, formats & stages of data
• RDM planning
• Storage, backup & security
• Documentation & metadata
• Legal & ethical considerations
• Sharing & reuse
• Archive and preservation
50. Timeline of data activities at OHSU
OHSU
library
awarded
eagle-i
late
2009
Sept.
2012
Monarch Initiative
awarded
Oct.
2013
Data survey
launches
Beyond the PDF
1K challenge award
April
2013
OHSU hiring
CRIO position
Now
ESI
NIH BD2K
program
52. 0%
10%
20%
30%
40%
50%
60%
Specific Uniform
Resource
Identifier (URI)
or other URL
where data is
held
Contact
information of
the data steward
Reference to a
public repository
where the data
is held
Provide
supplementary
data to the
journal
SPARQL
endpoint and/or
Linked Open
Data
Digital Object
Identifier (DOI)
I don't know Other (please
specify)
How do you reference your data when you publish,
either in the context of a journal publication, or by
direct publication of data sets?
53. Are there any professional community standards in your
research area regarding data management, sharing, storage,
archiving, and/or producing metadata or other descriptive
information that would apply to your research data?
Answer Instructor
Assistant
Professor,
Research Assistant
Professor, or
Assistant Scientist
Associate
Professor or
Associate
Scientist
Professor
or Senior
Scientist
Director,
Division
Head,
Department
Head
PostDoc/
ResAssoc/
PhD
Yes 1 9 5 16 6 13
No 1 8 9 15 1 10
I don't
know 1 19 13 14 4 19
54. Scope of Data Services at OHSU
Open houses,
Lib Guides, NIH proposals to
improve data education,
hosting fellows
New IR,
research
profiling tools
Participation in
national efforts:
BD2K, Force11, Galaxy,
Biocuration Society
Data consults,
collaborations
56. NIH Big Data to Knowledge Initiative
http://bd2k.nih.gov/
57. 1 | Can facilitate the creation of a smarter body
of literature for future research
2 | Train researchers to utilize metadata
standards to enable data reuse
3 | Facilitate researchers understanding of
available resources
Libraries, in summary…
58. Members from:
Oregon Health & Science University
Oregon State University
University of Oregon
University of Idaho
University of Washington
Portland State University
Reed College
Join us @ bit.ly/pnwdatalibs
Also we need a logo:
Free data science training for good suggestions!
PNW Research Data Geeks
Group
http://commons.wikimedia.org/wiki/File:DARPA_Big_Data.jpg
59. How do you think libraries
can best facilitate best
practices in data
management?
Editor's Notes
National Network of Libraries of Medicine, Pacific Northwest Region
PNR Rendezvous
Here is the link to the recording of the presentation: https://webmeeting.nih.gov/p8swadmbzpo/ and to our PNR Rendezvous webpage where the recording is posted: http://nnlm.gov/pnr/training/RMLrendezvous.html
Talk abstract: “While the generation or collection of large, complex research datasets is becoming easier and less expensive all the time, researchers often lack the knowledge and skills that are necessary to properly manage them. Having these skills is paramount in ensuring data quality, integrity, discoverability, integration, reproducibility, and reuse over time. Librarians have been preserving, managing and disseminating information for thousands of years. As scholarly research is increasingly carried out digitally, and products of research have expanded from primarily text-based manuscripts to include datasets, metadata, maps, software code etc., it is a natural expansion of scope for libraries to be involved in the stewardship of these materials as well. This kind of evolution requires that libraries bring in faculty with new skills and collaborate more intimately with researchers during the research data lifecycle, and this is exactly what is happening in academic libraries across the country. In this webinar, two researchers-turned-data-specialists, both based in academic libraries, will share their experiences and perspectives on the development of research data services at their respective institutions. Each will share their perspective on the important role that libraries can play in helping researchers manage, preserve, and share their data.”
Adobe Connect instant polling to poll attendees (N=37).
Responses:
45% - yes, have data-related tasks or duties;
90 % - metadata plays a big role in the modern research cycle
Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate that may be critical to the research process but that don’t become part of the ’final’ dataset.
Data could be:
Observational
Experimental
Simulated
Derived
Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate that may be critical to the research process but that don’t become part of the ’final’ dataset.
Data could be:
Observational
Experimental
Simulated
Derived
Data management is a verb – it involves intentional effort and activity.
The main goals of DM are preservation and reuse, for you and for others.
Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.
image: Microsoft clipart
Let’s look at one important area of scientific inquiry: climate change. What scale of data integration is necessary to study global trends over geologic timescales?
Slide credit: DataONE Education Module 1. http://www.dataone.org/education-modules
Data are being generated in massive quantities daily. Improvements in technology enable higher precision and coverage in data acquisition and makes higher capacity systems store and migrate more data –increasing the importance of managing, integrating, and re-using data. In order to integrate these diverse datasets to answer questions of global significance, the data have to be well organized, well documented and described, preserved and accessible. It all depends of effective management of the data.
Slide credit: DataONE Education Module 1. http://www.dataone.org/education-modules
22 February 2013: The Office of Science and Technology Policy in the White House released a memorandum about expanding pubic access to the results of federally funded research. In addition to scholarly publications, federal agencies are making serious efforts to increase the sharing of research data.
All federal agencies with more than $100M in R&D expenditures are subject to this memo.
http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
This is going to place huge additional demands on faculty who submit and review proposals – they overwhelmingly have NO IDEA what constitutes a good DMP.
“PLOS is now releasing a revised Data Policy that will come into effect on March 1, 2014, in which authors will be required to include a data availability statement in all research articles published by PLOS journals … {policy language: PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. When submitting a manuscript online, authors must provide a Data Availability Statement describing compliance with PLOS’s policy. The data availability statement will be published with the article if accepted.}”
Since the policy was updated in March 2014: “…more than 16,000 sets of authors have included information about data availability with their submission. We have had fewer than 10 enquiries per week to data@plos.org from authors who need advice about ‘edge cases’ of data handling and availability – fewer than 1% of authors.”
http://blogs.plos.org/biologue/2014/05/30/plos-data-policy-update/
Citations on statement that accumulation of β amyloid “precedes” other abnormalities in inclusion body myositis muscle.
Statement as fact is supported through citation to papers that only state it as hypothesis
Four most authoritative papers were from same lab, two had potentially the same data, and all lacked quantitative data as to how many affected muscle fibres were seen and a specificity of reagents for distinguishing β amyloid protein from β amyloid precursor protein.
MH - notes
We are working on determining how to deal with this longer term- is this a new data citation that goes alongside the paper. Needs to be in the keywords do it is mineable. Trying to figure out to deal with this in the long run.
“ When an official at America’s National Institutes of Health (NIH) reckons, despairingly, that researchers would find it hard to reproduce at least three-quarters of all published biomedical findings, the public part of the process seems to have failed.”
Give background about the reproducibility initiatve.
Talk about example of replication- scientific reproducibility experiment with leishmania and it being a different strain, different amidation, etc.
Most research projects use and create multiple data types & formats, and produce many, many files. My own dissertation work included the generation or use of all of the data types shown here (which might help to explain why it took me 7 years to earn a Ph.D.). http://hdl.handle.net/1957/9088 This data was collected over the course of 5 years, at locations all over the Pacific and Atlantic Oceans. I never received ANY formal training in how to organize and manage all of this data. Where is all of this data now? On an external hard drive sitting in my desk.
Image credit: Document by Piotrek Chuchla from The Noun Project
Librarians have been preserving, managing and disseminating information for thousands of years, going all the way back to Alexandria.
As scholarly research is increasingly carried out digitally, and products of research have expanded from primarily text-based manuscripts to include datasets, metadata, maps, software code etc., it is a natural expansion of scope for libraries to be involved in the stewardship of these materials, too.
This kind of evolution requires that libraries bring in faculty with new skills, and that’s exactly what’s happening in academic libraries across the country.
Data management is something that faculty all over campus have become aware of.
As a neutral entity, the library is well positioned to address campus-wide needs, like data management. It makes sense, under the economy of scale, for a centralized unit to address a campus-wide need.
We recognize that individual colleges and departments have computer support personnel and resources, and we aim to complement those resources (not duplicate them).
(Switzerland metaphor swiped from the incomparable Jackie Wirz at OHSU)
We aren’t here to replace the external resources that already exist to support you – we are here to act as a conduit to these resources.
Our goal is to help you effectively discover, navigate and utilize these resources where appropriate, in the same way that the library has been providing this kind of support for decades.
SO, what’s going on with academic libraries and data services? This ACRL white paper (2012) provides some context.
I spent my first year here getting my feet under me:
Participating in the DuraSpace/ARL/DLF E-Science Institute, which involved doing an environmental scan and engaging faculty and administrators in interviews. Strengthened an existing relationship with campus Information Services (IS). This experience resulted in the creation of our Strategic Agenda for Research Data Services, which really laid out my priority tasks and areas of emphasis. http://hdl.handle.net/1957/38794
Submitting an IMLS National Leadership Grant with 4 co-PIs
Developing a collaboration with the Graduate School to create a credit-bearing course for graduate students in research data management (http://bit.ly/GRAD521)
Trying (with limited success) to advertise the existence of library-based data services for faculty & grad students
Creating a data services web site (via LibGuides, http://bit.ly/OSUData)
Curating the limited number of datasets in our IR; updating metadata practices
The first ¼ of 2014:
All GRAD 521, all the time
And, some grant stuff.
Response rate was 23%, 451 completed surveys across all colleges and ranks surveyed. The goal was to get a feel for how much and what types of data are being produced on campus, what faculty are doing with it, and figure out where they need more support.
Example question and responses to the OSU faculty data stewardship survey (figure created in R).
What do faculty find more difficult: metadata creation, version control, finding and accessing data created by others, long-term storage, and sharing their own data.
What am I going to do with the survey results? I’m working on a report, which I will share with faculty and OSU administration. Am hoping that it leads to a campus-wide conversation about data stewardship.
The OSUL&P Research Data Services model.
Data planning & consultation
DMPs/Planning
Storage & backup
File organization & naming
Documentation & metadata
Legal/ethical considerations
Sharing & reuse
Archiving & preservation
Data access & preservation infrastructure
Data curation in our IR
We offer DOIs for datasets via membership in EZID (CDL)
Recommend using ORCID iDs but haven’t had much traction on this yet. NIH mandate will change this.
Data management training
90-minute workshops, mostly grad students, some faculty
2-credit course launched in January 2014. GRAD 521. http://bit.ly/GRAD521
presentations at faculty/staff mtgs;
invited lectures in classes
Open data consortia & collaborations
CUAHSI – implemented, in parntership with OSU faculty in CEOAS and Institute for Natural Resources
DataONE & DataFOUR are under consideration or development
Periodic surveys can be used to identify service needs on campus, but depend on useful response rates. We suggest that regular reviews of DMPs can also be a legitimate source of information regarding what researchers are up to, and where they may need support. This project aims to provide a tool for librarians to facilitate consistent, quality reviews of DMPs.
Project in a nutshell:
Develop a rubric for consistent evaluation of NSF DMPs
Multi-university study of DMPs
Identify common gaps in knowledge, skills and practice
Target data support services to ameliorate gaps
Website with more info. is under development. Contact Amanda or DMPResearch@oregonstate.edu with questions.
Graduate students (like to meet in person) or faculty (most prefer email)
Generally project or task-specific
Examples:
coming up with a file-naming convention and data organization strategy for a project
reviewing a data management plan for a grant proposal
how to share data in support of a submitted manuscript
Midterm assignment: a scaled-back Data Curation Profile
Final assignment: a data management plan
First cohort: 11 students, including 3 faculty members; degree ranges from non-thesis MS to PhD; many disciplines
Whitmire, Amanda (2014): GRAD 521 Research Data Management Syllabus and Lesson Plans. figshare. http://dx.doi.org/10.6084/m9.figshare.1003834
Whitmire, Amanda (2014): GRAD 521 Research Data Management Course Assignments. figshare. http://dx.doi.org/10.6084/m9.figshare.1003852
Whitmire, Amanda (2014): GRAD 521 Research Data Management Lectures. figshare. http://dx.doi.org/10.6084/m9.figshare.1003835