Digital Repositories: Essential Information for Academic Librarians
This presentation provides essential information for academic librarians about digital repositories.It describes institutional, disciplinary, and data repositories and gives examples of each. The presentation also looks at the current state of access, focusing on OAI-PMH, and it examines digital preservation for IRs. Academic libraries that host repositories essentially become publishers, and this responsibility has many implications for libraries. The talk closes with a brief look at the proposed "all-scholarship repository" (ASR).
An institutional repository is an OA repository that is sponsored by an institution, usually a university or college. Most of its content is open access, but some may be embargoed and some content may be dark archived.
Green open-access refers to author self-archiving of a post-print of a published work (published in a toll-access journal) in an open-access repository. The repository can be institutional or disciplinary. The advantage to the author is that he or she gets to publish in a top toll-access journal and at the same time the content is freely available through the repository. There are many disadvantages to green OA. Because you sign over copyright to the publisher, you need their permission to post the content in the repository. If they grant this permission, they only grant it for the Word version which is not the version that they copyedit and not the version for which they enhance the images, tables, etc. Many also impose embargoes before the author can post the document, six months, one year, two years. Some publishers only allow green OA for institutional repositories, that is, disciplinary repositories are excluded.
A post print is the author’s last version of the paper that he or she sends to the journal. It is usually a Word document and incorporates all the changes suggested by peer reviewers. The term author’s accepted manuscript (AAM) is synonymous.
The SPARC author addendum “The form provides a templated request by authors to add to the copyright transfer agreement which the publisher sends to the author upon acceptance of their work for publication. Authors which use the form typically retain the rights to use their own work without restriction, receive attribution, and to self-archive. The form gives the publisher the right to obtain a non-exclusive right to distribute a work for profit and to receive attribution as the journal of first publication” From Wikipedia.
arXive is a preprint server. This tradition started in the particle physics field. In the pre-internet days, because of the long lag time between submitting a manuscript and its eventual publication in a journal, physicists would create mimeographed copies of their manuscripts or pre-prints and share them with colleagues via the mail or at conferences. Eventually these became photocopies, and eventually they became available through telnet and gopher. I can remember helping set up a database at Harvard in 1991 or 2 that was called the Physics Preprint database, and it was metadata for all the preprints. Then the internet came and changed everything. Today the physics preprint database is known as arXive, and it’s still called a pre-print server, but many people are submitting papers to it and then never submitting them to any journal. So it’s morphed into a type of publisher. Similar initiatives are being started in other fields. The problem is that much of the content is not peer-reviewed. We know that the major publishers make articles available soon after they are accepted, generally using names like “articles in press” or something like that, and this is an attempt to compete with pre-print servers.
Sherpa Romeo is a free database that collects green OA policy statements for journals. Authors can use it to determine what they can do with their post-prints.
A dark archive is one that is not accessible at all generally, and may include embargoed material or material being stored for cooperative preservation.
First we’ll talk about institutional repositories. They are often referred to as IRs. Open DOAR is a directory of them.
To give some local context, I gathered information about IRs in this region.
Here are some of the principal IR companies. Explain hosted versus software Some of these are open source. Explain TIND.
There are two cooperatives for digital preservation for institutional repositories. Basically they work by having several other libraries host all your content in a dark archive on their servers, and you do the same in return. Academic Preservation Trust is based at UVA. Its members include: Columbia University Indiana University Johns Hopkins University North Carolina State University Penn State University Syracuse University University of Chicago University of Cincinnati University of Connecticut University of Maryland University of Miami University of Michigan University of North Carolina University of Notre Dame University of Virginia Virginia Tech
The digital preservation network does not indicate where it is based but it gives a 434 area code for its telephone number, which is Lynchburg, Virginia, so it looks like Virginia is the hotspot for digital preservation. It has these members:
Member Listing Arizona State UniversityBrigham Young UniversityBrown UniversityCalifornia Institute of TechnologyColumbia UniversityCornell UniversityDartmouth CollegeDuke UniversityEmory UniversityHarvard UniversityIndiana UniversityIowa State UniversityJohns Hopkins UniversityKansas State UniversityMassachusetts Institute of TechnologyMichigan State University New York UniversityNorthwestern UniversityNorth Carolina State UniversityOhio State UniversityPennsylvania State UniversityPrinceton UniversityPurdue UniversityRutgers UniversityStanford UniversitySyracuse UniversityTexas A&MTexas Tech UniversityTufts UniversityTulane UniversityUniversity of AlabamaUniversity of Arizona University of BuffaloUniversity of California San DiegoUniversity of ChicagoUniversity of FloridaUniversity of Illinois at ChicagoUniversity of Illinois at Urbana-ChampaignUniversity of IowaUniversity of KansasUniversity of KentuckyUniversity of MarylandUniversity of MiamiUniversity of MichiganUniversity of MinnesotaUniversity of NebraskaUniversity of New MexicoUniversity of North Carolina University of Notre DameUniversity of TennesseeUniversity of TexasUniversity of UtahUniversity of VirginiaUniversity of WashingtonUniversity of WisconsinUtah State UniversityVanderbilt UniversityVirginia Polytechnic Institute and State UniversityYale UniversityTexas Digital LibraryCalifornia Digital LibraryJohn D. Evans FoundationAmerican Council on Education
Figshre is unique because it markets to individual scholars. It does also market to institutions. It’s owned by Digital Science, which is owned by Macmillan Publishers Limited.
There is an organization called DataCite that focuses on citing digital objects. They have something called the “Metadata Store” where you can buy DOIs and assign them to the digital objects in your repository. Increasingly, the quality of a repository will be judged by whether it provides DOIs for its objects and digital preservation for its content. The sponsors of repositories essentially become publishers, and publishers have responsibilities. Publishing is much more than just mounting PDFs or images on the internet; there are many activities that must be carried out to support publishing, if you want to do it right.
Now let’s talk about disciplinary repositories. There is one directory of them that I know of, and it covers most fields, and it’s hosted on the Sommons College OA wiki. Some of the major subject repositories include these.
Here are screenshots of SSRN and RePec, which I think is pronounced REE Peck. I don’t completely understand SSRN. It is starting to act more like a business than a repository. Indeed it’s owned by a company called Social Science Electronic Publishing, Inc. It may also do some publishing. It also hosts preprints. It uses number of downloads as a metric to measure individual researchers. RePEc is sponsored by the Research Division of the Federal Reserve Bank of St. Louis
The basic difference is that PubMed is a database of metadata, and PMC is a database of full-text scholarly articles. The two databases are often confused. PMC has an HTML “reader” and a classic reader and in many cases the publisher’s PDFs are also available.
Both PubMed and PMC are made available by the National Center for Biotechnology Information, NCBI, which is part of the U.S. National Library of Medicine. A lot of funding agencies in the bio-medical sciences require that research completed using their funding be made freely available, and PMC is one place where this is often done.
Data repositories publish much more than just numerical or statistical data. They also publish genomic data, structured textual data, image data, and more.
Mention CC 0 license Started in North Carolina with grant funding. One of the ideas is that people can use the published data to generate new research They can also re-do the experiments and see of they get the same results.
It started at the University of Michigan. It doesn’t work well for items that are removed. ResourceSync is a prototype replacement. It aims to synchronize metadata with the objects they describe.
4th bullet point: I’ve heard the term “publications ghetto” used to refer to institutional repositories, specifically referring to green open access articles, which are Word versions of documents or a PDF derivative of such.
This is an initiative of the National Science Communication Institute. It would be centralized and would make things like OAISTER obsolete. In other words, it would centralize all IR content rather than just the metadata.
Digital Repositories: Essential Information for Academic Librarians
FOR ACADEMIC LIBRARIANS
• Institutional Repositories
• IRs in Colorado
• IR software
• Standard identifiers for digital objects in repositories
• Digital preservation for IRs
• Disciplinary repositories
• Data repositories
• The future
• Institutional repository (IR)
• Disciplinary repository (Subject repository)
• Green open-access
• Author's accepted manuscript (AAM)
• SPARC author addendum
• Embargo period
• Pre-print server
• Sherpa Romeo
• Dark archive
Institutional Repositories : Local instances
• Colorado / Wyoming Institutional Repositories (selected)
• University of Colorado Boulder, University of Colorado Colorado Springs, Anschutz
Medical Campus, Colorado School of Mines, Colorado Mesa University, and Colorado
State University still using Digital Collections of Colorado
• Wyoming Scholars Repository (Digital Commons)
• University of Northern Colorado, Denver University and Colorado College and others
use the Colorado Alliance's repository service, which is an Islandora implementation.
• Fort Lewis College has Fort Works, an Eprints implementation
Institutional Repositories :
"The Academic Preservation Trust (APTrust) is
committed to the creation and management of a
sustainable environment for digital preservation.
APTrust’s aggregated repository will solve one of
the greatest challenges facing research libraries
and their parent institutions – preventing the
permanent loss of scholarship and cultural records
being produced today."
"The Digital Preservation Network (DPN) was formed to
ensure that the complete scholarly record is preserved
for future generations. DPN uses a federated approach
to preservation. The higher education community has
created many digital repositories to provide long-term
preservation and access. By replicating multiple dark
copies of these collections in diverse nodes, DPN
protects against the risk of catastrophic loss due to
technology, organizational or natural disasters."
• Directory of disciplinary repositories (Simmons College) =
• Some major disciplinary repositories:
• SSRN (Social Sciences Research Network)
• RePEc (Research Papers in Economics)
• E-LIS (Eprints in Library and Information Science)
• PMC (PubMedCentral)
• Ag Econ Search (University of Minnesota)
Focus: PubMed Central (PMC)
“PMC (PubMed Central) launched in 2000 as a free archive for full-text biomedical
and life sciences journal articles. PMC serves as a digital counterpart to the NLM
extensive print journal collection; it is a repository for journal literature deposited by
participating publishers, as well as for author manuscripts that have been submitted
in compliance with the NIH Public Access Policy and similar policies of other
research funding agencies. Some PMC journals are also MEDLINE journals. For
publishers, there are a number of ways to participate and deposit their content in this
archive, explained on the NLM Web pages Add a Journal to PMC and PMC
Policies. Journals must be in scope according to the NLM Collection Development
Manual. Although free access is a requirement for PMC deposit, publishers and
individual authors may continue to hold copyright on the material in PMC and
publishers can delay the release of their material in PMC for a short period after
publication. There are reciprocal links between the full text in PMC and
corresponding citations in PubMed. PubMed citations are created for content not
already in the MEDLINE database. Some PMC content, such as book reviews, is
not cited in PubMed.”
What is the Difference between PubMed
Central and PubMed?
Directories of Data Repositories
• Data repositories (Simmons College, OA Directory)
• Registry of Research Data Repositories
• Databib "Databib is a searchable catalog registry / directory/
bibliography of research data repositories."
Focus: Dryad Digital Repository
• Works with journals
• Requires use of the CC 0 license
• Located at http://datadryad.org/
• Costs $90
“DataDryad.org is a curated general-purpose
repository that makes the data underlying
scientific publications discoverable, freely
reusable, and citable. Dryad has integrated data
submission for a growing list of journals;
submission of data from other publications is also
welcome” -- http://datadryad.org/
• A collection of software repositories
• Used for sharing code, programs, software
• Has paid and free options; free option used for open source
“GitHub is the largest code host on the planet with
over 19.4 million repositories. Large or small, every
repository comes with the same powerful tools.
These tools are open to the community for public
projects and secure for private projects.”
DMP = Data management plan
From the Wikipedia article, "Data management plan“
• Description of the data
• How / When / Where data will be acquired
• How the data will be processed
• What file formats the data will be in, naming conventions
• Version control
• Policies for access, sharing, and re-use
• Long-term storage and data management
Review of OAI-PMH
• Open-Archives Initiative Protocol for Metadata Harvesting
• Provides a way to create a "union catalog" of resources in digital
• The metadata is indexed in WorldCat (including WCL), updated
• Institutional repositories convert libraries into publishers, and this has
many long-term legal, ethical, and financial implications.
• Repositories exist in sort of a digital version of the Wild West
• Repositories with strong digital preservation practices and that use
and maintain standard identifiers for the digital objects they publish
will stand out from others.
• Most repositories will contain material of secondary or local-only
importance, but a few “gems” will exist here and there.
• Libraries are competing with scholarly publishers (Odlyzko , 2013).
“Investigate the possibility of constructing the world’s first all-
scholarship repository (ASR). [...] Conversations are currently ongoing
on this matter. The Department of Energy has authorized the Los
Alamos National Laboratory (LANL) to build the prototype ASR.” SOURCE