Organizing Digital Stuff

Weekly reflection
• What digital “stuff” do you have? Where
do you put it? How do you organize it, if
you do? How do you find it when you
need it?

• In the course of your career, you will have to
do things you don’t entirely know how to do.
• Technical and non-!
• Without training, guidance, or clear instructions.
• No, of course we don’t teach you everything in library
school!
• Learn to dive in despite imperfect knowledge.
• Use your common sense.
• Trust that those around you want you to succeed.
• If you need to, research! Always be ready to learn.
• Mentors are great... but they’re not babysitters.
• Accept imperfection.
• Please model these behaviors in my class!
Tool of the week: Self-eﬃcacy

Tip of the week: Staying informed
• Weblogs and newsfeeds are your friends.
• If you are not reading at least a few librarian blogs,
you are not staying informed.
• Can’t hurt to pick up some journal TOCs too.
• Blogs are faster than the published literature! And
often written by the same people.
• For (library) tech:
• Librarian in Black
• Planet Code4Lib
• librarian.net
• Lifehacker, Gizmodo, Engadget
• Roy Tennant’s LJ columns

What is metadata?
• Heck, I dunno. I’m not sure that’s even a
useful question.
• This is one reason I’m not a library-school
professor. Definitional pilpul bores me.
• Operationally: when we collect stuff, we
take notes on it so we can organize it,
inventory it, find it later, etc. Those
notes are metadata.
• Is MARC metadata? Well, of course!
• But many librarians don’t think about it that way.

Why are there so many
metadata standards?
• Different things described
• For an image, you want to know its bit depth and
colorspace. This has no meaning for a finding aid.
• Several targeted standards vastly easier to cope
with than one supposedly universal standard.
• Different purposes
• More on this in a moment
• Different provider and user communities
• Level of detail/specificity
• Wheel (or toothbrush) reinvention

Metadata ﬁle formats
• You can express metadata in an Excel
spreadsheet, a MARC record, XML, RDF...
• But some expressions are more readable, useful,
and reusable than others!
• Metadata librarians spend a lot of time ﬁxing and
transforming Other People’s Metadata, in as
automated a fashion as possible.
• Large majority of modern metadata
standards expressed in XML.
• Though RDF wants to be a contender, and XML is
only one way of several to express RDF.

So what’s this RDF thing all the
cool kids are talking about?
• Resource Description Framework
• by the W3C
• Like XML, RDF is more or less friendly to
whatever kind of metadata you want to
throw at it.
• Unlike XML, RDF is a data model designed for integrating
information from diﬀerent metadata vocabularies, and
expressing how items and metadata records relate to one
another. Links and linking!
• (Also, XML works for content, e.g. TEI. RDF doesn’t.)

(very) Basic RDF
• “Triple:” subject, property, value
• A little like subject, verb, object in English.
• Dorothea Salo is the author of “Innkeeper
at the Roach Motel.”
• Subject: either me or the article (works either way,
depending on property chosen)
• Property: authorship (“isAuthorOf” or “isBy”); often
comes from a controlled vocabulary like Dublin Core
• Value: either the article or me, depending
• One annoying thing: URIs as identiﬁers
• What is my URI? Or the article’s (several versions)?
• Several other annoying things about RDF, but they’re
super-nerdy.

Linked data
• As the web linked documents and people,
it’s now time (say some) to link data.
• Not a simple proposition!
• RDF is hard. Calling it linked data doesn’t make it easier.
• Data modeling is hard.
• Data integration is hard. RDF makes it easier... up to a
point. Still HUGE problems around people using the
same term diﬀerently, other unexamined assumptions.
• Idea gaining traction among governments,
other big data providers.
• So we probably need to keep our eye on it.
• ALWAYS a good idea to think about how
other people might use your metadata.

Kinds of metadata
• Descriptive (“bibliographic”)
• Who made this? When? Where? What’s it about? Etc.
• Technical
• What is this? What is its format? What made it? Etc.
• Administrative
• Who owns this? Who’s changed it? Who has what IP
rights over it? Who can see it? Etc.
• Structural
• How is this thing put together?
• In practice, the landscape is muddier.
• Most standards have bits of two or more types.
• Also, “relationship” metadata coming to the fore.

Descriptive metadata:
MODS
• Metadata Object Description Schema
• Maintained by Library of Congress
• Stripped-down, human-readable MARC
in XML
• http://www.loc.gov/standards/mods/
• Sample: http://www.loc.gov/standards/mods/v3/
mods99042030.xml

Technical metadata: MIX
• Metadata for Images in XML
• By Library of Congress, NISO
• Captures information about an image’s ﬁle
format and other technical characteristics
• Why? Think about ﬁle-format
obsolescence.
• http://www.loc.gov/standards/mix/
• Sample document: http://www.loc.gov/standards/mix/
instances/test_mix10.xml

Administrative
metadata: PREMIS
• Preservation Metadata Maintenance
Activity
• who comes up with these acronyms?
• Library of Congress, again
• Designed to track digital preservation
activity across an object’s lifecycle
• http://www.loc.gov/standards/premis/
• Samples: look in http://www.dlib.org/dlib/
september08/dappert/09dappert.html
• But be aware that PREMIS is usually embedded in
other metadata, like METS.

Structural metadata:
METS
• Metadata Encoding and Transmission
Standard
• By... guess who?
• Wrapper for other kinds of metadata;
delineates the structure of a complex
digital object
• http://www.loc.gov/standards/mets/
• Samples: http://www.loc.gov/standards/mets/
mets-examples.html

Metadata spaghetti: TEI
• Text Encoding Initiative
• by the TEI Consortium
• For digital transcriptions of books,
manuscripts, dictionaries, etc. etc.
• Content standard, not metadata standard!
But contains its own “metadata header”
• This header sometimes reused in other contexts
• Moral: Sometimes content “embeds”
metadata.
• This is OK, but should every content standard roll its
own internal metadata?

Where does metadata
come from?
• Human data entry
• Slow, expensive, error-prone
• Often semi-automatable (80/20 point)
• If you can automate, DO IT. Do not waste keystrokes!
• Auto-extracting from a content object
• Common for technical metadata
• Auto-capture by preservation system
• Common for some administrative metadata
• Grabbing from elsewhere
• From other metadata: “crosswalking”
• HTML screenscraping, Excel spreadsheets
• Issues: authority control? granularity? accuracy?

Subject metadata,
speciﬁcally
• What is this thing about?
• Plenty of variation in sources
• Author’s keyword vs. indexer’s descriptor
• Controlled vocabulary vs. free-form keywording
• Community tagging/“folksonomy”
• Mechanically-extracted keywords
• All of this matters if you’re searching!

Where does metadata live?
• In XML files (or MARC files, or...)
• In relational databases
• In RDF “triple stores” (special databases)
• In content objects (as with TEI)
• Or some combination of the above!
• E.g. DSpace: can accept metadata in an XML file; stores
all metadata in relational database
• Next trick: associating content with its
metadata!

What is done with metadata?
• To search against it or use it to browse,
you need to “index” it ﬁrst.
• Turn it inside-out: records containing terms --> list
of terms and the records they appear in
• It’s all more complicated: stemming, phrases,
variant spellings, languages, stopwords, etc.
• The hot new indexing software is “Solr” from UVa.
Underlies Blacklight, which underlies Forward.
• Full-text search works the same way!
• Google’s index: MASSIVE database of words with
the web pages they appear in.
• Spider/crawler: program that follows links across
the web and indexes page content

Relevance ranking
• You have a bunch of words and the records
or documents they appear in. How do you
decide which records/pages to display first?
• Traditionally in libraries: last-in-first-out. Awful.
• Using document structure and metadata
• If the word’s in a title, heading, or subject field, take it
more seriously than if it’s just in ordinary text.
• TF/IDF
• Term frequency: how often the search term shows up in
a given record/document
• Inverse document frequency: how rare the search term
is in the whole mass of records/documents.

Super-
relevant!
Record not
“about” this
term
Overused
word or
stopword
Irrelevant
TF
(one record)
IDF
(whole corpus)
High Low
Rare term
Common term

What other information can
be used to gauge relevance?
• People pointing
• Google: PageRank, based on counting links to a
document
• Scholarly communication: many metrics based on
later citation of articles
• People choosing
• Google also up-votes pages based on people
clicking on them in search results.
• Individual or social history of interests
• Amazon, Netﬂix
• Notice who’s doing this and who isn’t.
• Serious question: what about privacy?

Search engine
optimization
• Making sure that your page turns up in
searches for relevant terms.
• Done maliciously, this amounts to spam. Google
spends LOTS of eﬀort despamming its index.
• Clean markup helps. So does putting
highly relevant terms in highly visible/
important locations.
• Also, don’t overload pages! Dilutes vocabulary.

What else can you do with
relevance information?
• Point people to PEOPLE and SERVICES,
not just search results!
• Point people to context that will help
them evaluate search results.
• We know people just throw search terms at boxes.
We might as well work with that.
• This may well be the best work Forward
is doing.

A word about GIS
• “Geographic Information Systems”
• It’s metadata all the way down! Metadata
about places.
• Also a lot about how to represent and visualize that
metadata.
• And how to mash it up with other data.
• Heavily based on relational-database
technology.
• HOT JOB MARKET. If you can get trained, do.

Finding and using
metadata standards
• Nobody knows every metadata
standard out there. I sure don’t.
• But faced with a new standard, I may
have to get up to speed fast.
• I may even be making adoption decisions.
• So here’s how I do it.

Getting up to speed
• Find its website. If it doesn’t have a
website, you don’t want to use it.
• Is the website current? Is there recent activity?
• Is there a list of who’s using this standard?
• Find a sample record.
• How is this standard expressed? XML, RDF, what?
• Does it pass a sniﬀ test?
• Find the documentation and community.
• “Tag libraries” and “data dictionaries” especially helpful.
• Primers, “getting started” documents also nice.
• Look for tools.
• Authoring/crosswalk tools (and programming libraries)
• Validation tools

Organizing Digital Stuff

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Organizing Digital Stuff

Similar to Organizing Digital Stuff (20)

More from Dorothea Salo

More from Dorothea Salo (14)

Recently uploaded

Recently uploaded (20)

Organizing Digital Stuff