SlideShare a Scribd company logo
1 of 29
Download to read offline
Metadata
25 October 2010
Weekly reflection
• What digital “stuff” do you have? Where
do you put it? How do you organize it, if
you do? How do you find it when you
need it?
• In the course of your career, you will have to
do things you don’t entirely know how to do.
• Technical and non-!
• Without training, guidance, or clear instructions.
• No, of course we don’t teach you everything in library
school!
• Learn to dive in despite imperfect knowledge.
• Use your common sense.
• Trust that those around you want you to succeed.
• If you need to, research! Always be ready to learn.
• Mentors are great... but they’re not babysitters.
• Accept imperfection.
• Please model these behaviors in my class!
Tool of the week: Self-efficacy
Tip of the week: Staying informed
• Weblogs and newsfeeds are your friends.
• If you are not reading at least a few librarian blogs,
you are not staying informed.
• Can’t hurt to pick up some journal TOCs too.
• Blogs are faster than the published literature! And
often written by the same people.
• For (library) tech:
• Librarian in Black
• Planet Code4Lib
• librarian.net
• Lifehacker, Gizmodo, Engadget
• Roy Tennant’s LJ columns
What is metadata?
• Heck, I dunno. I’m not sure that’s even a
useful question.
• This is one reason I’m not a library-school
professor. Definitional pilpul bores me.
• Operationally: when we collect stuff, we
take notes on it so we can organize it,
inventory it, find it later, etc. Those
notes are metadata.
• Is MARC metadata? Well, of course!
• But many librarians don’t think about it that way.
Why are there so many
metadata standards?
• Different things described
• For an image, you want to know its bit depth and
colorspace. This has no meaning for a finding aid.
• Several targeted standards vastly easier to cope
with than one supposedly universal standard.
• Different purposes
• More on this in a moment
• Different provider and user communities
• Level of detail/specificity
• Wheel (or toothbrush) reinvention
Metadata file formats
• You can express metadata in an Excel
spreadsheet, a MARC record, XML, RDF...
• But some expressions are more readable, useful,
and reusable than others!
• Metadata librarians spend a lot of time fixing and
transforming Other People’s Metadata, in as
automated a fashion as possible.
• Large majority of modern metadata
standards expressed in XML.
• Though RDF wants to be a contender, and XML is
only one way of several to express RDF.
So what’s this RDF thing all the
cool kids are talking about?
• Resource Description Framework
• by the W3C
• Like XML, RDF is more or less friendly to
whatever kind of metadata you want to
throw at it.
• Unlike XML, RDF is a data model designed for integrating
information from different metadata vocabularies, and
expressing how items and metadata records relate to one
another. Links and linking!
• (Also, XML works for content, e.g. TEI. RDF doesn’t.)
(very) Basic RDF
• “Triple:” subject, property, value
• A little like subject, verb, object in English.
• Dorothea Salo is the author of “Innkeeper
at the Roach Motel.”
• Subject: either me or the article (works either way,
depending on property chosen)
• Property: authorship (“isAuthorOf” or “isBy”); often
comes from a controlled vocabulary like Dublin Core
• Value: either the article or me, depending
• One annoying thing: URIs as identifiers
• What is my URI? Or the article’s (several versions)?
• Several other annoying things about RDF, but they’re
super-nerdy.
Linked data
• As the web linked documents and people,
it’s now time (say some) to link data.
• Not a simple proposition!
• RDF is hard. Calling it linked data doesn’t make it easier.
• Data modeling is hard.
• Data integration is hard. RDF makes it easier... up to a
point. Still HUGE problems around people using the
same term differently, other unexamined assumptions.
• Idea gaining traction among governments,
other big data providers.
• So we probably need to keep our eye on it.
• ALWAYS a good idea to think about how
other people might use your metadata.
Kinds of metadata
• Descriptive (“bibliographic”)
• Who made this? When? Where? What’s it about? Etc.
• Technical
• What is this? What is its format? What made it? Etc.
• Administrative
• Who owns this? Who’s changed it? Who has what IP
rights over it? Who can see it? Etc.
• Structural
• How is this thing put together?
• In practice, the landscape is muddier.
• Most standards have bits of two or more types.
• Also, “relationship” metadata coming to the fore.
Descriptive metadata:
MODS
• Metadata Object Description Schema
• Maintained by Library of Congress
• Stripped-down, human-readable MARC
in XML
• http://www.loc.gov/standards/mods/
• Sample: http://www.loc.gov/standards/mods/v3/
mods99042030.xml
Technical metadata: MIX
• Metadata for Images in XML
• By Library of Congress, NISO
• Captures information about an image’s file
format and other technical characteristics
• Why? Think about file-format
obsolescence.
• http://www.loc.gov/standards/mix/
• Sample document: http://www.loc.gov/standards/mix/
instances/test_mix10.xml
Administrative
metadata: PREMIS
• Preservation Metadata Maintenance
Activity
• who comes up with these acronyms?
• Library of Congress, again
• Designed to track digital preservation
activity across an object’s lifecycle
• http://www.loc.gov/standards/premis/
• Samples: look in http://www.dlib.org/dlib/
september08/dappert/09dappert.html
• But be aware that PREMIS is usually embedded in
other metadata, like METS.
Structural metadata:
METS
• Metadata Encoding and Transmission
Standard
• By... guess who?
• Wrapper for other kinds of metadata;
delineates the structure of a complex
digital object
• http://www.loc.gov/standards/mets/
• Samples: http://www.loc.gov/standards/mets/
mets-examples.html
Metadata spaghetti: TEI
• Text Encoding Initiative
• by the TEI Consortium
• For digital transcriptions of books,
manuscripts, dictionaries, etc. etc.
• Content standard, not metadata standard!
But contains its own “metadata header”
• This header sometimes reused in other contexts
• Moral: Sometimes content “embeds”
metadata.
• This is OK, but should every content standard roll its
own internal metadata?
Where does metadata
come from?
• Human data entry
• Slow, expensive, error-prone
• Often semi-automatable (80/20 point)
• If you can automate, DO IT. Do not waste keystrokes!
• Auto-extracting from a content object
• Common for technical metadata
• Auto-capture by preservation system
• Common for some administrative metadata
• Grabbing from elsewhere
• From other metadata: “crosswalking”
• HTML screenscraping, Excel spreadsheets
• Issues: authority control? granularity? accuracy?
Subject metadata,
specifically
• What is this thing about?
• Plenty of variation in sources
• Author’s keyword vs. indexer’s descriptor
• Controlled vocabulary vs. free-form keywording
• Community tagging/“folksonomy”
• Mechanically-extracted keywords
• All of this matters if you’re searching!
Where does metadata live?
• In XML files (or MARC files, or...)
• In relational databases
• In RDF “triple stores” (special databases)
• In content objects (as with TEI)
• Or some combination of the above!
• E.g. DSpace: can accept metadata in an XML file; stores
all metadata in relational database
• Next trick: associating content with its
metadata!
What is done with metadata?
• To search against it or use it to browse,
you need to “index” it first.
• Turn it inside-out: records containing terms --> list
of terms and the records they appear in
• It’s all more complicated: stemming, phrases,
variant spellings, languages, stopwords, etc.
• The hot new indexing software is “Solr” from UVa.
Underlies Blacklight, which underlies Forward.
• Full-text search works the same way!
• Google’s index: MASSIVE database of words with
the web pages they appear in.
• Spider/crawler: program that follows links across
the web and indexes page content
Relevance ranking
• You have a bunch of words and the records
or documents they appear in. How do you
decide which records/pages to display first?
• Traditionally in libraries: last-in-first-out. Awful.
• Using document structure and metadata
• If the word’s in a title, heading, or subject field, take it
more seriously than if it’s just in ordinary text.
• TF/IDF
• Term frequency: how often the search term shows up in
a given record/document
• Inverse document frequency: how rare the search term
is in the whole mass of records/documents.
Super-
relevant!
Record not
“about” this
term
Overused
word or
stopword
Irrelevant
TF
(one record)
IDF
(whole corpus)
High Low
Rare term
Common term
What other information can
be used to gauge relevance?
• People pointing
• Google: PageRank, based on counting links to a
document
• Scholarly communication: many metrics based on
later citation of articles
• People choosing
• Google also up-votes pages based on people
clicking on them in search results.
• Individual or social history of interests
• Amazon, Netflix
• Notice who’s doing this and who isn’t.
• Serious question: what about privacy?
http://xkcd.com/522
Search engine
optimization
• Making sure that your page turns up in
searches for relevant terms.
• Done maliciously, this amounts to spam. Google
spends LOTS of effort despamming its index.
• Clean markup helps. So does putting
highly relevant terms in highly visible/
important locations.
• Also, don’t overload pages! Dilutes vocabulary.
What else can you do with
relevance information?
• Point people to PEOPLE and SERVICES,
not just search results!
• Point people to context that will help
them evaluate search results.
• We know people just throw search terms at boxes.
We might as well work with that.
• This may well be the best work Forward
is doing.
A word about GIS
• “Geographic Information Systems”
• It’s metadata all the way down! Metadata
about places.
• Also a lot about how to represent and visualize that
metadata.
• And how to mash it up with other data.
• Heavily based on relational-database
technology.
• HOT JOB MARKET. If you can get trained, do.
Finding and using
metadata standards
• Nobody knows every metadata
standard out there. I sure don’t.
• But faced with a new standard, I may
have to get up to speed fast.
• I may even be making adoption decisions.
• So here’s how I do it.
Getting up to speed
• Find its website. If it doesn’t have a
website, you don’t want to use it.
• Is the website current? Is there recent activity?
• Is there a list of who’s using this standard?
• Find a sample record.
• How is this standard expressed? XML, RDF, what?
• Does it pass a sniff test?
• Find the documentation and community.
• “Tag libraries” and “data dictionaries” especially helpful.
• Primers, “getting started” documents also nice.
• Look for tools.
• Authoring/crosswalk tools (and programming libraries)
• Validation tools

More Related Content

What's hot

The Buzz About BIBFRAME, by Angela Kroeger
The Buzz About BIBFRAME, by Angela KroegerThe Buzz About BIBFRAME, by Angela Kroeger
The Buzz About BIBFRAME, by Angela KroegerAngela Kroeger
 
The liaison librarian: connecting with the qualitative research lifecycle
The liaison librarian: connecting with the qualitative research lifecycleThe liaison librarian: connecting with the qualitative research lifecycle
The liaison librarian: connecting with the qualitative research lifecycleCelia Emmelhainz
 
Library Language: Vocabulary for the Modern Librarian
Library Language: Vocabulary for the Modern LibrarianLibrary Language: Vocabulary for the Modern Librarian
Library Language: Vocabulary for the Modern LibrarianLibraries Thriving
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅kulibrarians
 
EDUC 601 Library Presentation
EDUC 601 Library PresentationEDUC 601 Library Presentation
EDUC 601 Library Presentationkmokopp
 
Referencing methods and approaches
Referencing methods and approachesReferencing methods and approaches
Referencing methods and approachesKaren McAulay
 
Annotated bib and research strategies
Annotated bib and research strategiesAnnotated bib and research strategies
Annotated bib and research strategiesTraciwm
 
Ws spring 2014 rogers
Ws spring 2014 rogersWs spring 2014 rogers
Ws spring 2014 rogersTraciwm
 
Searching for MAED Research Articles
Searching for MAED Research ArticlesSearching for MAED Research Articles
Searching for MAED Research Articlesviterbolibrary
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryWilliam Ulate
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCapgemini
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
Research Strategies
Research StrategiesResearch Strategies
Research StrategiesTraciwm
 
Digital Medieval Manuscripts
Digital Medieval ManuscriptsDigital Medieval Manuscripts
Digital Medieval Manuscriptsblalbritton
 
Writing Seminar Moore
Writing Seminar Moore Writing Seminar Moore
Writing Seminar Moore Traciwm
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communicationSören Auer
 
Engl 1221 bauer spring 2014
Engl 1221 bauer spring 2014Engl 1221 bauer spring 2014
Engl 1221 bauer spring 2014Traciwm
 

What's hot (20)

The Buzz About BIBFRAME, by Angela Kroeger
The Buzz About BIBFRAME, by Angela KroegerThe Buzz About BIBFRAME, by Angela Kroeger
The Buzz About BIBFRAME, by Angela Kroeger
 
The liaison librarian: connecting with the qualitative research lifecycle
The liaison librarian: connecting with the qualitative research lifecycleThe liaison librarian: connecting with the qualitative research lifecycle
The liaison librarian: connecting with the qualitative research lifecycle
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Library Language: Vocabulary for the Modern Librarian
Library Language: Vocabulary for the Modern LibrarianLibrary Language: Vocabulary for the Modern Librarian
Library Language: Vocabulary for the Modern Librarian
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
EDUC 601 Library Presentation
EDUC 601 Library PresentationEDUC 601 Library Presentation
EDUC 601 Library Presentation
 
Referencing methods and approaches
Referencing methods and approachesReferencing methods and approaches
Referencing methods and approaches
 
Annotated bib and research strategies
Annotated bib and research strategiesAnnotated bib and research strategies
Annotated bib and research strategies
 
Ws spring 2014 rogers
Ws spring 2014 rogersWs spring 2014 rogers
Ws spring 2014 rogers
 
Searching for MAED Research Articles
Searching for MAED Research ArticlesSearching for MAED Research Articles
Searching for MAED Research Articles
 
Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
CWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlpCWIN17 Frankfurt / talend_nlp
CWIN17 Frankfurt / talend_nlp
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Research Strategies
Research StrategiesResearch Strategies
Research Strategies
 
Digital Medieval Manuscripts
Digital Medieval ManuscriptsDigital Medieval Manuscripts
Digital Medieval Manuscripts
 
Writing Seminar Moore
Writing Seminar Moore Writing Seminar Moore
Writing Seminar Moore
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 
Databases mtcp4
Databases mtcp4Databases mtcp4
Databases mtcp4
 
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
 
Engl 1221 bauer spring 2014
Engl 1221 bauer spring 2014Engl 1221 bauer spring 2014
Engl 1221 bauer spring 2014
 

Viewers also liked

Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing SerendipityDorothea Salo
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?Dorothea Salo
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)Dorothea Salo
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Dorothea Salo
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAsDorothea Salo
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!Dorothea Salo
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?Dorothea Salo
 
Even the Loons are Licensed
Even the Loons are LicensedEven the Loons are Licensed
Even the Loons are LicensedDorothea Salo
 
Solving Problems with Web 2.0
Solving Problems with Web 2.0Solving Problems with Web 2.0
Solving Problems with Web 2.0Dorothea Salo
 
A Successful Failure: Community Requirements Gathering for DSpace
A Successful Failure: Community Requirements Gathering for DSpaceA Successful Failure: Community Requirements Gathering for DSpace
A Successful Failure: Community Requirements Gathering for DSpaceDorothea Salo
 
Lipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsLipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsDorothea Salo
 
Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Dorothea Salo
 
So you think you know libraries
So you think you know librariesSo you think you know libraries
So you think you know librariesDorothea Salo
 
Save the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsSave the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsDorothea Salo
 
Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Dorothea Salo
 

Viewers also liked (20)

Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Occupy Copyright!
Occupy Copyright!Occupy Copyright!
Occupy Copyright!
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)Soylent Semantic Web Is People! (with notes)
Soylent Semantic Web Is People! (with notes)
 
RDF, RDA, and other TLAs
RDF, RDA, and other TLAsRDF, RDA, and other TLAs
RDF, RDA, and other TLAs
 
I own copyright, so I pwn you!
I own copyright, so I pwn you!I own copyright, so I pwn you!
I own copyright, so I pwn you!
 
Encryption
EncryptionEncryption
Encryption
 
So are we winning yet?
So are we winning yet?So are we winning yet?
So are we winning yet?
 
Even the Loons are Licensed
Even the Loons are LicensedEven the Loons are Licensed
Even the Loons are Licensed
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
Solving Problems with Web 2.0
Solving Problems with Web 2.0Solving Problems with Web 2.0
Solving Problems with Web 2.0
 
A Successful Failure: Community Requirements Gathering for DSpace
A Successful Failure: Community Requirements Gathering for DSpaceA Successful Failure: Community Requirements Gathering for DSpace
A Successful Failure: Community Requirements Gathering for DSpace
 
Who owns our work?
Who owns our work?Who owns our work?
Who owns our work?
 
Lipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library SystemsLipstick on a Pig: Integrated Library Systems
Lipstick on a Pig: Integrated Library Systems
 
Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!
 
So you think you know libraries
So you think you know librariesSo you think you know libraries
So you think you know libraries
 
Save the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of UsSave the Cows! Cyberinfrastructure for the Rest of Us
Save the Cows! Cyberinfrastructure for the Rest of Us
 
Grab a bucket! It's raining data!
Grab a bucket! It's raining data!Grab a bucket! It's raining data!
Grab a bucket! It's raining data!
 

Similar to Organizing Digital Stuff

Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) robin fay
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genrobin fay
 
Online Citation Tools
Online Citation ToolsOnline Citation Tools
Online Citation Toolswill wade
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYINFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYChris Okiki
 
Linked data and the future of libraries
Linked data and the future of librariesLinked data and the future of libraries
Linked data and the future of librariesRegan Harper
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profileskcoylenet
 
Semantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSemantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSerdar Sönmez
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonRussell Duhon
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in CatalogingWilliam Worford
 
Write a better FM
Write a better FMWrite a better FM
Write a better FMRich Bowen
 
Publishing and Using Linked Open Data - Day 4
Publishing and Using Linked Open Data - Day 4Publishing and Using Linked Open Data - Day 4
Publishing and Using Linked Open Data - Day 4Richard Urban
 

Similar to Organizing Digital Stuff (20)

Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries) Intro to the semantic web (for libraries)
Intro to the semantic web (for libraries)
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 
Online Citation Tools
Online Citation ToolsOnline Citation Tools
Online Citation Tools
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYINFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY
 
Linked data and the future of libraries
Linked data and the future of librariesLinked data and the future of libraries
Linked data and the future of libraries
 
Beyond gsafd
Beyond gsafdBeyond gsafd
Beyond gsafd
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profiles
 
Code4Lib Keynote 2011
Code4Lib Keynote 2011Code4Lib Keynote 2011
Code4Lib Keynote 2011
 
Schema and Identity for Linked Data
Schema and Identity for Linked DataSchema and Identity for Linked Data
Schema and Identity for Linked Data
 
Semantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSemantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parser
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in Cataloging
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Publishing and Using Linked Open Data - Day 4
Publishing and Using Linked Open Data - Day 4Publishing and Using Linked Open Data - Day 4
Publishing and Using Linked Open Data - Day 4
 

More from Dorothea Salo

Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Dorothea Salo
 
Privacy and libraries
Privacy and librariesPrivacy and libraries
Privacy and librariesDorothea Salo
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)Dorothea Salo
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Dorothea Salo
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly CommunicationDorothea Salo
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Dorothea Salo
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!Dorothea Salo
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsDorothea Salo
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's WayDorothea Salo
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing SerendipityDorothea Salo
 
Databases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDatabases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDorothea Salo
 

More from Dorothea Salo (14)

Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!Soylent SemanticWeb Is People!
Soylent SemanticWeb Is People!
 
Privacy and libraries
Privacy and librariesPrivacy and libraries
Privacy and libraries
 
Paying for it
Paying for itPaying for it
Paying for it
 
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
The Canonically Bad (Digital) Humanities Proposal (and how to avoid it)
 
Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?Is this BIG DATA which I see before me?
Is this BIG DATA which I see before me?
 
FRBR and RDA
FRBR and RDAFRBR and RDA
FRBR and RDA
 
Research Data and Scholarly Communication
Research Data and Scholarly CommunicationResearch Data and Scholarly Communication
Research Data and Scholarly Communication
 
Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)Research Data and Scholarly Communication (with notes)
Research Data and Scholarly Communication (with notes)
 
Librarians love data!
Librarians love data!Librarians love data!
Librarians love data!
 
Taming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation ToolsTaming the Monster: Digital Preservation Planning and Implementation Tools
Taming the Monster: Digital Preservation Planning and Implementation Tools
 
Avoiding the Heron's Way
Avoiding the Heron's WayAvoiding the Heron's Way
Avoiding the Heron's Way
 
Manufacturing Serendipity
Manufacturing SerendipityManufacturing Serendipity
Manufacturing Serendipity
 
Open Content
Open ContentOpen Content
Open Content
 
Databases, Markup, and Regular Expressions
Databases, Markup, and Regular ExpressionsDatabases, Markup, and Regular Expressions
Databases, Markup, and Regular Expressions
 

Recently uploaded

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 

Recently uploaded (20)

Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 

Organizing Digital Stuff

  • 2. Weekly reflection • What digital “stuff” do you have? Where do you put it? How do you organize it, if you do? How do you find it when you need it?
  • 3. • In the course of your career, you will have to do things you don’t entirely know how to do. • Technical and non-! • Without training, guidance, or clear instructions. • No, of course we don’t teach you everything in library school! • Learn to dive in despite imperfect knowledge. • Use your common sense. • Trust that those around you want you to succeed. • If you need to, research! Always be ready to learn. • Mentors are great... but they’re not babysitters. • Accept imperfection. • Please model these behaviors in my class! Tool of the week: Self-efficacy
  • 4. Tip of the week: Staying informed • Weblogs and newsfeeds are your friends. • If you are not reading at least a few librarian blogs, you are not staying informed. • Can’t hurt to pick up some journal TOCs too. • Blogs are faster than the published literature! And often written by the same people. • For (library) tech: • Librarian in Black • Planet Code4Lib • librarian.net • Lifehacker, Gizmodo, Engadget • Roy Tennant’s LJ columns
  • 5. What is metadata? • Heck, I dunno. I’m not sure that’s even a useful question. • This is one reason I’m not a library-school professor. Definitional pilpul bores me. • Operationally: when we collect stuff, we take notes on it so we can organize it, inventory it, find it later, etc. Those notes are metadata. • Is MARC metadata? Well, of course! • But many librarians don’t think about it that way.
  • 6. Why are there so many metadata standards? • Different things described • For an image, you want to know its bit depth and colorspace. This has no meaning for a finding aid. • Several targeted standards vastly easier to cope with than one supposedly universal standard. • Different purposes • More on this in a moment • Different provider and user communities • Level of detail/specificity • Wheel (or toothbrush) reinvention
  • 7. Metadata file formats • You can express metadata in an Excel spreadsheet, a MARC record, XML, RDF... • But some expressions are more readable, useful, and reusable than others! • Metadata librarians spend a lot of time fixing and transforming Other People’s Metadata, in as automated a fashion as possible. • Large majority of modern metadata standards expressed in XML. • Though RDF wants to be a contender, and XML is only one way of several to express RDF.
  • 8. So what’s this RDF thing all the cool kids are talking about? • Resource Description Framework • by the W3C • Like XML, RDF is more or less friendly to whatever kind of metadata you want to throw at it. • Unlike XML, RDF is a data model designed for integrating information from different metadata vocabularies, and expressing how items and metadata records relate to one another. Links and linking! • (Also, XML works for content, e.g. TEI. RDF doesn’t.)
  • 9. (very) Basic RDF • “Triple:” subject, property, value • A little like subject, verb, object in English. • Dorothea Salo is the author of “Innkeeper at the Roach Motel.” • Subject: either me or the article (works either way, depending on property chosen) • Property: authorship (“isAuthorOf” or “isBy”); often comes from a controlled vocabulary like Dublin Core • Value: either the article or me, depending • One annoying thing: URIs as identifiers • What is my URI? Or the article’s (several versions)? • Several other annoying things about RDF, but they’re super-nerdy.
  • 10. Linked data • As the web linked documents and people, it’s now time (say some) to link data. • Not a simple proposition! • RDF is hard. Calling it linked data doesn’t make it easier. • Data modeling is hard. • Data integration is hard. RDF makes it easier... up to a point. Still HUGE problems around people using the same term differently, other unexamined assumptions. • Idea gaining traction among governments, other big data providers. • So we probably need to keep our eye on it. • ALWAYS a good idea to think about how other people might use your metadata.
  • 11. Kinds of metadata • Descriptive (“bibliographic”) • Who made this? When? Where? What’s it about? Etc. • Technical • What is this? What is its format? What made it? Etc. • Administrative • Who owns this? Who’s changed it? Who has what IP rights over it? Who can see it? Etc. • Structural • How is this thing put together? • In practice, the landscape is muddier. • Most standards have bits of two or more types. • Also, “relationship” metadata coming to the fore.
  • 12. Descriptive metadata: MODS • Metadata Object Description Schema • Maintained by Library of Congress • Stripped-down, human-readable MARC in XML • http://www.loc.gov/standards/mods/ • Sample: http://www.loc.gov/standards/mods/v3/ mods99042030.xml
  • 13. Technical metadata: MIX • Metadata for Images in XML • By Library of Congress, NISO • Captures information about an image’s file format and other technical characteristics • Why? Think about file-format obsolescence. • http://www.loc.gov/standards/mix/ • Sample document: http://www.loc.gov/standards/mix/ instances/test_mix10.xml
  • 14. Administrative metadata: PREMIS • Preservation Metadata Maintenance Activity • who comes up with these acronyms? • Library of Congress, again • Designed to track digital preservation activity across an object’s lifecycle • http://www.loc.gov/standards/premis/ • Samples: look in http://www.dlib.org/dlib/ september08/dappert/09dappert.html • But be aware that PREMIS is usually embedded in other metadata, like METS.
  • 15. Structural metadata: METS • Metadata Encoding and Transmission Standard • By... guess who? • Wrapper for other kinds of metadata; delineates the structure of a complex digital object • http://www.loc.gov/standards/mets/ • Samples: http://www.loc.gov/standards/mets/ mets-examples.html
  • 16. Metadata spaghetti: TEI • Text Encoding Initiative • by the TEI Consortium • For digital transcriptions of books, manuscripts, dictionaries, etc. etc. • Content standard, not metadata standard! But contains its own “metadata header” • This header sometimes reused in other contexts • Moral: Sometimes content “embeds” metadata. • This is OK, but should every content standard roll its own internal metadata?
  • 17. Where does metadata come from? • Human data entry • Slow, expensive, error-prone • Often semi-automatable (80/20 point) • If you can automate, DO IT. Do not waste keystrokes! • Auto-extracting from a content object • Common for technical metadata • Auto-capture by preservation system • Common for some administrative metadata • Grabbing from elsewhere • From other metadata: “crosswalking” • HTML screenscraping, Excel spreadsheets • Issues: authority control? granularity? accuracy?
  • 18. Subject metadata, specifically • What is this thing about? • Plenty of variation in sources • Author’s keyword vs. indexer’s descriptor • Controlled vocabulary vs. free-form keywording • Community tagging/“folksonomy” • Mechanically-extracted keywords • All of this matters if you’re searching!
  • 19. Where does metadata live? • In XML files (or MARC files, or...) • In relational databases • In RDF “triple stores” (special databases) • In content objects (as with TEI) • Or some combination of the above! • E.g. DSpace: can accept metadata in an XML file; stores all metadata in relational database • Next trick: associating content with its metadata!
  • 20. What is done with metadata? • To search against it or use it to browse, you need to “index” it first. • Turn it inside-out: records containing terms --> list of terms and the records they appear in • It’s all more complicated: stemming, phrases, variant spellings, languages, stopwords, etc. • The hot new indexing software is “Solr” from UVa. Underlies Blacklight, which underlies Forward. • Full-text search works the same way! • Google’s index: MASSIVE database of words with the web pages they appear in. • Spider/crawler: program that follows links across the web and indexes page content
  • 21. Relevance ranking • You have a bunch of words and the records or documents they appear in. How do you decide which records/pages to display first? • Traditionally in libraries: last-in-first-out. Awful. • Using document structure and metadata • If the word’s in a title, heading, or subject field, take it more seriously than if it’s just in ordinary text. • TF/IDF • Term frequency: how often the search term shows up in a given record/document • Inverse document frequency: how rare the search term is in the whole mass of records/documents.
  • 22. Super- relevant! Record not “about” this term Overused word or stopword Irrelevant TF (one record) IDF (whole corpus) High Low Rare term Common term
  • 23. What other information can be used to gauge relevance? • People pointing • Google: PageRank, based on counting links to a document • Scholarly communication: many metrics based on later citation of articles • People choosing • Google also up-votes pages based on people clicking on them in search results. • Individual or social history of interests • Amazon, Netflix • Notice who’s doing this and who isn’t. • Serious question: what about privacy?
  • 25. Search engine optimization • Making sure that your page turns up in searches for relevant terms. • Done maliciously, this amounts to spam. Google spends LOTS of effort despamming its index. • Clean markup helps. So does putting highly relevant terms in highly visible/ important locations. • Also, don’t overload pages! Dilutes vocabulary.
  • 26. What else can you do with relevance information? • Point people to PEOPLE and SERVICES, not just search results! • Point people to context that will help them evaluate search results. • We know people just throw search terms at boxes. We might as well work with that. • This may well be the best work Forward is doing.
  • 27. A word about GIS • “Geographic Information Systems” • It’s metadata all the way down! Metadata about places. • Also a lot about how to represent and visualize that metadata. • And how to mash it up with other data. • Heavily based on relational-database technology. • HOT JOB MARKET. If you can get trained, do.
  • 28. Finding and using metadata standards • Nobody knows every metadata standard out there. I sure don’t. • But faced with a new standard, I may have to get up to speed fast. • I may even be making adoption decisions. • So here’s how I do it.
  • 29. Getting up to speed • Find its website. If it doesn’t have a website, you don’t want to use it. • Is the website current? Is there recent activity? • Is there a list of who’s using this standard? • Find a sample record. • How is this standard expressed? XML, RDF, what? • Does it pass a sniff test? • Find the documentation and community. • “Tag libraries” and “data dictionaries” especially helpful. • Primers, “getting started” documents also nice. • Look for tools. • Authoring/crosswalk tools (and programming libraries) • Validation tools