Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Classification, Tagging & Search

This is a high-level summary of three important ways to help people find information. The slides were presented at Vera Rhoades' information architecture class at the University of Maryland.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Classification, Tagging & Search

  1. 1. Classification, Tagging & Search James Melzer August 14, 2007
  2. 2. Where does this fit in my project? Search Sitemap Portals Content Integration & Aggregation (Yahoo, Lexis/Nexis) Discover Metadata Navigation Filters Classify/ Organize Content Assets Create
  3. 3. Paradigms of Information Organization <ul><li>Classification is the process of organizing a domain of items into a systematic scheme. Items (instances) are classified into categories (classes) by a person or system. All parties share a common scheme for consistency and clarity, which make it ideal for high-value information. </li></ul><ul><li>Tagging is the process of assigning a term or phrase to an item. Every person uses their own non-systematic tag scheme, which they generally make up as they go along. Tagging is most effective with personal collections of information. </li></ul><ul><li>Search is the multi-stage process of reducing a group of unstructured documents into structured data and then matching that data to a human’s query. Search is most effective in huge collections of disorganized information. </li></ul>
  4. 4. Classification
  5. 5. Attributes of Classical Classification <ul><li>Taxonomy </li></ul><ul><ul><li>Strict concept hierarchy </li></ul></ul><ul><ul><li>Mutual exclusivity </li></ul></ul><ul><ul><li>Comprehensiveness </li></ul></ul><ul><ul><li>Inheritance </li></ul></ul><ul><li>Thesaurus </li></ul><ul><ul><li>Has all the attributes of a taxonomy, plus: </li></ul></ul><ul><ul><li>Synonyms and alternate forms </li></ul></ul><ul><ul><li>Related terms </li></ul></ul>
  6. 6. Classification’s Strengths <ul><li>Descriptive of a domain </li></ul><ul><li>Exhaustive within a domain </li></ul><ul><li>Provides colocation of similar items </li></ul><ul><li>Unlimited domain scale </li></ul>
  7. 7. Classification’s Weaknesses <ul><li>Expensive to create or maintain </li></ul><ul><li>Rigid in perspective and application </li></ul><ul><li>Slow to grow or change </li></ul><ul><li>Hard to use * </li></ul><ul><li>* may require trained experts, leading to scalability issues </li></ul>
  8. 8. Fancy Classification: Facets <ul><li>A taxonomy usually classifies everything in its domain along a single axis. </li></ul><ul><li>A polyhierarchy is made up of multiple mutually-exclusive taxonomies. A class from each taxonomy is applied to every item. </li></ul><ul><li>Each taxonomy in a polyhierarchy is called a facet (like the many sides of a diamond). </li></ul><ul><li>Automobile example... </li></ul>
  9. 9. Standard Facets <ul><li>Subject </li></ul><ul><li>Asset </li></ul><ul><li>Use </li></ul><ul><li>Relation </li></ul>
  10. 10. Facet: Subject <ul><li>Extrinsic properties </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Subjects discussed </li></ul></ul><ul><ul><li>Geographic coverage </li></ul></ul><ul><ul><li>Companies mentioned </li></ul></ul><ul><li>Questions Answered: </li></ul><ul><ul><li>What is it about? </li></ul></ul><ul><ul><li>Where does it fit in similar discourse? </li></ul></ul>
  11. 11. Facet: Asset <ul><li>Intrinsic properties </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Title </li></ul></ul><ul><ul><li>Language </li></ul></ul><ul><ul><li>Social Security Number </li></ul></ul><ul><ul><li>Bar Code/UPC </li></ul></ul><ul><li>Questions Answered: </li></ul><ul><ul><li>What makes it unique? </li></ul></ul><ul><ul><li>What authority does it have? </li></ul></ul>
  12. 12. Facet: Use <ul><li>Permissions and audience </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Intended for children ages 9-12 </li></ul></ul><ul><ul><li>Intended for professionals with advanced degrees </li></ul></ul><ul><ul><li>Restricted to senior management and auditors </li></ul></ul><ul><ul><li>Restricted to subscribers </li></ul></ul><ul><ul><li>Non-subscribers can access only the abstract </li></ul></ul><ul><li>Questions Answered: </li></ul><ul><ul><li>Who can use it? </li></ul></ul><ul><ul><li>Who should use it? </li></ul></ul>
  13. 13. Facet: Relation <ul><li>Connections to other objects </li></ul><ul><li>Examples: </li></ul><ul><ul><li>People who bought this book also bought these other books </li></ul></ul><ul><ul><li>If you buy this grill, you may also want to buy these tongs and apron </li></ul></ul><ul><ul><li>If you are researching electric cars, you may also want to look into hybrid cars </li></ul></ul><ul><ul><li>This top will go with that skirt really well </li></ul></ul><ul><li>Questions Answered: </li></ul><ul><ul><li>What commonly goes with this? </li></ul></ul><ul><ul><li>What other objects would help me use this object more effectively? </li></ul></ul>
  14. 14. Classification Applied <ul><li>Where is classification used? Anywhere people need excellent colocation of similar items, and can afford to ensure it with professional cataloging </li></ul><ul><ul><li>Research </li></ul></ul><ul><ul><li>Business </li></ul></ul><ul><ul><li>Government </li></ul></ul><ul><ul><li>Libraries </li></ul></ul>
  15. 15. Classification Guidelines <ul><li>Specificity rule Apply the most specific terms when tagging assets. Specific terms can always be generalized, but generic terms cannot be specialized. </li></ul><ul><li>Repeatable rule All attributes should be repeatable. Use as many terms as necessary to describe What the asset is about and Why it is important. Storage is cheap. Re-creating content is expensive. </li></ul><ul><li>Appropriateness rule Not all attributes apply to all assets. Only supply values for attributes that make sense. </li></ul><ul><li>Usability rule Anticipate how the asset will be searched for in the future, and how to make it easy to find it. Remember that search engines can only operate on explicit information. </li></ul>
  16. 16. Tagging
  17. 17. Tagging’s Attributes <ul><li>No controlled vocabulary (It’s NOT classification) </li></ul><ul><li>Informal </li></ul><ul><li>Personal, although sometimes social </li></ul><ul><li>Messy </li></ul>
  18. 18. Tagging Conceptual Model Can’t get enough? More:
  19. 19. Social Tagging <ul><li>Many tagging systems (particularly Yahoo! properties and ) allow people to see other people’s tags and items. </li></ul><ul><li>The way tags, items and people are shared influences people’s behavior </li></ul><ul><ul><li>Self-conscious tagging </li></ul></ul><ul><ul><li>Intentional group tags </li></ul></ul><ul><ul><li>Copying another’s tags </li></ul></ul><ul><ul><li>Tag spamming </li></ul></ul>
  20. 20. Types of Tags * <ul><li>Description </li></ul><ul><li>Categorization </li></ul><ul><li>Opinion </li></ul><ul><li>Action </li></ul><ul><li>Relation </li></ul><ul><li>Insider reference </li></ul><ul><li>Spam </li></ul><ul><li>* According to Rashmi Sinha, Uzanto </li></ul>
  21. 21. Tagging’s Strengths <ul><li>Easy and cheap </li></ul><ul><li>Personalized </li></ul><ul><li>Rapid adaptation </li></ul><ul><li>Infinite scalability </li></ul><ul><li>Readily multilingual </li></ul>
  22. 22. Tagging’s Weaknesses <ul><li>Not necessarily exhaustive within domain </li></ul><ul><li>No inheritance </li></ul><ul><li>Weak colocation of similar items </li></ul><ul><li>Variable quality </li></ul>
  23. 23. Tagging Applied <ul><li>Where is tagging used? Amateurs with stuff to organize and not a lot of time on their hands. </li></ul><ul><ul><li>Serendipitous searching or browsing </li></ul></ul><ul><ul><li>Re-finding items (refindability) </li></ul></ul><ul><ul><li>Personal or insider collections </li></ul></ul><ul><ul><li>Shared resources </li></ul></ul>
  24. 24. Search Search
  25. 25. Search Basics <ul><li>A search engine mediates between user’s query and metadata surrogates for documents </li></ul><ul><li>Documents are reduced to metadata </li></ul><ul><li>User’s need is translated into a query </li></ul><ul><li>Query terms are used to find matching metadata terms </li></ul><ul><li>Lots and lots of room for error... </li></ul>
  26. 26. Search Process <ul><li>Crawl content for metadata </li></ul><ul><li>Index document terms into an inverted file; an inverted file is very fast to search </li></ul><ul><li>Search the index to identify the result set; search the index; not the documents </li></ul><ul><li>Rank the results for display; ranking is the hardest part </li></ul>
  27. 27. Search Algorithm 1 <ul><li>Term-based Ranking (tf/idf) </li></ul><ul><li>tf = term frequency documents that use the query terms most are presumed to be most relevant </li></ul><ul><li>idf = inverse document frequency terms that are more rare are better indicators of relevance </li></ul><ul><li>Assumptions 1) relevance can be measured with document terms </li></ul>
  28. 28. Search Algorithm 2 <ul><li>PageRank (Google) </li></ul><ul><li>Relevant set is still identified by term matching </li></ul><ul><li>A revolution in ranking: based on linking between documents </li></ul><ul><li>Assumptions: 1) important sites link to other important sites 2) if many people link to a site, it is important </li></ul>
  29. 29. Improving Search <ul><li>Best Bets </li></ul><ul><li>Relevance Feedback </li></ul><ul><li>Search + Classification </li></ul>
  30. 30. Best Bets <ul><li>A best bet is a manually selected search result, tied to specific query terms or phrases </li></ul><ul><li>User-driven phrases select the most-used phrases from search traffic; go for easy wins, because returns diminish sharply </li></ul><ul><li>Business-driven phrases select phrases important to the business; such as product names or office locations </li></ul>
  31. 31. Relevance Feedback <ul><li>“More like this”; use one document’s metadata as a query to find others </li></ul><ul><li>Cluster results, so users can filter by a cluster (e.g. Jaguar the car vs. jaguar the cat) </li></ul><ul><li>Structured search guesses (e.g. is it a zip code? a product name?) </li></ul>
  32. 32. Combining Search and Classification <ul><li>Lead-in synonyms enter “fridge”; get “refrigerator” instead; best if collection is well-cataloged </li></ul><ul><li>Term-expansion synonyms; enter “refrigerator”; get “fridge” too; best if the collection is not well-cataloged </li></ul><ul><li>Spell check on query phrases </li></ul><ul><li>Classifying documents with additional metadata (even tagging) </li></ul>
  33. 33. Wrap Up <ul><li>Classification </li></ul><ul><li>Tagging </li></ul><ul><li>Search </li></ul>
  34. 34. Questions? <ul><li>James Melzer </li></ul><ul><li>Senior Information Architect </li></ul><ul><li>SRA International, Inc. </li></ul><ul><li>[email_address] </li></ul><ul><li> ( ) </li></ul><ul><li> </li></ul>