SlideShare a Scribd company logo
1 of 69
INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk Muhammad AtifQureshi
Contents Story Mode Learning Learning by Imagination Appendix 2 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Story Mode Learning (Borrowed from Prof. Jimmy Lin, University of Maryland, Scientist in Twitter) 3 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information Retrieval Systems Information What is “information”? Retrieval What do we mean by “retrieval”? What are different types information needs? Systems How do computer systems fit into the human information seeking process? 4 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Information? What do you think? There is no “correct” definition Cookie Monster’s definition:  “news or facts about something” Different approaches: Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science 5 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Dictionary says… Oxford English Dictionary information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known Random House Dictionary information: knowledge communicated or received concerning a particular fact or circumstance; news 6 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Intuitive Notions Information must Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be “new”: repetition of previously received messages is not informative Be “true”: false or counterfactual information is “mis-information” Be “about” something Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269. 7 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Three Views of Information Information as process Information as communication Information as message transmission and reception 8 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
One View Information = characteristics of the output of a process Tells us something about the process and the input Information-generating process do not occur in isolation Input Output Process Input Output Input Output Process1 Process2 Input Output … 9 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s the human? If a tree falls in the forest, and no one is around to hear it, is information transmitted? In the “information as process”: Yes, but that’s not very interesting to us We’re concerned about information for human consumption Transmission of information from one person to another Recording of information Reconstruction of stored information 10 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Another View Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the recipient's structure Text = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient” Information = “the structure of any text which is capable of changing the image-structure of a recipient” 11 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Transfer of Information Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding 12 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information Theory Better called “communication theory” Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” 13 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Noisy Channel Model Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant channel Receiver message Transmitter noise Source Destination message 14 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A Synthesis Information retrieval as communication over time and space, across a noisy channel Sender Recipient Encoding Decoding Transmitter Receiver channel storage message message indexing/writing retrieval/reading noise Source Destination message message noise 15 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
“Retrieval?” “Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand Encoding Decoding storage Sender Recipient message message indexing/writing Retrieval/reading noise 16 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143. 17 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Information Needs Retrospective “Searching the past” Different queries posed against a static collection Time invariant Prospective “Searching the future” Static query posed against a dynamic collection Time dependent 18 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (I) Ad hoc retrieval: find documents “about this” Known item search Directed exploration Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Find Jimmy Lin’s homepage. What’s the ISBN number of “Modern Information Retrieval”? Who makes the best chocolates? What video conferencing systems exist for digital reference desk services? 19 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (II) Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? “Factoid” What countries export oil? Name U.S. cities that have a “Shubert” theater. “List” Who is Aaron Copland? What is a quasar? “Definition” 20 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Prospective “Searches” Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins? Spam or not spam? Categorize news headlines: World? Nation? Metro? Sports? 21 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What types of information? Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.)  Video Source code Applications/Web services 22 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Content-Based Search This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications? 23 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Interesting Examples Google image search Google video search Query by humming http://images.google.com/ http://video.google.com/ http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html 24 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What about databases? What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world” 25 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A (Simple) Database Example Student Table Department Table Course Table Enrollment Table 26 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Database Queries What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address? 27 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Databases vs. IR 28 IR Databases What we’re retrieving Mostly unstructured.  Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries.  Unambiguous. Results we get Sometimes relevant, often not. Exact.  Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical. Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Big Picture The four components of the information retrieval environment: User Process System Collection What computer geeks care about! What we care about! 29 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Information Retrieval Cycle Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Query Formulation Search Selection Examination Delivery 30 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Supporting the Search Process Source Selection Resource Query Formulation Query Search Ranked List Selection Indexing Documents Index Examination Acquisition Documents Collection Delivery 31 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Simplification? Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Is this itself a vast simplification? Query Formulation Search Selection Examination Delivery 32 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Tackling the IR Challenge Divide and conquer! Strategy: use encapsulation to limit complexity Approach: Define interfaces (input and output) for each component Define the functions performed by each component Study each component in isolation Repeat the process within components as needed Make sure that this decomposition makes sense Result: a hierarchical decomposition 33 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where do we make the cut? Study the IR black box in isolation Simple behavior: in goes query, out comes documents Optimize the quality of documents that come out Study everything else around the black box Put the human back in the loop! Search Query Ranked List 34 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The IR Black Box Documents Query Hits 35 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits 36 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 37 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning by Imagination 38 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Imagine a System We have 1000s of web pages, what make these web pages different? May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing) 39 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Realize Query Needs What do we expect when query? Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan") How to manage data of web pages Bag of words data structure with/without position of words/terms (simply, posting list of words/terms) What’s the best match? We have many matching results, but what’s the order? 40 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Order of Matching Results How could we rank web pages?  Via query content matching score against web pages i.e., content based methods  Via importance of web pages i.e., link based methods 41 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What does Content Tell? Content Information: Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy) So what does common words make?  Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents) 42 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Ranking Methods Content based methods: Examples: Tf-idf with cosine similarity, bm25, etc. Link based methods: Examples: PageRank, HITS, etc. 43 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is More in Ranking? What other measures we can take for ranking better? Combining content based methods with link based methods How about learning to rank by user click through data (apply machine learning) How about learning from social web (apply social science theories) 44 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Lots of Web Pages How about scalability?  We have too many words, can we limit them?  Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study) Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education) 45 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Is it sufficient? Can we feel confident about how Web Search Engine works? No, it was just a summary for the day 46 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Guess! what next you would see? ? 47 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Our search engine Yes we are making it 48 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Appendix 49 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Outline What is Research? How to prepare yourself for IR research? 50 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research? 51 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research? Research Discover new knowledge  Seek answers to questions Basic research Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, …  Applied research Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors, vaccinations, … The boundary is vague; distinction isn’t important 52 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Why Research? Funding Curiosity Utility of  Applications Advancement of  Technology Amount of  knowledge Application Development Applied Research Basic Research 53 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s IR Research? Information Science Funding Quality of  Life Utility of  Applications Advancement of  Technology Amount of  knowledge Computer Science Application Development Applied Research Basic Research 54 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 55 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Typical IR Research Process Look for a high-impact topic (basic or applied) New problem: define/frame the problem  Identify weakness of existing solutions if any Propose new methods  Choose data sets (often a main challenge) Design evaluation measures (can be very difficult) Run many experiments (need to have clear research hypotheses) Analyze results and repeat the steps above if necessary Publish research results 56 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research Methods Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”) Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”) Empirical research: evaluate and compare existing solutions  (e.g., “a comparative evaluation of link analysis methods for web search”)  The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory… 57 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Research Questions and Results Exploratory (Framework): What’s out there?  Descriptive (Principles): What does it look like? How does it work? Evaluative (Empirical results): How well does a method solve a problem?  Explanatory (Causes): Why does something happen the way it happens?  Predictive (Models): What would happen if xxx ? 58 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Solid and High Impact Research Solid work:  A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best  59 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
How to Prepare Yourself for IR Research? 60 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Learning: take you to the frontier of knowledge Persistence: so that you don’t give up Respect data and truth: ensure your research is solid Communication: allow you to publish your work … 61 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (1/2) Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…) Then read “Readings in IR” by Karen Sparck Jones,  Peter Willett  And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf Read other papers published in recent IR/IR-related conferences 62 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (2/2) Getting more focused  Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) Stay in frontier: Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on  Industry activities  Read about industry trends Try out novel prototype systems Funding trends Read request for proposals 63 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Critical Thinking Develop a habit of asking questions, especially why questions Always try to make sense of what you have read/heard; don’t let any question pass by Get used to challenging everything Practical advice Question every claim made in a paper or a talk (can you argue the other way?)  Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) Force yourself to challenge one point in every talk that you attend and raise a question 64 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Respect Data and Truth Be honest with the experiment results  Don’t throw away negative results!  Try to learn from negative results Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data Be objective in data analysis and interpretation; don’t mislead readers  Aim at understanding/explanation instead of just good results Be careful not to over-generalize (for both good and bad results); you may be far from the truth 65 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Communications General communication skills:  Oral and written Formal and informal Talk to people with different level of backgrounds Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction)  English proficiency Get used to talking to people from different fields 66 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Persistence Work only on topics that you are passionate about Work only on hypotheses that you believe in Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative results In many cases, negative results don’t completely reject a hypothesis  Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) Think of possibilities of repositioning a work 67 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Optimize Your Training Know your strengths and weaknesses strong in math vs. strong in system development creative vs. thorough … Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of your strengths 68 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Thank You Reach me on Twitter: @matifq Email me: maqureshi@iba.edu.pk 69 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk

More Related Content

What's hot

Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
METS(Metadata Encoding and Transmission Standard )
METS(Metadata Encoding and Transmission Standard )METS(Metadata Encoding and Transmission Standard )
METS(Metadata Encoding and Transmission Standard )
Manu K M
 
ellis model of information seeking behaviour
ellis model of information seeking behaviourellis model of information seeking behaviour
ellis model of information seeking behaviour
natashagandhi11
 

What's hot (20)

Design and development of subject gateways with special reference to lisgateway
Design and development of subject  gateways with special reference to lisgatewayDesign and development of subject  gateways with special reference to lisgateway
Design and development of subject gateways with special reference to lisgateway
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Indexing language concept types and characteristics
Indexing language concept types and characteristicsIndexing language concept types and characteristics
Indexing language concept types and characteristics
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
METS(Metadata Encoding and Transmission Standard )
METS(Metadata Encoding and Transmission Standard )METS(Metadata Encoding and Transmission Standard )
METS(Metadata Encoding and Transmission Standard )
 
Modes of formation of subject
Modes of formation of subjectModes of formation of subject
Modes of formation of subject
 
Webometrics
WebometricsWebometrics
Webometrics
 
Web scale discovery service
Web scale discovery serviceWeb scale discovery service
Web scale discovery service
 
Precis
PrecisPrecis
Precis
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
ellis model of information seeking behaviour
ellis model of information seeking behaviourellis model of information seeking behaviour
ellis model of information seeking behaviour
 
Dublin Core Intro
Dublin Core IntroDublin Core Intro
Dublin Core Intro
 
Web OPAC
Web OPAC Web OPAC
Web OPAC
 
Marc 21
Marc 21Marc 21
Marc 21
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Evaluation of library automation software
Evaluation of library automation softwareEvaluation of library automation software
Evaluation of library automation software
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Use and user study
Use and user study Use and user study
Use and user study
 

Similar to Information Retrieval

Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
Biplav Srivastava
 
COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011
Laksamee Putnam
 
The Generation Game He Forum
The Generation Game He ForumThe Generation Game He Forum
The Generation Game He Forum
HAROLDFRICKER
 
Let the trumpet sound 2007
Let the trumpet sound 2007Let the trumpet sound 2007
Let the trumpet sound 2007
Johan Koren
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
STLab
 

Similar to Information Retrieval (20)

Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Unesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalUnesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-final
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
 
COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011
 
Conference Report Final 11.18
Conference Report Final 11.18Conference Report Final 11.18
Conference Report Final 11.18
 
The Internet, Part II: The Internet, Higher Education, and the Next 25 Years
The Internet, Part II:  The Internet, Higher Education, and the Next 25 YearsThe Internet, Part II:  The Internet, Higher Education, and the Next 25 Years
The Internet, Part II: The Internet, Higher Education, and the Next 25 Years
 
The Generation Game He Forum
The Generation Game He ForumThe Generation Game He Forum
The Generation Game He Forum
 
Report on hacking crime and workable solution
Report on hacking crime and workable solutionReport on hacking crime and workable solution
Report on hacking crime and workable solution
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
Let the trumpet sound 2007
Let the trumpet sound 2007Let the trumpet sound 2007
Let the trumpet sound 2007
 
Connecting Museums with Linked Data
Connecting Museums with Linked DataConnecting Museums with Linked Data
Connecting Museums with Linked Data
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
 
Kick-off meeting Linkflows project
Kick-off meeting Linkflows projectKick-off meeting Linkflows project
Kick-off meeting Linkflows project
 
Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)
 
Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
 
Social Media Mining and Analytics
Social Media Mining and AnalyticsSocial Media Mining and Analytics
Social Media Mining and Analytics
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
CSE509 Lecture 5
CSE509 Lecture 5CSE509 Lecture 5
CSE509 Lecture 5
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
CSE509 Lecture 2
CSE509 Lecture 2CSE509 Lecture 2
CSE509 Lecture 2
 
CSE509 Lecture 1
CSE509 Lecture 1CSE509 Lecture 1
CSE509 Lecture 1
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Information Retrieval

  • 1. INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk Muhammad AtifQureshi
  • 2. Contents Story Mode Learning Learning by Imagination Appendix 2 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 3. Story Mode Learning (Borrowed from Prof. Jimmy Lin, University of Maryland, Scientist in Twitter) 3 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 4. Information Retrieval Systems Information What is “information”? Retrieval What do we mean by “retrieval”? What are different types information needs? Systems How do computer systems fit into the human information seeking process? 4 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 5. What is Information? What do you think? There is no “correct” definition Cookie Monster’s definition: “news or facts about something” Different approaches: Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science 5 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 6. Dictionary says… Oxford English Dictionary information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known Random House Dictionary information: knowledge communicated or received concerning a particular fact or circumstance; news 6 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 7. Intuitive Notions Information must Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be “new”: repetition of previously received messages is not informative Be “true”: false or counterfactual information is “mis-information” Be “about” something Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269. 7 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 8. Three Views of Information Information as process Information as communication Information as message transmission and reception 8 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 9. One View Information = characteristics of the output of a process Tells us something about the process and the input Information-generating process do not occur in isolation Input Output Process Input Output Input Output Process1 Process2 Input Output … 9 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 10. Where’s the human? If a tree falls in the forest, and no one is around to hear it, is information transmitted? In the “information as process”: Yes, but that’s not very interesting to us We’re concerned about information for human consumption Transmission of information from one person to another Recording of information Reconstruction of stored information 10 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 11. Another View Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the recipient's structure Text = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient” Information = “the structure of any text which is capable of changing the image-structure of a recipient” 11 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 12. Transfer of Information Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding 12 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 13. Information Theory Better called “communication theory” Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” 13 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 14. The Noisy Channel Model Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant channel Receiver message Transmitter noise Source Destination message 14 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 15. A Synthesis Information retrieval as communication over time and space, across a noisy channel Sender Recipient Encoding Decoding Transmitter Receiver channel storage message message indexing/writing retrieval/reading noise Source Destination message message noise 15 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 16. “Retrieval?” “Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand Encoding Decoding storage Sender Recipient message message indexing/writing Retrieval/reading noise 16 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 17. What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143. 17 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 18. Types of Information Needs Retrospective “Searching the past” Different queries posed against a static collection Time invariant Prospective “Searching the future” Static query posed against a dynamic collection Time dependent 18 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 19. Retrospective Searches (I) Ad hoc retrieval: find documents “about this” Known item search Directed exploration Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Find Jimmy Lin’s homepage. What’s the ISBN number of “Modern Information Retrieval”? Who makes the best chocolates? What video conferencing systems exist for digital reference desk services? 19 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 20. Retrospective Searches (II) Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? “Factoid” What countries export oil? Name U.S. cities that have a “Shubert” theater. “List” Who is Aaron Copland? What is a quasar? “Definition” 20 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 21. Prospective “Searches” Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins? Spam or not spam? Categorize news headlines: World? Nation? Metro? Sports? 21 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 22. What types of information? Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services 22 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 23. Content-Based Search This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications? 23 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 24. Interesting Examples Google image search Google video search Query by humming http://images.google.com/ http://video.google.com/ http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html 24 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 25. What about databases? What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world” 25 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 26. A (Simple) Database Example Student Table Department Table Course Table Enrollment Table 26 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 27. Database Queries What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address? 27 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 28. Databases vs. IR 28 IR Databases What we’re retrieving Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Results we get Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical. Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 29. The Big Picture The four components of the information retrieval environment: User Process System Collection What computer geeks care about! What we care about! 29 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 30. The Information Retrieval Cycle Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Query Formulation Search Selection Examination Delivery 30 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 31. Supporting the Search Process Source Selection Resource Query Formulation Query Search Ranked List Selection Indexing Documents Index Examination Acquisition Documents Collection Delivery 31 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 32. Simplification? Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Is this itself a vast simplification? Query Formulation Search Selection Examination Delivery 32 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 33. Tackling the IR Challenge Divide and conquer! Strategy: use encapsulation to limit complexity Approach: Define interfaces (input and output) for each component Define the functions performed by each component Study each component in isolation Repeat the process within components as needed Make sure that this decomposition makes sense Result: a hierarchical decomposition 33 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 34. Where do we make the cut? Study the IR black box in isolation Simple behavior: in goes query, out comes documents Optimize the quality of documents that come out Study everything else around the black box Put the human back in the loop! Search Query Ranked List 34 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 35. The IR Black Box Documents Query Hits 35 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 36. Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits 36 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 37. The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 37 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 38. Learning by Imagination 38 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 39. Imagine a System We have 1000s of web pages, what make these web pages different? May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing) 39 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 40. Realize Query Needs What do we expect when query? Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan") How to manage data of web pages Bag of words data structure with/without position of words/terms (simply, posting list of words/terms) What’s the best match? We have many matching results, but what’s the order? 40 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 41. Order of Matching Results How could we rank web pages? Via query content matching score against web pages i.e., content based methods Via importance of web pages i.e., link based methods 41 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 42. What does Content Tell? Content Information: Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy) So what does common words make? Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents) 42 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 43. Ranking Methods Content based methods: Examples: Tf-idf with cosine similarity, bm25, etc. Link based methods: Examples: PageRank, HITS, etc. 43 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 44. What is More in Ranking? What other measures we can take for ranking better? Combining content based methods with link based methods How about learning to rank by user click through data (apply machine learning) How about learning from social web (apply social science theories) 44 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 45. Lots of Web Pages How about scalability? We have too many words, can we limit them? Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study) Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education) 45 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 46. Is it sufficient? Can we feel confident about how Web Search Engine works? No, it was just a summary for the day 46 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 47. Guess! what next you would see? ? 47 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 48. Our search engine Yes we are making it 48 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 49. Appendix 49 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 50. Outline What is Research? How to prepare yourself for IR research? 50 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 51. What is Research? 51 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 52. What is Research? Research Discover new knowledge Seek answers to questions Basic research Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, … Applied research Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors, vaccinations, … The boundary is vague; distinction isn’t important 52 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 53. Why Research? Funding Curiosity Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research 53 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 54. Where’s IR Research? Information Science Funding Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Computer Science Application Development Applied Research Basic Research 54 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 55. Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 55 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 56. Typical IR Research Process Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very difficult) Run many experiments (need to have clear research hypotheses) Analyze results and repeat the steps above if necessary Publish research results 56 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 57. Research Methods Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”) Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”) Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”) The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory… 57 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 58. Types of Research Questions and Results Exploratory (Framework): What’s out there? Descriptive (Principles): What does it look like? How does it work? Evaluative (Empirical results): How well does a method solve a problem? Explanatory (Causes): Why does something happen the way it happens? Predictive (Models): What would happen if xxx ? 58 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 59. Solid and High Impact Research Solid work: A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 59 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 60. How to Prepare Yourself for IR Research? 60 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 61. What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Learning: take you to the frontier of knowledge Persistence: so that you don’t give up Respect data and truth: ensure your research is solid Communication: allow you to publish your work … 61 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 62. Learning about IR (1/2) Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…) Then read “Readings in IR” by Karen Sparck Jones, Peter Willett And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf Read other papers published in recent IR/IR-related conferences 62 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 63. Learning about IR (2/2) Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) Stay in frontier: Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on Industry activities Read about industry trends Try out novel prototype systems Funding trends Read request for proposals 63 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 64. Critical Thinking Develop a habit of asking questions, especially why questions Always try to make sense of what you have read/heard; don’t let any question pass by Get used to challenging everything Practical advice Question every claim made in a paper or a talk (can you argue the other way?) Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) Force yourself to challenge one point in every talk that you attend and raise a question 64 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 65. Respect Data and Truth Be honest with the experiment results Don’t throw away negative results! Try to learn from negative results Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data Be objective in data analysis and interpretation; don’t mislead readers Aim at understanding/explanation instead of just good results Be careful not to over-generalize (for both good and bad results); you may be far from the truth 65 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 66. Communications General communication skills: Oral and written Formal and informal Talk to people with different level of backgrounds Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) English proficiency Get used to talking to people from different fields 66 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 67. Persistence Work only on topics that you are passionate about Work only on hypotheses that you believe in Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative results In many cases, negative results don’t completely reject a hypothesis Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) Think of possibilities of repositioning a work 67 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 68. Optimize Your Training Know your strengths and weaknesses strong in math vs. strong in system development creative vs. thorough … Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of your strengths 68 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 69. Thank You Reach me on Twitter: @matifq Email me: maqureshi@iba.edu.pk 69 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk