This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search.
1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies).
2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse.
3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Dynamic Search Using Semantics & Statistics
1. Text Mining - Bayesian Topic Modeling for Interactive Retrievalat SAP and Cisco Ram Akella University of California and Stanford With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC and Paul Hofmann SAP Labs October 6, 2011 SAP
2. Outline Motivation Statistical Topic Modeling - SAP & Saffron Knowledge Extraction and Reuse at Cisco Interactive Retrieval Interactive Retrieval Demo
3. Outline Motivation Statistical Topic Modeling - SAP & Saffron Knowledge Extraction and Reuse in Cisco Interactive Retrieval Interactive Retrieval Demo
4. Motivation 10/6/2011 User expects to find more relevant results each time she interacts with the system Depression treatment of patients… q3: symptoms and treatment q2: depression symptoms q1: elderly depression DOCTOR SEARCH Depression influence on family relationships… Relevance of the presented documents depends on user context SOCIAL SCIENTIST
5. Interactive Retrieval Model Query Interactive Retrieval System User Feeback Document Collection Metadata Generation System Information need Update Feedback and propagation to similar documents
6. Interactive Retrieval Model Query Interactive Retrieval System User Feeback Document Collection Metadata Generation System Add to the document metadata that facilitates the retrieval process This metadata consist of: Statistical Topic Mixture Knowledge Extraction based on Business process (problem, cause, solution) Information need Update Feedback and propagation to similar documents
7. Outline Motivation Statistical Topic Modeling - SAP & Saffron Motivation Related Work Proposed Approach Topic Modeling and Entity Association Knowledge Extraction and Reuse at Cisco Interactive Retrieval Interactive Retrieval Demo
8. Topic Modeling: Motivation Given a set of documents, we want to identify the main areas or topics discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents. If two words appear in the same document, they should be related. For each topic we have different distributions of words and each document might contain material about a variety of topics. Music notes instrument net ball racquet Sports Play net Topic 1 (80%) Sports game Topic 1 Sports Topic 2 (5%) Topic 3 (20%) Common Words ball 10/6/2011
10. Our Approach The higher probability mass is accommodated in the upper part of the tree (this facilitates the truncation and reduction in the number of topics) We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics) … … … 0.0851 0.0660 0.0310 0.0096 0.0146 10/6/2011
11. Experimental Setup The datasets are from two types: Scientific Articles (NIPS) Longer documents News Data (NYT, APW, XIE) Shorter Documents More diverse vocabulary We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko We test our model using Empirical Likelihood This method estimate how likely it is that a test document will be generated from the estimated model. We want this value to be high (better generalization and applicability to unseen documents). 10/6/2011
12. Results: NYT Dataset We obtain the topic mixture for the NYT Dataset using K=20 topics . 10/6/2011 + + - - + + + +
14. Results: Running Time 10/6/2011 Minutes Minutes APW Dataset NIPS Dataset Minutes Our Model Minutes XIE Dataset NYT Dataset
15. Illustrative Example: NYT Dataset 10/6/2011 NORTHRIDGE TAUGHT A LESSON LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
16. Illustrative Example: NYT Dataset 10/6/2011 NORTHRIDGE TAUGHT A LESSON LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
17. Illustrative Example: NYT Dataset 10/6/2011 NORTHRIDGE TAUGHT A LESSON LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
18. Illustrative Example: NYT Dataset 10/6/2011 NORTHRIDGE TAUGHT A LESSON LOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
19. Topic Modeling & Entity Association Entities SAP Business Objects Entity Extractor Saffron Associative Memory Base Base knowledge Source Query Text Data to be monitored UCSC Topic Mining System We would like to know who are the actors involved in a particular action that led to the failure of Lehman brothers Valukas Report about why Lehman Brothers Failed (6 volumes) Topics Saffron Associative Memory creates associations among entities and topics This work was presented at SAPPHIRE NOW 2010
20. Outline Motivation Statistical Topic modeling - SAP & Saffron Knowledge Extraction and Reuse in Cisco Knowledge Extraction System System Architecture Domain Knowledge Improving Productivity Performance of Service Request Recommender Interactive Retrieval Interactive Retrieval Demo
21. Knowledge Extraction System at Cisco Service Request Database Knowledge Database Applications such as retrieval Service Request Text Mining System Unstructured Text Knowledge Finding different solutions to the same problem Problem Cause Document 1 Document 2 Similarity Solution high Problem Problem high Cause Cause Irrelevant Content low Solution Solution Why did it occur? How was it solved? What was the problem?
22. System Architecture Features from Expertise Service Request Preprocessor Bag-of-words Feature Generator Hierarchical Classifier Expertise Domain Knowledge Labeled Paragraphs Service Request Recommender User Legend Data flow of Analyzer Data flow of Recommender Data output for User
23.
24.
25. Improving Productivity Compare the time spent by engineers in reading service requests before and after using our system. Browse a service request Time to access relevance N Relevant? Y Read and understand thoroughly Time to extract knowledge Read enough? N Y Create knowledge article
26.
27. Result 2: Using domain knowledge further improves retrieval results.
28.
29. Interactive Retrieval Model the user intent to retrieve relevant documents Identify the trade-off between Retrieval accuracy (how accurate are the results required to be by the user?) Interaction time (how much time is the user willing to spend on interaction?) Applied to Medical documents retrieval e.g., search for past patient cases with similar symptoms Resume retrieval in a labor marketplace e.g., search for Python developers who work in machine learning MORE IMPORTANT LESS IMPORTANT
30. Problem 10/6/2011 28 Dynamic Programming t1 t2 t3 … tn Reinforcement Learning User Intent User Intent User Intent Set of Relevant Documents Set of Relevant Documents Set of Relevant Documents Myopic Dynamic Static Dynamic What is the best path to choose ?
31. Reinforcement Learning formulation of IIR Agent IIR system Environment User Action Ranking Rk Objective Max. sum of rewards Reward Improvement v(Rk)-v(Rk-1) (as observed from user feedback) Intent Best guess for user intent or need (expressed in query terms)
32. Experiments Set-Up Dataset: TREC-9 OHSUMED, 348.566 medical documents with a list of relevance judgments 65 user queries query title: 2 − 5 words query description: 5 − 10 words Interactive Sessions of 3 − 5 steps Relevance function is binary Value of results (with appropriate weights wi) Precision @10: percentage of relevant documents in the top-10 results We compare our results with Pseudo-relevance Feedback
34. How much feedback is needed? Experiments tested on 348,566 OHSU-MED medical dataset, TREC 2002
35. Interactive Retrieval w Topic Modeling Topics help us to reduce the search They add context to the query Some important terms to describe the users’ intent may not be included in the query Topics are calculated a-priori and added to each document as metadata Topic Mixture of Relevant Docs Meta-query (combination of user inputs) Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition) Topic Mixture of Non Relevant Docs Combination of terms and topic relevance scores
36. Proposed Dataset We test our approach using the HARD TREC queries which consist of : 851,018 news documents from NYT APW and XIE agencies Each document has an average length of 305 terms There are 496,779 unique terms We infer the topic information of the corpus using 75 topics For testing purposes we use m=3 interactions We use test 30 queries We compare our algorithm with mixture relevance feedback 10/6/2011
38. Outline Motivation Statistical Topic modeling– SAP & Saffron Knowledge Extraction and Reuse at Cisco Interactive Retrieval Interactive Retrieval Demo
39. Example User intent young female with fevers and increased CPK (CreatinePhosphoKinase) CPK: enzyme, may cause heart attack or severe muscle breakdown if increased neuroleptic malignant syndrome (life-threatening neurological disorder) Associated with CPK Symptoms: muscular cramps, fever, unstable blood pressure, changes in cognition, including agitation, delirium and coma differential diagnosis List symptoms List causes of the symptoms Prioritize by the most dangerous Treat treatment
40. Relevant Documents Non-relevant documents: Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year. Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses… Relevant document: Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…