This proposal outlines the development of a comprehensive information retrieval portal for Canadian scientific researchers. The portal would aggregate content from various sources and use techniques like collaborative filtering and content analysis to provide personalized search and recommendations. It would include features for user profiling, concept discovery, and interactive visualization of results. The proposal discusses forming partnerships with organizations to incorporate additional content and conducting a pilot program to evaluate the portal's usability and ability to improve search satisfaction and reuse.
2. Overview
Context: CISTI Strategic Plan
Proposal Statement
System Architecture
Proposal Components
Partnerships
Outcomes and Draft Workplan
Andre’s Relevant Experience
3. Holy Grail
“It’s easy to say what would be the ideal
online resource for scholars and scientist: all
papers in all fields, systematically
interconnected, effortlessly accessible and
rationally navigable from any researcher’s
desk, worldwide, for free”
Stevan Harnad, 1999
Professor of Cognitive Science
University of Southampton
4. Excerpts from CISTI Strategic Plan
“Goal 1: Provide universal, seamless, and permanent
access to information for Canadian research and
innovation.”
“Canadians look to CISTI to deliver distilled,
aggregated, and validated information that is relevant
to their research and innovation activities.”
“Available at the client’s desktop, these services are
provided through a technologically sophisticated
infrastructure.”
“[All users] will have electronic access at their desktop
to a wealth of national and international STM
information resources, supported by intelligent search
and analysis tools and expert advice.”
5. Proposal Vision
To develop a web-based information
portal that offers universal, seamless
access to highly relevant, distilled and
aggregated SMT information using
intelligent search and analysis tools that
support scientific innovation.
6. High Level Functional Architecture
LitMiner
Content Analysis
Personalized
Scientific
Content Aggregator Web Application
Literature
OpenURL Resolver Server Research
Portal
Personalization Engine
Commercial
Science User Collaborative
Publishers CISTI & Agents Filtering
University
Libraries
Taste (open source)
8. User Needs
Customers of CISTI services and content are elite –
highly educated and exacting in their requirements;
Compared to mass-market or intranet commercial
search-portals, the number of CISTI end-users is
small (30,000 – 100,000);
User needs are (likely) varied but focused: e.g.
bibliographic literature searches / peer reviews /
competitive analysis / historical research;
Contribution to “innovation” can be measured (in the
short term) by asking the user directly.
9. User Profiling
Enables
Customized services
Alerts / Notifications
Higher precision search results
Greater user satisfaction
Item and User based recommender system
Broadens scope of search to semantically
cognate but otherwise disparate domains
10. Content Aggregation
Most end users will (likely) not care where the
information they seek resides;
Results for a search should show that many
sources are available and provide links to
these sources (Open Access / Commercial /
Academic / Government);
Requires partnerships with content providers
and search engines.
11. Collaborative Filtering
Monitors user’s browsing behaviour (and / or explicit
feedback) to build a profile of the users choices;
Other users with “similar” profiles can share
(anonymously) their opinions (e.g. on the value or
usefulness of an article or book) with others. “People
who ordered article X also ordered article Y”);
Enables serendipitous recommendations (options
that the “active user” might not have considered
otherwise)
May stimulate “innovation”;
May complement citation indexing as a relevance criterion;
Untested technology in the scientific information
retrieval community;
12. Content Mining
Concept discovery using:
Automatic Classification (Categorization)
Named Entity Tagging
Document meta-tagging w/ Concepts
Value:
Improved Precision in Search Results
May add dimensions to meta-data about content
“Related Articles” feature in Google Scholar
Enables novel visualization of results
13. Entrust Toolkit
Categorizer
DB Categories
Concepts Concepts,
Meta-Data
Summarie
Summarizer s,
Ranked
File
Entrust Phrases
System
n o C t n e mu c o D
Content Search Hits,
Analysis Locations
Toolkit
15. Results Visualization
Content Analysis and
Personalization
May allow different
display paradigms for
“more documents like
this” or “similar articles”
Interactive Vizualization of Multiple
Query Results – Battelle
Feedback on relevance
of the query terms to the
selected item.
Using Visualisation to Interpret Search
Engine Results– Wolverhampton
16. Partners
Google (Books / Scholar)
http://scholar.google.com/
Online Computer Library Center - WorldCat
http://www.worldcat.org/
Public Library of Science
http://www.plos.org
Science.gov
http://www.science.gov/
International Association of STM Publishers
http://www.stm-assoc.org/
Annual Reviews
http://www.annualreviews.org/
BioMed Central (UK)
http://www.biomedcentral.com/
17. Related Areas of Research
Digital Archiving
Mechanisms for preserving digital objects (multi-media)
Valuation and payment models for Digital Objects
To decide what to preserve / for how long / how much to
charge
Application of Metadata Standards
Dublin core / Semantic Web Ontologies (OWL)
Digital Rights Management & Security
Access control / Intellectual Property protection
18. Project Phases & Outcomes
Project Phases
Requirements / Research Phase
Analysis / Design Phase
Development / Test Phase
Outcomes
Develop prototype of content-aggregation search portal
with collaborative filtering and content analysis engine
Establish partnerships with content providers and search
engine organizations
Test user satisfaction and "return use" improvements on a
sample population
Publish results
19. Requirements /Research Phase
User Requirements
Find out what classes of users there are and what
features users want in an information portal that
would help them innovate;
Technology Literature Review
Content Aggregation
Visualization
Categorization
Personalization / Collaborative Filtering
20. Analysis / Design Phase
Use-Cases
For each category of user, enumerate the use-
cases (behavioural scenarios).
User Interface Design
Design the interface for query, query-refinement,
results visualization and recommendations.
Software Evaluation
Portal web-application components
Collaborative Filtering packages
Categorization / LitMiner interfaces
21. Development / Test Phase
Prototype Information Portal
Develop Content Aggregator
Personalization / Recommendation agents
Integrate Content Analysis
LitMiner or Categorization / Concept Tagging
toolkits
Test and Evaluate in a Pilot program.
Experiments with test group to determine
Measure of user acceptance
Rates of Return Usage
23. Andre Vellino – Relevant Experience
Entrust
Content Analysis Policy Architect - Concept extraction and automatic categorization.
imGenie – startup
Systems architect for a wireless, bi-modal (voice / text), personalized information
retrieval and groupware application.
National Research Council
Research Scientist, IIT – Information Retrieval on small-format displays.
Nortel Networks
Senior Systems Architect, Disruptive Network Solutions - Personal Identity
Management for intelligent mediation of content-delivery in the network.
Carleton University
Cognitive Science Ph.D. program, Adjunct Research Professor
NCF Internet
Server-side Web architect for new NCF web-portal – registration, payment,
single sign-on to integrated applications.
University of Georgia / Environmental Protection Agency
Research Associate, Advanced Computational Methods Center - development of
expert system for predicting chemical reactivity from chemical structure.
Editor's Notes
This quote is from Stevan Harnad, a professor of cognitive science and advocate of Open Access. He is an especially strong believer in self-archiving as a method for increasing the accessibility of scholarly work. This vision is similar, in several respects, to the one offered by the CISTI Strategic Plan (2005-2010).
Excerpts from the CISTI 2005-2010 Strategic Plan
In a sentence my proposal is : To develop a web-based information portal that offers universal, seamless access to highly relevant, distilled and aggregated information using intelligent search and analysis tools that support scientific innovation.
This picture illustrates the overall functional architecture of this proposal.
The proposal has 6 principle components.
The specificity of scientific and technical researchers provides both a challenge and an opportunity. The challenge is that the users’ requirements are much more stringent, the opportunity is that the user’s needs are much more focused than that of the typical Google user.
If we know who the users are and we keep track of the users’ behaviours, we can provide them with value-added services (alerts / notifications), better quality search results and novel capabilities that stimulate scientific innovation (recommender services.)
One objective of this proposal is to provide a single point of access for Canadians to access a variety of STM content sources.
This is one of the core technology components – a recommendation service build on collaborative filtering technology.
The other core technology component is content analysis (classification / named-entity tagging). This will facilitates a better user search experience and enables novel ways of visualizing results.
One (commercial) candidate for content analysis is the Entrust Content Analysis Toolkit, which offers a mixed-paradigm method of analyzing text content.
One example that I developed for Entrust is this concept-hierarchy for detecting medical concepts. For example the concept “ICD-9” contains several thousand scored search terms. In combination, they can detect the presence of medical information in an e-mail or text document.
These are some possibilities for search-result visualization that may be considered for this project.
Partnerships with content publishers, whether Open Access or commercial, will have to be developed to achieve the goal of “seamless” and “comprehensive” access to STM information. This is a partial list of some content / search engine providers with which CISTI could for partnerships.
There are other valid areas of research, such as Digital Archiving and Metadata Standards which may contribute to the objectives of this proposed research, but in this proposal I focus on the work that would best suit me and to which I have the most to contribute.
There are 3 principle project phase and 4 main outcomes of this work.
User Requirements: To develop an effective IR portal, we need to find out what features scientists of various sorts would want in such a portal that would help them with their task. This phase would allow us to better understand the different categories of users and the varieties of tasks that they are attempting to achieve when using the services of an information portal. Technology Literature Review: This phase will review the computer science and cognitive science literature in the 4 major technologies that need to be integrated in this portal.
Use Cases: From the user-requirements, we can abstract out “Use Cases” – typical scenarios of usage that cover the range of user-requirements. For example, one use-case might be that of an industrial Chemist doing a search for “prior art” for a patent application. Design The UI: The use-cases enumerated in the previous phase, define some of the constraints for the User Interface. If users typically just want to enter search terms “google” style and then sort through the results and refine the search, that will dictate some aspects of the UI. If users typically know which sources of information they wish to search, that will constrain a different UI paradigm. Existing off-the-shelf software (commercial and open source) will be assessed for their suitability in this project.
Prototyping the portal will have several components: * User authentication / login (for personalization features to be active, such as “high precision search” , “recommendation”, “notification alerts” etc.) Personalization based on one-time registration profile (profession / interests / contact information (for alerts)). Which of Content Analysis toolkits are integrated will depend in part on the application interfaces that are discovered in the software evaluation phase. How the whole application performs, from the point of view of user-acceptance, will have to be determined experimentally in a pilot program.
This is the Gantt chart of the work plan outlined in the accompanying paper.
Extracted from my Curriculum Vitae, this is the strength and depth of experience I bring to this project.