From BonnieShucha at U Wisc.Show of hands:Of all the time you spend on legal research, how much of it is spent on the Web? Less than 25%, 25-50%, 50-75%, more than 75%What search tools do you prefer? Google, Westlaw, Lexis, others?Generally, how satisfied are you with your search experiences? Very, Fairly, Somewhat, Not Very, Not at All
NEED AN INTRO HERE OR ON THE SLIDE BEFORE THIS ONeThere has been, and continues to be a lot written on the Invisible Web.Here are just some samples I used when writing this presentation.
Dr. Jill Ellsworth coined the term “Invisible Web” in 1996. It is also called the “Deep Web,” “Hidden Web,” and “Dark Web”. The Invisible web consists of the data you cannot retrieve using a keyword search in a general search engine. I and others also include any results from a general search engine that are not on the first couple of pages. If one isn’t going to look past those first few pages than the rest of the results are effectively “invisible”.Due to the dynamic nature of the Web, what is “invisible” today may be “visible” tomorrow. The Visible Web – is what you see in the results pages generated by general web search engines such as Yahoo!, Google or Bing.It is also called the Surface Web.
In order to understand the concept of the “invisible Web”, it may be helpful to first explore the nature of the “visible Web”Visible Web page exists in “static” or unchanging formThey exists as a “physical” file on a computerMost in .htm or .html formatSimilar to a word processed document in .doc or .wpd formatStatic Web pages considered “visible” because standard search engines can index them and display them as search resultsThe Surface Web is just the Tip of the Iceberg.
What is the Invisible WebIn late 2001 a search company called Bright Planet made several points in a now famous White Paper on the invisible Web (The Deep Web: Surfacing Hidden Value)It is as much as five hundred times larger than the visible webIt is the largest growing category of new information on the Internet.contains sites that tend to be narrower, with deeper content than surface sitescontent is highly relevant to every information need, market, and domain.consists primarily of publicly accessible information – not subject to fees or subscriptions. Chris Sherman and Gary Price, Internet search gurus and authors of The Invisible Web (2001) disagree with BrightPlanet’s sizing of the invisible Web. They estimate it to be somewhere between 2 and 50 times larger than the visible Web. Search engines such as Google are making strides to index deep Web content so this number is constantly changing. Nonetheless, the invisible Web still accounts for the majority of information available on the Internet.
Karen explained earlier this morning how search engines create their Indexes.Understanding this process explains how so much information is not retrieved by a search engine searchI decided to use the word content here rather than sites. The main page of a Deep Web site is usually easy to find and is in the Search Engine’s index. It is the rest of the site – other webpages and other content within the site) that may be hidden.Search engines do not index certain web content (making it Invisible) mainly for the following reasons:The Search Engine does not know about the page. No one has submitted the URL to the search engine, and/or no pages currently covered by the Search engine have linked to it. The Search Engine is assuming in this case that hardly anyone cares about this page, so you probably don’t either.
2. The Search Engines have decided not to index the contentBecause it’s too deep in the site (probably less useful)The page changes frequently and indexing the content would be somewhat meaningless (for example: News pages)The page is generated dynamically -- it only exists for a moment in time as a result of a query/search
3. The Search Engine has been asked not to index the content A robots.txt file is present on the site. This file asks the Search Engine not to index the site or not to index specific pages or particular parts of the site. (The website creator has determined that this information is Nobody Else’s Business)
Web pages are created or coded using HTML. This is what the Search Engines spiders are crawling for content. Files such as Images, and Videos are not HTML. This category used to include files types as PDF, Excel, Word and others, but these are now able to be indexed.Also because of the increasing amounts of HTML readable data attached to these visual files that are becoming more and more indexable and thus findable by Search Engines.
The Search Engine cannot get to the pages to index them because it encounters a request for a password of the site has a search box that must be filled out in order to get to the content.Main example of this type of content is information in databases.
This chart is another way to look at the types of sites and content that is in the Invisible Web.It breaks the Invisible Web into 4 main sub-types based on the categories identified by Sherman and Price. These areThe Opaque WebThe Private WebThe Proprietary WebThe Truly Invisible WebHighlight some of these depending on time: Maximum # of viewable results – only first 10,000 or so are actually viewable (still way more than anyone would actually look at)Password protectedInformation in DatabasesDynamically generatedCorresponds to paper bottom of pg 2 and top of pg 3
Remember that the Invisble Web exists – AND then use the other tips in this presentation to find valuable information that is “hidden” there. If you don’t know it exists then you won’t go looking for it – and will miss out.
Devine & Egger-Sider noted in their book Going Beyond Google: The Invisible Web in Learning and Teaching, that the object in teaching about and remember about the Invisible Web is NOT to replace general-purpose search engines but to show how Invisible Web resources complement search enginge results.Knowing about the Invisible Web helps close what Alex Salkever called the “Google-Gap” or the difference between what people think Google can retrieve and what it actually can. (From his article “The Web According to Google” in Business Week Online 10, no.23 June 2003)
One basic thing to do is use a Search Engine such as Google to find a database that includes the content you are looking forEnter some keywords and then the word DATABASEOnce you find a database then do another search using it’s search function to find the content you need. This is also referred to as “split-level searching”“The point is that often the key to the answer is not locating the answer itself as the first step, but locating the right database in which to search for it.”Diana Botluk, Mining Deeper into the Invisible Web, http://www.llrx.com/features/mining.htm
One basic thing to do is use a Search Engine such as Google to find a database that includes the content you are looking forEnter some keywords and then the word DATABASEFor example if we want to find Rhode Island DocketsEnter the keyword: rhode island dockets AND then the word database
Clicking on the link leads us to Justia’s Dockets & Fillings page. This is one of the resources covered in my paper and I will be talking more about Justia in my presentation on Government Resources. It is a legal portal whose mission according to their website, is to advance the availability of legal resources for the benefit of society. They are especially focused on making primary legal materials and community resources free and easy to find on the Internet. This site is independently owned by two individuals with legal and research backgrounds and is not affiliated with any publisher.Justia is a very deep website and in this section you can search dockets from different jurisidictions so I’ve chosen Rhode Island from the drop down menu, and clicked on SEARCH to get these results Cases filed in Rhode IslandJustia also provides Internet users with free case law, codes, regulations, legal articles and legal blog and twittterer databases, as well as additional community resources. Justia works with educational, public interest and other socially focused organizations to bring legal and consumer information to the online community. Justia provides premium Web site, blogging and online marketing solutions to help law firms optimize their marketing budget and provide their clients with an increased level of information and service
Actual question from our CFO (my boss)Tip #2 Example:Finding Statistics: No. of Income Taxes Filed in RI.
Here’s the first page from a general keyword Google search for our question:The first hits are just for places with our keywordSo, our first real hit is at the site: rhodeislandwork.usThe next 4 results are all for tax forms or instructions for forms
The second real hit is to an article in the Providence Journal. The rest of the results on this page are for more forms and instructions.
Screen capture real hit No. 1 from Google general search for “number of tax returns filed in Rhode Island”This leads to a page or section of the website that is no longer available
Google’s cached version of real hit No. 2This is why the page is included in our results. Our keywords are highlighted
Screen capture real hit No. 2 from Google general search for “number of tax returns filed in Rhode Island”Still doesn’t have the information/data we are looking for.So, the first page of Google results does not lead us to an answer for our question.
Next use Google to locate a site that may have this informationI thought that the page for the RI Division of Taxation might have this information
Website for Division of TaxationDidn’t see anything at first glanceDid a search for returns and it took me to the Reports page. Originally I just looked at the links on the left of the page and choose Reports thinking that the data I am looking for might be in a report.
Reports page.I choose 2009 Resident because it was the most recent and I am looking for Personal and not corporate information.
Part of the report page with the answer I am looking for
Use as examples:DRAGNET Complete PlanetLibrary of Law
Created by Mendik Library at New York Law SchoolWinner of the AALL 2011 Law Library Publications Award, Nonprint DivisionAward honors achievement in creating in-house library materials that are outstanding in quality and significanceDRAGNET stands for Database Retrieval Access Using Google’s New Electronic TechnologyIt is a Google Custom Search engine that searches only 100 highly recommended FREE legal databases and websitesIncludes a mix of governmental and organizational sites with an emphasis on New York.
In addition to this DRAGNET search of free legal databases there is also a DRAGNET search of 150 law journals with FREE online content(And a Third DRAGNET that searches constitutions and codes of the 50 states and federal government).
One of the best Invisible Web portals is CompletePlanet.It is not included in my paper because during the few days I was writing it I could not get into the site and I didn’t want to include it without verifying that it was up and running. I’ve since had to problem accessing it so am including it in this presentation.It is a comprehensive listng of dynamic searchable databasesYou can search by keyword or drill down by subjects.Here I am going to start by clicking on the LAW subject classification
Now I am at a page displaying databases classified in the LAW category.I can drill down further by more specific subject subcategoriesHere I’ve chosedCourtsNotice the “breadcrumb” trail near the top of the page circled in yellow. You can use this to go back to earlier pages, or just to see where you are in the subject hierarchy. While the results are in the 1000s in far fewer and much more relevant than a general search engine search which would retrieve results in the millions.
The Public Library of Law is another Legal Web Portal which I have recommended in my paper.They bill themselves as the best starting place to find law on the Web. It’s easy to use and contains links to case law, federal law, legislative information, and more.
They Advanced search lets you search by keyword, date range and jurisdiction, underneath each of the tabs for Case Law, Statutes, Regulations, Court Rules, Constitutions, and Legal Forms
There are also links to the for-fee information available with a Fastcase subscription because this site is availated with the publisher FASTCASE
Ask a web research specialist such as a Law Librarian or other Information Professional.
If you have a Librarian at your firm you could consult him or her.Or consultthe Librarians at your Public Library or my fellow presenter Karen at the RI State Law Library.You could also hire an Independent Information Professional to do a research project for you. (No I didn’t color coordinate my presentation to these websites!)
You can limit your results using the categories in the left hand sidebar. For example if you are looking just for images click on Images to limit by that content type.If you click on More you can also limit by BLOGS,
Now we have only Images in our results list.We don’t have to scroll through pages of results to get to the images.You can further limit by size of image
Karen mentioned the search engine Yippy earlier this morning.Using the left sidebar on Yippy you can limit your search results by subject areas – they call them Clouds – you can drill down many layers. Here I limited by Law – Immovable, Common law – and under that Real Estate etc…You can also limit by type of Sites.
I saw something this morning about Roundup Weed Killer and wanted to see what people were saying about it, or what news sources they were linking too.To do this I used Twitter Search. This is obviously a hot topic since 3 more results were found since I started searching – just seconds earlier.
Lastly, keep your eyes open for useful websites that are reviewed in the newspapers, your professional newsletters, or that are recommended at seminars such as this. I’ve listed many of these at the end of my paper and will be covering others in my next section on Government Resources.Our other speakers today have also mentioned, and will be mentioning more useful sites today.
In February of this year MarcusZillman, and expert on the deep web, published this paper on LLRX.com.It includes many sections – one of the most useful is the list of Deep Web Research Resources.
Related to information hidden on the Web is the issue of what happens to websites, or information contained in those sites that used to be on the Internet. Perhaps some of you have tried to find a Web page again and found that it was no longer there or that it had changed its appearance. The Wayback Machine, part of the Internet Archive initiative, is one source for archived internet information. It is an archive of billions of Web pages, with new sites and new versions of sites added regularly.They’ve just released a new interface which is pictured here. Notice it’s very clean and simple – much like Google
For example if I wanted to look at the website for Partridge Snow & Hahn from 2006I would enter the URL for the website in the search box and click on SHOW ALL
I selected the year 2006The dates that this URL was crawled by the Wayback Machine are circled in blue. Note that the calendar maps the number of times the site was crawled and not how many times the site was actually updated.
Selecting the Feb 12, 2003 date leads to this version of the website.I can even click on a subpage and bring up this list of attorneys as off that date in 2003.
you can also find earlier versions of some Web pages on Google. When Google indexes a Web page it takes a “snapshot” of itat that time to create its index. Most of these are cached and can be accessed by clicking on the word Cache under each hit in the list. This isn’t a solution for very old information though since when each site is re-indexed – usually at least once a month – a new snapshot will be taken overwriting the older one
Let’s review when and why you need to use invisible Web resources. Use these resources when you are looking for a precise answer, when you want authoritative, exhaustive results, when timeliness of the content is important, and when you are looking for information in a specific subject area. Resources in the invisible Web are usually more authoritative, more comprehensive, and more limited in scope, so your search results are more precise and relevant. I encourage all of you to use the skills you learn today to search smart, keep up to date and know what is out there and how to find it. Knowing what the invisible Web is and how to find information in it is an essential step to becoming a savvy Web searcher.
I have posted these sildes on my website and feel free to call me or contact me by e-mail if you have any questions.I’ve also included my LinkenIn contact info and my Twitter name.
Actual question from our CFO (my boss)
This chart is another way to look at the types of sites and content that is in the Invisible Web.It breaks the Invisible Web into 4 main sub-types based on the categories identified by Sherman and Price. These areThe Opaque WebThe Private WebThe Proprietary WebThe Truly Invisible WebHighlight some of these depending on time: Maximum # of viewable results – only first 10,000 or so are actually viewable (still way more than anyone would actually look at)Password protectedInformation in DatabasesDynamically generatedCorresponds to paper bottom of pg 2 and top of pg 3
This chart is another way to look at the types of sites and content that is in the Invisible Web.It breaks the Invisible Web into 4 main sub-types:The Opaque WebThe Private WebThe Proprietary WebThe Truly Invisible Web