Wednesday, December 06, 2006

New Frontiers of Internet BS -- The Deep Web


The deep web (or invisible web or hidden web) is the name given to pages on the World Wide Web that are not part of the surface web that is indexed by common search engines. It consists of pages which are not linked to by other pages (e.g., dynamic pages which are returned in response to a submitted query). The deep web also includes sites that require registration or otherwise limit access to their pages (e.g., using the Robots Exclusion Standard), prohibiting search engines from browsing them and creating cached copies. Pages that are only accessible through links produced by JavaScript and Flash also often reside in the deep web since most search engines are unable to properly follow these links.

It is estimated that the deep web is several magnitudes larger than the surface web (Bergman, 2001).

More.

I just heard someone saying the "deep web" is one of the future frontiers of web searching. I think this is B.S. The reason the Robots Exclusion Standard is in place is to prevent web crawlers from going where they are not supposed to go and messing up your (dynamic, database driven) site by bringing up non-static content.

The only point I can see is being able to search non-public (i.e. paid subscription required) content, but if one has to pay to access it what's the point...?

More BS here:

  • Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
  • The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.
  • The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
  • More than 200,000 deep Web sites presently exist.
  • Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web forty times.
  • On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public.
  • The deep Web is the largest growing category of new information on the Internet.
  • Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
  • Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
  • Deep Web content is highly relevant to every information need, market, and domain.
  • More than half of the deep Web content resides in topic-specific databases.
  • A full ninety-five per cent of the deep Web is publicly accessible information -- not subject to fees or subscriptions.

Since they are missing the deep Web when they use such search engines, Internet searchers are therefore searching only 0.03% -- or one in 3,000 -- of the pages available to them today.

No comments:


Sports News: CBSSports.com