Why we need an independent index of the Web

Why we need an independent index of the Web

Dirk Lewandowski [email protected] http://www.bui.haw-hamburg.de/lewandowski.html @Dirk_Lew Society of the Query Conference, Amsterdam, 7/11/2013

The “local copy” of the Web

•  Web Indexing –  New, changed, deleted document

–  “Holy grail” of keeping the index complete and current

Risvik, K. M., & Michelsen, R. (2002). Search engines and web dynamics. Computer Networks, 39(3), 289–302.

Representation of documents in a search engine

Referring documents à Document à Metadata (examplex)

heading1

heading 2

Anchor text

Anchor text

Anchor text

From the source code - Title - Description - Keywords - Author

From the document (document info) - Length - Date - Decay -  Name of the author

From the Web - PageRank - Number of citations

The User’s Perspective

•  Everyone uses search engines (Purcell, Brenner & Raine, 2012; van Eimeren & Frees, 2012)

•  Market is dominated by Google (ComScore data) •  Users rely on

–  Google’s method of ordering results

–  Google’s method of collecting data

à If Google hasn’t seen it — and indexed it — or kept it up to date, it can’t be found with a search query.

Freshness of Web search engines (see Lewandowski, Wahlig & Meyer-Bautor, 2006; Lewandowski, 2008)

Original (as of yesterday) Google‘s copy (as of yesterday)

What about the alternatives to Google?

•  Many “seems to be” search engines –  Accessing the data of another search engine

–  Representing nothing more than an alternative user interface to one of the more well-known engines

–  In many cases, that turns out to be Google

–  E.g., in Germany, we can see that the major internet portals T-Online, GMX, AOL, and web.de all display results obtained from Google

Why is one search engine not enough?

•  We need more than one search engine to ensure that a broad range of opinions are represented in the search market.

•  Users should have the choice between different worldviews which originate as a product of algorithm-based search result generation

•  Ideology-free search algorithms are simply not possible

Alternative Search Engine Indexes

•  There are only a handful of search engines that operate their own indexes, due to costs and technical complexity

•  Search engines start-ups –  Use an existing external index –  Focus on a specialised topic (which requires only a small index)

–  Aggregate data from different search engines (meta search engine)

•  Actual search engine startups like Blekko and Duck Duck Go are more the exception than the rule

Partner model

•  “Real” search engine providers such as Google and Bing operate their own search engines but also provide their search results to partners

•  All the major web portals have now embraced this model.

•  Income through ads; revenue-sharing

•  Attractiveness of the model –  The search engine provider encounters only minimal costs

–  The operator of the portal no longer needs to go to the great expense of running its own search engine.

–  The partner index model has served to thin out the competition in the search industry.

Access to Search Engine Indexes

•  Application programming interfaces (APIs) –  No direct access to the search engine index –  Limited number of top results which have already been ranked by the search

engine provider

–  Access via APIs is similar to what is occurring at the meta-search engines

–  The representation of the document in the source search engine is also not included

Alternative Search Engines

•  What constitutes an “alternative search engine”? –  All search engines that are not Google? (“Google Killers“, e.g., Cuil)

–  Some alternatives are not perceived as such because they are considered to be simply the same as Google (e.g., Bing)

–  Search engines which explicitly position themselves as an alternative to Google through a regional approach (e.g., Seekport)

–  New approaches to search / “Real alternatives”: Alternative approaches to gathering and representing web content

Public Support for Search Engine Technology?

•  Quaero/Theseus: Funding a “Google Killer”? –  Quaero: Technologies for multimedia searching.

–  Theseus: Semantic technologies for business-to-business applications (without focusing exclusively on search).

•  The proposal to provide government funding for search engine technology has been subject to intense criticism in the past

•  Establish a single alternative?

•  A number of factors which would cause it to fail –  Poor marketing –  Graphic design of the user interface –  ...

•  Regardless of the reason, a failure of the new search engine would result in the entire publicly funded initiative failing.

Economic perspective

•  Only the largest internet companies are able to afford large indexes. •  Microsoft is the only company besides Google to possess a comprehensive

search engine index.

•  Yahoo gave up on its own index several years ago

•  It appears as though operating a dedicated index is attractive to practically no one — and there are hardly any candidates with the necessary financial resources in any case

The Solution

•  Create the conditions that will make establishing alternative search engines possible

•  We can expect that the possibilities it presents would benefit a number of different companies, individuals, and institutions.

•  The result will be fair competition to develop the best concepts for using the data provided by the index.

Vision

•  “An index of the web that can be accessed at fair conditions for everyone”

–  “Everyone” means that anyone who is interested can access the index.

–  “Fair conditions” does not mean that access to the index must be free of charge for everyone. A certain number of document requests per day should be available at no cost in order to promote non-profit projects.

–  “Access” to the index can be defined as the ability to automatically query the index with ease.

–  The concept “index of the web” is intended to cover as much of the web as possible

Funding and operation

•  Funding –  This type of project cannot be supported by any one country alone. The only

feasible option is a pan-European initiative.

•  Who would operate the index? –  Existing research institution or newly-founded institution –  The operator of the index should not obtain the exclusive right to determine the

way in which the documents are used or made available (à Board of trustees)

Conclusion: Advantages of an independent index of the web

•  Motivate companies, institutions, and developers pursuing personal projects to create their own search applications.

•  The data available on the web is so boundless that it lends itself to countless applications in a broad range of fields.

•  Enable applications we are not yet capable of even imagining.

•  An open structure, transparency with respect to access, and the assurance of permanent availability thanks to state sponsorship would lay the groundwork for innovation.

Thank you

Prof. Dr. Dirk Lewandowski Hochschule für Angewandte Wissenschaften Hamburg dirk.lewandowski@haw-hamburg,de Twitter: Dirk_Lew http://www.bui.haw-hamburg.de/lewandowski.html http://www.searchstudies.org

Internet

Why we need an independent index of the Web