18
Why we need an independent index of the Web Dirk Lewandowski [email protected] http://www.bui.haw-hamburg.de/lewandowski.html @Dirk_Lew Society of the Query Conference, Amsterdam, 7/11/2013

Why we need an independent index of the Web

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Why we need an independent index of the Web

Why we need an independent index of the Web

Dirk Lewandowski [email protected] http://www.bui.haw-hamburg.de/lewandowski.html @Dirk_Lew Society of the Query Conference, Amsterdam, 7/11/2013

Page 2: Why we need an independent index of the Web

The “local copy” of the Web

•  Web Indexing –  New, changed, deleted document

–  “Holy grail” of keeping the index complete and current

Risvik, K. M., & Michelsen, R. (2002). Search engines and web dynamics. Computer Networks, 39(3), 289–302.

Page 3: Why we need an independent index of the Web

Representation of documents in a search engine

Referring documents à Document à Metadata (examplex)

heading1

heading 2

Anchor text

Anchor text

Anchor text

From the source code - Title - Description - Keywords - Author

From the document (document info) - Length - Date - Decay -  Name of the author

From the Web - PageRank - Number of citations

Page 4: Why we need an independent index of the Web

The User’s Perspective

•  Everyone uses search engines (Purcell, Brenner & Raine, 2012; van Eimeren & Frees, 2012)

•  Market is dominated by Google (ComScore data) •  Users rely on

–  Google’s method of ordering results

–  Google’s method of collecting data

à If Google hasn’t seen it — and indexed it — or kept it up to date, it can’t be found with a search query.

Page 5: Why we need an independent index of the Web

Freshness of Web search engines (see Lewandowski, Wahlig & Meyer-Bautor, 2006; Lewandowski, 2008)

Original (as of yesterday) Google‘s copy (as of yesterday)

Page 6: Why we need an independent index of the Web

What about the alternatives to Google?

•  Many “seems to be” search engines –  Accessing the data of another search engine

–  Representing nothing more than an alternative user interface to one of the more well-known engines

–  In many cases, that turns out to be Google

–  E.g., in Germany, we can see that the major internet portals T-Online, GMX, AOL, and web.de all display results obtained from Google

Page 7: Why we need an independent index of the Web

Why is one search engine not enough?

•  We need more than one search engine to ensure that a broad range of opinions are represented in the search market.

•  Users should have the choice between different worldviews which originate as a product of algorithm-based search result generation

•  Ideology-free search algorithms are simply not possible

Page 8: Why we need an independent index of the Web

Alternative Search Engine Indexes

•  There are only a handful of search engines that operate their own indexes, due to costs and technical complexity

•  Search engines start-ups –  Use an existing external index –  Focus on a specialised topic (which requires only a small index)

–  Aggregate data from different search engines (meta search engine)

•  Actual search engine startups like Blekko and Duck Duck Go are more the exception than the rule

Page 9: Why we need an independent index of the Web

Partner model

•  “Real” search engine providers such as Google and Bing operate their own search engines but also provide their search results to partners

•  All the major web portals have now embraced this model.

•  Income through ads; revenue-sharing

•  Attractiveness of the model –  The search engine provider encounters only minimal costs

–  The operator of the portal no longer needs to go to the great expense of running its own search engine.

–  The partner index model has served to thin out the competition in the search industry.

Page 10: Why we need an independent index of the Web

Access to Search Engine Indexes

•  Application programming interfaces (APIs) –  No direct access to the search engine index –  Limited number of top results which have already been ranked by the search

engine provider

–  Access via APIs is similar to what is occurring at the meta-search engines

–  The representation of the document in the source search engine is also not included

Page 11: Why we need an independent index of the Web

Alternative Search Engines

•  What constitutes an “alternative search engine”? –  All search engines that are not Google? (“Google Killers“, e.g., Cuil)

–  Some alternatives are not perceived as such because they are considered to be simply the same as Google (e.g., Bing)

–  Search engines which explicitly position themselves as an alternative to Google through a regional approach (e.g., Seekport)

–  New approaches to search / “Real alternatives”: Alternative approaches to gathering and representing web content

Page 12: Why we need an independent index of the Web

Public Support for Search Engine Technology?

•  Quaero/Theseus: Funding a “Google Killer”? –  Quaero: Technologies for multimedia searching.

–  Theseus: Semantic technologies for business-to-business applications (without focusing exclusively on search).

•  The proposal to provide government funding for search engine technology has been subject to intense criticism in the past

•  Establish a single alternative?

•  A number of factors which would cause it to fail –  Poor marketing –  Graphic design of the user interface –  ...

•  Regardless of the reason, a failure of the new search engine would result in the entire publicly funded initiative failing.

Page 13: Why we need an independent index of the Web

Economic perspective

•  Only the largest internet companies are able to afford large indexes. •  Microsoft is the only company besides Google to possess a comprehensive

search engine index.

•  Yahoo gave up on its own index several years ago

•  It appears as though operating a dedicated index is attractive to practically no one — and there are hardly any candidates with the necessary financial resources in any case

Page 14: Why we need an independent index of the Web

The Solution

•  Create the conditions that will make establishing alternative search engines possible

•  We can expect that the possibilities it presents would benefit a number of different companies, individuals, and institutions.

•  The result will be fair competition to develop the best concepts for using the data provided by the index.

Page 15: Why we need an independent index of the Web

Vision

•  “An index of the web that can be accessed at fair conditions for everyone”

–  “Everyone” means that anyone who is interested can access the index.

–  “Fair conditions” does not mean that access to the index must be free of charge for everyone. A certain number of document requests per day should be available at no cost in order to promote non-profit projects.

–  “Access” to the index can be defined as the ability to automatically query the index with ease.

–  The concept “index of the web” is intended to cover as much of the web as possible

Page 16: Why we need an independent index of the Web

Funding and operation

•  Funding –  This type of project cannot be supported by any one country alone. The only

feasible option is a pan-European initiative.

•  Who would operate the index? –  Existing research institution or newly-founded institution –  The operator of the index should not obtain the exclusive right to determine the

way in which the documents are used or made available (à Board of trustees)

Page 17: Why we need an independent index of the Web

Conclusion: Advantages of an independent index of the web

•  Motivate companies, institutions, and developers pursuing personal projects to create their own search applications.

•  The data available on the web is so boundless that it lends itself to countless applications in a broad range of fields.

•  Enable applications we are not yet capable of even imagining.

•  An open structure, transparency with respect to access, and the assurance of permanent availability thanks to state sponsorship would lay the groundwork for innovation.

Page 18: Why we need an independent index of the Web

Thank you

Prof. Dr. Dirk Lewandowski Hochschule für Angewandte Wissenschaften Hamburg dirk.lewandowski@haw-hamburg,de Twitter: Dirk_Lew http://www.bui.haw-hamburg.de/lewandowski.html http://www.searchstudies.org