17
Steve Cassidy Computing at Macquarie No 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Embed Size (px)

DESCRIPTION

Steve Cassidy Computing at MacquarieNo 3 What is the Web? Documents, text, images, sound A web of hyperlinks –Link one (text) document to others Easy to join –Any Internet user can be a publisher Anarchic –No-one is in charge Very big

Citation preview

Page 1: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 1

Searching The Web

Steve CassidyCentre for Language

TechnologyDepartment of Computing

Macquarie University

Page 2: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 2

The First Web Page

Page 3: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 3

What is the Web?• Documents, text, images, sound• A web of hyperlinks

– Link one (text) document to others• Easy to join

– Any Internet user can be a publisher• Anarchic

– No-one is in charge• Very big

Page 4: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 4

The Problem• Much of the information

available is text-based• Text is difficult to process

by computers• The popular use of

computers and the Internet has increased the availability of text-based information

• Information Overload

Page 5: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 5

The Solution?

Only one of the top four commercial

search engines finds itself

The best navigation should make it easy to find almost anything on

the web (once all the data is entered)

The Web1997

Page 6: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 6

How do they work?

• Two major steps– Build an inverted index– Match query terms in the index

• Problems– The web is very big– Finding relevant documents– Avoiding false hits

Page 7: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 7

Inverted Index

document

D1 D2 D3

D1

D1 D3

D1

D2

computer

software

information

language

computersoftware

informationlanguage

computer

libraryretrieval

computerinformation

retrievalfiltering

D1

D2

D3document

Page 8: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 8

Building the Index

List of web addresses

Download web page Parse Web page

Index

New links Web pagetext

Page 9: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 9

Building the Index

List of web addresses

Download web page Parse Web page

Index

New links Web pagetext

<table width="70%" border="0" cellspacing="0" cellpadding="4"> <tr> <td style="background-color: #f1e1f1"> <a name="works"><b><font face="Arial, sans-serif">How Google Works </font></b></a></td> </tr>

</table>

<p><a name="howGoogleWorks">If you aren't interested in learning how Google creates the index andthe database of documents that it accesses when processing a query,skip this description. I adapted the following overview from ChrisSherman and Gary Price's wonderful description of How Search EnginesWork in Chapter 2 of <a ref="http://www.amazon.com/exec/obidos/tg/detail/-/091096551X/002-5190375-1505602">The Invisible Web</a> (CyberAge Books, 2001).</a><p><a name="fast"><a name="index">Google consists of three distinct parts, each of which is run on adistributed network of thousands of low-cost computers and cantherefore carry out fast parallel processing. Parallel processing isa method of computation in which many calculations can be performed simultaneiously, significantly speeding up dataprocessing.</a></a>

Page 10: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 10

Using the IndexD1 D2

D3

D1

D1 D3

computer

software

information

documentD1

D2

language

Query: computer software information

D1 D2 D3

D1 D3

D1

Page 11: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 11

Server Farm

http://www.microsoft.com/technet/archive/windows2000serv/plan/hiavsys.mspx

Over 10,000 computersEach with a copy of the index

Page 12: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 12

Relevance• Finding pages with search terms is

easy• Which ones are the best? • Google:

– Text in titles, headings is important– Text earlier in the page is important– Text of links to this page is important– Important pages link to other important

pages

Page 13: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 13

Making the Most of Search Engines• Use words likely to appear in the

pages you want• Use more query terms to narrow

your result• Be brief• Don’t worry about spelling • Use “words in quotes” to search

for phrases

Page 14: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 14

Other Search Engines

• www.teoma.com– Offers ‘refine your search’ – Subject specific popularity

• www.ask.com– Natural language questions

• search.yahoo.com

Page 15: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 15

The Future

• Information Extraction– Find all the details of this conference

for my diary• Question Answering

– When did Armstrong land on the moon?

• The Semantic Web– Exchanging machine readable data

Page 16: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 16

Language Technology• SLP148 Language, Logic and

Computation• COMP248 Language Technology• COMP249 Web Technology• COMP348 Document Processing and the

Semantic Web• COMP349 Spoken Language Dialogue

Systems

Page 17: Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University

Steve Cassidy Computing at Macquarie No 17

Questions?