26
Introduction to Data Structures Vamshi Ambati [email protected]

Introduction to Data Structures

Embed Size (px)

DESCRIPTION

Introduction to Data Structures. Vamshi Ambati [email protected]. Overview. Java you need for the Project Search Engine and Data Structures THIS Code Structure On the Data Structure front Dictionaries (Dictionary Structures) Java Collections Linked List Queue. - PowerPoint PPT Presentation

Citation preview

Introduction to Data Structures

Vamshi Ambati [email protected]

Overview

Java you need for the Project Search Engine and Data Structures THIS Code Structure On the Data Structure front

Dictionaries (Dictionary Structures) Java Collections Linked List Queue

[c] Vamshi Ambati 2

Java you will need for the Project

Core Programming + I/O and Files OOPS

Inheritance Packages Encapsulation

Java API Collections

[c] Vamshi Ambati 3

What is a Search Engine? A sophisticated tool for finding information

on the web An Index for the World Wide Web

Analogous to the Index on a textbook

Just Imagine a world without Search Engine!

[c] Vamshi Ambati 4

Why Index in the first place? Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak A Sorted list always helps

Permits binary search. About log2n probes into list

log2(1 billion) ~ 3

[c] Vamshi Ambati 5

How search engines work The search engines maintain data of web

sites in its database. Use programs (often referred to as

"spiders" or "robots") to collect information.

The information is then indexed by the search engine.

It allows users to look for the words or combination of words found in the index

Inverted Files

A file is a list of words and this file contains words at various positions. Each entry of the word is associated with a position.

[c] Vamshi Ambati 8

POS1

10

20

30

36

FILE

a (1, 4, 24…)entry (17…)file (2, 10)contains(11,….)position (25…)positions (15…)word (20….)words (6,12..)..

INVERTED FILE

Inverted Files for Multiple Documents

[c] Vamshi Ambati 9

107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132. . .

“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORD INDEX

A comprehensive form of Inverted Index

[c] Vamshi Ambati 10SOURCE: http://www.searchtools.com/slides/bestsearch/bls-24.html

THIS Search engine for the website http://www.hinduonnet.com/

Website for the news paper The Hindu Not for the entire web Results are confined to only one web site

[c] Vamshi Ambati 11

Index Structure for our Project (THIS)

http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2004091500081100.htm&date=2004/09/15/&prd=bl :: 4http://www.hinduonnet.com/thehindu/thscrip/print.pl?file=2002102700140200.htm&date=2002/10/27/&prd=mag :: 7

..

http://www.hindu.com/2004/10/09/stories/2004100904051900.htm :: 23http://www.hindu.com/2004/10/09/stories/2004100910970300.htm :: 3..

….

http://www.hinduonnet.com/thehindu/gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/gallery/0048/004807.htm :: 1..

[c] Vamshi Ambati 12

India

ManMohan

Cricket

Bollywo

Sharukh

Sachin

….

Search Engines

Search Engine Differences Coverage (What part of the web do they

really cover?) Crawling algorithms

Frequency of crawl depth of visits

http://www.msitprogram.net/ Depth -0 http://www.msitprogram.net/admissions.html/

Depth -1 Indexing policies

Data Structures Representation

Search interfaces Ranking

[c] Vamshi Ambati 14

[c] Vamshi Ambati 15

Search Engine

Index

[c] Vamshi Ambati 16

Crawl

Search

Index

[c] Vamshi Ambati 17

Query

retrieve

ResultSet

FinalResult

Sort by Rank

ResultPage

makePage

TheWeb

Spider

Parser

URLList

crawl parse

getNextUrl

addUrls

addPage

Indexer

store

retrieve

Index

[c] Vamshi Ambati 18

Query

retrieve

ResultSet

FinalResult

Sort by Rank

ResultPage

makePage

TheWeb

Spider

Parser

URLList

crawl parse

getNextUrl

addUrls

addPage

Indexer

store

retrieve

Where are our data structures and algorithms lying?

QueuePriority Queue

Hashtable

BinaryTree

LinkedList

MergeSort&InsertionSort

Code Structure (THIS)

[c] Vamshi Ambati 19

PageImg PageHref

PageElement

Spider

WebSpider

PageWord

Queue

SearchDriver

PageLexer

HttpTokenizer URLTextReader

CrawlerDriver

TreeDictionary

Query

addPage

ListDictionary

Indexer

Index

HashDictionary

Index

Save

Restore

Crawl

Parse

DictionaryInterface

Inheritance

Uses

Calls

DictionaryDriver

Dictionary Structures (Lexicon) A Dictionary is an unordered container that contains key-

element pairs Ordered Dictionary has the elements in sorted order

Keys are unique, but the values could be any

[c] Vamshi Ambati 20

Dictionary ADT size(): returns the number of items in D

Output: Integer isEmpty(): Test whether D is empty.

Output: Boolean elements(): Return the elements stored in D.

Output: iterator of elements (objects) keys(): Return the keys stored in D.

Output: iterator of keys (objects) findElement(k): if D contains an item with key == k, then return the element of

that item, else return NO_SUCH_KEY. Output: Object

findAllElements(k): Output: Iterator of elements with key k

insertItem(k,e): Insert an Item with element e and key k into D. removeElement(k): Remove an item with key == k and return it. If no such

element, return NO_SUCH_KEY Output: Object (element)

removeAllElements(k): Remove from D the items with key == k. Output: iterator of elements

[c] Vamshi Ambati 21

Also see the Java Standard API for Dictionary http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html

Dictionary ADT in THIS Project size(): returns the number of items in D

Output: Integer isEmpty(): Test whether D is empty.

Output: Boolean getKeys(): Return all the keys of the elements stored in D.

Output: String array (Ideally it should be Vector!!) getValue(k): if D contains an item with key == k, then return the

element of that item, else return NULL. Output: Object

insertItem(k,e): Insert an Item with element e and key k into D. remove(k): Remove an Item with key k from D.

We have customized the Dictionary a bit as we would be inserting only elements of the type <String,Object> !!

[c] Vamshi Ambati 22

Java Collections java.util.* (A quite helpful library)

Has implementations for most of the Data Structures They make life really easy You can not use the data structures inbuilt unless

specified (Eg:Task1 Tasklet-A) Use them for non-data structural purposes - Collections

Eg: Arrays,Vectors, Iterators,Lists, Sets etc You would definitely be using “Iterator” atleast as you

would be dealing with many Objects at a time!

http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterator.html.

[c] Vamshi Ambati 23

See: http://java.sun.com/docs/books/tutorial/collections/

Other Data structures Queue LinkedList

Beware! there are no Pointers in Java However there are “references”

Learn more about References in Java

Do not use the java.util package for DataStructures or Sorting Algorithms! You are expected to code them

[c] Vamshi Ambati 24

Summary Learn data structures by implementing

THIS

Mini version of a real search engine

Frame work is provided

More details in the next video

[c] Vamshi Ambati 25

THANK YOU

[c] Vamshi Ambati 26