38
LECTURE 21: INDEXED FILES CSC 213 – Large Scale Programming

Lecture 21: Indexed Files

  • Upload
    keita

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

CSC 213 – Large Scale Programming. Lecture 21: Indexed Files. Today’s Goals. Look at how Dictionary s used in real world Where this would occur & why they are used there In real world setting, what problems can/do occur Indexed file usage presented and shown - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 21: Indexed Files

LECTURE 21:INDEXED FILES

CSC 213 – Large Scale Programming

Page 2: Lecture 21: Indexed Files

Today’s Goals

Look at how Dictionarys used in real world Where this would occur & why they are

used there In real world setting, what problems can/do

occur Indexed file usage presented and

shown How & why we split index & data files Formatting of each file and how they get

used Describe what problems solved using

indexed files Java coding techniques that simplify using

these files Idea needed when using multiple

indexes shown

Page 3: Lecture 21: Indexed Files

Dictionaries in Real World

Often need large database on many machines Split search terms across machines Updating & searching work split between

machines Database way too large for any single

machine If you think about it, this is incredibly

common Where?

Page 4: Lecture 21: Indexed Files

Split Dictionaries

Page 5: Lecture 21: Indexed Files

Split Dictionaries

Page 6: Lecture 21: Indexed Files

Splitting Keys From Values

In real world, we often have many indices Simple units measure where we can find

values Values could be searched for in multiple

ways

Page 7: Lecture 21: Indexed Files

Splitting Keys From Values

In real world, we often have many indices Simple units measure where we can find

values Values could be searched for in multiple

ways

Page 8: Lecture 21: Indexed Files

Index & Data Files

Split information into two (or more) files Data file uses fixed-size records to store

data Index files contain search terms & data

locations Fixed-size records usually used in data

file Each record will use exactly that much

space Extra space wasted if the value is smaller But limits data size, cannot get more space Makes it far easier to reuse space &

rebuild index

Page 9: Lecture 21: Indexed Files

Index File Format

No standard format – depends on type of data Often variable sized, but this not specific

requirement Each entry in index file begins with exact

search term Followed by position containing matching

data As a result, often find indexes smushed

together Can read indexes at start of program

execution Reasonably assumes index file smaller than

data file Changes written immediately, however

When program starts, do NOT read data file

Page 10: Lecture 21: Indexed Files

Never Read Entire Data File

Page 11: Lecture 21: Indexed Files

Indexed Files

Enables splitting search terms across computers Alphabetical split searches faster on many

serversA - C

D-E

F-HI-P

Q-R

S-T

U-X Y-Z

Page 12: Lecture 21: Indexed Files

Indexed Files

Enables splitting search terms across computers Create indexes for different types of

searchingSong name

SongLength

Page 13: Lecture 21: Indexed Files

How Does This Work?

Using index files simplified using positions Look in index structure to find position of

data in file With this position can then seek to specific

record Create instance & initialize by reading data

from file

Page 14: Lecture 21: Indexed Files

Starting with Indexed Files

American Telephone & Telegraph 112International Business Machines

0

Ford Motorcars, Inc. 224

IBM 106

IBM AT & T 23 T Ford 2 F

F 224IBM 0T 112

Page 15: Lecture 21: Indexed Files

Where Was "Searching" Used?

Indexed files used in Maps and Dictionarys Read data into searchable object after

opening file For each record, Entry uses indexed data as

its key Single data file has multiple indexes to

search it Not a problem, each index has own Collection

Cannot have multiple instances for each data item

Cannot have single instance for each data item

Then how can we construct each Entry's value?

Page 16: Lecture 21: Indexed Files

Proxy Pattern For The Win!

Page 17: Lecture 21: Indexed Files

Proxy Pattern For The Win!

Create proxy instances to use as Entry's value Proxy pretends has data by defining getters

& setters Data's position & file only fields these

objects have Whenever method called looks up &

returns data Other classes will think proxy has fields

declared Simplifies using class & ensures up-to-date

data used But little memory needed, since data

resides on disk!

Page 18: Lecture 21: Indexed Files

Starting with Indexed Files

American Telephone & Telegraph 112International Business Machines

0

Ford Motorcars, Inc. 224

IBM 106

IBM AT & T 23 T

F 224IBM 0T 112

Ford 12 F

Page 19: Lecture 21: Indexed Files

Coding

public class Stock {private static final int NAME_OFF = 0;private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Page 20: Lecture 21: Indexed Files

Coding

public class Stock {private static final int NAME_OFF = 0; private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Fixed max. sizeof each field

Fixed size of a record in data file

Page 21: Lecture 21: Indexed Files

Coding

public class Stock {private static final int NAME_OFF = 0;private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Offset in recordto field start

Page 22: Lecture 21: Indexed Files

Coding

public class Stock { // Continues from last time

public int getStockPrice() { theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) { theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) { theFile.seek(position + TICK_OFFSET); theFile.writeUTF(sym);}// More getters & setters from here…

Page 23: Lecture 21: Indexed Files

Visualizing Indexed Files

American Telephone & Telegraph 112International Business Machines

0

Ford Motorcars, Inc. 224

F 224IBM 0T 112

IBM 106

IBM AT & T 23 T Ford 12 F

Page 24: Lecture 21: Indexed Files

How Do We Add Data?

Adding new records takes only a few steps Add space for record with setLength on

data file Update index structure(s) to include new

record Records in data file updated at each

change

Page 25: Lecture 21: Indexed Files

Adding New Data To The Files

C 336F 224IBM 0T 112

0 Ø

American Telephone & Telegraph 112Citibank 336International Business Machines

0

Ford Motorcars, Inc. 224

IBM 106

IBM AT & T 23 T Ford 12 F

Page 26: Lecture 21: Indexed Files

Adding New Data To The Files

C 336F 224IBM 0T 112

Citibank -2 C

American Telephone & Telegraph 112Citibank 336International Business Machines

0

Ford Motorcars, Inc. 224

IBM 106

IBM AT & T 23 T Ford 12 F

Page 27: Lecture 21: Indexed Files

How Does This Work?

Removing records even easier To prevent using record, remove items from

indexes Do NOT update index file(s) until program

completes Use impossible magic numbers for record in

data file

Page 28: Lecture 21: Indexed Files

Removing Data As We Go

C 336F 224IBM 0T 112

American Telephone & Telegraph 112Citibank 336International Business Machines

0

Ford Motorcars, Inc. 224

Citibank -2 CIBM 106

IBM AT & T 23 T Ford 12 F

Page 29: Lecture 21: Indexed Files

Removing Data As We Go

C 336IBM 0T 112

American Telephone & Telegraph 112Citibank 336International Business Machines

0

Citibank -2 CIBM 106

IBM AT & T 23 T 0 Ø

Page 30: Lecture 21: Indexed Files

Using Multiple Indexes

Multiple indexes for data file very often needed Provides many ways of searching for

important data Since file read individually could also create

problem Multiple proxy instances for data could

be created Duplicates of instance are created for each

index Makes removing them all difficult, since not

linked Very easy to solve: use Map while loading

index Converts positions in file to proxy instances

to solve this

Page 31: Lecture 21: Indexed Files

Linking Multiple Indexes

Use one Map instance while reading all indexes For each position in file, check if already in Map

Use existing proxy instance, if position already in Map

If a search in Map returns null, create new instance

Make sure to call put() when we must create proxy

Page 32: Lecture 21: Indexed Files

What to Study for Midterm

Study your Maps and Dictionarys When would we use each of the ADTs? Why?

What do their methods do? Why do they differ?

Consider each implementation of these ADTs Explain why method has its given big-Oh

complexity Why use an implementation? Where is it

used? What are negatives or limitations of

implementation? What fields needed by implementation?

Why is this?

Page 33: Lecture 21: Indexed Files

What to Study for Midterm

Hash tables How do hash functions work? What does

mod do? How do we add & remove data from hash

table? What are collisions & how do we handle

them? What is real & pretend big-Oh complexity?

Why? Binary Search Trees

How do we add, remove, & search in these trees?

How are data in BSTs organized? Tricks to their use?

How do we code & use BSTs? What methods exist?

Page 34: Lecture 21: Indexed Files

What to Study for Midterm

List-based approaches – Why? When? Hash tables

How do hash functions work? What does mod do?

How do we add & remove data from hash table?

What are collisions & how do we handle them?

What is real & pretend big-Oh complexity? Why?

Binary Search Trees How do we add, remove, & search in these

trees? How are data in BSTs organized? Tricks to

their use? How do we code & use BSTs? What

methods exist?

Page 35: Lecture 21: Indexed Files

What to Study for Midterm

AVL Trees How do we add, remove, & search in these

trees? How are data in them organized? Tricks to

their use? When must we reorganize tree? How is this

done? Splay Trees

How do we add, remove, & search in these trees?

For each method is node splayed & which one?

How to chain splayings together? When do we stop?

Page 36: Lecture 21: Indexed Files

What to Study for Midterm

Class selection & design Where do classes come from? How do we

know? When to use each connection between

classes? How to list methods & fields in UML class

diagram? Comments & Outlines

When, where, and how much? What should & should not be included?

Page 37: Lecture 21: Indexed Files

Midterm Process

Open-book & open-note test; do not memorize But have methods & information at your

fingertips Use my slides ONLY with note(s) on that day's

slides Cannot use daily or weekly activities Must submit all printed pages along

with test Problems resembles tone of those

already seen All new problems, however; do not memorize

answers Includes tracing, showing state of ADT,

method returns Coding, big-Oh analysis, and more can be

asked

Page 38: Lecture 21: Indexed Files

For Next Lecture

Midterm #1 in class week on Friday

Project #2 available on Angel on Friday, too

Lab phase #2 due on Friday at midnight I still will be out of town, but lab activity will be posted Due week from Friday; chance to use indexed files

No class on Monday; take some time to relax I will be out-of-town serving on an NSF grant panel Updated schedule on Angel accounts for change