15
Processing - Hash File Considerations MVNC 1 Hash File Considerations

File Processing - Hash File Considerations MVNC1 Hash File Considerations

Embed Size (px)

Citation preview

Page 1: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 1

Hash File

Considerations

Page 2: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 2

Hashing - Hash File Considerations

Statistical Considerations» Record Distribution is important» Ideal - one record per location» Load Factor - How full the file is

– Load Factor = r / b * m– r - number of records stored– b - bucket size– m - number of addresses

Page 3: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 3

Hashing - Statistical Considerations

Graphing Record Distribution» Frequency Distribution Graph

– Y axis - records per address– X axis - RRP

» Alternate Frequency Distribution Graph– Y axis - Number of address with x records– X axis - x records assigned

Example - (x DIV 5) MOD 4,» Data: 22, 1, 14, 56, 25, 13, 43, 62, 11

Page 4: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 4

Hashing - Overall Guidelines

If possible, use uniformly distributed Keys Use a carefully designed hashing scheme

» Do statistical studies if possible» Monitor performance» Should be computationally efficient

Taylor bucket size and load factor to particular I/O device

Page 5: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 5

Hashing - Advantages

Flexibility» Adaptable to a variety of situations» Useful both for disk and memory based retrieval

Efficiency of record access» Can achieve O(1) access times

Page 6: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 6

Hashing - Disadvantages

No ordered record access by PK Data (key set) dependency

» Must be specifically tailored for each key distribution and form

» If characteristics change, hashing scheme may need to change

Fixed upper limit on file size» Size determined at creation time» Must "rehash" to larger file if expansion needed» May need to redesign hash algorithm as well

Page 7: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 7

Hashing Considerations

Static vs. Dynamic Files» Static files

– fixed key data– entire domain of keys known a priori (key set)– By experimentation, my be able to find collision free

solution– Examples

Assembler OP code table FAX group three compression table

Page 8: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 8

Hashing Considerations

Static vs. Dynamic Files» Dynamic files

– Key set not known in advance– Patterns/samples of keys may be known– Collision free solution not generally possible– Experimentation may be used to to fine good hash

algorithm and configuration. Hash Algorithm technique File size bucket size Overflow strategy

Page 9: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 9

Hashing Considerations

Static vs. Dynamic Hashing» Static Hashing

– file size fixed over life of file– must rebuild to make larger

» Dynamic Hashing– file may expand and contract over time– called extensible hashing

Page 10: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 10

Hashing Considerations

Distribution of keys» May know some information about key distribution

in advance– Complete set– patterns are predicable– completely unpredictable

Page 11: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 11

Hashing Considerations

Files versus arrays» Hashing suitable for both primary and secondary

retrieval purposes.» Primary memory based systems

– I/O time not a consideration buckets not really helpful

– Other factors gain in importance Hash algorithm complexity overflow technique

Page 12: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 12

Hashing Considerations

Hash Algorithms - general forms» Division

– Division remainder scheme an example.– Choice of divisor importance

Should be prime relative to the file size. Should not be a power of two. Bad choices result in simple truncation, thus part of the key

is simply discarded.

Page 13: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 13

Hashing Considerations

Hash Algorithms - general forms » Multiplication

– Multiplicative techniques tend to use ALL of the information in the key (no truncation)

– Mid-square technique is an example.

» Compression. extraction, folding– Useful for large keys

Page 14: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 14

Hashing Considerations

Hash Algorithms - general forms» Double Hashing

– Rather then progressive overflow on collision, use a secondary hash function to generate a step length for the next probe

– Helps reduce secondary clustering of linear probing with step size greater then one.

– Non-linear, or random probing

Page 15: File Processing - Hash File Considerations MVNC1 Hash File Considerations

File Processing - Hash File Considerations MVNC 15

Hashing Considerations

Hash Algorithms - general forms» Multi-Attribute hashing

– Base the calculation for home address on more than the primary key attribute.

– Useful if the primary key exhibits certain bad hashing attributes (clustering, etc.)

– Example - use part number (PK) and distributor fields.

» Extendible Hashing– See text