Upload
sharon-simmons
View
221
Download
5
Embed Size (px)
Citation preview
File Processing - Hash File Considerations MVNC 1
Hash File
Considerations
File Processing - Hash File Considerations MVNC 2
Hashing - Hash File Considerations
Statistical Considerations» Record Distribution is important» Ideal - one record per location» Load Factor - How full the file is
– Load Factor = r / b * m– r - number of records stored– b - bucket size– m - number of addresses
File Processing - Hash File Considerations MVNC 3
Hashing - Statistical Considerations
Graphing Record Distribution» Frequency Distribution Graph
– Y axis - records per address– X axis - RRP
» Alternate Frequency Distribution Graph– Y axis - Number of address with x records– X axis - x records assigned
Example - (x DIV 5) MOD 4,» Data: 22, 1, 14, 56, 25, 13, 43, 62, 11
File Processing - Hash File Considerations MVNC 4
Hashing - Overall Guidelines
If possible, use uniformly distributed Keys Use a carefully designed hashing scheme
» Do statistical studies if possible» Monitor performance» Should be computationally efficient
Taylor bucket size and load factor to particular I/O device
File Processing - Hash File Considerations MVNC 5
Hashing - Advantages
Flexibility» Adaptable to a variety of situations» Useful both for disk and memory based retrieval
Efficiency of record access» Can achieve O(1) access times
File Processing - Hash File Considerations MVNC 6
Hashing - Disadvantages
No ordered record access by PK Data (key set) dependency
» Must be specifically tailored for each key distribution and form
» If characteristics change, hashing scheme may need to change
Fixed upper limit on file size» Size determined at creation time» Must "rehash" to larger file if expansion needed» May need to redesign hash algorithm as well
File Processing - Hash File Considerations MVNC 7
Hashing Considerations
Static vs. Dynamic Files» Static files
– fixed key data– entire domain of keys known a priori (key set)– By experimentation, my be able to find collision free
solution– Examples
Assembler OP code table FAX group three compression table
File Processing - Hash File Considerations MVNC 8
Hashing Considerations
Static vs. Dynamic Files» Dynamic files
– Key set not known in advance– Patterns/samples of keys may be known– Collision free solution not generally possible– Experimentation may be used to to fine good hash
algorithm and configuration. Hash Algorithm technique File size bucket size Overflow strategy
File Processing - Hash File Considerations MVNC 9
Hashing Considerations
Static vs. Dynamic Hashing» Static Hashing
– file size fixed over life of file– must rebuild to make larger
» Dynamic Hashing– file may expand and contract over time– called extensible hashing
File Processing - Hash File Considerations MVNC 10
Hashing Considerations
Distribution of keys» May know some information about key distribution
in advance– Complete set– patterns are predicable– completely unpredictable
File Processing - Hash File Considerations MVNC 11
Hashing Considerations
Files versus arrays» Hashing suitable for both primary and secondary
retrieval purposes.» Primary memory based systems
– I/O time not a consideration buckets not really helpful
– Other factors gain in importance Hash algorithm complexity overflow technique
File Processing - Hash File Considerations MVNC 12
Hashing Considerations
Hash Algorithms - general forms» Division
– Division remainder scheme an example.– Choice of divisor importance
Should be prime relative to the file size. Should not be a power of two. Bad choices result in simple truncation, thus part of the key
is simply discarded.
File Processing - Hash File Considerations MVNC 13
Hashing Considerations
Hash Algorithms - general forms » Multiplication
– Multiplicative techniques tend to use ALL of the information in the key (no truncation)
– Mid-square technique is an example.
» Compression. extraction, folding– Useful for large keys
File Processing - Hash File Considerations MVNC 14
Hashing Considerations
Hash Algorithms - general forms» Double Hashing
– Rather then progressive overflow on collision, use a secondary hash function to generate a step length for the next probe
– Helps reduce secondary clustering of linear probing with step size greater then one.
– Non-linear, or random probing
File Processing - Hash File Considerations MVNC 15
Hashing Considerations
Hash Algorithms - general forms» Multi-Attribute hashing
– Base the calculation for home address on more than the primary key attribute.
– Useful if the primary key exhibits certain bad hashing attributes (clustering, etc.)
– Example - use part number (PK) and distributor fields.
» Extendible Hashing– See text