File Structures SNU-OOPSLA Lab. 1
Chap6. Organizing Files for PerformanceChap6. Organizing Files for Performance
서울대학교 컴퓨터공학부객체지향시스템연구실SNU-OOPSLA-LAB
교수 김 형 주
File structures by Folk, Zoellick and Ricarrdi
File StructuresSNU-OOPSLA Lab. 2
Chapter Objectives(1)Chapter Objectives(1)
Look at several approaches to data compression Look at storage compaction as a simple way of reusing space in a file Develop a procedure for deleting fixed-length records that allows vacated file
space to be reused dynamically Illustrate the use of linked lists and stacks to manage an avail list Consider several approaches to the problem of deleting variable-length
records Introduce the concepts associated with the terms internal fragmentation and
external fragmentation
File StructuresSNU-OOPSLA Lab. 3
Chapter Objectives(2)Chapter Objectives(2)
Outline some placement strategies associated with the reuse of space in a variable-length record file
Provide an introduction to the idea underlying a binary search Undertake an examination of the limitations of binary searching Develop a keysort procedure for sorting larger files; investigate the costs
associated with keysort Introduce the concept of a pinned record
File StructuresSNU-OOPSLA Lab. 4
ContentsContents
6.1 Data compression
6.2 Reclaiming space in files
6.3 Finding things quickly: An Introduction to internal sorting and binary searching
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 5
Data Compression(1)Data Compression(1)
Reasons for data compression less storage transmitting faster, decreasing access time processing faster sequentially
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 6
Data Compression(2)Data Compression(2)::Using a different notationUsing a different notation
Fixed-Length fields are good candidates
Decrease the # of bits by finding a more compact notationex) original state field notation is 16bits, but we can encode with 6bit
notation because of the # of all states are 50
Cons. unreadable by human cost in encoding time decoding modules => increase the complexity of s/w=> used for particular application
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 7
Data Compression(3)Data Compression(3)::Suppressing repeating sequencesSuppressing repeating sequences
Run-length encoding algorithm read through pixels, copying pixel values to file in sequence, except the
same pixel value occurs more than once in succession when the same value occurs more than once in succession, substitute the
following three bytes special run-length code indicator((ex) ff) pixel value repeated the number of times that value is repeated ex) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
22 23 ff 24 07 25 ff 26 06 25 24
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 8
화면
pixel
빛의 세기 수치화(digital)
각 pixel 당
전기 신호 (analog)
컴퓨터내 컴퓨터내 imageimage 의 표현의 표현
File StructuresSNU-OOPSLA Lab. 9
화면
12 8 12 33 99 1256 7 13 44 66 2312 4 34 57 99 12…...
컴퓨터내컴퓨터내 imageimage 의 표현의 표현
File StructuresSNU-OOPSLA Lab. 10
화면12 4 34 57 99 12…...
56 7 13 44 66 2312 4 34 57 99 12…...
12 8 12 33 99 1256 7 13 44 66 2312 4 34 57 99 12…...
** 동영상 --- 초당 25 - 30 개의 정지화상을 교체 (video) (image)
컴퓨터내 컴퓨터내 color color 영상의 표현영상의 표현
File StructuresSNU-OOPSLA Lab. 11
Data Compression(3)Data Compression(3)::Suppressing repeating sequencesSuppressing repeating sequences
Run-length encoding (cont’d) example of redundancy reduction cons.
not guarantee any particular amount of space savings under some circumstances, compressed image is larger than
original image Why? Can you prevent this?
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 12
Data Compression(4)Data Compression(4)::Assigning variable-length codesAssigning variable-length codes
Morse code: oldest & most common scheme of variable-length code Some values occur more frequently than others
that value should take the least amount of space Huffman coding
base on probability of occurrence determine probabilities of each value occurring build binary tree with search path for each value more frequently occurring values are given shorter search paths in tree
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 13
Data Compression(5)Data Compression(5)::Assigning variable-length codesAssigning variable-length codes
Huffman coding Letter: a b c d e f g Prob: 0.4 0.1 0.1 0.1 0.1 0.1 0.1 Code: 1 010 011 0000 0001 0010 0011 ex) the string “abde”
101000000001
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 14
d(0000) e(0001) f(0010) g(0011)
b(010) c(011)
a(1)
Huffman TreeHuffman Tree
0
0001
000 001
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 15
Data Compression(6)Data Compression(6)::Irreversible compression techniquesIrreversible compression techniques
Some information can be sacrificed Less common in data files Shrinking raster image
400-by-400 pixels to 100-by-100 pixels 1 pixel for every 16 pixels
Speech compression voice coding (the lost information is of no little or no value)
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 16
Compression in UNIXCompression in UNIX
System V pack & unpack use Huffman codes after compress file, appends “.z” to end of packed file
Berkeley UNIX compress & uncompress use Lempel-Ziv method after compress file, appends “.Z” to end of compressed file
6.1 Data Compression
File StructuresSNU-OOPSLA Lab. 17
Record Deletion and Storage CompactionRecord Deletion and Storage Compaction
Storage compaction record deletion : just marks each deleted record reclamation of all deleted records
=> pros : delete/undelete operation with little effort
Ex)
Ames|123|OK|…...|Morrison|9035|OK|Brown|625|IA|…...|
Delete second record
Ames|123|OK|…|*|rrison|9035|OK|Brown|625|IA|…|
After
compaction
Ames|123|OK|…|Brown|625|IA|…|
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 18
Deleting Fixed-length Records for Deleting Fixed-length Records for Reclaiming Space Dynamically(1)Reclaiming Space Dynamically(1)
Reuse the space from deleted records as soon as possible deleted records must be marked in special way we could find the deleted space
To make record reuse quickly, we need a way to know immediately if there are empty slots in the file a way to jump directly to one of those slots if they exist=> Linked lists or Stacks for avail list* avail list : a list that is made up of deleted records
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 19
Deleting Fixed-length Records for Deleting Fixed-length Records for Reclaiming Space Dynamically(2)Reclaiming Space Dynamically(2)
Headpointer
RRN5
RRN2
-1
Headpointer
RRN3
PRN5
RRN2
-1
(2)
(3)
2
25
(a)
(b)
after pushing record of RRN 3
Headpointer ptr ptr ptr ptr
-1
The Linked List
The Stack
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 20
Deleting Fixed-length Records for Deleting Fixed-length Records for Reclaiming Space Dynamically(3)Reclaiming Space Dynamically(3)
Linking and stacking deleted records arranging and rearranging links are used to make one available
record slot point to the next second field of deleted record points to next record
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 21
0 1 2 3 4 5 6
0 1 2 3 4 5 6
0 1 2 3 4 5 6
Edwards... Betas... Wills... *-1 Masters.. *3 Chavez...
Edwards... *5 Wills... *-1 Masters.. *3 Chavez...
Edwards.. 1st new rec Wills... 3rd new rec Masters.. 2nd new rec Chavez...
Sample file showing linked list of deleted records
List head(first available record) 5 (delete 3, 5 )
List head(first available record) 1 (delete 1)
List head(first available record) -1 (insert three new records)
(a)
(b)
(c)
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 22
Deleting Variable-length RecordsDeleting Variable-length Records
Avail list of variable-length records it has byte count of record at beginning of each record use byte offset instead of RRN
Adding and removing records in adding records, search through avail list for right size (=>big
enough)
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 23
Size47
Size38
Size72
Size68
-1
Size47
Size68
-1Size38
Size72
New Link
Removed record
(a)Before removal
(b)After removal
Removal of a record from an avail list with variable-length records
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 24
Storage FragmentationStorage Fragmentation Internal fragmentation (in fixed-length record)
waste space within a record in variable-length records, minimize wasted space by doing away with
internal fragmentation External fragmentation (in variable-length record)
unused space outside or between individual records three possible solutions
storage compaction coalescing the holes: a single, larger record slot minimizing fragmentation by adopting placement strategy
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 25
Internal FragmentationInternal Fragmentationin Fixed-length Recordsin Fixed-length Records
Ames | John | 123 Maple | Stillwater | OK | 740751 |...................................
Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 |
Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | .........................
64-byte fixed-length records
Unused space ->
Internal fragmentation
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 26
External FragmentationExternal Fragmentationin Variable-length Recordsin Variable-length Records
40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian |
9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb
bark | Des Moines | IA | 50311 |
Record[1] Record[2]
Record[3]
ex) Delete Record[2] and Insert New Record[i] : 12-byte unused space
52 Adams | Kits | 3301 Washington D.C | Forest Village | IA | 43563 |
External fragmentation
recordlength
Record[i]
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 27
Placement StrategiesPlacement Strategies First-fit
select the first available record slot suitable when lost space is due to internal fragmentation
Best-fit select the available record slot closest in size avail list in ascending order suitable when lost space is due to internal fragmentation
Worst-fit select the largest record slot avail list in descending order suitable when lost space is due to external fragmentation
6. 2 Reclaiming Space in Files
File StructuresSNU-OOPSLA Lab. 28
Finding Things Quickly(1)Finding Things Quickly(1) Goal: Minimize the number of disk accesses Finding things in simple field and record files may have many
seeks Binary search algorithm for fixed-sized record
int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType &key)// binary search for key.{
int low = 0; int high = file.NumRecs() - 1;while (low <= high){
int guess = (high - low)/2;file.ReadByRRN(obj, guess);if(obj.Key () == key) return 1; // record foundif*obj.Key() < key) high = guess - 1; // search before guesselse low = guess + 1; // search after guess
}return 0; // loop ended without finding key
}
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching
File StructuresSNU-OOPSLA Lab. 29
Classes and Methods for Binary SearchClasses and Methods for Binary SearchClass KeyType {public
int operator == (KeyType &);
int operator < (KeyType &);
};
class RecType {public: KeyType Key();};
class FixedRecordFile{public:
int NumRecs();
int ReadByRRN (RecType & Record, int RRN);
};
File StructuresSNU-OOPSLA Lab. 30
Finding Things Quickly(2)Finding Things Quickly(2)
Binary search vs. Sequential search binary search
O(log n) list is sorted by key
sequential search O(n)
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching
File StructuresSNU-OOPSLA Lab. 31
Finding Things Quickly(3)Finding Things Quickly(3)
Sorting a disk file in RAM read the entire file from disk to memory use internal sort (=sort in memory)
UNIX sort utility uses internal sort Limitations of binary search & internal sort
binary search requires more than one or two access c.f.) single access by RRN
keeping a file sorted is very expensive an internal sort works only on small files
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching
File StructuresSNU-OOPSLA Lab. 32
Internal SortInternal Sort
unsortedfile
unsortedfile
sortedfile
Read the entire file
Sort in memory
disk
memory
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching
File StructuresSNU-OOPSLA Lab. 33
Key Sorting & Its LimitationsKey Sorting & Its Limitations
So called, “tag sort” : sorted thing is “key” only Sorting procedure
Read only the keys into memory Sort the keys Rearrange the records in file by the sorted keys
Advantage less RAM than internal sort
Disadvantages(=Limitations) reading records in disk twice is required a lot of seeking for records for constructing a new(sorted) file
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 34
12
3
k
HARRISON
KELLOG
HARRIS
BELL
.
.
.
.
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
Harris|Margaret|4343 West....
Bell|Robert|8912 Hill....
KEY RRN Records
In RAM On secondary storage
k3
1
2
HARRISON
KELLOG
HARRIS
BELL
.
.
.
.
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
Harris|Margaret|4343 West....
Bell|Robert|8912 Hill....
KEY RRN Records
Conceptualview
after sortingkeys
in RAM
Conceptualview
beforesorting
KEYNODES array6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 35
Pseudocode for keysort(1)Pseudocode for keysort(1) Program: keysort
open input file as IN_FILE create output file as OUT_FILE
read header record from IN_FILE and write a copy to OUT_FILE REC_COUNT := record count from header record /* read in records; set up KEYNODES array */ for i := 1 to REC_COUNT
read record from IN_FILE into BUFFER extract canonical key and place it in KEYNODES[i].KEY KEYNODES[i].KEY = i
(continued....)
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 36
Pseudocode for keysort(2)Pseudocode for keysort(2) /* sort KEYNODES[].KEY, thereby ordering RRNs correspondingly */ sort(KEYNODES, REC_COUNT)
/* read in records according to sorted order, and write them out in this order */
for i := 1 to REC_COUNT seek in IN_FILE to record with RRN of KEYNODES[I].RRN write BUFFER
contents to OUT_FILE close IN_FILE and OUT_FILE
end PROGRAM
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 37
Two SolutionsTwo Solutions:why bother to write the file back?:why bother to write the file back?
Write out sorted KEYNODES[] array without writing records back in sorted order
KEYNODES[] array is used as index file
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 38
k3
1
2
HARRISON
KELLOG
HARRIS
BELL
.
.
.
.
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
Harris|Margaret|4343 West....
Bell|Robert|8912 Hill....
KEY RRN Records
Index file Original file
Relationship between the index file and the data file
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 39
Pinned records(1)Pinned records(1)
Records that are referenced to physical location of themselves by other records
Not free to alter physical location of records for avoiding dangling references
Pinned records make sorting more difficult and sometimes impossible solution: use index file, while keeping actual data file in original
order
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 40
Pinned records(2)Pinned records(2)
File with pinned records
Record(i)
Pinned Record
Record (i+1) Pinned Record
delete pinned record
dangling pointer
6.4 Keysorting
File StructuresSNU-OOPSLA Lab. 41
Let’s Review !!!Let’s Review !!!
6.1 Data compression
6.2 Reclaiming space in files
6.3 Finding things quickly: An Introduction to internal sorting and binary searching
6.4 Keysorting