Upload
denis-mcdowell
View
264
Download
0
Embed Size (px)
Citation preview
File Processing - Fundamental concepts MVNC 1
Fundamental File Structure Concepts
Chapter 4
File Processing - Fundamental concepts MVNC 2
Record and Field Structure
A record is a collection of fields. A field is used to store information about
some attribute. The question: when we write records, how do
we organize the fields in the records:» so that the information can be recovered» so that we save space» so that we can process efficiently» to maximize record structure flexibility
File Processing - Fundamental concepts MVNC 3
Field Structure issues
What if» Field values vary greatly» Fields are optional
File Processing - Fundamental concepts MVNC 4
Field Delineation methods
Fixed length fields Include length with field Separate fields with a delimiter Include keyword expression to identify each
field
File Processing - Fundamental concepts MVNC 5
Fixed length fields
Easy to implement - use language record structures (no parsing)
Fields must be declared at maximum length needed
last first address city state zip
10 10 15 15 2 9
“Yeakus Bill 123 Pine Utica OH43050 “
File Processing - Fundamental concepts MVNC 6
Include length with field
Begin field with length indicator If maximum field length <256, a byte can be
used for length
last first address city state zip
Length bytes
Yeakus Bill 123 Pine
06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33 20 50 69 6E 64 . .
File Processing - Fundamental concepts MVNC 7
Separate fields with a delimiter
Use a special character not used in data» space, comma, tab» Also special ASCII char’s: Field Separator (fs) 1C» Here we use “|”
Also need a end of record delimiter: “#”
“Yeakus|Bill|123 Pine|Utica|OH|43050#“
File Processing - Fundamental concepts MVNC 8
Include keyword expression
Keywords label each fields A self-describing structure Allows LOTS of flexibility Uses lots of space
“LAST=Yeakus|FIRST=Bill|ADDRESS=123 Pine|CITY=Utica|STATE=OH|ZIP=43050#“
File Processing - Fundamental concepts MVNC 9
Optional Fields
Fixed length» Leave blank
Field length» zero length field
Delimiter» Adjacent delimiters
Keywords» Just leave out
File Processing - Fundamental concepts MVNC 10
Reading a stream of fields
Need to break record into fields Fixed length can simply be read into record
structure Others must be “parsed” with a parse
algorithm
File Processing - Fundamental concepts MVNC 11
Record Structures
How do we organize records in a file? Records can be fixed length or variable length
» Fixed length allows simple direct access lookup» Fixed may waste space» Variable - how do we find a records position?
File Processing - Fundamental concepts MVNC 12
Record Structures
Fixed Length Records Fixed number of fields in records Variable length
» prefix each record with a length» Use a second file to keep track of record start
positions» Place delimiter between records
File Processing - Fundamental concepts MVNC 13
Fixed Length Records
All records same length Record positions can be calculated for direct
access reads. Does not imply the that the sizes or number of
fields are fixed. Variable length records would lead to unused
space.
File Processing - Fundamental concepts MVNC 14
Fixed number of fields in records
Field size could be fixed or variable Fixed
» results in fixed size records» simply read directly into “struct”
Variable sized fields» delimited or field lengths» Simply count fields while parsing
File Processing - Fundamental concepts MVNC 15
Variable length Records
prefix each record with a length Use a second file to keep track of record start
positions Place delimiter between records
File Processing - Fundamental concepts MVNC 16
Prefix records with a length
Allows true variable length records Form of prefix:
» Character number (fixed length)» Binary number (write integer without conversion)» Must consider Maximum length
No direct access (great for sequencial access)
File Processing - Fundamental concepts MVNC 17
Index of record start addresses
A second file is simply a list of offsets to successive records
Since the offsets are fixed length, this file allows direct access, thereby allow direct access to main file.
Problem» Maintaining file (adding and deleting records)» Cost of index
File Processing - Fundamental concepts MVNC 18
Place delimiter between records
Special character not used in record Allows efficient variable size No direct access Bible files - use ‘\n’ as delimiter
File Processing - Fundamental concepts MVNC 19
Binary data in files
Binary reals and integers can be written, and read, from a file:» Need to know byte size of variables used.» “tsize” function returns data size
File Processing - Fundamental concepts MVNC 20
Binary data in files
int rsize;
char rec_buf[MAX];
...
cpystr(rec_buf,”this is a test record”);
rsize = strlen(rec_buf);
write(my_fd,&rsize,tsize(int)); // write the size
write(my_fd,&rec_buf,rsize); // write the record
...
read(my_fd, &rsize,tsize(int)); // read the size
read(my_fd,&rec_buf,rsize); // read the record
File Processing - Fundamental concepts MVNC 21
Viewing Binary file data
Use the file dump utility (od - octal dump)» od -xc <filename>» x - hex output» c - character output
Useful for viewing what is actually in file
File Processing - Fundamental concepts MVNC 22
Using Classes to Manipulate Buffer
Three Classes» delimited fields» Length-based fields» Fixed length fields
File Processing - Fundamental concepts MVNC 23
Class for Delimited fields
Consider a class to manage delimited text buffers» Allows reading and writing of delimited records» Allows packing and unpacking
File Processing - Fundamental concepts MVNC 24
Class for Delimited fieldsclass Person
{
public:
// fields
char LastName [11];
char FirstName [11];
char Address [16];
char City [16];
char State [3];
char ZipCode [10];
// Methods next ...
}
File Processing - Fundamental concepts MVNC 25
Class for Delimited fieldsclass DelimTextBuffer
{ public:
DelimTextBuffer (char Delim = '|', int maxBytes = 1000);
int Read (istream &);
int Write (ostream &) const;
int Pack (const char *, int size = -1);
int Unpack (char *);
private:
char Delim;
char DelimStr[2]; // zero terminated string for Delim
char * Buffer; // character array to hold field values
int BufferSize; // size of packed fields
int MaxBytes; // maximum number of characters in the buffer
int NextByte; // packing/unpacking position in buffer
};
File Processing - Fundamental concepts MVNC 26
Class for Delimited fields
Packing a bufferPerson Bill_Yeakus
DelimitedTextBuffer buffer;
buffer.pack(Bill_Yeakus.LastName);
buffer.pack(Bill_Yeakus.FastName);
…
buffer.pack(Bill_Yeakus.ZipCode);
buffer.Write(stream);
File Processing - Fundamental concepts MVNC 27
Class for Delimited fieldsint DelimTextBuffer :: Pack (const char * str, int size)
// set the value of the next field of the buffer;
// if size = -1 (default) use strlen(str) as Delim of field
{
short len; // length of string to be packed
if (size >= 0) len = size;
else len = strlen (str);
if (len > strlen(str)) // str is too short!
return FALSE;
int start = NextByte; // first character to be packed
NextByte += len + 1;
if (NextByte > MaxBytes) return FALSE;
memcpy (&Buffer[start], str, len);
Buffer [start+len] = Delim; // add delimeter
BufferSize = NextByte;
return TRUE;
}
File Processing - Fundamental concepts MVNC 28
Class for Delimited fieldsint DelimTextBuffer :: Write (ostream & stream) const
{
stream . write ((char*)&BufferSize, sizeof(BufferSize));
stream . write (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts MVNC 29
Class for Delimited fieldsint DelimTextBuffer :: Read (istream & stream)
{
Clear ();
stream . read ((char*)&BufferSize, sizeof(BufferSize));
if (stream.fail()) return FALSE;
if (BufferSize > MaxBytes) return FALSE; // buffer overflow
stream . read (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts MVNC 30
Class for Delimited fieldsint DelimTextBuffer :: Unpack (char * str)
// extract the value of the next field of the buffer
{
int len = -1; // length of packed string
int start = NextByte; // first character to be unpacked
for (int i = start; i < BufferSize; i++)
if (Buffer[i] == Delim)
{len = i - start; break;}
if (len == -1) return FALSE; // delimeter not found
NextByte += len + 1;
if (NextByte > BufferSize) return FALSE;
strncpy (str, &Buffer[start], len);
str [len] = 0; // zero termination for string
return TRUE;
}
File Processing - Fundamental concepts MVNC 31
Class for Delimited fields
Class Person can be extended to provide specialized packing functions
File Processing - Fundamental concepts MVNC 32
Class for Delimited fieldsint Person::Pack (DelimTextBuffer & Buffer) const
{// pack the fields into a FixedTextBuffer, return TRUE if all succeed, FALSE o/w
int result;
Buffer . Clear ();
result = Buffer . Pack (LastName);
result = result && Buffer . Pack (FirstName);
result = result && Buffer . Pack (Address);
result = result && Buffer . Pack (City);
result = result && Buffer . Pack (State);
result = result && Buffer . Pack (ZipCode);
return result;
}
File Processing - Fundamental concepts MVNC 33
Class for Delimited fieldsint Person::Unpack (DelimTextBuffer & Buffer)
{
int result;
result = Buffer . Unpack (LastName);
result = result && Buffer . Unpack (FirstName);
result = result && Buffer . Unpack (Address);
result = result && Buffer . Unpack (City);
result = result && Buffer . Unpack (State);
result = result && Buffer . Unpack (ZipCode);
return result;
}
File Processing - Fundamental concepts MVNC 34
Class for Fixed Length fields
int FixedTextBuffer :: AddField (int fieldSize)
{
if (NumFields == MaxFields) return FALSE;
if (BufferSize + fieldSize > MaxChars) return FALSE;
FieldSize[NumFields] = fieldSize;
NumFields ++;
BufferSize += fieldSize;
return TRUE;
}
File Processing - Fundamental concepts MVNC 35
Class for Fixed Length fields
int FixedTextBuffer :: Read (istream & stream)
{
stream . read (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts MVNC 36
Class for Fixed Length fields
int FixedTextBuffer :: Write (ostream & stream)
{
stream . write (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts MVNC 37
Class for Fixed Length fields
int FixedTextBuffer :: Pack (const char * str)// set the value of the next field of the buffer;{
if (NextField == NumFields || !Packing) // buffer is full or not packing modereturn FALSE;
int len = strlen (str);int start = NextCharacter; // first byte to be packedint packSize = FieldSize[NextField]; // number bytes to be packedstrncpy (&Buffer[start], str, packSize);NextCharacter += packSize;NextField ++;// if len < packSize, pad with blanksfor (int i = start + packSize; i < NextCharacter; i ++)
Buffer[start] = ' ';Buffer [NextCharacter] = 0; // make buffer look like a stringif (NextField == NumFields) // buffer is full{
Packing = FALSE;NextField = NextCharacter = 0;
}return TRUE;
}
File Processing - Fundamental concepts MVNC 38
Class for Fixed Length fields
int FixedTextBuffer :: Unpack (char * str)// extract the value of the next field of the buffer{
if (NextField == NumFields || Packing) // buffer is full or not unpacking mode
return FALSE;int start = NextCharacter; // first byte to be unpackedint packSize = FieldSize[NextField]; // number bytes to be unpackedstrncpy (str, &Buffer[start], packSize);str [packSize] = 0; // terminate string with zeroNextCharacter += packSize;NextField ++;if (NextField == NumFields) Clear (); // all fields unpackedreturn TRUE;
}
File Processing - Fundamental concepts MVNC 39
Class for Fixed Length fields
void FixedTextBuffer :: Print (ostream & stream)
{
stream << "Buffer has max fields "<<MaxFields<<" and actual "<<NumFields<<endl
<<"max bytes "<<MaxChars<<" and Buffer Size "<<BufferSize<<endl;
for (int i = 0; i < NumFields; i++)
stream <<"\tfield "<<i<<" size "<<FieldSize[i]<<endl;
if (Packing) stream <<"\tPacking\n";
else stream <<"\tnot Packing\n";
stream <<"Contents: '"<<Buffer<<"'"<<endl;
}
File Processing - Fundamental concepts MVNC 40
Class for Fixed Length fields
class FixedTextBuffer
{ public:
FixedTextBuffer (int maxFields, int maxChars = 1000); int AddField (int fieldSize);
int Read (istream &);
int Write (ostream &);
int Pack (const char *);
int Unpack (char *);
private:
char * Buffer; // character array to hold field values
int BufferSize; // sum of the sizes of declared fields
int * FieldSize; // array to hold field sizes
int MaxChars; // maximum number of characters in the buffer
int NextCharacter; // packing/unpacking position in buffer
};
File Processing - Fundamental concepts MVNC 41
Class for Fixed Length fields
int Person::Pack (FixedTextBuffer & Buffer) const
{// pack the fields into a FixedTextBuffer, return TRUE if all succeed, FALSE o/w
int result;
Buffer . Clear ();
result = Buffer . Pack (LastName);
result = result && Buffer . Pack (FirstName);
result = result && Buffer . Pack (Address);
result = result && Buffer . Pack (City);
result = result && Buffer . Pack (State);
result = result && Buffer . Pack (ZipCode);
return result;
}
File Processing - Fundamental concepts MVNC 42
Class for Fixed Length fields
int Person::Unpack (FixedTextBuffer & Buffer)
{
Clear ();
int result;
result = Buffer . Unpack (LastName);
result = result && Buffer . Unpack (FirstName);
result = result && Buffer . Unpack (Address);
result = result && Buffer . Unpack (City);
result = result && Buffer . Unpack (State);
result = result && Buffer . Unpack (ZipCode);
return result;
}
File Processing - Fundamental concepts MVNC 43
Record Access - Keys
Attribute used to identify records Often used to find records Standard or canonical form
» rules which keys must conform to» prevents missing record because key in different
form» Example:
– all capitals– Phone in form (nnn) nnn-nnnn
File Processing - Fundamental concepts MVNC 44
Record Access - Keys
Keys can distinct - uniquely identify records» Primary keys» one-to-one relationship between key value and
possible entities represented» SSN, Student ID
Keys can identify a collection of records» Secondary keys» one-to-many relationship» City, position, department
File Processing - Fundamental concepts MVNC 45
Record Access - Keys
Primary key desired characteristics» unique among collection of entities» dataless - what if some entities have not value of
this type (e.g. SSN)» unchanging
File Processing - Fundamental concepts MVNC 46
Record access
Performance of access method» how do we compare techniques?» Must be careful what events we count.» “big-oh” notation gives us a way to factor out all but
the most significant factors
File Processing - Fundamental concepts MVNC 47
Record Access - timing
Sequential searching» Consider file of 4000 records» What if no blocking done, and one record per
block? (500 bytes records, 512 byte blocks)» What if cluster size set to 8?» always requires O(n), but search is faster by a
constant factor
File Processing - Fundamental concepts MVNC 48
Sequential searching
Usually NOT the best method Sometimes it is best:
» Searching for some ASCII pattern (grep)» Small files» Files rarely searched» Searching on secondary key, and a large
percentage of records match (say 25%)
File Processing - Fundamental concepts MVNC 49
Unix Tools for sequential file processing
cat - display a file wc - count lines, words, and characters grep - find lines in file(s) which match regular
expression.
File Processing - Fundamental concepts MVNC 50
Direct Access
Move “directly” to record without scanning preceding data
Different languages/OS’s support different models:» Byte offset model
– Programmer must specify offset to record, and record size to read.
– Supports variable size records, skip sequential processing
» Relative Record Number (RRN) model– File has a fixed record size (declared at creation time)
– Records are specified by a record number
– File modeled as a collection of components
– Higher level of abstraction
File Processing - Fundamental concepts MVNC 51
Direct Access
Different language support» RRN support
– PL/I– COBOL– Pascal (files are modeled as a collection of components
(records)– FORTRAN
» Byte offset– C
File Processing - Fundamental concepts MVNC 52
Choosing Record Sizes for Direct Access
Fixed Length Fields» Very easy to parse records - just read into record
structure!» Each field must be maximum length needed!
– Thus record must be as long all the maximum fields
last first address city state zip
10 10 15 15 2 9
“Yeakus Bill 123 Pine Utica OH43050 “
File Processing - Fundamental concepts MVNC 53
Choosing Record Sizes for Direct Access
Variable length fields» Each field can be any length» since some can be long, others short, overall
record size may be shorter.» This gives more flexibility to fields length» Records must be parsed, space wasted for
delimiter or length bytes.
Yeakus|Bill|123|Pine|Utica|OH43050Snivenloppinsky|Helmut|12232 Galmentary Avenue|Spotsdale|NY|11232
File Processing - Fundamental concepts MVNC 54
Header Records
The first record in a direct file may be used to store special information» Number of records used.» Location of first record in key order sequence.» Location of first empty record» File record structure (meta-data)
In languages with the RRN model Pascal, variant record facility must be used
In C, the header record can be of different size from the rest of the file records.
File Processing - Fundamental concepts MVNC 55
Header Records
Consider “update.c” is text. Header record contains 2 byte number of
record count. Header size is 32, record size is 64
static struct { short rec_count; char fill[30];} head;
File Processing - Fundamental concepts MVNC 56
Header Records
Must be written when file created Must be rewritten when file changed Must be read when file is opened
File Processing - Fundamental concepts MVNC 57
File Access and Organization
File Organization» Variable Length Records» Fixed Length Records» Field Structures (size bytes, delimiters, fixed)
File Access» Sequential access» Direct access» Indexed access
File Processing - Fundamental concepts MVNC 58
File Access and Organization
Interaction between organization and access» Can the file be divided into fields?» Is there a higher level of organization to the file
(mete data)?» Do all records have to have the same number of
fields, bytes?» How do we distinguish one record from the next?» How do we recognize if a fixed length record holds
real data or not?
File Processing - Fundamental concepts MVNC 59
File Access and Organization
There is a often a trade-off between space and time» Fixed length records - allow direct access, waste
space» Variable require sequential search
We also must consider the typical use of the file - what are the desired access patterns
Selection of a particular organization has implications on the allowable types of access
File Processing - Fundamental concepts MVNC 60
Portability and Standardization
Differences among Languages» Fixed sized records versus byte addressable
access
Differences among Machine Architectures» Byte order of binary data» May be high order or low order byte first
File Processing - Fundamental concepts MVNC 61
Byte order of binary data
High order first: (Big Endian)» A long int: say 45 is stored in memory.» It is stored as: 00 00 00 2D» Sun’s, Network protocols
Low order first (Little Endian)» A long int: say 45 is stored in memory.» It is stored as: 2D 00 00 00» PC’s, VAX’s
File Processing - Fundamental concepts MVNC 62
Byte order of binary data
If binary data is written to a file, it is written in the order stored in memory
If the data is later read by a system with a different ordering, the number will be incorrect!
For the sake of portability, files should be written in an agreed upon format (probably Big Endian)