Next: 7.8 Spreadsheets Up: 7. Data Storage Previous: 7.6 XML Contents
As we saw in Section 7.4, all electronic information, regardless of the format, is ultimatelystored in a binary form--as a series of bits (zeroes and ones).However, the same value can be recorded as a binary value ina number of different ways.
For example, given the number 12345, we could store itas individual characters 1, 2, 3,4, and 5, using one byte for each character:
00110001 00110010 00110011 00110100 00110101
Alternatively, we could store the number as a four-byte integer (see Section 7.4.3):
00111001 00110000 00000000 00000000
When we store information as individual one-byte characters, the result is a plain text file. This tends to be a less efficientmethod because it tends to consume more memory, but it has the advantage that the file has a very simple structure. This means that it is very simple to write software to read the file becausewe know that each byte just needs to be converted to a character.There may be problems determining data values from the individual characters(see Section 7.5), but the process of readingthe basic unit of information (a character) from the file is straightforward.
For the purposes of this book, a binary format is justany format that is not plain text.
The characteristicfeature of a binary format is that there is not a simplerule for determining how many bits or how many bytes constitutea basic unit of information. Given a series of, say, four bytes,we cannot assume that these correspond to four characters, or a singlefour-byte integer, or half of an eight-byte floating-point value (see Section 7.4.3). It is necessary forthere to be a description of the rules for the format (we willlook at one example soon) that state what information is storedand how many bits or bytes are used for each piece of information.
Binary formats are consequently much harder to write software for,which results in there being less software available to do the job.
However, some binary formats are easier to read than others.Given that a description is necessary to have any chance of reading a binary file, proprietary formats, where the file formatdescription is kept private, are extremely difficult to deal with.Open standards become more important than ever.
7.7.1 Binary file structure
One of the advantages of binary files is that they are more efficient.
In terms of memory, storing values using numeric formats such as IEEE 754, rather than as text characters, tends to use lessmemory.
In addition, binary formats also offer advantages in terms of speedof access.While the basic unit of information is very straightforward in a plain text file (one byte equals one character),finding the actual data values is often much harder. For example,in order to find the third data value on the tenth row of a CSVfile, the reader software must keep reading bytes until nineend-of-line characters have been found and then two delimitercharacters have been found. This means that,with text files, it is usually necessary to read the entire file inorder to find any particular value.
For binary formats, some sort of format description, or map, is required to be able to find the location(and meaning) of any value in the file. However, the advantage of having such amap is that any value within the file can be found without having toread the entire file.
As a typical example, a standard feature of binary files is the inclusion of some sortof header information, both for the overall file, and for subsections within the file. This header information contains information such as the byte location within the file where a set of values begins (a pointer), the number of bytes used for each data value (the data size), plusthe number of data values. It is then very simple to find, for example, the third data value within a set of values, which is:pointer + 2 x size.
More information is required in order to locate values within a binary format, but once that information is available, navigation within thefile is faster and more flexible.
7.7.2 NetCDF
Next: 7.8 Spreadsheets Up: 7. Data Storage Previous: 7.6 XML Contents
Paul Murrell
This document is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.