Data Layout on the Disk

attr attr attr file1 file2 file3 1 16 FF 16 11 FF 8 2 4 12 FF -1 10 6 15 directory file allocation table Exercise 127 Repeat Ex. 121 for the Unix inode and FAT structures: what happens if you seek beyond the current end of the file, and then write some data? Exercise 128 What are the pros and cons of the Unix inode structure vs. the FAT struc- ture? Hint: consider the distribution of file sizes shown above. FAT’s Success and Legacy Problems The FAT file system was originally designed for storing data on floppy disks. Two leading considerations where therefore simplicity and saving space. As a result file names were limited to the 8.3 format, where the name is no more than 8 characters, followed by an extension of 3 characters. In addition, the pointers were 2 bytes, so the table size was limited to 64K entries. The problem with this structure is that each table entry represents an allocation unit of disk space. For small disks it was possible to use an allocation unit of 512 bytes. In fact, this is OK for disks of up to 512 × 64K = 32MB. But when bigger disks became available, they had to be divided into the same 64K allocation units. As a result the allocation units grew considerably: for example, a 256MB disk was allocated in units of 4K. This led to inefficient disk space usage, because even small files had to allocate at least one unit. But the design was hard to change because so many systems and so much software were dependent on it.

6.4.2 Data Layout on the Disk

The structures described above allow one to find the blocks that were allocated to a file. But which blocks should be chosen for allocation? Obviously blocks can be chosen 150 at random — just take the first one you find that is free. But this can have adverse effects on performance. To understand why, we first need some background on how disks work. Background: Mechanics of Disk Access The various aspects of controlling disk operations are a good example of the shifting roles of operating systems and hardware. In the past, most of these things were done by the op- erating system, and thereby became standard material in the operating systems syllabus. But in modern systems, most of it is done by the disk controller. The operating system no longer needs to bother with the details. To read more: A good description of the workings of modern disk drives is given by Ruemmler and Wilkes [13]. An update for even more advanced disk drives that have their own caches and reorder requests is given by Shriver et al. [14]. Addressing disk blocks is was based on disk anatomy A modern disk typically has multiple 1–12 platters which rotate together on a common spindle at 5400 or 7200 RPM. Data is stored on both surfaces of each platter. Each surface has its own readwrite head, and they are all connected to the same arm and move in unison. However, typically only one head is used at a time, because it is too difficult to align all of them at once. The data is recorded in concentric tracks about 1500–2000 of them. The set of tracks on the different platters that have the same radius are called a cylinder. This concept is important because accessing tracks in the same cylinder just requires the heads to be re-aligned, rather than being moved. Each track is divided into sectors , which define the minimal unit of access each is 256–1024 data bytes plus error correction codes and an inter-sector gap; there are 100-200 per track. Note that tracks near the rim of the disk are much longer than tracks near the center, and therefore can store much more data. This is done by dividing the radius of the disk into 3–20 zones, with the tracks in each zone divided into a different number of sectors. Thus tracks near the rim have more sectors, and store more data. sector zone platter cylinder arm head track spindle 151 In times gone by, addressing a block of data on the disk was accomplished by specifying the surface, track, and sector. Contemporary disks, and in particular those with SCSI controllers, present an interface in which all blocks appear in one logical sequence. This allows the controller to hide bad blocks, and is easier to handle. However, it prevents certain optimizations, because the operating system does not know which blocks are close to each other. For example, the operating system cannot specify that certain data blocks should reside in the same cylinder. Scheduling IO requests affects performance Getting the readwrite head to the right track and sector involves mechanical motion, and takes time. Therefore reordering the IO operations so as to reduce head motion improves performance. The most common optimization algorithms involve reordering the requests according to their tracks, to minimize the movement of the heads along the radius of the disk. Modern controllers also take the rotational position into account. The base algorithm is FIFO first in first out, that just services the requests in the order that they arrive. The most common improvement is to use the SCAN algorithm, in which the head moves back and forth across the tracks and services requests in the order that tracks are encountered. A variant of this is C-SCAN circular SCAN, in which requests are serviced only while moving in one direction, and then the head returns as fast as possible to the origin. This improves fairness by reducing the maximal time that a request may have to wait. transfer 1 2 3 4 5 6 7 initial position tracks time C-SCAN FIFO queued requests slope reflects rate of head movement rotational delay and As with addressing, in the past it was the operating system that was responsible for scheduling the disk operations, and the disk accepted such operations one at a time. Contemporary disks with SCSI controllers are willing to accept multiple outstanding re- quests, and do the scheduling themselves. 152 Performance is optimized by placing related blocks in close proximity The conventional Unix file system is composed of 3 parts: a superblock which con- tains data about the size of the system and the location of free blocks, inodes, and data blocks. The conventional layout is as follows: the superblock is the first block on the disk, because it has to be at a predefined location. Next come all the inodes — the system can know how many there are, because this number appears in the superblock. All the rest are data blocks. The problem with this layout is that it entails much seeking. Consider the example of opening a file named abc . To do so, the file system must access the root inode, to find the blocks used to implement it. It then reads these blocks to find which inode has been allocated to directory a . It then has to read the inode for a to get to its blocks, read the blocks to find b , end so on. If all the inodes are concentrated at one end of the disk, while the blocks are dispersed throughout the disk, this means repeated seeking back and forth. A possible solution is to try and put inodes and related blocks next to each other, in the same set of cylinders, rather than concentrating all the inodes in one place. This was done in the Unix fast file system. However, such optimizations depend on the ability of the system to know the actual layout of data on the disk, which tends to be hidden by modern disk controllers. Exercise 129 The superblock contains the data about all the free blocks, so every time a new block is allocated we need to access the superblock. Does this entail a disk access and seek as well? How can this be avoided? What are the consequences? Details: The Special Case of Indirect Blocks While placing an inode and the blocks it points to together reduces seeking, it may also cause problems. Specifically, a large file may monopolize all the blocks in the set of cylin- ders, not leaving any for other inodes in the set. Luckily, the list of file blocks is not all contained in the inode: for large files, most of it is in indirect blocks. The fast file system therefore switches to a new set of cylinders whenever a new indirect block is allocated, choosing a set that is less loaded than the average. Thus large files are indeed spread across the disk. The extra cost of the seek is relatively low in this case, because it is amortized against the accesses to all the data blocks listed in the indirect block. However, this solution is also problematic. Assuming 12 direct blocks of size 8 KB each, the first indirect block is allocated when the file size reaches 96 KB. Having to perform a seek at this relatively small size is not amortized, and leads to a substantial reduction in the achievable bandwidth for medium-size files in the range of ∼ 100 KB. The solution to this is to make the first indirect block a special case, that stays in the same set of cylinders as the inode [15]. While this solution improves the achievable bandwidth for intermediate size files, it does not necessarily improve things for the whole workload. The reason is that large files 153 indeed tend to crowd out other files, so leaving their blocks in the same set of cylinders causes other small files to suffer. More than teaching us about disk block allocation, this then provides testimony to the complexity of analyzing performance implications, and the need to take a comprehensive approach. Another optimization is to place consecutive logical blocks a certain distance from each other along the track, called the track skew. The idea is that sequential access is common, so it should be optimized. However, the operating system and disk controller need some time to handle each request. If we know how much time this is, and the speed that the disk is rotating, we can calculate how many sectors to skip to account for this processing. Then the request will be handled exactly when the requested block arrives under the heads. To read more: The Unix fast file system was originally described by McKusick and friends [8]. Log structured file systems reduce seeking The use of a large buffer cache and aggressive prefetching can satisfy most read re- quests from memory, saving the overhead of a disk access. The next performance bottleneck is then the implementation of small writes, because they require much seeking to get to the right block. This can be solved by not writing the modified blocks in place, but rather writing a single continuous log of all changes to all files and meta- data. Of course, this complicates the system’s internal data structures. When a disk block is modified and written in the log, the file’s inode needs to be modified to reflect the new location of the block. So the inode also has to be written to the log. But now the location of the inode has also changed, so this also has to be updated and recorded. To reduce overhead, metadata is not written to disk immediately every time it is modified, but only after some time or when a number of changes have accumulated. Thus some data loss is possible if the system crashes, which is the case anyway. Another problem is that eventually the whole disk will be filled with the log, and no more writing will be possible. The solution is to perform garbage collection all the time: we write new log records at one end, and delete old ones at the other. In many cases, the old log data can simply be discarded, because it has since been overwritten and therefore exists somewhere else in the log. Pieces of data that are still valid are simply re-written at the end of the log. To read more: Log structured file systems were introduced by Rosenblum and Ousterhout [12]. Logical volumes avoid disk size limitations The discussion so far has implicitly assumed that there is enough space on the disk for the desired files, and even for the whole file system. With the growing size of data 154 sets used by modern applications, this can be a problematic assumption. The solution is to use another layer of abstraction: logical volumes. A logical volume is an abstraction of a disk. A file system is created on top of a logical volume, and uses its blocks to store metadata and data. In many cases, the logical volume is implemented by direct mapping to a physical disk or a disk partition a part of the disk that is disjoint from other parts that are used for other purposes. But it is also possible to create a large logical volume out of several smaller disks. This just requires an additional level of indirection, which maps the logical volume blocks to the blocks of the underlying disks.

6.4.3 Reliability