Up to 10 points can be gained towards your final score by completing the in-class assignment on Friday.
This week's self-preparation assignment is part practical preparation and part study.
First, install a command-line environment on your computer that you can use to complete the next few weeks of this course. (Linux and Mac users already have a suitable command-line environment; there is nothing to do. Windows users have several options; please follow that link and install one of the options on your laptop computer.)
Second, Review the notes on this page before coming to class. Use Internet search engines to find online tutorials, Wikipedia articles, etc., for any additional information (or explanation) that you might need.
On Thursday evening, press 'submit'. In class on Friday we will check which topics were difficult for everyone.
To succeed at the in-class assignment for this class you should understand the topics outlined in the next section and further explained in the “Notes” section.
A filesystem (or file system) manages the storage of data on a device (such as a solid-state disk, hard disk drive, or USB flash memory drive). Many hundreds of different filesystems exist, each one providing different tradeoffs between speed, reliability, safety, security, etc. Almost all of them manage data by splitting it into files (sequences of bytes) placed within a hierarchy of directories (which map symbolic names to individual files). In most situations a filesystem refers to a single logical repository of files residing on a single physical medium (e.g., SSD or HDD).
Sizes of files (and filesystems, and the physical media on which they exist) are measured in bytes. One byte contains eight bits, where a bit is a single binary digit (0 or 1). A bit is the smallest unit of information possible. A byte contains eight bits because that is sufficient to store a single character in English. For English and many European languages, a single character in a document will be represented as a single byte. Asian characters are typically larger. In Japanese, common characters are represented using two bytes but uncommon characters can require up to four bytes of storage.
storage sizes (traditional units) | |||
---|---|---|---|
unit | name | size (decimal SI1 units) | size (binary JEDEC2 units) |
1 kB | kilobyte | 103 = 1,000 bytes | 210 = 1,024 bytes |
1 MB | megabyte | 106 = 1,000,000 bytes | 220 = 1,048,576 bytes |
1 GB | gigabyte | 109 = 1,000,000,000 bytes | 230 = 1,073,741,824 bytes |
1 TB | terabyte | 1012 = 1,000,000,000,000 bytes | 240 = 1,099,511,627,776 bytes |
storage sizes (IEC binary prefixes) | ||
---|---|---|
unit | name | size |
1 KiB | kibibyte | 210 = 1,024 bytes |
1 MiB | mebibyte | 220 = 1,048,576 bytes |
1 GiB | gibibyte | 230 = 1,073,741,824 bytes |
1 TiB | tebibyte | 240 = 1,099,511,627,776 bytes |
A byte is too small a quantity to be useful when discussing an entire filesystem or disk. SI prefixes are often applied to storage sizes, as shown in the table on the right. Permanent storage (such as disk drives) usually use the SI (metric) prefixes, which increase by factors of 1000. Volatile storage (such as computer memory) usually use the JEDEC (binary) prefixes, which increase by factors of 1024.
Available disk space reported by the OS might use either system leading to discrepancies between advertised (by the manufacturer) and actual (available to the computer user) storage space. For example, a drive advertised in SI units as containing 1 TB of space would be reported as providing only 931 MB of space by an OS using JEDEC units. This can be a cause for confusion, or even legal action.
A new set of binary prefixes was standardised by the IEC, corresponding to the JEDEC binary sizes but using a different spelling and (somewhat silly-sounding) names. The prefixes are formed from the SI/JEDEC ones by inserting an 'i' between the multiplier (k, M, G, etc.) and the unit (B). Their names are formed from the first two letters of the corresponding SI/JEDEC prefix (ki, me, gi, te, etc.) and the first two letters of the word 'binary' (bi); hence kibibyte, mebibyte, and so on.
Linux and recent versions of MacOS have adopted these prefixes for displaying information that uses 1024-based scales (such as memory statistics). Windows and the mass media have so far ignored them.
Filesystems are used to store:
C:\Program Files
' to find lots of applications.
On MacOS, look in '/Applications
'.
On Linux, look in '/usr/bin
'.C:\pagefile.sys
'.
On MacOS look in '/private/var/vm
'.
On Linux you probably have part of the physical disk dedicated to this task, using its own filesystem
that is independent of OS and user files.
No matter what filesystem is being used, the application sees a very simple model of storage:
A directory maps names onto files.
In the example on the right, 'Documents
' is a directory containing two entries.
One entry points to a regular file called 'local
' (containing a document represented as a sequence of bytes).
The other entry points to another directory called 'MobaXterm
'.
Using a 'family tree' metaphor, 'Documents
' is called the parent directory of 'MobaXterm
'.
The root directory, which is at the top of the tree structure, has no name. (File and directory names are stored in their parent directory. Since the root directory has no parent directory, there is simply nowhere to store a name for it.)
To specify a particular file or directory, start at the root and describe the path that must be followed to find that file or directory.
For example, the 'Documents
' directory has the following path:
Users
' directory piumarta
' directory Documents
' directory
Each element in the path is separated by a “/” character (or “" on Windows).
The root directory has no name so we start with an empty name, then the separator, then 'Users
', another separator, and so on.
The final path is therefore: '/Users/piumarta/Documents
' (or '\Users\piumarta\Documents
' on Windows).
On most computers there is only one root directory.
For historical reasons Windows is an exception and has one root directory per filesystem, or 'volume' in Windows terminology.
Each volume is named by a single letter followed by a colon, in this case 'C:
'.
Usually the volume name is prepended to paths (i.e., added at the beginning of the path).
The correct path to the 'Documents
' directory in Windows would therefore be: 'C:\Users\piumarta\Documents
'.
Windows explorer shows you the path to the current directory above the list of files. If you click in it you will see it written in the notation shown above. You can also type into the location bar, or copy/paste from/to it, using the same notation.
Every directory contains two special entries whose names are '.
' and '..
'.
The name '.
' points to the inode of the directory itself
(so the path '/Users/././././.
' refers still to the '/Users
' directory).
The name '..
' points to the inode of the parent directory
(so the path '/Users/Administrator/../piumarta/.
' refers to my account's directory).
The only exception is the root directory, which has no parent, and so for it the name
'..
' points back to the root directory again.
When you open a folder in Mac Finder or Windows Explorer, or type 'ls
' in a command line window, you are looking at a list of directory entries.
Every entry in the directory has a unique (file or directory) name
and associates that name with some storage on the disk where the contents of the file or directory are stored.
The structure describing where the contents are stored is called an index node, or inode for short.
Inodes are not stored in the directories, but in a separate table on the disk.
There is exactly one inode specifying where the contents for any given file or directory are stored on the disk.
However, more than one name can be associated with a given inode (and therefore file) by having more than one directory entry specify the same inode
as the location of the file's contents.
An inode contains all the information relating to a file's or directory's contents, including:
The inode also contains a list of blocks on the disk where the contents of the file/directory are actually stored. Each block has a fixed sized, typically 4096 bytes, and is identified by a number (unique within the filesystem) that can easily be converted into the physical location on the disk where the block is stored.
(Windows, just to be difficult, calls its inodes 'MFT records' where MFT stands for 'master file table'.)
A disk drive, whether solid-state or rotating, stores information in a fixed number of fixed-sized sectors. A sector is typically 512 bytes long, and so a 1TB disk would contain 2,147,483,648 sectors of data. A sector is the smallest unit of data that can be transferred to or from the disk.
While sectors can be addressed directly, most filesystems do not do that for reasons of efficiency. Instead they combine several sectors into a block and treat that as the smallest unit of data when managing space on the disk. A typical block size is 8 sectors, or 4096 bytes. (Of course, Windows has to be different to everyone else and uses the terms 'allocation unit' or 'cluster'.)
Each block has a unique number, and data is always read or written to the disk in multiples of the block size.
Why am I bothering to tell you about block (allocation unit size) size? Efficiency.
Before you can store information on a disk you have to format it. Formatting writes all the data structures that are needed to describe an empty filesystem onto the disk. It puts in place the framework into which you can start creating new files and directories.
While you are formatting a disk you will likely be given the choice of what block (allocation unit) size to use. The default on many filesystems (including Windows, surprisingly) is 4096 bytes. You can almost certainly increase or decrease this size (by factors of a power of 2).
The smallest unit of information that can be allocated is one block. A one-byte file therefore consumes an entire block for its contents, no matter what block size is chosen. The rest of the block is wasted. This is called internal fragmentation of the storage and there is nothing you can do about it (other than reduce the block size).
On the other hand, a huge file will contain very many blocks of data and will therefore have a very large block list in its inode. This wastes space in the inode and potentially makes accessing the file less efficient, because the contents of the file are distributed over many different blocks located far apart on the surface of the disk. (This is what most people mean when they say 'fragmentation' in the context of disk storage. It is also the problem that is solved by the 'disk defragmenter' tool in Windows which attempts to rearrange the blocks belonging to each file to keep them closer together on the disk, ideally in a single contiguous sequence of blocks.)
If you expect your disk to contain mostly very small files (common in big data analytics) then a small block (allocation unit) size will perform better. If you expect your disk to contain mostly very large files (common in audio/video media editing) then a large block (allocation unit) size will perform better. If you have no idea what to expect then the default block (allocation unit) size is probably going to be fine.
In your programs you will use functions such as open()
, read()
, write()
, and close()
to access and modify the contents of files.
Those functions directly manipulate the directories, inodes, and disk blocks described above.
The filesystem's job is to make sure that happens safely and efficiently, and that you never notice all of the underlying complexity.