Review ====== Files and directories form a //tree// structure. The topmost directory is called the //root// and it has no name. File and directory names that start with '/' are called //absolute// and they specify the //path// to the file/directory starting from the root directory. Elements in a path are separated by the '/' character. Each element names a directory in which the next element in the path can be found. The final element can name either a directory or a file. Your shell (command line) has a //current working directory//. File and directory names that do not start with a '/' are called //relative// and they specify a path to a file/directory starting from the current working directory. Every directory has entries for '.' and '..'. The entry '.' points to the directory itself. The entry '..' points to the parent directory, the one 'above it' in the tree (one step closer to the root). If your current working directory is '/home/user' then the paths '/home' and '..' both refer to the same directory. The 'pwd' command prints the current working directory. The 'cd' command changes your current working directory. Without an argument it takes you to your home directory, where all your files and directories are stored. With a directory name as agument it changes to that directory. With the special argument '-' it changes to the directory you were in before the current one. The 'mkdir' command creates a new directory. The 'cp' command copies files. The 'ls' command lists the contents of directories. The 'cat' command concatenates files and prints the result. The 'grep' command searches for patterns in files. The 'wc' command counts words (and/or lines and/or characters). The 'cut' command cuts specified fields from lines of text and then prints them. Type 'Control-C' to terminate a command that is sill running (hold down 'Control' while typing 'C'). Or, type 'Control-D' to simulate end-of-file which will termiante a command that is waiting for input from the keyboard. Use 'command > filename' to send the output of 'command' to a file called 'filename'. Use 'command < filename' to read the input of 'command' from a file called 'filename'. Use 'command1 | command2' to send the output of 'command1' to the input of 'command2'. Challenge review: finding files and directories by type ======================================================= The 'find' command finds files and directories by name or by property. The default action of 'find' is to print the path of the files/directories it finds. The general form of the command is: 'find directory -property value' One option for -property is '-type' which understands a value of 'd' or 'f'. So, 'find . -type d' looks in the current directory (.) and finds all directories (-type d) and 'find . -type f' looks in the current directory (.) and finds all regular files (-type f). Assume your current working directory is your home directory. ** Q.11 What command pipeline will count the number of directories under '/usr/lib' that have the digit '2' somewhere in their name? Moving and renaming files ========================= The command 'mv oldfile newfile' renames oldfile to newfile. If newfile already exists then it will be deleted, so please be careful. It is safer to use the '-i' option which will ask what you want to do if newfile already exists. Note that 'mv' can be used to rename both files and directories. The command 'mv files... directory' moves one or more files (or directories) into an existing directory. ** Q.1 Write down the commands needed to perform the following operations: (a) Create a file called 'test'. $ ________________ (b) Create a directory called 'temp'. $ ________________ (c) Rename 'test' to 'tested'. $ ________________ (d) Move 'test' into the directory 'temp'. $ ________________ (e) Rename 'temp' to 'temped'. $ ________________ (f) Remove 'temped' and all its contents. $ ________________ Downloading and unpacking an archive ==================================== Download the file 'metars-2019-10.tgz' into your home directory. $ cd $ curl -O https://kuas.org/tmp/metar-2019-10.tgz Make sure the file arrived intact: $ ls -l metar-2019-10.tgz -rw-r--r-- 1 user UsersGrp 1560043 Oct 29 18:17 metar-2019-10.tgz The file is an archive made by the 'tar' program (which is like 'zip', but is better at compressing large files). To extract the archive, use the 'tar' program like this: $ tar -xzf metar-2019-10.tgz The first argument given three options (combined into one argument) to the 'tar' program: '-x' eXtract an archive '-z' decompress (unZip) the archive '-f' the Filename of the archive is given by the next argument The second argument 'metar-2019-10.tgz' is the filename needed by the '-f' option. (To see the complete instructions for using 'tar' you can run 'tar --help'.) Nothing will appear to happen, but if you list your directory you should see a new directory entry has been created. $ ls -l total 956 drwxr-xr-x 1 user UsersGrp 0 Oct 29 18:20 metar-2019 -rw-r--r-- 1 user UsersGrp 1560043 Oct 29 18:19 metar-2019-10.tgz ** Q.2 What command will show you how many files are in your new directory? $ ________________ | _____ Important: If you do not have 742 files, ask for help now! Inspecting data files ===================== The data files contain hourly weather data for October 2019 collected from sensors installed at 99 airports around Japan. Let's take a look at one of them. Pick one of the files (it does not matter which one) and use the 'cat' program to send its contents to the screen. $ cat metar-2019/2019-10-01T00:52:40-japan.txt There is much more content than will fit on one screen. You can scroll back in the terminal to see the content near the start of the file. Or you could use the 'less' command. $ less metar-2019/2019-10-01T00:52:40-japan.txt The 'less' command shows you the contents of a file one page at a time while letting you use the keyboard to navigate. Press: RETURN to move one line down, SPACE to move one page down, b to move one page up, G to move to the last page, g to move back to the first page, or q to quit. The 'less' program (without a filename argument) is very useful as the last command in a pipeline. Use the 'head' command to see just the start of a file. $ head metar-2019/2019-10-01T00:52:40-japan.txt With no options, 'head' shows up to the first ten lines of a file (or whatever input is piped into it). To see a different number of lines, use the '-n number' option. $ head -n 5 metar-2019/2019-10-01T00:52:40-japan.txt The 'tail' command does the opposite of head: it shows the last few (default ten) lines of a file. $ tail metar-2019/2019-10-01T00:52:40-japan.txt $ tail -n 1 metar-2019/2019-10-01T00:52:40-japan.txt Of course, the 'head' and 'tail' commands can be used anywhere in a pipeline along with other commands. ** Q.3 (a) What command shows the first 5 lines of your data file? $ ____________ (b) What two-command pipeline shows you only the 5th line of the file? $ ____________ Sorting data ============ The 'sort' command puts lines of text into alphabetic order. With the option '-n' it puts lines of text into numeric order. Normally 'sort' considers entire lines when sorting. Two options, '-t' and 'k' change this behaviour. Similar to the 'cut' command, 'sort -k N' tells sort to use the Nth field (numbered from 1). The '-t char' option sets the field separator to char, instead of the default blank (space or tab) character. The 'sort' command can therefore reorder the lines of a CSV file directly: 'sort -t , -k 6' sorts input lines alphabetically by their 6th comma-separated field 'sort -n -t , -k 3' sorts input lines numerically by the 3rd comma-separated field 'sort' also understands the option '-r' which reverses the order of the sort (similar to 'ls'). Consider the telephone directory from the self-preparation exercises: name,given,office,phone,lab,phone Adams,Douglas,042,0042,092,0092 Kay,Alan,301,3001,351,3051 Knuth,Donald,201,2001,251,2051 Lee,Tim,404,4004,454,4054 McCarthy,John,202,2002,252,2052 Shannon,Claude,304,3004,351,3051 Vinge,Vernor,302,3003,352,3053 If this is stored in a file called 'directory.txt' then tail -n +2 directory.txt will print the contents of the file without the 'header' line. This can be piped into a 'sort' command to reorder the lines. For (a very silly) example, tail -n +2 directory.txt | sort -r -n -d , -k 6 will print the directory sorted in reverse numeric of telephone number. ** Q.4 What command will sort this file into alphabetic order of given name (not family name)? ** Q.5 What command will sort this file into numeric order of office number? ** Q.6 What command will sort this file into reverse numeric order of lab number? Analysing data files ==================== Let's do some simply analysis on the data: What was the hottest and coldest temperature in October 2019? As a first step, we can collect what we know about the data. 1. What is in each file? Each file contains one reading per sensor station for a particular hour. 2. Do the files contain only data? Each file has six lines of header information followed by the data. 3. How is the data formatted? Each line of data contains all the information for one station. The information is a comma-separated list of records, with data in predictable places. text,station_id,time,latitude,longitude,temp_c,dewpoint_c,wind_dir_deg,wind_speed_kt,... 4. Where is the temperature? The 6th record on each line contains the temperature as a number (degrees Celsius). Next is to write, in natural language, the steps needed to obtain the information we want. 1. From the data files, extract just the lines containing the sensor readings. ROTM 301533Z...,ROTM,2019-09-30T15:33:00Z,26.27,127.75,27.2,25.6,120,12,... ROAH 301530Z...,ROAH,2019-09-30T15:30:00Z,26.17,127.65,28.0,26.0,120,13,... RJAA 301530Z...,RJAA,2019-09-30T15:30:00Z,35.77,140.37,20.0,19.0,10,7,... 2. From each line of sensor readings, extract just the temperature reading. 27.2 28.0 20.0 3. Put the readings into a useful order, e.g., smallest to largest. 20.0 27.2 28.0 4. From the ordered temperature readings, extract the first = the smallest. 20.0 5. From the ordered temperature readings, extract the last = the largest. 28.0 Finally, turn each of these steps into one stage of a command pipeline. 1. From the data files, extract just the lines containing the sensor readings. => What is special about the data lines, compared to the header lines, that identifies them? grep R metar-2019/2* 2. From each line of sensor readings, extract just the temperature reading. cut -d , -f 6 3. Put the temperatures into a useful order. sort -n 4. From the temperature readings, extract the first = smallest. head -1 5. From the temperature readings, extract the last = largest. tail -1 Putting that all together: grep R metar-2019/2* | cut -d , -f 6 | sort -n | head -1 grep R metar-2019/2* | cut -d , -f 6 | sort -n | tail -1 Where and when were the lowest and highest temperatures? ======================================================== The 'cut -d , -f 6' command cuts out just the temperature field from each line. Changing that to 'cut -d , -f 2,6' will print the station_id (location) as well as the temp_c column. (The two will be separated by a comma.) The output lines from 'cut' will now look like this: ROTM,27.2 ROAH,28.0 RJAA,20.0 ** Q.7 What 'sort' command is now needed to sort lines numerically on the temperature field? grep R metar-2019/2* | cut -d , -f 2,6 | _________________ | head -1 ** Q.8 What 'cut' command will print the station_id (2nd), time (3rd), and temp_c (6th) columns? grep R metar-2019/2* | _________________ ** Q.9 What 'sort' command is now needed to sort lines numerically on the temperature field? _________________ For various reasons the sensor reports from stations are not 100% reliable. Stations are sometimes missing from the hourly reports. Recall that the 'uniq -c' command removes duplicate lines and prints a count of how many times each line occurs. We can use this to count how many times each station reported its sensor values during the month. ** Q.10 Write a pipeline that prints each station_id prefixed by how many reports it generated during the month. ________________________ | _____________ | ____ | _______ (Hint: you should find RJAA produced 742 reports but RJAI only produced 464.) Bonus ===== ** Q.11 (a) Which was the least reliable station? ____ (b) How many reports did it generate? ___ (c) How many stations were 100% reliable? __