~~NOTOC~~

{{page>css&nodate&noeditbtn&nofooter}}

<WRAP syllabus round>

===== Week 07 — Command sequencing =====

Note: This week we will practice and learn more about working with multiple files and directories, and about how text files can be used to store simple databases.
In the two weeks following this one we will study command sequencing (scripts), control, and shell variables.

==== Evaluation ====

Up to 10 points can be gained towards your final score by completing the **in-class assignment** on Friday.

==== Preparation ====

== 1. Complete the self-preparation assignment at home before next class ==

This week's self-preparation assignment is mostly practice with some familiar shell commands and some new ones.
The new commands are explained in the
[[#Notes|Notes]]
section below, which also contains the practice exercises.
(If you are already familiar with the command line, please at least skim the notes to make sure there are no small details that are new to you.)
These commands and concepts will be further explored in the in-class assignment on Friday.

== 2. Check your understanding of command concepts using the self-assessment questionnaire ==

  - Answer each question in the self-assessment questionnaire as honestly as you can.
  - Revise the topics having the lowest scores, update your scores.
  - Repeat the previous step until you feel comfortable with most (or all) of the topics.

On Thursday evening, press 'submit'.
In class on Friday we will check which topics were difficult for everyone.

To succeed at the in-class assignment for this class you should understand the topics outlined in the "Notes" section.

==== What you will learn from this class ====

/**************************************************************** TODO

!!!! DELETING DIRECTORIES AND FILES !!!!

==== Environment variables ====
==== Path name expansion ====
==== Quoting ====
==== Interactive history ====
==== Command and filename completion ====
==== Command substitution ====
==== Arithmetic substitution ====
===== Looking for help: man and help =====
===== Shell scripts =====

  * how to scale performing one operation on one file to performing bulk operations on multiple files

****************************************************************/

  * How to create new directories and files.
  * How to move (rename) and copy (duplicate) files and directories.
  * How to delete directories.
  * How to use //wildcards// to specify a pattern that expands to many file names.
  * How to use //brace expressions// to generate new file and directory names.
  * How to skip over the first part a file using ''tail''.
  * How to use ''cut'' to extract fields from a simple database stored as a text file.


/*
++++ Glossary of |

  ; entry
  : description

++++

*/

/* ==== Further reading ==== */

/****************************************************************/

==== Notes ====

The notes below include several exercises with answers that introduce new concepts.
Many of these concepts will be used in this week's in-class assignment.

Read the notes and try to complete each of the
exercises //without// looking at the sample answer.
If you cannot complete an exercise using a few short commands then
read the sample answer, practice it, and make sure you
understand it //before// continuing.

=== Review ===

First make sure you understand the important topics from the previous two weeks.

++++ Review of previous weeks |

Review of important concepts:

  * The file system manages the storage of data on the disk.
  * Files contain data.
  * Directories contain files or other directories, forming a directory tree.
  * ''cd //path//'' changes the current working directory.
  * ''ls //path//'' lists information about a file or directory.  With no argument, ''ls'' lists the files in the current working directory.
  * ''pwd'' prints the current working directory.
  * ''/'' at the start of a path means the //root// directory at the 'top' of the filesystem.
  * An //absolute path// specifies a location starting from the root directory (and therefore always begins with ''/'').
  * A //relative path// specifies a location starting from the current working directory.
  * Directory names in a path are separated by ''/'' characters.
  * "''..''" is the name of the parent directory;
    "''.''" is the name the current directory.

Files and directories form a //tree// structure.
The topmost directory is called the //root// and it has no name.

File and directory names that start with ''/'' are called //absolute// and they specify the //path// to the file/directory starting from the root directory.
Elements in a path are separated by the ''/'' character.
Each element names a directory in which the next element in the path can be found.
The final element can name either a directory or a file.

Your shell (command line) has a //current working directory//.
File and directory names that do not start with a ''/'' are called //relative// and they specify a path to a file/directory starting from the current working directory.

Each user (account) of a computer has their own //home directory// where their files and and directories are stored.
The home directories for all the user accounts are stored in a directory called ''/home'' or ''/Users'', depending on the OS.
Each user's directory is named after their account.
The home directory for a user account called ''piumarta'' will therefore be ''/home/piumarta'' or ''/Users/piumarta''.
When you start the shell your working directory will be set to your home directory.

Every directory has two entries for directories called ''.'' and ''..''.
The entry ''.'' points to the directory itself.
The entry ''..'' points to the parent directory, the one 'above it' in the tree (one step closer to the root).

If your current working directory is ''/home/user'' then the paths ''/home'' and ''..'' both refer to the same directory.

The ''pwd'' command prints the current working directory.

The ''cd'' command changes your current working directory.
  * Without an argument, ''cd'' takes you to your home directory, where all your files and directories are stored.
  * With a directory name as argument, ''cd //dir//'' changes to the directory ''//dir//''.
  * With the special argument ''-'', ''cd -'' changes to the directory you were in before the current one.

The ''mkdir //dir//'' command creates a new directory called ''//dir//''.

The ''cp'' command copies files.
  * ''cp //file1// //file2//'' copies //file1// to //file2//.
  * ''cp //files...// //dir//'' copies one or more //files...// into the directory //dir// (which can be ''.'' to copy the files into your working directory).
Useful options for ''cp'' include:
  * ''-i'' (interactive) which prompts you before overwriting an existing file.
  * ''-v'' (verbose) which shows files as they are copied.

The ''ls'' command lists the contents of directories.
  * Without an argument, ''ls'' lists the contents of the working directory.
  * With one or more arguments ''ls //paths...//'' lists information about each //path//.
Useful options for ''ls'' include:
  * ''-a'' (all) shows information about hidden files, whose names start with ''.''.
  * ''-d'' (directory) shows information about directory entries instead of about the files that the directory contains.
  * ''-l'' (long) shows the access permissions, link count, owner, group, size, and last modification time of the file(s).

The ''cat'' command //concatenates// files and prints the result.
  * Without an argument, ''cat'' reads from the default input (keyboard) one line at a time and sends each line to the default output (screen).
  * With one or more arguments, ''cat //files...//'' reads from each //file// and sends their contents to the default output (screen).
Useful options for ''cat'' include:
  * ''-n'' (number) adds line numbers to the output.

The ''grep //pattern//'' command searches files for lines that contain //pattern//.
  * Without additional arguments, ''grep //pattern//'' reads from the default input (keyboard) one line at a time and sends lines that match //pattern// to the default output (screen).
  * With one or more additional arguments, ''grep //pattern// //files...//'' searches each //file// and sends lines that match //pattern// to the screen.  If
    two or more //files// are given then each line of output is preceded with the name of the file where the line was found.
Useful options for the ''grep'' command include:
  * ''-i'' (ignore case) ignores upper and lower case differences when matching letters.
  * ''-v'' (invert) negates the matching: only lines that do not match //pattern// are sent to the output.

The ''wc'' command counts words, lines, and characters.  For each input it prints the number of lines, then words, then characters.
  * Without an argument, ''wc'' reads from the default input (keyboard) and summarises the results when the input ends.
  * With one or more arguments, ''wc //files...//'' reads from each //file// and summarises the results for each //file// as well as the total result for all //files//.
Useful options for ''wc'' include:
  * ''-l'' (lines) print only the number of lines.
  * ''-w'' (words) print only the number of words.
  * ''-c'' (characters) print only the number of characters.

The ''rm //files...//'' command removes (deletes) one or more //files//.
Useful options for the ''rm'' command include:
  * ''-r'' (recursive) deletes the contents of a directory before deleting the directory itself.

The ''rmdir //directories...//'' command removes (deletes) one or more //directories//.
The //directories// must already be empty (no entries other than ''.'' and ''..''), otherwise ''rmdir'' will refuse to remove them.

By default, commands send their output to the screen.
To change this, use ''//command// > //file//'' to send the output of //command// to //file// instead of the screen.

By default, commands read their input from the keyboard.
To change this, use ''//command// < //file//'' to read the input of //command// from //file// instead of the keyboard.

The output of a command can be sent directory to the input of another command using a pipe ''|''.
''//command1// | //command2//'' connects the output of //command1// to the input of //command2//.
This can be repeated to form long //pipelines// of commands.

Press ''Control-C'' (hold down the ''Control'' key while typing ''C'') to terminate a program.

Press ''Control-D'' (hold down the ''Control'' key while typing ''D'') to simulate 'end of file' when input is being read from the keyboard.

++++


/**** TODO: Filesystem layout: standard directories ****/


=== Copying directories ===

The command ''cp //files//... //directory//'' copies one or more //files// into //directory//.
If any of the ''files'' happen to be directories then the ''cp'' command will fail.

To copy an entire directory (recursively) use ''cp'' with the ''-r'' option.

The ''cp -r //files//... //directory//'' command copies one or more //files// into //directory//.
If any of the ''files'' are directories then first the directory is copied along with
all of its contents.

Let's practice on a simple directory hierarchy.

<wrap exercise>
Use the ''mkdir'' and ''echo'' commands to recreate the ''dir1'' directory
and its three files as shown in the diagram.
The content of the three files is not important.
</wrap>

{{  07-dir1-bb.png?473  }}

<WRAP shell>
$ **cd /tmp**
$ **mkdir dir1**
$ **echo 1 > dir1/file1**
$ **echo 2 > dir1/file2**
$ **echo 3 > dir1/file3**
$ **ls -lR dir1**
dir1:
total 48
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:15 file1.txt
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:15 file2.txt
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:15 file3.txt
</WRAP>

<wrap exercise>
Use ''cp -rv'' (**r**ecursive and **v**erbose)
to copy the entire directory ''dir1'' to a new directory tree called ''dir2''.
</wrap>

<WRAP shell>
$ **cp -rv dir1 dir2**
'dir1' -> 'dir2'
'dir1/file3.txt' -> 'dir2/file3.txt'
'dir1/file2.txt' -> 'dir2/file2.txt'
'dir1/file1.txt' -> 'dir2/file1.txt'
</WRAP>

Because ''dir2'' does not yet exist, it is first created in the current directory and then the contents of ''dir1'' are copied to ''dir2''.
The ''-v'' option shows you the directory being created and the files being copied.

<wrap exercise>
What will happen if you run the same ''cp -rv dir1 dir2'' command again?
</wrap>

<WRAP shell>
$ **cp -rv dir1 dir2**
'dir1' -> 'dir2/dir1'
'dir1/file3.txt' -> 'dir2/dir1/file3.txt'
'dir1/file2.txt' -> 'dir2/dir1/file2.txt'
'dir1/file1.txt' -> 'dir2/dir1/file1.txt'
$ **ls -lR dir2**
dir2:
total 64
drwxr-xr-x 2 piumarta dialout 170 Oct 26 05:57 dir1
-rw-r--r-- 1 piumarta dialout   2 Oct 26 05:54 file1.txt
-rw-r--r-- 1 piumarta dialout   2 Oct 26 05:54 file2.txt
-rw-r--r-- 1 piumarta dialout   2 Oct 26 05:54 file3.txt
<html> </html>
dir2/dir1:
total 48
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:57 file1.txt
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:57 file2.txt
-rw-r--r-- 1 piumarta dialout 2 Oct 26 05:57 file3.txt
</WRAP>

Because ''dir2'' already exists, ''dir1'' is copied into ''dir2'';
the new copy of ''dir1'' does not replace ''dir2''.

=== Removing directories ===

The ''rmdir //dir//'' command removes the directory //dir//.

<wrap exercise>Try removing ''dir1''.</wrap>

<WRAP shell>
$ **rmdir dir1**
rmdir: failed to remove 'dir1': Directory not empty
</WRAP>

A directory must be empty before it can be removed.

You could remove the files ''dir1/file1.txt'', ''dir1/file2.txt'', and ''dir1/file3.txt''
one at a time but that would be tedious.
Instead, remove all three at the same time using a //wildcard//.
The path ''dir1/*'' expands to all three of the files in ''dir1''.
If you use ''rm -v dir1/*'' (''-v'' for **v**erbose)
then each name will be printed as it is removed.
Once the three files are removed you will he able to remove their parent directory ''dir1''.

<wrap exercise>
Use ''rm -v dir1/*'' to remove all the files in ''dir1''.
</wrap>

<WRAP shell>
$ **ls dir1**
file1.txt  file2.txt  file3.txt
$ **rm -v dir1/* **
removed 'dir1/file1.txt'
removed 'dir1/file2.txt'
removed 'dir1/file3.txt'
$ **rmdir dir1**
$ **ls dir1**
ls: cannot access 'dir1': No such file or directory
</WRAP>

We still have ''dir2'' which contains three files and a copy of the original
''dir1'' (with three more files inside that directory).
The ''*'' wildcard is less useful when removing this many files.
Instead you can use ''rm -r'' (''-r'' for **r**ecursive) which
will remove the contents of a directory before removing the directory itself.

<wrap exercise>
Use ''rm -r dir2'' to remove ''dir2'' and all of its contents.
</wrap>

<WRAP shell>
$ **ls -F dir2**
dir1/  file1.txt  file2.txt  file3.txt
$ **rm -r dir2**
$ **ls dir2**
ls: cannot access 'dir2': No such file or directory
</WRAP>

<WRAP danger><html><label>WARNING!</label></html>
When you delete a file from the command line it is gone //forever//.
There is no 'trash can' that collects deleted files.
There is no way to restore a deleted file later if you change your mind.</WRAP>

=== Wildcards ===

In the exercises above the argument ''dir2/*'' matched all the filenames in ''dir2''.
The shell //expanded// the pattern ''dir2/*'' into three separate arguments: ''dir2/file1'',  ''dir2/file2'', and ''dir2/file3''.

The ''*'' character actually matches any sequence of characters (zero or more) except ''/''.
You can use it to match 'anything' in a part of a filename.
You can also use it more than once to match 'anything' in several different parts of a filename.

<wrap exercise>
List all files in ''/etc'' that begin with ''b'', that end with ''.conf'', or that have a ''.'' anywhere in their name.
</wrap>

<WRAP shell>
$ **ls /etc/b* **
/etc/baseprofile      /etc/bash_completion
$ **ls /etc/*.conf**
/etc/nsswitch.conf
$ **ls -d /etc/*.* **
/etc/init.d              /etc/nsswitch.conf       /etc/rebase.db.i386      /etc/vimrc.less
/etc/minirc.dfl          /etc/persistprofile.sh   /etc/sessionsaliases.sh  /etc/xmodmap.esc
</WRAP>

Another useful wildcard character is ''?'' which matches exactly one of any character (except ''/'').

<wrap exercise>
List all files in ''/etc'' that have an ''o'' and an ''f'' in their name separated by exactly one other character (it does not matter which character).
</wrap>

<WRAP shell>
$ **ls /etc/*o?f* **
/etc/nsswitch.conf  /etc/ssh_config
</WRAP>

One more useful wildcard pattern is ''[//chars//]'' which matches exactly one of any of the //chars// listed between the square brackets.

<wrap exercise>
List all files in ''/etc'' that have a two consecutive vowels ('a', 'e', 'i', 'o', or 'u') in their name.
</wrap>

<WRAP shell>
$ **ls -d /etc/*[aeiou][aeiou]* **
/etc/bash_completion     /etc/defaults            /etc/screenrc            /etc/version
/etc/bash_completion.d   /etc/group               /etc/sessionsaliases.sh
</WRAP>

When the //chars// contains a range of consecutive characters, you can specify the entire range using "''//first//-//last//''".

<wrap exercise>
Use the "''[//first//-//last//]''" pattern to list all files in ''/etc'' whose name contains at least one digit.
</wrap>

<WRAP shell>
$ **ls -d /etc/*[0-9]* **
/etc/X11             /etc/at-spi2         /etc/dbus-1
/etc/gtk-3.0         /etc/pkcs11          /etc/rebase.db.i386
</WRAP>

The wildcard patterns explained above are expanded by the shell according to the files that actually exist in the filesystem.
What happens if you use a wildcard pattern that does not match any files?

<wrap exercise>
Try to delete some non-existent 'log' files: ''dir1/*.log''.
</wrap>

<WRAP shell>
$ **rm dir/*.log**
rm: can't remove 'dir/<nowiki>*</nowiki>.log': No such file or directory
</WRAP>

If the wildcard pattern does not match any files, it is simply left //unexpanded//.
When the command tries to access a file named by a wildcard expression, the file does not exist and an error message is generated.


=== Dry runs: using "echo" to preview commands ===

A 'dry run' is a rehearsal or practice that takes place before the real performance.
In computing, a dry run shows you what a command //would// do but without actually doing it.
One example of how useful they are is to see what files would be matched by wildcard patterns, for example before actually removing them.

For the next exercise, set up your ''dir1'' directory as above, containing six files:
  * three text files ''file1.txt'',  ''file2.txt'', and ''file3.txt'', containing the words ''think'',  ''for'',  and ''yourself'';
  * three data files ''file1.dat'',  ''file2.dat'', and ''file3.dat'', containing the number of characters in the corresponding .''txt'' files.

<WRAP shell>
$ **mkdir dir1**
$ **echo think    > dir1/file1.txt**
$ **echo for      > dir1/file2.txt**
$ **echo yourself > dir1/file3.txt**
$ **wc -c dir1/file1.txt > dir1/file1.dat**
$ **wc -c dir1/file2.txt > dir1/file2.dat**
$ **wc -c dir1/file3.txt > dir1/file3.dat**
$ **ls -l dir1**
total 3
-rw-r--r--    1 user     UsersGrp        17 Oct 26 16:51 file1.dat
              -rw-r--r--    1 user     UsersGrp         6 Oct 26 16:51 file1.txt
-rw-r--r--    1 user     UsersGrp        17 Oct 26 16:51 file2.dat
-rw-r--r--    1 user     UsersGrp         4 Oct 26 16:51 file2.txt
-rw-r--r--    1 user     UsersGrp        17 Oct 26 16:51 file3.dat
-rw-r--r--    1 user     UsersGrp         9 Oct 26 16:51 file3.txt
</WRAP>

<wrap exercise>Use</wrap> the ''echo'' command to perform a dry-run of removing:
  * all the ''.txt'' files in ''dir1'',
  * all the ''.dat'' files in ''dir1'',
  * the ''.txt'' and ''.dat'' files for only ''file2'' (two files in total),
  * the ''.txt'' and ''.dat'' files for ''file1'' and''file3'' (four files in total).

<WRAP shell>
$ **echo rm dir1/*.txt **
rm dir1/file1.txt dir1/file2.txt dir1/file3.txt
$ **echo rm dir1/*.dat **
rm dir1/file1.dat dir1/file2.dat dir1/file3.dat
$ **echo rm dir1/file2.* **
rm dir1/file2.dat dir1/file2.txt
$ **echo rm dir1/file[13].* **
rm dir1/file1.dat dir1/file1.txt dir1/file3.dat dir1/file3.txt
</WRAP>

++++ Why is it called a 'dry run'? |
<WRAP right 14%>{{07-keep-calm-dry-run.png}}</WRAP>
<WRAP right 25%>{{07-fire-department-dry-run.jpg}}</WRAP>
Fire departments run practice sessions in which fire engines are dispatched, fire hoses are deployed, but water is not actually pumped onto a fire.
Since the exercise performs all the actions of fire-fighting //except// pumping water onto a fire, it is literally a 'dry' run.
On the right: fire fighters perform a 'dry run' of rescuing people from a frozen lake whose surface has started to break up.
<WRAP clear></WRAP>
++++

=== Creating files and updating timestamps ===

The ''touch'' command updates the last modification time of an existing file to be the current date and time.
If the file does not exist, an empty file is created.

<wrap exercise>
Create two empty files called ''file1'' and ''file2''.
</wrap>

<WRAP shell>
$ **cd dir1**
$ **ls -lt file[12]**
ls: file[12]: No such file or directory
$ touch file1 file2
$ **ls -lt file[12]**
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file1
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file2
$ touch file2
$ **ls -lt file[12]**
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file2
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file1
$ touch file1
$ **ls -lt file[12]**
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file1
-rw-r--r--    1 user     UsersGrp         0 Oct 26 18:33 file2
</WRAP>


Note how ''touch''ing a file moves it to the top of the 'most recent' list (''ls -t'').

=== Generating path names using brace expressions ===

Wildcards are used to match existing file names.
They cannot be used to generate file names for non-existent files or directories, for example, to create a set of needed files or directories.

<wrap exercise>
Try using a wildcard to create ten empty files called ''test0'', ''test1'', ''test2'', ..., ''test9''.
</wrap>

<WRAP shell>
$ **touch test[0123456789]**
$ **ls test* **
test[0123456789]
</WRAP>

Creating a single file called ''test[0123456789]'' is not what you intended.
That is what happened because the shell could not find any existing file to match
the pattern ''test[0123456789]'' and so did not expand it in the command line.

A //brace expression// will generate multiple //words// based on a list or sequence of values.
The list of values to generate is written between curly braces ''{'' and ''}''
with items in the list separated by commas.
For example, the expression ''{a,b,c}'' generates three separate words ''a'', ''b'', and ''c''.
The brace expression can appear in a larger pattern,
for example, the expression ''p{a,b,c}q'' generates three separate words
''paq'', ''pbq'', and ''pcq''.

<wrap exercise>
Use a brace expression to generate the command needed to create the five files
''test0.txt'' to ''test4.txt''.
</wrap>

<WRAP shell>
$ **touch test{0,1,2,3,4}.txt**
$ **ls test* **
test0.txt  test1.txt  test2.txt  test3.txt  test4.txt
</WRAP>

When a //sequence// of numbers or letters are needed then the list can contain
just the first and last values separated by ''..''.
This is called a //sequence expression//.
For example, the sequence expression ''p{a..z}q'' generates a list of 26 words,
starting with ''paq'' and ''pbq'', and ending with ''pyq'' and ''pzq''.

<wrap exercise>
Use a brace expression to generate the command needed to create the five files
''test5.txt'' to ''test9.txt''.
</wrap>

<WRAP shell>
$ **touch test{5..9}.txt**
$ **ls test* **
test0.txt  test1.txt  test2.txt  test3.txt  test4.txt
test5.txt  test6.txt  test7.txt  test8.txt  test9.txt
</WRAP>

In a sequence expression that generates numbers, the first value in the sequence
sets the minimum width of the generated numbers.
This is useful if leading ''0''s are needed.
For example, the following sequence expressions generate lists of 100 words:
  * ''test{0..99}'' generates ''test0'', ''test1'', ... , ''test98'', ''test99'', and
  * ''tt{000..099}'' generates ''tt000'', ''tt001'', ... , ''tt098'', ''tt099'', and
  * ''t{00000..99}'' generates ''t00000'', ''t00001'', ... , ''t00098'', ''t00099''.


=== CSV files and the "cut" command ===

Text files are often used as simple 'databases' for storing captured sensor data, the results of data processing, etc.
The shell provides several commands for manipulating data stored in this kind of text file.

A comma-separated value (CSV) file is one example of this kind of text file database.
Each line is a record and each field in that record is separated from the next with a specified delimiter character.
In a CSV file the delimiter is a comma, "'',''".

The ''cut'' command selects and prints fields from exactly this kind of text file.
By default it uses a 'tab' character to separated fields (just as a copy-paste operation between Excel and a text editor does) but this can be changed using a command line option.
''cut'' has the following command line options:
  * ''-d //character//'' specifies the delimiter //character//.  To manipulate CSV files, use: "''cut -d ,''"
  * ''-f //fields//'' tells ''cut'' which of the fields you want to print.  Fields are numbered, starting at 1, and //fields// can contain multiple fields separated by commas.

Create a CSV file called ''directory.txt'' that contains the following data.
(The easiest way is to copy the text it from this web page and paste it into a text editor,
or into "''cat > directory.txt''" followed by <wrap key>Control</wrap>+<wrap key>D</wrap> to simulate end-of-file.)

  name,given,office,phone,lab,phone
  Adams,Douglas,042,0042,092,0092
  Kay,Alan,301,3001,351,3051
  Knuth,Donald,201,2001,251,2051
  Lee,Tim,404,4004,454,4054
  McCarthy,John,202,2002,252,2052
  Shannon,Claude,304,3004,351,3051
  Vinge,Vernor,302,3003,352,3053

<wrap exercise>
Use the ''cut'' command to extract just the "office" column from the data.
</wrap>

<WRAP shell>
$ **cut -d , -f 3 directory.txt**
office
042
301
201
404
202
304
302
</WRAP>

The ''tail'' command has an option to print a file starting at a specific line number.
The syntax is: "''tail -n +//number//''".
For example, "''tail -n +5 //file//''" will print the contents of //file// starting from the 5th line in the file.

<wrap exercise>
Pipe (''|'') the output from the previous command into ''tail''.
Use the ''tail -n +//number//'' option to print the input starting at line number 2.
</wrap>

<WRAP shell>
$ **cut -d , -f 3 directory.txt | tail -n +2**
042
301
201
404
202
304
302
</WRAP>

The ''grep'' command understands the similar wildcard patterns to the shell.
(The shell uses them to filter file names and ''grep'' uses them to filter or select lines of text.)

Each office number in our sample data is three digits long.
The first digit says which floor the office is on.
One way to extract just the office numbers on the second floor is to use ''grep'' to search for numbers matching the pattern "''2[0-9][0-9]''".
You can then count how many offices are on the second floor using "''wc -l''".

<wrap exercise>
Write a pipeline of commands that prints how many offices are located on the third floor.
Try very hard to do this without looking at the sample answer.
If you cannot find the solution, click on the link below to view the answer.
</wrap>

++++ Sample answer |

<WRAP shell>
$ **cut -d , -f 3 directory.txt | tail -n +2 | grep '3[0-9][0-9]' | wc -l**
3
</WRAP>

If this does not make sense, look at the output from each stage of the pipeline.

<WRAP shell>
$ **cut -d , -f 3 directory.txt**
office
042
301
201
404
202
304
302
$ **cut -d , -f 3 directory.txt  | tail -n +2**
042
301
201
404
202
304
302
$ **cut -d , -f 3 directory.txt  | tail -n +2 | grep '3[0-9][0-9]'**
301
304
302
$ **cut -d , -f 3 directory.txt  | tail -n +2 | grep '3[0-9][0-9]' | wc -l**
3
</WRAP>

++++

=== Summary ===

  * ''echo > //file//''               can be used to create a //file// containing a line of data.
  * ''touch //file//''                can be used to create an empty //file// or to update its modification time to 'now'.
  * ''mkdir //directory//''           creates a new //directory//.
  * ''cp //oldfile// //newfile//''    copies (duplicates) //oldfile// to //newfile//.
  * ''mv //oldfile// //newfile//''    moves (renames) a file or directory.
  * ''cp //files...// //directory//'' copies one or more //files// (or directories) into an existing //directory//.
  * ''mv //files...// //directory//'' moves one or more //files// (or directories) into an existing //directory//.
  * ''rm //files...//''               removes (deletes) //files//.
  * ''rmdir //directory//''           removes (deletes) a //directory// which **must** be empty.
  * ''rm -r //directory//''           removes (deletes) a //directory// and all its contents, recursively.
  * "''*''"                           in a file name matches zero or more characters, so "''*.txt''" matches all files ending in "''.txt''".
  * "''?''                            in a file name matches any single character, so "''?.txt''" matches "''a.txt''"" but //not// "''any.txt''".
  * "''[//characters//']''            in a file name matches any one of the //characters//, so "''[aeiou].txt''" matches "''a.txt''"" but //not// "''b.txt''".
  * "''[//first//-//last//']''        in a file name matches any character in the range //first// to //last//, so "''*[a-m].txt''" matches "''boa.txt''"" but //not// "''constrictor.txt''".
  * Wildcards (''*'', ''?'', ''[]'')  are expanded by the shell to match files that //already exist//.  They cannot generate new (non-existent) file names.
  * ''{a,b,c}''                       expands to three words: ''a'', ''b'', and ''c''.
  * ''p{a,b,c}q{x,y,z}r''             expands to nine words: ''paqxr paqyr paqzr pbqxr pbqyr pbqzr pcqxr pcqyr pcqzr''
  * ''{000..5}.txt''                  expands to six words: ''000.txt 001.txt 002.txt 003.txt 004.txt 005.txt''
  * ''tail -n +//number//''           displays input starting at line //number// (and continuing until the last line).
  * There is no 'trash': when a file or directory is deleted it is gone immediately and forever.
  * ''cut -d //char// -f //fields//'' prints the given //fields// from its input lines using //char// as the field delimiter.
    The //fields// are numbered from 1 and multiple field numbers are separated by commas.

/* ---------------- IN CLASS ----------------


doing wc -l on data files

saving output in lengths.txt

see lengths.txt page by page

analyse lengths using sort -n

use head and tail to find longest and shortest files


organising files into folders


what happens if you redirect output to a file being used as input?
sort -n lengths.txt > lengths.txt


using >> to append output to a file


pipelines and understanding what text flows through each step of the pipeline


extensions don't mean anything -- .txt is just a convention


combining cut sort uniq wc


checking quality of data, removing damaged files


  * Most files’ names are something.extension. The extension isn’t required, and doesn’t guarantee anything, but is normally used to indicate the type of data in the file.
  * command >> [file] appends a command’s output to a file.
  * [first] | [second] is a pipeline: the output of the first command is used as the input to the second.
  * The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).


*/

/*
----------------------------------------------------------------
			NEXT
----------------------------------------------------------------


=== Translating characters with the "tr" command ===


LOOPS

performing the same action on many different files

for thing in list of things
do
    operation on $thing
done

how the prompt changes when waiting for additional input

variables, word vs $word vs ${word}

using variables in loops -- operation (ls)

using > inside a loop vs >> vs > after the loop


quoting to allow spaces and other funny characters in filenames


cp file-*.txt backup-file-*.txt
  !=
for i in files-*; do cp $i backup-$i.txt; done


using ECHO to understand what a loop is going to do before running it for real


loop visualised as flowchart ; each execution of loop body visualised as echo process


using semicolons to separated the parts of a command instead of newlines


editing longer lines: Control + A E B F P N or arrow keys

repeating earlier commands
history | less
history | grep
!<number> to rerun a command

Control + R to reverse search

!! runs the previous command

!$ is last word of previous command

ESC-. inserts the last word of the previous command


using ECHO to do a "dry run"

protecting > and >> using "" in an ECHO argument


nested loops


  * A for loop repeats commands once for every thing in a list.
  * Every for loop needs a variable to refer to the thing it is currently operating on.
  * Use $name to expand a variable (i.e., get its value). ${name} can also be used.
  * Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.
  * Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
  * Use the up-arrow key to scroll up through previous commands to edit and repeat them.
  * Use Ctrl+R to search through the previously entered commands.
  * Use history to display recent commands, and ![number] to repeat a command by number.

----------------------------------------------------------------

shell scripts

get middle of file using head and tail

save the commands in a file.sh

run the 'bash file.sh'

replace the built-in file name with "$1" (double quotes around arguments to protect spaces)

$@ means all of the arguments, and "$@" means all of the arguments, each one inside implicit double quotes

bash options for debugging: -x

  * Save commands in files (usually called shell scripts) for re-use.
  * bash [filename] runs the commands saved in a file.
  * $@ refers to all of a shell script’s command-line arguments.
  * $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.
  * Place variables in quotes if the values might have spaces in them.
  * Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

----------------------------------------------------------------

finding things: grep and find

grep options: -i -E -w

anchoring expressions: ^ and $

command substitution: $()

wc -l $(find . -name "*.txt")

grep PATTERN $(find .. -name "*.txt")

inverting matches: grep -v

  * find finds files with specific properties that match patterns.
  * grep selects lines in files that match patterns.
  * --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.
  * man [command] displays the manual page for a given command.
  * $([command]) inserts a command’s output in place.

----------------------------------------------------------------

=== Working with multiple files ===

Let's practice manipulating large numbers of data files using the shell.

From the course web site you can download a archive file called ''metars-2019.tgz''.

++++ What is an archive? |
An archive is a file that contains other files.
You have probably already used a ''.zip'' file, which is popular on Windows.
Another popular format is ''.tar'' and the compressed version ''.tgz'' (for '**t**ar' compressed with '**gz**ip').
++++

The ''metars-2019.tgz'' archive contains aviation weather data for Japan.
??? HOW TO DOWLOAD THE FILE TO COMMAND LINE DIRECTORY?
Download the file and then extract the files inside it with the command ''tar xf metars-2019.tgz''.

++++ How the ''tar'' command works |
The command ''tar'' is short for **t**ape **ar**chive.
We don't use magnetic tape to store data any more, but the program is still a popular alternative to ''.zip'' files.
The first argument to ''tar'' tells it what to do.
In this case ''x'' means e**x**tract files from an archive, and
''f'' means the archive should be read from a **f**ile whose name appears in the next argument.
If you add ''v'' for **v**erbose it will also print each filename as it is extracted.
++++

Readings are collected from automated weather stations installed at Japanese airports.
Every hour the data from these weather stations is collected and stored in a file.
There are 8753 files in the archive (365 days x 24 hours per day = 8760, with a few omissions because of downtime).
Each file is named according to the date and time that the data were collected.
For example the file ''2019-01-01T00:53:57-japan.txt'' contains the data recorded on 2019/01/01 at 00:53:57 JST.

What is the structure of each file?

The 'structure' of a file means how the data is arranged within it.
Let's look in file ''2019-01-01T01:53:58-japan.txt'' to see what the structure looks like.
You can use ''cat 2019-01-01T01:53:58-japan.txt'' to do this, but the file is long (85 lines).
To see it one 'page' at a time use the command ''less 2019-01-01T00:53:57-japan.txt''.


+ is every file of the correct structure?

+ are there any unreliable data files due to network or system failures?

+ how many stations are there?

+ how can we reorganise the data to make it more useful?

+ how can we simplify the data for analysis?


Challenge: finding files and directories by type
================================================

The 'find' command finds files and directories by name or by property.
The default action of 'find' is to print the path of the files/directories it finds.

The general form of the command is: 'find directory -property value'

One option for -property is '-type' which understands a value of 'd' or 'f'.
So, 'find . -type d' looks in the current directory (.) and finds all directories (-type d)
and 'find . -type f' looks in the current directory (.) and finds all regular files (-type f).

Assume your current working directory is your home directory.

** Q.11 What command pipeline will count the number of directories under
   	'/usr/lib' that have the digit '2' somewhere in their name?

        ________________________________________________________________


*/


/*
=== Resources ===

Software Carpentry, etc.

Using the command line puts you in control at the level of the operating system and other fundamental processes that make it work.
Many operations and options that are not accessible using a graphical interface (Windows Explorer, Mac Finder, etc.) become accessible to you
on the command line.

Developers, engineers, scientists, and researchers all use the command line to make themselves faster and more effective (and happier)
than would be using only graphical interfaces.

Developers love command lines because they can invoke their tools directly with no options hidden from them by a 'development environment'.
Full access to debuggers, debugging facilities, and debugging information is possible.
Many low-level tools (e.g., for simulation) don't even have a graphical interface, and so using the command line is //the// way to interact with them.

Scientists and researchers use the command line to help manage and interact with data,
even if that involves hundreds of thousands of files and directories.
The facilities of the command line, such as bulk operations on files, make it easy to manage and work with that amount of data.

Command line tools can very quickly and easily be customised and combined to make your own new tools that perform powerful operations.
Converting data between different formats and representations is much easier on the command line than in a 'pointy-cicky' interface.

Command line users are faster, more efficient, more effective, and as a result //happier// than GUI users.
*/


</WRAP> /* syllabus */

/*
 * Local Variables:
 * eval: (flyspell-mode)
 * eval: (ispell-change-dictionary "british")
 * eval: (flyspell-buffer)
 * End:
 */