Skip to main content
LPI Linux Essentials

3.2 Searching and Extracting Data from Files

By August 16, 2020September 12th, 2022No Comments

Linux Essentials Logo depicting Searching and Extracting data from Files in LinuxIn this blog we step you through the process of searching and extracting data from files at the Linux command line. The Linux command line is built around many small commands. The output from one command can be sent to the input of another.  Piping is sending the output from one command to the input of another command. Redirection is similar but works with files rather than commands; along with these we take the time to visit a little of the menagerie of the file reporting tools that Linux supplies.

Tools Used For Searching and Extracting Data from Files

There are many tools that we can use at the Linux command line aimed directly at extracting data from files and searching that data. We can start by seeing how the modular nature of the UNIX and Linux command line shells allow for bespoke applications using pipes.

Piping

As mentioned in Linux Essentials objective 2.4: we have two types of pipes, un-named and named pipes. Mainly, we see un-named pipes but named pipes are commonly used between processes on your PC, allowing one application to talk to another. Making use of an unnamed pipe we use the vertical bar between two commands as shown below.

$ ls -l | wc -l

This is a command pipeline in its very simplest form, just two commands used in extracting data from files. The output of ls is sent to the input the command wc. In this case, the pipeline that we have built will count the number of lines of output from ls, or simply the number of entries in the current directory. Yes, we have extracted data, the number if files in a directory.

We call this an un-named pipe as it is created on the fly without the existence of a pipe file. This is very convenient for us on the command line but not so convenient for applications to be able to communicate.

In this situation the caped-crusader appears in the form of named pipes to save the day. The application will often create special files of the pipe type that can be used to link to commands together. The pipe file will never store data itself, but marshals  data, (controls the data movement), from one application to another. We can create our own named pipes using the command mkfifo (/usr/bin/mkfifo).

$ which mkfifo
/usr/bin/mkfifo

These special files are known as named pipes as they are represented by files of type PIPE in the file system and as such have a name. You can search for these file types on any Linux system using the find command. This is a very powerful tool and I would encourage you to practice with the command. The output from your system may differ:

$ find / -type p 2>/dev/null
/run/dmeventd-client
/run/dmeventd-server
/run/systemd/inhibit/14.ref
/run/systemd/inhibit/13.ref
/run/systemd/sessions/211.ref
/run/systemd/initctl/fifo

Redirection

Unlike piping, redirection takes the output from a command and sends that output to a text file. Alternatively a command may redirect a text file to its input. Using the file’s content as as input.

In the previous find command, we redirected any error output, denoted by the number 2, to the special file /dev/null. In this way we do not list all of the errors that are produced when we cannot access a file due to limited permissions.

Each command has three channels that can be used for redirection:

  • Standard Input : Channel 0
  • Standard Error : Channel 1
  • Error Output : Channel 2

We only need to to use the channel number when redirecting error output; the symbol < indicates standard input when used without a number and > represents standard output without a number in use. As such:

  • cat < file1 : file1 is read into standard input for the command cat
  • ls /etc > file1 : the standard output of ls is sent to file1, errors are shown on the screen and not redirected
  • ls /etc 2> file1 : Standard output is shown to the screen but errors are written to file1

We can use the >> symbols to append to files are create the files if they do not exist. When using the single greater than symbol > we can create the file and overwrite the file if it exists. If you are concerned about overwriting existing files in error you may set the shell option noclobber. When set, new files can be created but if the file exists normal operation will not permit you to overwrite the existing file. Using >| allows the file to be over-written. The noclobber option may usually be set in a login script or from the command line.

$ set -o noclobber

The -o option sets the option, or turns the option ON.

$ set +o noclobber

The option +o turns the option OFF. To view the current setting you can use the the command

$ set -o

The above command will show all settings and their current state. Taking what we have learned about piping; we now know that it is possible to pipe the output of the set (shell built-in command) command to grep(/bin/grep) which can search then for the particular option we wish to view.

$ set -o | grep noclobber
noclobber off

Currently the option is disabled on my system as we can see from the above output.

Settings that we make at the command line with the command set are transient. They are for that shell and that shell only. On new sheel will not have the same setting and it will be lost on logout. The make settings permanent then add the commands in your personal login script: .bashrc in your home directory. The file /home/<user>/.bashrc runs each time the bash shell is executed for the specific user.

Searching and Extracting Data from Files using Grep and Regular Expressions

The command grep  becomes a simple tool that we can make use of both practically in every day Linux usage as well as here in the course to help demonstrate regular expressions. To test regular expressions fully we may want to use egrep (/bin/egrep) or more simply grep -E to allow for extended regular expression matches.

In the previous graphic we can see that we search for the text string noclobber in the output of the set command. We literally search for the string noclobber. We may think we are looking for the word noclobber but computers think differently to us. Consider the following text file test.txt: It is shown in the following graphic using the command cat .

$ cat test.txt 
no color
color
colour
coloured
colored

To create the file use your favourite text editor such as nano or vim, or simple copy this code to the command line of your system. Practicing some redirection!

$ cat > test.txt <<END
 no color
 color
 colour
 coloured
 colored
 END

If we use grep to search for the string color we we return 3 line matches that contain color:

  1. no color

  2. color

  3. colored

$ grep color test.txt 
no color
color
colored

The command grep will always return complete line matches but often people are surprised that the line colored is returned. We have not specified to search for words, so the string match does apply. If we need to search for the word color in the line then we can use the \b operators in the regular expression to include word boundaries. Of course the boundaries will surround the word so that is how we must use b. We use the -E option with grep to allow for the enhanced regular expression looking for the word boundaries.

$ grep -E '\bcolor\b' test.txt 
no color
color

To search for color at the end of the line we could use the character $, designating the end of line marker, note that we do not need the extended regular expressions with this search

$ grep color$ test.txt 
no color
color

Reversing this a little we could use the carat, ^ , to search for lines beginning with :

$ grep ^color test.txt 
color
colored

Again we can use the word boundary with an extended search to exclude the additional line:

$ grep -E '^color\b' test.txt
color

But what if we want both color and colour? US and UK spellings. We can make the u optional bu using the ? meta-character in the regular expression. The ? applies to the previous character making it optional:

$ grep -E '\bcolou?r\b' test.txt
no color
color
colour

Should we want to search for ranges on characters in the regular expression we can use square brackets. If we want to search for lines that begin with n or N then we could use the -i option with grep for a case insensitive search; alternatively:

$ grep ^[nN] test.txt
no color

But be careful, a misplaced ^ can easily reverse the search. A carat inside the brackets indicates that we are not looking for lines that start with n or N. We could also use the -v option to grep in invert the complete search. In the example note that we use the ^ outside the brackets as before denoting the lines starts with and then the carat inside that denotes not n or N. So the line must start with anything other than n or N.

$ grep ^[^nN] test.txt
color
colour
coloured
colored

Searching for Files Using Find

If you have not already found the command find (/usr/bin/find) then you will need to find it soon. We can use find in a similar way to ls. If used on its own, just the word find, find will list all files in the current directory and below. The behavior of find is to recurse automatically, listing subdirectory content. The output can be extensive especially if run higher up in the file-system. So we can run find with options to set criteria for the search, we can also control the recursion, limiting the depth of directories searched.

$ find -type d
.
./.ssh
./.gnupg
./.gnupg/private-keys-v1.d
./.cache
./.cache/libvirt
./.cache/libvirt/virsh
./.cache/virt-manager

The command above will list only directories , (-type d ), within the current directory and below

$ find /var -maxdepth 1 -type d -perm /g+s
/var/local
/var/mail

The find syntax above with search the /var directory for directories, the maxdepth option limits the search to this directory only, 1 level down. The additional criteria searches for the file permissions including the special group permissions, These two criteria are ANDed together; we could use -o to OR them together. We are then looking for directories which have the SGID bit set that are directly below the /var directory.

The criteria that we can use in are searches include:

  • -type : file type being a value of:
    • f for regular files
    • l for symbolic links
    • d for directories
    • c for character devices
    • b for block devices
    • p for pipes
    • s for sockets
  • -perm | files with certain permissions
  • -atime | last accessed time
  • -mtime | last modified time
  • -size | file size
  • -inum | find files based on the inode number
  • And many more. The man page for find is very good with lots of examples

Find also has actions, the default action for find is to print to the screen. It is optionals but the following two commands are the same, displaying symbolic links from the /etc directory down:

$ find /etc -type l -print
$ find /etc -type l

Another simple action is -delete, you are not prompted to delete files; but those files meeting the criteria are deleted:

$ find $HOME/Documents/ -type f -atime +365 -delete

The Documents directory in the current users home directory is searched for files that have not been accessed in the last 365 days.

Very powerfully we can use -exec or -ok to run any command against the found files. The exec action will proceed without any prompts whereas the ok option will prompt for each file before any actions

$ find $HOME/Documents/ -type f -atime +365 -ok rm {}\ ;
$ find $HOME/Documents/ -type f -atime +365 -exec rm {} \;

The two commands above are similar: the first will prompt and the second will not. Both will run the action to delete (rm) the file name in the place holder {}. For each file located it is placed in the braces awaiting its imminent expunging. Any command can be used in place of rm, this is just an example. The next example removes the execute permissions only from files and not directories or links:

$ find $HOME/Documents/ -type f -exec chmod -x {}\ ;

Extracting Data Using Head or Tail

If we need to view the top of a file we can use head (/usr/bin/head) and should we need to view the end of a file we can use tail (/usr/bin/tail). The following command will list the first two lines of the file test.txt.

$ head -n 2 test.txt 
no color
color

Using the similar command tail we can display the last two lines:

$ tail -n 2 test.txt 
coloured
colored

When reading log files it is common to follow the end of the log. This uses the -f option to tail and will continue to display the current last 10 lines of the log. Use control + c to stop following the file.

$ tail -f /var/log/syslog
Nov 15 20:23:03 kvm dnsmasq-dhcp[12415]: DHCPACK(virbr0) 192.168.122.5 52:54:00:63:51:a2 proxy2
Nov 15 20:35:04 kvm dnsmasq-dhcp[12415]: DHCPREQUEST(virbr0) 192.168.122.4 52:54:00:63:51:a1
Nov 15 20:35:04 kvm dnsmasq-dhcp[12415]: DHCPACK(virbr0) 192.168.122.4 52:54:00:63:51:a1 proxy1
Nov 15 20:50:52 kvm dnsmasq-dhcp[12415]: DHCPREQUEST(virbr0) 192.168.122.5 52:54:00:63:51:a2
Nov 15 20:50:52 kvm dnsmasq-dhcp[12415]: DHCPACK(virbr0) 192.168.122.5 52:54:00:63:51:a2 proxy2
Nov 15 21:01:37 kvm dnsmasq-dhcp[12415]: DHCPREQUEST(virbr0) 192.168.122.4 52:54:00:63:51:a1
Nov 15 21:01:37 kvm dnsmasq-dhcp[12415]: DHCPACK(virbr0) 192.168.122.4 52:54:00:63:51:a1 proxy1
Nov 15 21:17:01 kvm CRON[11521]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Nov 15 21:18:52 kvm dnsmasq-dhcp[12415]: DHCPREQUEST(virbr0) 192.168.122.5 52:54:00:63:51:a2
Nov 15 21:18:52 kvm dnsmasq-dhcp[12415]: DHCPACK(virbr0) 192.168.122.5 52:54:00:63:51:a2 proxy2

Note: The log file used here is on Ubuntu on other systems the more general log file is /var/log/messages

Similarly we can use cat (/bin/cat) and tac (/usr/bin/tac), cat list or concatenates the file from top to bottom and tac from bottom to top. If you focus is on the bottom of the file use cat, you will be left at the bottom of the file. If your focus is on the top use tac as you will be left at the top of the file.

Counting Words, Lines or Characters Using wc

Using the command wc we are at the heart of extracting and searching data from files at the Linux command line. The command wc stands for word count but it can count much more that that. WE will step through some examples.

$ wc test.txt
$ wc -l test.txt
$ wc -w test.txt
$ wc -c test.txt

The first counts the lines, words and characters. The second line counts just lines, then just words and then just characters. The output from ls -l is shown below note that it shows both the line count and the file name.

$ wc -l test.txt 
5 test.txt

Searching and Extracting Data Using Fields and the Command Cut

The command cut (/usr/bin/cut) can be useful where viewing every filed in a file is not required. We may only want to see certain fields. Even with the output of a command we can pipe the output to cut. Suppose we only need the line count from wc not the file name:

$ wc -l test.txt | cut -d’ ‘ -f1
5

With cut we use the -d option to say that the output is space delimited and the -f option to display only the first field: