Compressing Tar Archives
In the previous blog we took a look at how we can extract tar archives. Having covered the basics we can start looking at extending tar by compressing tar archives. Whilst we could easily compress a tar archive post creation, if we know that we will want the archive compressed, we can combine the compression and decompression within the tar utility itself. Most commonly we use gzip compression, but other common option include bzip2 and xz compression routines.
First, we will check the time to create an archive and the resulting file size, adding the command time as a prefix will show the time taken:
$ time tar --create --file=my.tar /usr/share/doc tar: Removing leading `/' from member names tar --create --file=my.tar /usr/share/doc 0.03s user 0.08s system 96% cpu 0.112 total $ ls -lh my.tar -rw-rw-r-- 1 ubuntu ubuntu 17M Mar 26 16:33 my.tar
We can see without adding compression we used 0.112 seconds CPU time and a file of 17M. Adding compression will increase the CPU time but produce a smaller file. The balance is to achieve a smaller file size without effectiing the CPU time too much.
Our first look at compressing tar archives will use gzip. Having creating the baseline measurements without using compression we will now look at using gzip to compress the archive. Even though there is no requirement to use the prefix of time when compression archives, we will use the command purely so we can see the CPU time used.
$ time tar --create --gzip --file=my.tar.gz /usr/share/doc tar: Removing leading `/' from member names tar --create --gzip --file=my.tar.gz /usr/share/doc 1.33s user 0.06s system 99% cpu 1.403 total $ ls -lh my.tar.gz -rw-rw-r-- 1 ubuntu ubuntu 9.1M Mar 26 16:44 my.tar.gz
We can use the –gzip or the short form of -z. The extension of the file should be .tgz or .tar.gz. The extension helps us recognize the compression used in creating the archive. Older version of tar required us to use the same compression switch when listing or extracting the archive. Newer versions can detect the compression used.
We see that the cost of the compression in CPU time was just over a second taking us to a total of 1.403 seconds; however, the file size is reduced to 9.1M from 17M.
Better compression can be gained with bzip2 but is the smaller file size worth the extra CPU time?
$ time tar --create --bzip2 --file=my.tar.bzip2 /usr/share/doc tar: Removing leading `/' from member names tar --create --bzip2 --file=my.tar.bzip2 /usr/share/doc 8.38s user 0.14s system 99% cpu 8.594 total $ ls -lh my.tar.bzip2 -rw-rw-r-- 1 ubuntu ubuntu 8.5M Mar 26 16:52 my.tar.bzip2
The switch used here is –bzip2, the short form is -j. The resultant file size is 8.5M but the time was much more at 8.594 seconds
Finally we will look at XZ compression:
$ time tar --create --xz --file=my.tar.xz /usr/share/doc tar: Removing leading `/' from member names tar --create --xz --file=my.tar.xz /usr/share/doc 19.01s user 0.58s system 99% cpu 19.767 total $ ls -lh my.tar.xz -rw-rw-r-- 1 ubuntu ubuntu 7.6M Mar 26 16:58 my.tar.xz
To create this compressed archive we used the –xz switch with has the short form of -J. We get a file size of 7.8M but a massive time of nearly 20 seconds. We can start to umderstand now the balancing act that we need to manage when compressing our archives and why gzip is so popular as a good balance between performance and file size.
In the following summarizes the results gained in compressing tar archives:
No compression 0.112s 17M
GZIP 1.402s 9.1M
BZIP2 8.594s 8.5M
XZ 19.767s 7.6M
If the correct extension has not been used we can use the command file to test the type of archive, if we recall, later version of tar can detect the comperssion used and using the file command, so can we:
$ file my.tar my.tar.gz my.tar.bzip2 my.tar.xz my.tar: POSIX tar archive (GNU) my.tar.gz: gzip compressed data, last modified: Mon Mar 26 16:44:35 2018, from Unix my.tar.bzip2: bzip2 compressed data, block size = 900k my.tar.xz: XZ compressed data
Although we can use the compression switch when listing an archive, it is not required in the lastest versions of tar. This allows either of the following commands would work on a gzip compressed archive:
$ tar --list --file=my.tar.gz $ tar --list --gzip --file=my.tar.gz
I would argue that the first example would be MOST correct as it would work no matter what compression was used. Where compression is not explicitly set it is automatically detected. Explicity setting the compression program to use supercedes any detection and MUST match the compression used in creation. For example explicity setting –bzip2 will cause an error on this file:
$ tar --list --bzip2 --file=my.tar.gz bzip2: (stdin) is not a bzip2 file. tar: Child returned status 2 tar: Error is not recoverable: exiting now
The same applies to extracting archives, it is best to allow tar to detect the compression used.
The video now follows