One-liner shell script for data mining from log files

In our radio station we use Icecast as a stream server. Like any other good linux software it generates log files that – after hitting predefined size – are compressed and a new log is started. This is a nice feature as it save space on you hard drive.

If you’re not familiar with linux shell you might think it will be hard to get some data out of all these log files. Maybe it’s not a simple drag’n’drop operation but once you understand how it works everything will look like basic shell scripting.

Here we go… Let’s start with printing output from the last log file, which is /var/log/icecast2/error.log. Open you terminal. There is a cat function that will make this work:

cat /var/log/icecast2/error.log

You should see a long output from the log file. Let’s do some data mining now. We’re interested in the number of times selected artist have been played (in our case “Jethro Tull”). Our one-liner is now:

cat /var/log/icecast2/error.log | grep ‘Jethro Tull’

This will produce output like this:

[2011-10-09  10:34:55] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Kissing Willie"
[2011-10-09  12:24:54] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Acres Wild"
[2011-10-09  17:32:39] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - SeaLion"
[2011-10-09  19:52:59] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Locomotive Breath"
[2011-10-10  17:07:04] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Heavy Horses"
[2011-10-10  19:20:36] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Nothing Is Easy"
[2011-10-10  19:28:09] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Bouree"
[2011-10-10  22:09:49] INFO admin/command_metadata Metadata on mountpoint /stream changed to "Jethro Tull - Taxi Grab"

The vertical line is called a pipeline. It’s function is to redirect output from one command to another. In our case it will redirect the cat function output (whole file content) to a grep command, which is responsible for printing only the lines we are interested in.

You might have noticed that there are two streams available in this Icecast config ices and stream. The first one is a Ices stream that will pick random mp3 file from the drive and play it to Icecast server. When a DJ is streaming it goes to /stream stream, but /ices is a fallback server – once the DJ disconnects all clients are redirected back to /ices stream. This means that not all grep results were really played, they might be kept in the fallback stream in case the DJ disconnect. To see “more” we will check for DJ dc-s:

cat /var/log/icecast2/error.log | egrep ‘Jethro Tull|source_move_client’

Two things happened here. First – grep changed to egrep. Egrep is a more powerful grep command that allows basic logics. Second – we used another pipeline in egrep to make a logical OR. This command will look for lines with ‘Jethro Tull’ OR ‘source_move_client’. The second option we look for is a signal from Icecast that DJ connected or disconnected and clients were moved from one source to another. Here’s the output:

[2011-10-09  10:34:55] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Kissing Willie"
[2011-10-09  12:24:54] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Acres Wild"
[2011-10-09  12:59:30] INFO source/source_move_clients passing 4 listeners to "/stream"
[2011-10-09  12:59:47] INFO source/source_move_clients passing 4 listeners to "/ices"
[2011-10-09  13:03:18] INFO source/source_move_clients passing 5 listeners to "/stream"
[2011-10-09  16:05:04] INFO source/source_move_clients passing 5 listeners to "/ices"
[2011-10-09  17:00:07] INFO source/source_move_clients passing 0 listeners to "/stream"
[2011-10-09  17:32:39] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - SeaLion"
[2011-10-09  18:23:59] INFO source/source_move_clients passing 0 listeners to "/ices"
[2011-10-09  18:30:24] INFO source/source_move_clients passing 0 listeners to "/stream"
[2011-10-09  19:52:59] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Locomotive Breath"
[2011-10-09  21:30:51] INFO source/source_move_clients passing 0 listeners to "/ices"
[2011-10-09  21:34:53] INFO source/source_move_clients passing 0 listeners to "/stream"
[2011-10-10  00:05:08] INFO source/source_move_clients passing 1 listeners to "/ices"
[2011-10-10  17:07:04] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Heavy Horses"
[2011-10-10  19:20:36] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Nothing Is Easy"
[2011-10-10  19:28:09] INFO admin/command_metadata Metadata on mountpoint /ices changed to "Jethro Tull - Bouree"
[2011-10-10  20:00:01] INFO source/source_move_clients passing 6 listeners to "/stream"
[2011-10-10  22:09:49] INFO admin/command_metadata Metadata on mountpoint /stream changed to "Jethro Tull - Taxi Grab"
[2011-10-11  00:03:49] INFO source/source_move_clients passing 4 listeners to "/ices"

OK, now we know if mp3 was “on air” or just a fallback. At 17:00 we switched from ices to stream. Half an hour later Jethro Tull was played on /ices so it was not on air. As for now you have to analyze it this way.

Now, let’s get back to the compressed log files. One we found in the Icecast2 log directory was error.log.1.gz. If you will read help page for gzip command you’ll notice two options: d and c. First will tell gzip to decompress a file, second will output decompressed content to console.

gzip -d -c /var/log/icecast2/error.log.1.gz | egrep ‘Jethro Tull|source_move_clients’

We can improve it easily with one char, asterisk. It’s a wildcard that will be used to combine our gziped log files :

gzip -d -c /var/log/icecast2/error.log.*.gz | egrep ‘Jethro Tull|source_move_clients’

This will produce what we wanted to achieve for the archived log files. Let’s just combine outputs from the two commands and then “egrep it”:

(gzip -d -c /var/log/icecast2/error.log.*.gz;cat /var/log/icecast2/error.log) | egrep ‘Jethro Tull|source_move_clients’

And that’s all. All you have to do now is to analyze it by yourself… as it’s not really in chronological order – as soon as I will fix it I will make a note here.

As for now this kind of command will be used to export all this data to a database where it will be easier to mine.

Leave a comment

Your email address will not be published. Required fields are marked *