Wednesday 1 March 2017

Process log files inside a gzipped tar file within python

So I have a task: Collect the information in a set of lines from jboss server.log files.

My first step is to collect the log files from server. It's a production server I don't really want to process log files there as I may inadvertently impact the running application. So I'll tar them up and scp them off the host:

tar cfz /var/tmp/server.log.tgz /apps/jbooss/aslogs/server.log.201[67]*

Now I have a tgz file filled with the log files I want to process on my local machine. There are several ways to attack this and I could just untar the files and process them individually, but I'd be using up disk space that I might forget to clean up and I don't really want to untar the file if I don't need to. Processing time isn't of concern to me.

My solution is to use the tarfile module within python and create a generator function to return each log file line that I want so I can process the lines individually.

Here's what I came up with:

import tarfile

def tar_read_log_lines(input_tar, logfile_name):
    with, 'r:*') as tar:
        for member in tar.getmembers():
                memberfile = tar.extractfile(member)
                for line in memberfile:
                    yield [, line]

This function will take in a compressed tar file and a match for the filenames, and then return each line and the file it came from in a list.

Here's how we make use of that in a simple line count:

>>> from collections import defaultdict
>>> linecounts = defaultdict(int)
>>> for linedata in tar_read_log_lines(r'server.log.tgz', 'server.log'):
...     linecounts[linedata[0]]+=1
>>> for c in linecounts:
...    print c, linecounts[c])
server.log.2017-02-19 72045
server.log.2017-02-18 86586
server.log.2017-01-21 20864
server.log.2017-01-20 30641

This is good. It's initially slow but not enough to investigate other methods. Now I can take that and process the log lines that I want without having to untar the file.

