Wednesday, 1 March 2017

Calculate the duration of tasks from two log file lines in python

This is a fairly common pattern in my work. This is a made up example but it's used as a template each time I need to perform a task like this.

I have a set of log files that look like this:
2017-03-01|01:00:00.000|Start|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.500|End|75b5f4f2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.550|End|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:49.000|Start|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:12.200|Start|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:13.800|End|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:30.000|End|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:00.000|Start|75b60528-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:05.123|End|75b60528-fe7e-11e6-bc64-92361f002671
And I need to calculate the duration of each UUID task within the file. For this we'll turn to python.

The full code is here. In this space we'll concentrate on how to perform the above task.

We'll assume we can open the file and read line by line. I've put all the line processing in to a separate function called process_line.

We can see these things about the file:
  1. Fields are separated by |
  2. We know a line that has the start time of the transaction contains 'Start'
  3. We know a  line that has the end time of the transaction contains 'End'
  4. The UUID is the link between start and end times
And here's what I came up with:

from collections import defaultdict
from datetime import datetime
import re

start_string = 'Start'
end_string = 'End'
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
uuid_times = defaultdict()

def process_line(line):
    linelist = line.split('|')
    time = linelist[1].strip()
    uuid = re.search(uuid_pattern, line)
    
    if uuid:
        identifer = '{uuid}'.format(uuid=uuid.group(0))
    
        if start_string in line:
            uuid_times[identifer] = {'start': time, 'end': None, 'duration': None}
            return

        if end_string in line:
            if identifer not in uuid_times.keys():
                # This uuid has not seen a start time, probably in 
                # a previous log file we don't have.
                return
            
            uuid_times[identifer]['end'] = time
            
            duration = datetime.strptime(uuid_times[identifer]['end'], '%H:%M:%S.%f') \
                - datetime.strptime(uuid_times[identifer]['start'], '%H:%M:%S.%f')
            uuid_times[identifer]['duration'] = duration

The function works like this:
  1. Take in a line and split it in to fields
  2. Record the time of the line
  3. Match a uuid in the line - you could just use linelist[3] instead of regex here, but this is a made up example.
  4. If the line contains 'Start', record the time in a dictionary to refer to later, then process the next line
  5. If the line contains 'End', record the end time of the line. - If we've never seen the uuid then continue to next.
  6. Now that we have a start and end time for the uuid, calculate the duration between the two recorded times and store it in the dictionary.

And to verify it works:
>>> for uuid in uuid_times:
...     print uuid, uuid_times[uuid]['duration']
...
75b5fdb2-fe7e-11e6-bc64-92361f002671 0:00:41
75b60528-fe7e-11e6-bc64-92361f002671 0:00:05.123000
75b60398-fe7e-11e6-bc64-92361f002671 0:00:01.600000
75b5f25e-fe7e-11e6-bc64-92361f002671 0:00:00.550000

No comments:

Post a Comment