Thursday 2 March 2017

Using python and regex to iterate over stacks in jstack output

I went away thinking about the multiline regex to capture the java stack output from yesterdays post and it was bugging me slightly that I didn't do that in the first place.

Now this morning I've quickly written up a regex version of the stack capture which will iterate through each stack rather than each line in the file.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfiles[0], 'r') as f:
        stackdump = f.read()
    
    # Iterate over each stack found in the file
    # The regular expression here is finding any line beginning with ", up to two new
    # lines in a row. But also capturing the text between the quotes as a group. Both
    # * characters are non-greedy with ?.
    for stack in re.finditer('^"(.*?)".*?\n\n', stackdump, re.DOTALL|re.MULTILINE):
        fullstack = stack.group(0)
        threadname = stack.group(1)
        
        # Do the same genericising of the thread name
        threadtype = threadname.translate(None, '0123456789')
        
        # Find the thread state line, but if it doesn't exist, leave it as None
        threadstate_match = re.search('java.lang.Thread.State: (.*)', fullstack)
        threadstate = threadstate_match.group(1) if threadstate_match else None
            
        stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])


The regex is pretty simple, as explained above. The only thing to be careful with is DOTALL so it matches newlines, and ? to make sure ? only matches the shortest possible match.

With this version I can then scan the full stack to make some kind of guess about what the objects are for. Say to scan for the keyword 'oracle' then I would attribute that thread to be an oracle thread if it's not named as such.

At a later date I'll use both versions and collect stats to see which is fastest.

No comments:

Post a Comment