Wednesday 1 March 2017

Capturing and processing java stack traces

I've put a few of my java stack capture/analysis scripts on github here.

capture.sh is pretty simple. There's some checking in there for disk space and cleanup so we can schedule this in crontab without having to worry too much about filling a disk. It's task is to run jstack for the jboss PID 10 times for 10 seconds. We schedule this in crontab to run hourly if we're trying to track down problems.

process.py is where we consolidate and process those stack outputs in to something we can feed in to other tools. In this case we just used excel to graph thread types over time to identify pools that didn't grow or hit limits during incoming request volume.

I'll go through process.py here.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfile, 'r') as f:
        stackdump = f.readlines()
    
    # Reset our thread based variables, as we're processing line by line rather
    # than stack by stack.
    threadline, threadname, threadtype, threadstate = None, None, None, None
    
    for line in stackdump:
        # Each thread name starts with ". Capture the bit between the quotes to get a thread name
        if line.startswith('"'):
            threadline = line
            threadname = line.split('"')[1]
            
            # Remove any numbers from the thread name to make it a generic type:
            threadtype = threadname.translate(None, '0123456789')
        
        # The state of the thread is on the running line, but it's also on the next line starting with this:
        if threadline and line.startswith('   java.lang.Thread.State:'):
            threadstate = line.split(':')[1].strip()
        
        # If the line is blank, it means we're finished processing one of the thread stacks
        if line == '\n' and threadline and threadname and threadtype and threadstate:
            # Append all the captured data to the data list, then reset all variables for next thread
            # in the stack.
            stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])
            threadline, threadname, threadtype, threadstate = None, None, None, None

All we're doing here is looping through each file, collecting the thread name details and the running state, then saving that in to a list along with some other variables (such as time of capture, and process start time). All of the data we're capturing can actually be found on the first line, however with this setup we could add more information captured from the stack objects as we pass through them (if object includes 'oracle' we can mark it as an oracle type thread). In my case I captured the top and bottom objects in the stack to help identify what the thread was for if it had a generic name.

The only smart thing we're doing is removing any digits from the name to make the thread name more generic to group them later. Such as 'cb-1-3' which is a couchbase related thread becomes 'cb--', then we can total the number of couchbase threads that are being created.

To do this a better way, we should use a multiline regex to capture the entire stack for each thread and process each, rather than processing line individually.

No comments:

Post a Comment