Sunday 26 March 2017

Using Feature Importance to identify metrics of concern in an application

This is less of a snippet, more of a story.

I've recently been involved in trying to diagnose an intermittent performance problem in a production system of which there were many performance metrics collected. However with the volume of data being recorded for each transaction there were no clear identifiers to what was causing the issues. Every time a batch of transactions would hit a threshold an incident would be raised and then investigations would go on and result in no clear answer on what was the cause. This would keep repeating over and over each time looking at different components that have had historical issues.

In the background I'd been reading about Tensorflow and seeing what machine learning could do. There's a lot of interest at the moment as it's just hit a milestone release, however nothing immediately applied to any problems I was having and it was more of an interest topic. What did come about while trying to think of ways machine learning could be used to apply to the above problems we were seeing was framing my question a bit better: How do I identify which metrics correlate the most to a transaction taking longer than the defined SLA?

Google wasn't so helpful with my new question, however it did lead me to discover more about how to select which data to train your machine learning models with. Within machine learning it's important to identify which features give the best information about what you're classifying so that you don't waste time training your model with unnecessary data, so out of that comes a topic called feature importance.

After reading about this and experimenting with the scikit-learn example for a few hours I was able to set up a notebook in ipython, load our data in and produce a graph which clearly identified which of the components were causing issues. It was a component we hadn't looked at before, and the output clearly showed that any increase in the timings on that component had a massive correlation with the total transaction time exceeding the SLA. I had loaded in a lot of other metrics along with this, such as the time of day and the datacenter the processing took place in, but the clearest measure by a huge margin was this one often ignored component.

What this allowed me to do was help redirect the investigations to this component so we can find out why it would impacted total transaction time so much. Eventually investigations may have ended up looking around this area, but with this technique I'm able to sift through the massive volumes of recorded metrics to short cut those investigations.

I've put together an ipython notebook to show how the application of feature importance could be used in identifying an application performance problem here. There is a lot of other information that I've included in the descriptions also. It should be very easy to pick this up and apply it to some other problem where you need to find what exactly correlates to an expected output.

Thursday 2 March 2017

Quick Zabbix API client from Python

Here's a problem I had, I'd like to collect the data used to make the graphs in zabbix so that I can analyse that data further.

There are details on python libraries here, but it looks simple enough from the API reference so I'll give it a go myself.

I've identified that I eventually want to call the history.get function, but it looks like it needs an itemid as input and I only have a graphid for now. Some more searching brings up the graphitem.get function which will take in a graphid and return any itemid's used to build it.

So I need to:
  1. Authenticate, to get a token
  2. Get itemid's from graphitem.get call, using graphid as input
  3. Get itemid history for each itemid returned.
First step, some authentication, I'll use getpass and raw_input for this:
from getpass import getpass

username = raw_input('Username: ')
password = getpass('Password: ')

Now i have a username and password stored that I can use later - note that the password is stored in cleartext in python.

Next up I need to get my token:
import json
import requests
import warnings

def zabbix_request(data=zabbix_version_request):
    warnings.simplefilter('ignore')
    response = requests.post(zabbix_api_url, data=data, headers=zabbix_json_rpc_header, verify=False)
    return response.json()['result']

def zabbix_authenticate(username, password):
    auth_request = {
        u'params': {
            u'password': password, 
            u'user': username
        }, 
        u'jsonrpc': u'2.0', 
        u'method': 
        u'user.login', 
        u'auth': None, 
        u'id': 1
    }
    token = zabbix_request(data=json.dumps(auth_request))
    return token

token = zabbix_authenticate(username, password)
The warnings module was used here to surpress the SSL certificate warning from requests as I've turned certificate verification off. It's good that it's there but if I was giving it to someone else I'd remove the warning suppression so they understand they're connecting to a potentially insecure service.

At this point I have my token that I can use with the rest of my requests.

I'll call the zabbix graph function first to get the itemid values, then the history.get function to get the values over time:
def zabbix_get_history(token, item, limit=10):
    request_data = {
        u'params': {
           u'itemids': item, 
           u'sortfield': u'clock', 
           u'limit': limit, 
           u'sortorder': u'ASC', 
           u'output': 
           u'extend', 
           u'history': 0
        }, 
        u'jsonrpc': u'2.0', 
        u'method': u'history.get', 
        u'auth': token, 
        u'id': 1
    }
    return zabbix_request(data=json.dumps(request_data))
    
def zabbix_get_graph_items(token, graphid):
    request_data = {
        "jsonrpc": "2.0",
        "method": "graphitem.get",
        "params": {
            "output": "extend",
            "expandData": 1,
            "graphids": graphid
        },
        "auth": token,
        "id": 1
    }
    return zabbix_request(data=json.dumps(request_data))

graphitems = zabbix_get_graph_items(token, graphid)
history = []
for graphitem in graphitems:
    history.append = zabbix_get_history(token, graphitem['itemid'], limit=100)
And here we should have two  values in the history list showing 100 of the latest values for each of the items in the graphid specified. This gets me exactly what I need. It also highlights that I've created two functions that look almost identical...

Using python and regex to iterate over stacks in jstack output

I went away thinking about the multiline regex to capture the java stack output from yesterdays post and it was bugging me slightly that I didn't do that in the first place.

Now this morning I've quickly written up a regex version of the stack capture which will iterate through each stack rather than each line in the file.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfiles[0], 'r') as f:
        stackdump = f.read()
    
    # Iterate over each stack found in the file
    # The regular expression here is finding any line beginning with ", up to two new
    # lines in a row. But also capturing the text between the quotes as a group. Both
    # * characters are non-greedy with ?.
    for stack in re.finditer('^"(.*?)".*?\n\n', stackdump, re.DOTALL|re.MULTILINE):
        fullstack = stack.group(0)
        threadname = stack.group(1)
        
        # Do the same genericising of the thread name
        threadtype = threadname.translate(None, '0123456789')
        
        # Find the thread state line, but if it doesn't exist, leave it as None
        threadstate_match = re.search('java.lang.Thread.State: (.*)', fullstack)
        threadstate = threadstate_match.group(1) if threadstate_match else None
            
        stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])


The regex is pretty simple, as explained above. The only thing to be careful with is DOTALL so it matches newlines, and ? to make sure ? only matches the shortest possible match.

With this version I can then scan the full stack to make some kind of guess about what the objects are for. Say to scan for the keyword 'oracle' then I would attribute that thread to be an oracle thread if it's not named as such.

At a later date I'll use both versions and collect stats to see which is fastest.

Wednesday 1 March 2017

Capturing and processing java stack traces

I've put a few of my java stack capture/analysis scripts on github here.

capture.sh is pretty simple. There's some checking in there for disk space and cleanup so we can schedule this in crontab without having to worry too much about filling a disk. It's task is to run jstack for the jboss PID 10 times for 10 seconds. We schedule this in crontab to run hourly if we're trying to track down problems.

process.py is where we consolidate and process those stack outputs in to something we can feed in to other tools. In this case we just used excel to graph thread types over time to identify pools that didn't grow or hit limits during incoming request volume.

I'll go through process.py here.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfile, 'r') as f:
        stackdump = f.readlines()
    
    # Reset our thread based variables, as we're processing line by line rather
    # than stack by stack.
    threadline, threadname, threadtype, threadstate = None, None, None, None
    
    for line in stackdump:
        # Each thread name starts with ". Capture the bit between the quotes to get a thread name
        if line.startswith('"'):
            threadline = line
            threadname = line.split('"')[1]
            
            # Remove any numbers from the thread name to make it a generic type:
            threadtype = threadname.translate(None, '0123456789')
        
        # The state of the thread is on the running line, but it's also on the next line starting with this:
        if threadline and line.startswith('   java.lang.Thread.State:'):
            threadstate = line.split(':')[1].strip()
        
        # If the line is blank, it means we're finished processing one of the thread stacks
        if line == '\n' and threadline and threadname and threadtype and threadstate:
            # Append all the captured data to the data list, then reset all variables for next thread
            # in the stack.
            stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])
            threadline, threadname, threadtype, threadstate = None, None, None, None

All we're doing here is looping through each file, collecting the thread name details and the running state, then saving that in to a list along with some other variables (such as time of capture, and process start time). All of the data we're capturing can actually be found on the first line, however with this setup we could add more information captured from the stack objects as we pass through them (if object includes 'oracle' we can mark it as an oracle type thread). In my case I captured the top and bottom objects in the stack to help identify what the thread was for if it had a generic name.

The only smart thing we're doing is removing any digits from the name to make the thread name more generic to group them later. Such as 'cb-1-3' which is a couchbase related thread becomes 'cb--', then we can total the number of couchbase threads that are being created.

To do this a better way, we should use a multiline regex to capture the entire stack for each thread and process each, rather than processing line individually.

Calculate the duration of tasks from two log file lines in python

This is a fairly common pattern in my work. This is a made up example but it's used as a template each time I need to perform a task like this.

I have a set of log files that look like this:
2017-03-01|01:00:00.000|Start|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.500|End|75b5f4f2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.550|End|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:49.000|Start|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:12.200|Start|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:13.800|End|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:30.000|End|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:00.000|Start|75b60528-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:05.123|End|75b60528-fe7e-11e6-bc64-92361f002671
And I need to calculate the duration of each UUID task within the file. For this we'll turn to python.

The full code is here. In this space we'll concentrate on how to perform the above task.

We'll assume we can open the file and read line by line. I've put all the line processing in to a separate function called process_line.

We can see these things about the file:
  1. Fields are separated by |
  2. We know a line that has the start time of the transaction contains 'Start'
  3. We know a  line that has the end time of the transaction contains 'End'
  4. The UUID is the link between start and end times
And here's what I came up with:

from collections import defaultdict
from datetime import datetime
import re

start_string = 'Start'
end_string = 'End'
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
uuid_times = defaultdict()

def process_line(line):
    linelist = line.split('|')
    time = linelist[1].strip()
    uuid = re.search(uuid_pattern, line)
    
    if uuid:
        identifer = '{uuid}'.format(uuid=uuid.group(0))
    
        if start_string in line:
            uuid_times[identifer] = {'start': time, 'end': None, 'duration': None}
            return

        if end_string in line:
            if identifer not in uuid_times.keys():
                # This uuid has not seen a start time, probably in 
                # a previous log file we don't have.
                return
            
            uuid_times[identifer]['end'] = time
            
            duration = datetime.strptime(uuid_times[identifer]['end'], '%H:%M:%S.%f') \
                - datetime.strptime(uuid_times[identifer]['start'], '%H:%M:%S.%f')
            uuid_times[identifer]['duration'] = duration

The function works like this:
  1. Take in a line and split it in to fields
  2. Record the time of the line
  3. Match a uuid in the line - you could just use linelist[3] instead of regex here, but this is a made up example.
  4. If the line contains 'Start', record the time in a dictionary to refer to later, then process the next line
  5. If the line contains 'End', record the end time of the line. - If we've never seen the uuid then continue to next.
  6. Now that we have a start and end time for the uuid, calculate the duration between the two recorded times and store it in the dictionary.

And to verify it works:
>>> for uuid in uuid_times:
...     print uuid, uuid_times[uuid]['duration']
...
75b5fdb2-fe7e-11e6-bc64-92361f002671 0:00:41
75b60528-fe7e-11e6-bc64-92361f002671 0:00:05.123000
75b60398-fe7e-11e6-bc64-92361f002671 0:00:01.600000
75b5f25e-fe7e-11e6-bc64-92361f002671 0:00:00.550000

Use bash to test if a port is open

This one is pretty simple. You can use bash to open a TCP port as a simple test to see if a port is open/closed:

</dev/tcp/127.0.0.1/80 3>&1 && echo "Connected!" || echo "Not Connected!"


Important things here are:

<, which will opening the socket and input to nothing, no data will be sent over the connection. This is better than using Telnet to test sockets as no data is sent down the connection - Telnet will try and negotiate a terminal and on some connections you don't know how that input will behave.

&& executes the statement if the previous statement didn't return an error, in this case the socket opened.

|| executes the statement if the previous statement failed, in this case if the socket didn't connect.

Combining all of this we have a simple tcp connection test. Here it is in function form with a timeout value also:

function tc {
    HOST=$1
    PORT=$2
    TIMEOUT=$([ -z "$3" ] && echo 1 || echo $3)   
    timeout ${TIMEOUT} bash -c "echo </dev/tcp/${HOST}/${PORT}" && echo "Connected!" || echo "Not Connected!"
}

This takes the host and port as the first two parameters. The third parameter is a timeout value, if it's not specified it will default to 1 second.

Process log files inside a gzipped tar file within python

So I have a task: Collect the information in a set of lines from jboss server.log files.

My first step is to collect the log files from server. It's a production server I don't really want to process log files there as I may inadvertently impact the running application. So I'll tar them up and scp them off the host:

tar cfz /var/tmp/server.log.tgz /apps/jbooss/aslogs/server.log.201[67]*

Now I have a tgz file filled with the log files I want to process on my local machine. There are several ways to attack this and I could just untar the files and process them individually, but I'd be using up disk space that I might forget to clean up and I don't really want to untar the file if I don't need to. Processing time isn't of concern to me.

My solution is to use the tarfile module within python and create a generator function to return each log file line that I want so I can process the lines individually.

Here's what I came up with:

import tarfile

def tar_read_log_lines(input_tar, logfile_name):
    with tarfile.open(input_tar, 'r:*') as tar:
        for member in tar.getmembers():
            if member.name.startswith(logfile_name):
                memberfile = tar.extractfile(member)
                for line in memberfile:
                    yield [member.name, line]


This function will take in a compressed tar file and a match for the filenames, and then return each line and the file it came from in a list.

Here's how we make use of that in a simple line count:

>>> from collections import defaultdict
>>> linecounts = defaultdict(int)
>>> for linedata in tar_read_log_lines(r'server.log.tgz', 'server.log'):
...     linecounts[linedata[0]]+=1
...
>>> for c in linecounts:
...    print c, linecounts[c])
...
...
server.log.2017-02-19 72045
server.log.2017-02-18 86586
server.log.2017-01-21 20864
server.log.2017-01-20 30641
--------8<-snip-----------

This is good. It's initially slow but not enough to investigate other methods. Now I can take that and process the log lines that I want without having to untar the file.

Identify individual processes swap usage

This is my first post! I'm still figuring out the formatting here, but as a first intro to what I hope to post: Here's a little explanation of how to find the swap usage of each process on a system using bash.

Here we'll make use of the /proc file system, so we need that available - Note that some kernels do not make this available for security reasons, but we'll assume it's there for our purposes.

Within proc for each PID there's a file called status, this contains an entry called 'VmSwap' which is the currently allocated swap for that PID.

Here are 10 entries from a random host:

[root@randomhost ryanbeales]$grep "VmSwap" /proc/*/status | head
/proc/10023/status:VmSwap:           644 kB
/proc/10031/status:VmSwap:           636 kB
/proc/10036/status:VmSwap:             0 kB
/proc/10062/status:VmSwap:           640 kB
/proc/10862/status:VmSwap:           644 kB
/proc/1098/status:VmSwap:            628 kB
/proc/1131/status:VmSwap:            548 kB
/proc/11347/status:VmSwap:           640 kB
/proc/11355/status:VmSwap:           468 kB
/proc/11390/status:VmSwap:             0 kB


This is a good first start, but I want more relevant information to me, I'd like to add the process start date, the executable name and the current working directory to help identify which processes these are. For that I've created this loop which will go through the status files for each process and:
  1. If the swap variable is empty, ignore the PID as it isn't relevant
  2. If the swap space is less than 100kb, then I'm not concerned, ignore it
  3. Collect the start time of the process from ps - implication here is if there is a lot of swap usage in a short living PID then I'll want to focus on it more.
  4. Get the executable and working directory from proc
Here's the completed function:

function get_swap_space() {
    # Print column headers:
    printf "PID kBSwap StartDate EXE CWD\n"
    
    # Loop through all pids:
    for file in /proc/*/status 
    do 
        # Get the PID and SWAP values from the status file
        read PID SWAP <<< $(awk ' /VmSwap|^Pid/ {printf $2 " "} END{print ""}' $file 2>/dev/null)
        
        # If the swap variable is empty, skip to the next one
        if [ -z "$SWAP" ]; then continue; fi
        
        # If less than 100kb of swap used, not significant, skip on. 
        if [ "$SWAP" -lt "100" ]; then continue; fi
        
        
        START_TIME=$(ps -p $PID -o lstart= |tr ' ' '-')
        CWD=$(readlink /proc/${PID}/cwd)
        EXE=$(readlink /proc/${PID}/exe)
        echo $PID $SWAP $START_TIME $EXE $CWD
    done
}




And it's output looks like this:

[root@randomhost ryanbeales]$get_swap_space | column -t | head
PID    kBSwap  StartDate                 EXE                                  CWD
10023  644     Tue-Jan-24-23:13:41-2017  /usr/bin/ssh-agent                   /
10031  636     Tue-Jan-31-20:25:14-2017  /usr/bin/ssh-agent                   /
10062  640     Mon-Jan-23-23:36:33-2017  /usr/bin/ssh-agent                   /
10862  644     Fri-Jan-27-06:00:13-2017  /usr/bin/ssh-agent                   /
1098   628     Thu-Feb--2-08:07:22-2017  /usr/bin/ssh-agent                   /
1131   548     Sun-Jan-22-02:11:43-2017  /usr/lib/vmware-tools/sbin/vmtoolsd  /
11347  640     Tue-Jan-31-01:12:45-2017  /usr/bin/ssh-agent                   /
11355  468     Fri-Feb-10-01:27:32-2017  /usr/sbin/ntpd                       /
1155   640     Thu-Feb--2-01:49:10-2017  /usr/bin/ssh-agent                   /

Excellent! Exactly what I wanted. With this I can then sort the output by usage or start date to find any processes which might be causing an issue.