Tuesday, 16 May 2017

Finding deleted files and summing all numbers in a column


Here are two handy things, which can be combined in to one!

First, is there a mismatch between what du says is used on disk and what df is reporting as used? Well you probably have some processes holding open deleted files, the links will be removed so du will be unable to sum their size, but since they're still allocated on disk then df will report it as used space.

You often see this with log files where someone has gone in to clean up files due to a disk space alert, but the process is still writing to those files. As an aside, it's better avoid this whole situation and use >logfile.name to truncate a file rather than rm to delete the file on a process you're unable to restart right away.

Point one, find the deleted files, you can do this easily with:

sudo lsof | grep deleted

Also if you know which process is holding open the files then you can use the proc filesystem and look in the processes fd directory, the links will have (deleted) after their names if they are removed. Historical tip: The flash plugin used to protect streaming videos (such as youtube) by saving the file in /tmp and then deleting the file. You could just go in to proc and retrieve it if you like, that's changed now of course with HTML5 and DRM. Anyway back on topic.

The second part of this is: now I have the list of deleted files how do I sum up all the size values to see how much disk space is used?

Here you can use awk to extract column, paste to remove the newline and put '+' in it's place (awk can probably do this, but this is easier for me to remember), then pipe that in to bc to get a total:

sudo lsof | grep deleted | awk '{ print $7}' | paste -sd+ | bc

Easy. That number should be the same as the mismatch between du and df values.

Wednesday, 10 May 2017

Tar and rename files using substring replacement

In my day to day I'm asked to tar up logs quite often, but often from hosts which have the same log file names. I've got this little snippet saved that can rename the log file paths in the tar file so we don't clobber log files when extracting at the destination end.

find /var/log/jbossas/standalone -mtime -1 -type f -print | \
   xargs tar --transform 'flags=r;s|var/log/jbossas/standalone|${hostname}|' \
   -cvf /var/tmp/logs_$(hostname)_$(date +%Y%m%d).tgz

The first part of the command is running find and looking for any files that are modified in the past 24 hours (use -daystart for the past day). We print any filenames found and pass that to xargs, which will then run tar and add the files to the output tar file. However in the middle is this transform option, it's doing a simple substring replacement of "var/log/jbossas/standalone" with "$(hostname)".

And that's it. Simple tar filename transformation.

This is of course completely ignoring solutions like splunk or greylog, but often vendors want their raw log files to look at.

Extract start/end dates of SSL certificates across a list of servers

This openssl command will extract the certificate start and end dates from a server:

HOST=randomnamehost.com
openssl s_client -showcerts -connect ${HOST}:443 </dev/null 2>/dev/null | \
openssl x509 -noout -dates

This is useful to extract the dates for monitoring/checking (although, about 1000 other solutions could work too).

Here's a quick application of that in a loop to check a list of servers I have saved:

for HOST in $(cat /host/name/file); do
echo "==========="
echo $HOST
openssl s_client -showcerts -connect ${host}:443 </dev/null 2>/dev/null | \
openssl x509 -noout -dates
done
There's most likely a python module to do the same which would make comparing dates easier in a monitoring script. I should look for one.

Sunday, 16 April 2017

Create a generic host in AWS using Terraform and Ansible

For my test projects I've been using Terraform to create a test host within AWS and then apply a configuration with Ansible. This is another less of a snippet and more of a kickstart for creating development hosts.

The full repository is here:
Terraform and Ansible repository

What the terraform script does is:
  1. Create an EC2 instance (of type and region specified)
  2. Assign a public IP and apply a security group to allow SSH
  3. Tag the instance to be found by ansible
  4. Create a DNS entry for the newly created host
Then with ansible it will:
  1. Apply the hostname
  2. Add a user, apply the public key and create a sudoers file
  3. Install required packages
The advantage here is that when I need a new host I can quickly start a new instance (in a few minutes) and be able to address that via an easy DNS entry. I can adjust the security groups to allow web traffic if needed. Terraform and ansible are immutable so I can run the scripts as often as I want and push updates to my host if needed.

The advantage of all this means I can create/destroy/update the environment on demand. For instance I mostly work on my personal projects on weekends. I could schedule the creation of the environment on a Friday afternoon and then it's destruction on a Monday morning. This would save a lot of cost in unused AWS EC2 time - A large advantage if I need a GPU instance for Tensorflow. 

There are of course many ways to achieve the same goal. However by using it in this way I can adjust the Terraform scripts to target an instance in VMware or another cloud provider, or even locally. You do not have to rely on AWS only.

Tuesday, 11 April 2017

Identify which Java threads are consuming CPU time

Here's a little bash snippet that will help identify which of the JVM threads are consuming CPU time.

Theoretical problem #1 (which may have actually occured...):
The JVM is using 100% of the CPU time on a host, what is the JVM doing?
You can see this condition with a cursory glance at top, but assume there is no log message or you have no further information that lets you know why this is occurring. The usual cases are garbage collection, but the GC logs or jstat don't show this. To investigate further we need to find what inside the JVM is causing this CPU usage, so we'll look to thread CPU usage. To begin with. how do I tell how much CPU time a thread is using? This ps command will show all threads for a single JVM sorted by 'Percent CPU' usage (more on that later). For your own use adjust the pgrep line to find whichever JVM you want:

PID=$(pgrep java | head -1)
ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r

Sample output:

[user@hostname default]# ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r | head
    TIME   LWP %CPU
00:00:38  9951  0.1
00:00:38  9950  0.1
00:00:26 11384  0.0
00:00:26 10953  0.0
00:00:25 14177  0.0
00:00:25 10954  0.0
00:00:24  9946  0.0
00:00:24 10955  0.0
00:00:23 13721  0.0

This is good, however the %CPU usage column is a little misleading. According to the man page for ps:
CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.
This makes the value not immediately useful here, we can identify threads that have consumed CPU usage for the entire lifetime of the JVM, however if it suddenly picks up after a number of days/weeks running the percentage will be very small. We need to focus on the cputime value.

What I need to do here is put the above ps line in to a function, then call that from diff in a loop so I can see the differences over time:

# Function to list threads by PID, Identify PID by adjusting the pgrep line
function java_ps() {
  PID=$(pgrep java | head -1)
  ps -eLq $PID -o cputime,lwp
}

# Wait 30 seconds between two captures of ps output to identify the differences
LOOP_WAIT=30

while true
do
    # Some formatting to make it look nice
    echo "---------------------"
    date
    echo "Waiting ${LOOP_WAIT} seconds..."
    # And this is where we do the magic
    # Provide two inputs to diff, one is the ps output, and the second is a sleep for the LOOP_WAIT
    # and then take another output from ps. Ask diff to show only the new/changed lines between the two inputs
    diff --unchanged-line-format= --old-line-format= --new-line-format='%L' <(java_ps) <(sleep ${LOOP_WAIT};java_ps)
done

Some example output from the above:

---------------------
Mon Apr 10 13:44:30 UTC 2017
Waiting 30 seconds...
00:01:32 20990
---------------------
Mon Apr 10 13:45:00 UTC 2017
Waiting 30 seconds...
00:02:02 20990
00:10:43  5805
00:01:30  8701
00:01:30  8702
00:02:57 30297
00:01:31  2402
00:00:00 19579
00:00:00 19580
00:00:00 19582

What is shown here is LWP 20990 is consuming 30 seconds of CPU between each 30 second sample of CPU time. This will be the thread we want to investigate more closely.

Once I've identified the LWP I can convert the number to hex and extract it from the jstack output of the JVM with this:

LWP=20990
THREADID=$(printf '%x\n' $LWP)
PID=$(pgrep java | head -1)
sudo -u ${javauser} jstack $PID|  pcregrep -M "(?s)nid=0x${THREADID}.*?\t.*?\n\n" | head -n -2

As per a previous post on extracting the stack traces: We extract using pcregrep in multiline mode (M), looking for the nid field (which is native thread ID, the LWP in hex), and then capture everything to the next blank line ('\n\n').

That will give us the stack trace of the problem thread, for example:

"LDAPv3EventService" daemon prio=10 tid=0x00007f6a6ce14000 nid=0xaae runnable [0x00007f6acc2db000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.put(HashMap.java:494)
        at java.util.HashSet.add(HashSet.java:217)
        at com.sun.identity.idm.server.IdRepoJAXRPCObjectImpl.processEntryChanged_idrepo(IdRepoJAXRPCObjectImpl.java:795)
        at com.sun.identity.idm.server.JAXRPCObjectImplEventListener.identityDeleted(JAXRPCObjectImplEventListener.java:43)
        at com.sun.identity.idm.IdRepoListener.objectChanged(IdRepoListener.java:205)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3Repo.objectChanged(LDAPv3Repo.java:5308)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.dispatchEvent(LDAPv3EventService.java:949)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processSearchResultMessage(LDAPv3EventService.java:1092)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processResponse(LDAPv3EventService.java:718)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.run(LDAPv3EventService.java:589)
        at java.lang.Thread.run(Thread.java:745)

In our maybe-not theoretical problem we've identified the stack trace of the problem thread. We can then perform a quick search to see if the problem has been discovered before, and in this case it leads to these bug reports (and probably this post now):

https://bugster.forgerock.org/jira/browse/OPENAM-5324
https://bugster.forgerock.org/jira/browse/OPENAM-119

Now we can form a plan to resolve the issue.

Sunday, 26 March 2017

Using Feature Importance to identify metrics of concern in an application

This is less of a snippet, more of a story.

I've recently been involved in trying to diagnose an intermittent performance problem in a production system of which there were many performance metrics collected. However with the volume of data being recorded for each transaction there were no clear identifiers to what was causing the issues. Every time a batch of transactions would hit a threshold an incident would be raised and then investigations would go on and result in no clear answer on what was the cause. This would keep repeating over and over each time looking at different components that have had historical issues.

In the background I'd been reading about Tensorflow and seeing what machine learning could do. There's a lot of interest at the moment as it's just hit a milestone release, however nothing immediately applied to any problems I was having and it was more of an interest topic. What did come about while trying to think of ways machine learning could be used to apply to the above problems we were seeing was framing my question a bit better: How do I identify which metrics correlate the most to a transaction taking longer than the defined SLA?

Google wasn't so helpful with my new question, however it did lead me to discover more about how to select which data to train your machine learning models with. Within machine learning it's important to identify which features give the best information about what you're classifying so that you don't waste time training your model with unnecessary data, so out of that comes a topic called feature importance.

After reading about this and experimenting with the scikit-learn example for a few hours I was able to set up a notebook in ipython, load our data in and produce a graph which clearly identified which of the components were causing issues. It was a component we hadn't looked at before, and the output clearly showed that any increase in the timings on that component had a massive correlation with the total transaction time exceeding the SLA. I had loaded in a lot of other metrics along with this, such as the time of day and the datacenter the processing took place in, but the clearest measure by a huge margin was this one often ignored component.

What this allowed me to do was help redirect the investigations to this component so we can find out why it would impacted total transaction time so much. Eventually investigations may have ended up looking around this area, but with this technique I'm able to sift through the massive volumes of recorded metrics to short cut those investigations.

I've put together an ipython notebook to show how the application of feature importance could be used in identifying an application performance problem here. There is a lot of other information that I've included in the descriptions also. It should be very easy to pick this up and apply it to some other problem where you need to find what exactly correlates to an expected output.

Thursday, 2 March 2017

Quick Zabbix API client from Python

Here's a problem I had, I'd like to collect the data used to make the graphs in zabbix so that I can analyse that data further.

There are details on python libraries here, but it looks simple enough from the API reference so I'll give it a go myself.

I've identified that I eventually want to call the history.get function, but it looks like it needs an itemid as input and I only have a graphid for now. Some more searching brings up the graphitem.get function which will take in a graphid and return any itemid's used to build it.

So I need to:
  1. Authenticate, to get a token
  2. Get itemid's from graphitem.get call, using graphid as input
  3. Get itemid history for each itemid returned.
First step, some authentication, I'll use getpass and raw_input for this:
from getpass import getpass

username = raw_input('Username: ')
password = getpass('Password: ')

Now i have a username and password stored that I can use later - note that the password is stored in cleartext in python.

Next up I need to get my token:
import json
import requests
import warnings

def zabbix_request(data=zabbix_version_request):
    warnings.simplefilter('ignore')
    response = requests.post(zabbix_api_url, data=data, headers=zabbix_json_rpc_header, verify=False)
    return response.json()['result']

def zabbix_authenticate(username, password):
    auth_request = {
        u'params': {
            u'password': password, 
            u'user': username
        }, 
        u'jsonrpc': u'2.0', 
        u'method': 
        u'user.login', 
        u'auth': None, 
        u'id': 1
    }
    token = zabbix_request(data=json.dumps(auth_request))
    return token

token = zabbix_authenticate(username, password)
The warnings module was used here to surpress the SSL certificate warning from requests as I've turned certificate verification off. It's good that it's there but if I was giving it to someone else I'd remove the warning suppression so they understand they're connecting to a potentially insecure service.

At this point I have my token that I can use with the rest of my requests.

I'll call the zabbix graph function first to get the itemid values, then the history.get function to get the values over time:
def zabbix_get_history(token, item, limit=10):
    request_data = {
        u'params': {
           u'itemids': item, 
           u'sortfield': u'clock', 
           u'limit': limit, 
           u'sortorder': u'ASC', 
           u'output': 
           u'extend', 
           u'history': 0
        }, 
        u'jsonrpc': u'2.0', 
        u'method': u'history.get', 
        u'auth': token, 
        u'id': 1
    }
    return zabbix_request(data=json.dumps(request_data))
    
def zabbix_get_graph_items(token, graphid):
    request_data = {
        "jsonrpc": "2.0",
        "method": "graphitem.get",
        "params": {
            "output": "extend",
            "expandData": 1,
            "graphids": graphid
        },
        "auth": token,
        "id": 1
    }
    return zabbix_request(data=json.dumps(request_data))

graphitems = zabbix_get_graph_items(token, graphid)
history = []
for graphitem in graphitems:
    history.append = zabbix_get_history(token, graphitem['itemid'], limit=100)
And here we should have two  values in the history list showing 100 of the latest values for each of the items in the graphid specified. This gets me exactly what I need. It also highlights that I've created two functions that look almost identical...

Using python and regex to iterate over stacks in jstack output

I went away thinking about the multiline regex to capture the java stack output from yesterdays post and it was bugging me slightly that I didn't do that in the first place.

Now this morning I've quickly written up a regex version of the stack capture which will iterate through each stack rather than each line in the file.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfiles[0], 'r') as f:
        stackdump = f.read()
    
    # Iterate over each stack found in the file
    # The regular expression here is finding any line beginning with ", up to two new
    # lines in a row. But also capturing the text between the quotes as a group. Both
    # * characters are non-greedy with ?.
    for stack in re.finditer('^"(.*?)".*?\n\n', stackdump, re.DOTALL|re.MULTILINE):
        fullstack = stack.group(0)
        threadname = stack.group(1)
        
        # Do the same genericising of the thread name
        threadtype = threadname.translate(None, '0123456789')
        
        # Find the thread state line, but if it doesn't exist, leave it as None
        threadstate_match = re.search('java.lang.Thread.State: (.*)', fullstack)
        threadstate = threadstate_match.group(1) if threadstate_match else None
            
        stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])


The regex is pretty simple, as explained above. The only thing to be careful with is DOTALL so it matches newlines, and ? to make sure ? only matches the shortest possible match.

With this version I can then scan the full stack to make some kind of guess about what the objects are for. Say to scan for the keyword 'oracle' then I would attribute that thread to be an oracle thread if it's not named as such.

At a later date I'll use both versions and collect stats to see which is fastest.

Wednesday, 1 March 2017

Capturing and processing java stack traces

I've put a few of my java stack capture/analysis scripts on github here.

capture.sh is pretty simple. There's some checking in there for disk space and cleanup so we can schedule this in crontab without having to worry too much about filling a disk. It's task is to run jstack for the jboss PID 10 times for 10 seconds. We schedule this in crontab to run hourly if we're trying to track down problems.

process.py is where we consolidate and process those stack outputs in to something we can feed in to other tools. In this case we just used excel to graph thread types over time to identify pools that didn't grow or hit limits during incoming request volume.

I'll go through process.py here.
stackdata = []

# Loop through all the stack files that have been captured
for stackfile in stackfiles:
    # The filenames captured by the script are built up of these values:
    process_start_time, stack_capture_time, capture_count = stackfile.split('_')[1:]
    
    # Open the stack file and get the contents
    with open(stackfile, 'r') as f:
        stackdump = f.readlines()
    
    # Reset our thread based variables, as we're processing line by line rather
    # than stack by stack.
    threadline, threadname, threadtype, threadstate = None, None, None, None
    
    for line in stackdump:
        # Each thread name starts with ". Capture the bit between the quotes to get a thread name
        if line.startswith('"'):
            threadline = line
            threadname = line.split('"')[1]
            
            # Remove any numbers from the thread name to make it a generic type:
            threadtype = threadname.translate(None, '0123456789')
        
        # The state of the thread is on the running line, but it's also on the next line starting with this:
        if threadline and line.startswith('   java.lang.Thread.State:'):
            threadstate = line.split(':')[1].strip()
        
        # If the line is blank, it means we're finished processing one of the thread stacks
        if line == '\n' and threadline and threadname and threadtype and threadstate:
            # Append all the captured data to the data list, then reset all variables for next thread
            # in the stack.
            stackdata.append([process_start_time, stack_capture_time, capture_count, threadtype, threadname, threadstate])
            threadline, threadname, threadtype, threadstate = None, None, None, None

All we're doing here is looping through each file, collecting the thread name details and the running state, then saving that in to a list along with some other variables (such as time of capture, and process start time). All of the data we're capturing can actually be found on the first line, however with this setup we could add more information captured from the stack objects as we pass through them (if object includes 'oracle' we can mark it as an oracle type thread). In my case I captured the top and bottom objects in the stack to help identify what the thread was for if it had a generic name.

The only smart thing we're doing is removing any digits from the name to make the thread name more generic to group them later. Such as 'cb-1-3' which is a couchbase related thread becomes 'cb--', then we can total the number of couchbase threads that are being created.

To do this a better way, we should use a multiline regex to capture the entire stack for each thread and process each, rather than processing line individually.

Calculate the duration of tasks from two log file lines in python

This is a fairly common pattern in my work. This is a made up example but it's used as a template each time I need to perform a task like this.

I have a set of log files that look like this:
2017-03-01|01:00:00.000|Start|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.500|End|75b5f4f2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:00.550|End|75b5f25e-fe7e-11e6-bc64-92361f002671
2017-03-01|01:00:49.000|Start|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:12.200|Start|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:13.800|End|75b60398-fe7e-11e6-bc64-92361f002671
2017-03-01|01:01:30.000|End|75b5fdb2-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:00.000|Start|75b60528-fe7e-11e6-bc64-92361f002671
2017-03-01|01:02:05.123|End|75b60528-fe7e-11e6-bc64-92361f002671
And I need to calculate the duration of each UUID task within the file. For this we'll turn to python.

The full code is here. In this space we'll concentrate on how to perform the above task.

We'll assume we can open the file and read line by line. I've put all the line processing in to a separate function called process_line.

We can see these things about the file:
  1. Fields are separated by |
  2. We know a line that has the start time of the transaction contains 'Start'
  3. We know a  line that has the end time of the transaction contains 'End'
  4. The UUID is the link between start and end times
And here's what I came up with:

from collections import defaultdict
from datetime import datetime
import re

start_string = 'Start'
end_string = 'End'
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
uuid_times = defaultdict()

def process_line(line):
    linelist = line.split('|')
    time = linelist[1].strip()
    uuid = re.search(uuid_pattern, line)
    
    if uuid:
        identifer = '{uuid}'.format(uuid=uuid.group(0))
    
        if start_string in line:
            uuid_times[identifer] = {'start': time, 'end': None, 'duration': None}
            return

        if end_string in line:
            if identifer not in uuid_times.keys():
                # This uuid has not seen a start time, probably in 
                # a previous log file we don't have.
                return
            
            uuid_times[identifer]['end'] = time
            
            duration = datetime.strptime(uuid_times[identifer]['end'], '%H:%M:%S.%f') \
                - datetime.strptime(uuid_times[identifer]['start'], '%H:%M:%S.%f')
            uuid_times[identifer]['duration'] = duration

The function works like this:
  1. Take in a line and split it in to fields
  2. Record the time of the line
  3. Match a uuid in the line - you could just use linelist[3] instead of regex here, but this is a made up example.
  4. If the line contains 'Start', record the time in a dictionary to refer to later, then process the next line
  5. If the line contains 'End', record the end time of the line. - If we've never seen the uuid then continue to next.
  6. Now that we have a start and end time for the uuid, calculate the duration between the two recorded times and store it in the dictionary.

And to verify it works:
>>> for uuid in uuid_times:
...     print uuid, uuid_times[uuid]['duration']
...
75b5fdb2-fe7e-11e6-bc64-92361f002671 0:00:41
75b60528-fe7e-11e6-bc64-92361f002671 0:00:05.123000
75b60398-fe7e-11e6-bc64-92361f002671 0:00:01.600000
75b5f25e-fe7e-11e6-bc64-92361f002671 0:00:00.550000

Use bash to test if a port is open

This one is pretty simple. You can use bash to open a TCP port as a simple test to see if a port is open/closed:

</dev/tcp/127.0.0.1/80 3>&1 && echo "Connected!" || echo "Not Connected!"


Important things here are:

<, which will opening the socket and input to nothing, no data will be sent over the connection. This is better than using Telnet to test sockets as no data is sent down the connection - Telnet will try and negotiate a terminal and on some connections you don't know how that input will behave.

&& executes the statement if the previous statement didn't return an error, in this case the socket opened.

|| executes the statement if the previous statement failed, in this case if the socket didn't connect.

Combining all of this we have a simple tcp connection test. Here it is in function form with a timeout value also:

function tc {
    HOST=$1
    PORT=$2
    TIMEOUT=$([ -z "$3" ] && echo 1 || echo $3)   
    timeout ${TIMEOUT} bash -c "echo </dev/tcp/${HOST}/${PORT}" && echo "Connected!" || echo "Not Connected!"
}

This takes the host and port as the first two parameters. The third parameter is a timeout value, if it's not specified it will default to 1 second.

Process log files inside a gzipped tar file within python

So I have a task: Collect the information in a set of lines from jboss server.log files.

My first step is to collect the log files from server. It's a production server I don't really want to process log files there as I may inadvertently impact the running application. So I'll tar them up and scp them off the host:

tar cfz /var/tmp/server.log.tgz /apps/jbooss/aslogs/server.log.201[67]*

Now I have a tgz file filled with the log files I want to process on my local machine. There are several ways to attack this and I could just untar the files and process them individually, but I'd be using up disk space that I might forget to clean up and I don't really want to untar the file if I don't need to. Processing time isn't of concern to me.

My solution is to use the tarfile module within python and create a generator function to return each log file line that I want so I can process the lines individually.

Here's what I came up with:

import tarfile

def tar_read_log_lines(input_tar, logfile_name):
    with tarfile.open(input_tar, 'r:*') as tar:
        for member in tar.getmembers():
            if member.name.startswith(logfile_name):
                memberfile = tar.extractfile(member)
                for line in memberfile:
                    yield [member.name, line]


This function will take in a compressed tar file and a match for the filenames, and then return each line and the file it came from in a list.

Here's how we make use of that in a simple line count:

>>> from collections import defaultdict
>>> linecounts = defaultdict(int)
>>> for linedata in tar_read_log_lines(r'server.log.tgz', 'server.log'):
...     linecounts[linedata[0]]+=1
...
>>> for c in linecounts:
...    print c, linecounts[c])
...
...
server.log.2017-02-19 72045
server.log.2017-02-18 86586
server.log.2017-01-21 20864
server.log.2017-01-20 30641
--------8<-snip-----------

This is good. It's initially slow but not enough to investigate other methods. Now I can take that and process the log lines that I want without having to untar the file.

Identify individual processes swap usage

This is my first post! I'm still figuring out the formatting here, but as a first intro to what I hope to post: Here's a little explanation of how to find the swap usage of each process on a system using bash.

Here we'll make use of the /proc file system, so we need that available - Note that some kernels do not make this available for security reasons, but we'll assume it's there for our purposes.

Within proc for each PID there's a file called status, this contains an entry called 'VmSwap' which is the currently allocated swap for that PID.

Here are 10 entries from a random host:

[root@randomhost ryanbeales]$grep "VmSwap" /proc/*/status | head
/proc/10023/status:VmSwap:           644 kB
/proc/10031/status:VmSwap:           636 kB
/proc/10036/status:VmSwap:             0 kB
/proc/10062/status:VmSwap:           640 kB
/proc/10862/status:VmSwap:           644 kB
/proc/1098/status:VmSwap:            628 kB
/proc/1131/status:VmSwap:            548 kB
/proc/11347/status:VmSwap:           640 kB
/proc/11355/status:VmSwap:           468 kB
/proc/11390/status:VmSwap:             0 kB


This is a good first start, but I want more relevant information to me, I'd like to add the process start date, the executable name and the current working directory to help identify which processes these are. For that I've created this loop which will go through the status files for each process and:
  1. If the swap variable is empty, ignore the PID as it isn't relevant
  2. If the swap space is less than 100kb, then I'm not concerned, ignore it
  3. Collect the start time of the process from ps - implication here is if there is a lot of swap usage in a short living PID then I'll want to focus on it more.
  4. Get the executable and working directory from proc
Here's the completed function:

function get_swap_space() {
    # Print column headers:
    printf "PID kBSwap StartDate EXE CWD\n"
    
    # Loop through all pids:
    for file in /proc/*/status 
    do 
        # Get the PID and SWAP values from the status file
        read PID SWAP <<< $(awk ' /VmSwap|^Pid/ {printf $2 " "} END{print ""}' $file 2>/dev/null)
        
        # If the swap variable is empty, skip to the next one
        if [ -z "$SWAP" ]; then continue; fi
        
        # If less than 100kb of swap used, not significant, skip on. 
        if [ "$SWAP" -lt "100" ]; then continue; fi
        
        
        START_TIME=$(ps -p $PID -o lstart= |tr ' ' '-')
        CWD=$(readlink /proc/${PID}/cwd)
        EXE=$(readlink /proc/${PID}/exe)
        echo $PID $SWAP $START_TIME $EXE $CWD
    done
}




And it's output looks like this:

[root@randomhost ryanbeales]$get_swap_space | column -t | head
PID    kBSwap  StartDate                 EXE                                  CWD
10023  644     Tue-Jan-24-23:13:41-2017  /usr/bin/ssh-agent                   /
10031  636     Tue-Jan-31-20:25:14-2017  /usr/bin/ssh-agent                   /
10062  640     Mon-Jan-23-23:36:33-2017  /usr/bin/ssh-agent                   /
10862  644     Fri-Jan-27-06:00:13-2017  /usr/bin/ssh-agent                   /
1098   628     Thu-Feb--2-08:07:22-2017  /usr/bin/ssh-agent                   /
1131   548     Sun-Jan-22-02:11:43-2017  /usr/lib/vmware-tools/sbin/vmtoolsd  /
11347  640     Tue-Jan-31-01:12:45-2017  /usr/bin/ssh-agent                   /
11355  468     Fri-Feb-10-01:27:32-2017  /usr/sbin/ntpd                       /
1155   640     Thu-Feb--2-01:49:10-2017  /usr/bin/ssh-agent                   /

Excellent! Exactly what I wanted. With this I can then sort the output by usage or start date to find any processes which might be causing an issue.