Sunday 16 April 2017

Create a generic host in AWS using Terraform and Ansible

For my test projects I've been using Terraform to create a test host within AWS and then apply a configuration with Ansible. This is another less of a snippet and more of a kickstart for creating development hosts.

The full repository is here:
Terraform and Ansible repository

What the terraform script does is:
  1. Create an EC2 instance (of type and region specified)
  2. Assign a public IP and apply a security group to allow SSH
  3. Tag the instance to be found by ansible
  4. Create a DNS entry for the newly created host
Then with ansible it will:
  1. Apply the hostname
  2. Add a user, apply the public key and create a sudoers file
  3. Install required packages
The advantage here is that when I need a new host I can quickly start a new instance (in a few minutes) and be able to address that via an easy DNS entry. I can adjust the security groups to allow web traffic if needed. Terraform and ansible are immutable so I can run the scripts as often as I want and push updates to my host if needed.

The advantage of all this means I can create/destroy/update the environment on demand. For instance I mostly work on my personal projects on weekends. I could schedule the creation of the environment on a Friday afternoon and then it's destruction on a Monday morning. This would save a lot of cost in unused AWS EC2 time - A large advantage if I need a GPU instance for Tensorflow. 

There are of course many ways to achieve the same goal. However by using it in this way I can adjust the Terraform scripts to target an instance in VMware or another cloud provider, or even locally. You do not have to rely on AWS only.

Tuesday 11 April 2017

Identify which Java threads are consuming CPU time

Here's a little bash snippet that will help identify which of the JVM threads are consuming CPU time.

Theoretical problem #1 (which may have actually occured...):
The JVM is using 100% of the CPU time on a host, what is the JVM doing?
You can see this condition with a cursory glance at top, but assume there is no log message or you have no further information that lets you know why this is occurring. The usual cases are garbage collection, but the GC logs or jstat don't show this. To investigate further we need to find what inside the JVM is causing this CPU usage, so we'll look to thread CPU usage. To begin with. how do I tell how much CPU time a thread is using? This ps command will show all threads for a single JVM sorted by 'Percent CPU' usage (more on that later). For your own use adjust the pgrep line to find whichever JVM you want:

PID=$(pgrep java | head -1)
ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r

Sample output:

[user@hostname default]# ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r | head
    TIME   LWP %CPU
00:00:38  9951  0.1
00:00:38  9950  0.1
00:00:26 11384  0.0
00:00:26 10953  0.0
00:00:25 14177  0.0
00:00:25 10954  0.0
00:00:24  9946  0.0
00:00:24 10955  0.0
00:00:23 13721  0.0

This is good, however the %CPU usage column is a little misleading. According to the man page for ps:
CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.
This makes the value not immediately useful here, we can identify threads that have consumed CPU usage for the entire lifetime of the JVM, however if it suddenly picks up after a number of days/weeks running the percentage will be very small. We need to focus on the cputime value.

What I need to do here is put the above ps line in to a function, then call that from diff in a loop so I can see the differences over time:

# Function to list threads by PID, Identify PID by adjusting the pgrep line
function java_ps() {
  PID=$(pgrep java | head -1)
  ps -eLq $PID -o cputime,lwp
}

# Wait 30 seconds between two captures of ps output to identify the differences
LOOP_WAIT=30

while true
do
    # Some formatting to make it look nice
    echo "---------------------"
    date
    echo "Waiting ${LOOP_WAIT} seconds..."
    # And this is where we do the magic
    # Provide two inputs to diff, one is the ps output, and the second is a sleep for the LOOP_WAIT
    # and then take another output from ps. Ask diff to show only the new/changed lines between the two inputs
    diff --unchanged-line-format= --old-line-format= --new-line-format='%L' <(java_ps) <(sleep ${LOOP_WAIT};java_ps)
done

Some example output from the above:

---------------------
Mon Apr 10 13:44:30 UTC 2017
Waiting 30 seconds...
00:01:32 20990
---------------------
Mon Apr 10 13:45:00 UTC 2017
Waiting 30 seconds...
00:02:02 20990
00:10:43  5805
00:01:30  8701
00:01:30  8702
00:02:57 30297
00:01:31  2402
00:00:00 19579
00:00:00 19580
00:00:00 19582

What is shown here is LWP 20990 is consuming 30 seconds of CPU between each 30 second sample of CPU time. This will be the thread we want to investigate more closely.

Once I've identified the LWP I can convert the number to hex and extract it from the jstack output of the JVM with this:

LWP=20990
THREADID=$(printf '%x\n' $LWP)
PID=$(pgrep java | head -1)
sudo -u ${javauser} jstack $PID|  pcregrep -M "(?s)nid=0x${THREADID}.*?\t.*?\n\n" | head -n -2

As per a previous post on extracting the stack traces: We extract using pcregrep in multiline mode (M), looking for the nid field (which is native thread ID, the LWP in hex), and then capture everything to the next blank line ('\n\n').

That will give us the stack trace of the problem thread, for example:

"LDAPv3EventService" daemon prio=10 tid=0x00007f6a6ce14000 nid=0xaae runnable [0x00007f6acc2db000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.put(HashMap.java:494)
        at java.util.HashSet.add(HashSet.java:217)
        at com.sun.identity.idm.server.IdRepoJAXRPCObjectImpl.processEntryChanged_idrepo(IdRepoJAXRPCObjectImpl.java:795)
        at com.sun.identity.idm.server.JAXRPCObjectImplEventListener.identityDeleted(JAXRPCObjectImplEventListener.java:43)
        at com.sun.identity.idm.IdRepoListener.objectChanged(IdRepoListener.java:205)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3Repo.objectChanged(LDAPv3Repo.java:5308)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.dispatchEvent(LDAPv3EventService.java:949)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processSearchResultMessage(LDAPv3EventService.java:1092)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processResponse(LDAPv3EventService.java:718)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.run(LDAPv3EventService.java:589)
        at java.lang.Thread.run(Thread.java:745)

In our maybe-not theoretical problem we've identified the stack trace of the problem thread. We can then perform a quick search to see if the problem has been discovered before, and in this case it leads to these bug reports (and probably this post now):

https://bugster.forgerock.org/jira/browse/OPENAM-5324
https://bugster.forgerock.org/jira/browse/OPENAM-119

Now we can form a plan to resolve the issue.