Tuesday 11 April 2017

Identify which Java threads are consuming CPU time

Here's a little bash snippet that will help identify which of the JVM threads are consuming CPU time.

Theoretical problem #1 (which may have actually occured...):
The JVM is using 100% of the CPU time on a host, what is the JVM doing?
You can see this condition with a cursory glance at top, but assume there is no log message or you have no further information that lets you know why this is occurring. The usual cases are garbage collection, but the GC logs or jstat don't show this. To investigate further we need to find what inside the JVM is causing this CPU usage, so we'll look to thread CPU usage. To begin with. how do I tell how much CPU time a thread is using? This ps command will show all threads for a single JVM sorted by 'Percent CPU' usage (more on that later). For your own use adjust the pgrep line to find whichever JVM you want:

PID=$(pgrep java | head -1)
ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r

Sample output:

[user@hostname default]# ps -eLq $PID -o cputime,lwp,%cpu | sort -k3,3 -r | head
    TIME   LWP %CPU
00:00:38  9951  0.1
00:00:38  9950  0.1
00:00:26 11384  0.0
00:00:26 10953  0.0
00:00:25 14177  0.0
00:00:25 10954  0.0
00:00:24  9946  0.0
00:00:24 10955  0.0
00:00:23 13721  0.0

This is good, however the %CPU usage column is a little misleading. According to the man page for ps:
CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. This is not ideal, and it does not conform to the standards that ps otherwise conforms to. CPU usage is unlikely to add up to exactly 100%.
This makes the value not immediately useful here, we can identify threads that have consumed CPU usage for the entire lifetime of the JVM, however if it suddenly picks up after a number of days/weeks running the percentage will be very small. We need to focus on the cputime value.

What I need to do here is put the above ps line in to a function, then call that from diff in a loop so I can see the differences over time:

# Function to list threads by PID, Identify PID by adjusting the pgrep line
function java_ps() {
  PID=$(pgrep java | head -1)
  ps -eLq $PID -o cputime,lwp
}

# Wait 30 seconds between two captures of ps output to identify the differences
LOOP_WAIT=30

while true
do
    # Some formatting to make it look nice
    echo "---------------------"
    date
    echo "Waiting ${LOOP_WAIT} seconds..."
    # And this is where we do the magic
    # Provide two inputs to diff, one is the ps output, and the second is a sleep for the LOOP_WAIT
    # and then take another output from ps. Ask diff to show only the new/changed lines between the two inputs
    diff --unchanged-line-format= --old-line-format= --new-line-format='%L' <(java_ps) <(sleep ${LOOP_WAIT};java_ps)
done

Some example output from the above:

---------------------
Mon Apr 10 13:44:30 UTC 2017
Waiting 30 seconds...
00:01:32 20990
---------------------
Mon Apr 10 13:45:00 UTC 2017
Waiting 30 seconds...
00:02:02 20990
00:10:43  5805
00:01:30  8701
00:01:30  8702
00:02:57 30297
00:01:31  2402
00:00:00 19579
00:00:00 19580
00:00:00 19582

What is shown here is LWP 20990 is consuming 30 seconds of CPU between each 30 second sample of CPU time. This will be the thread we want to investigate more closely.

Once I've identified the LWP I can convert the number to hex and extract it from the jstack output of the JVM with this:

LWP=20990
THREADID=$(printf '%x\n' $LWP)
PID=$(pgrep java | head -1)
sudo -u ${javauser} jstack $PID|  pcregrep -M "(?s)nid=0x${THREADID}.*?\t.*?\n\n" | head -n -2

As per a previous post on extracting the stack traces: We extract using pcregrep in multiline mode (M), looking for the nid field (which is native thread ID, the LWP in hex), and then capture everything to the next blank line ('\n\n').

That will give us the stack trace of the problem thread, for example:

"LDAPv3EventService" daemon prio=10 tid=0x00007f6a6ce14000 nid=0xaae runnable [0x00007f6acc2db000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.put(HashMap.java:494)
        at java.util.HashSet.add(HashSet.java:217)
        at com.sun.identity.idm.server.IdRepoJAXRPCObjectImpl.processEntryChanged_idrepo(IdRepoJAXRPCObjectImpl.java:795)
        at com.sun.identity.idm.server.JAXRPCObjectImplEventListener.identityDeleted(JAXRPCObjectImplEventListener.java:43)
        at com.sun.identity.idm.IdRepoListener.objectChanged(IdRepoListener.java:205)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3Repo.objectChanged(LDAPv3Repo.java:5308)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.dispatchEvent(LDAPv3EventService.java:949)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processSearchResultMessage(LDAPv3EventService.java:1092)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.processResponse(LDAPv3EventService.java:718)
        at com.sun.identity.idm.plugins.ldapv3.LDAPv3EventService.run(LDAPv3EventService.java:589)
        at java.lang.Thread.run(Thread.java:745)

In our maybe-not theoretical problem we've identified the stack trace of the problem thread. We can then perform a quick search to see if the problem has been discovered before, and in this case it leads to these bug reports (and probably this post now):

https://bugster.forgerock.org/jira/browse/OPENAM-5324
https://bugster.forgerock.org/jira/browse/OPENAM-119

Now we can form a plan to resolve the issue.

No comments:

Post a Comment