This is less of a snippet, more of a story.
I've recently been involved in trying to diagnose an intermittent performance problem in a production system of which there were many performance metrics collected. However with the volume of data being recorded for each transaction there were no clear identifiers to what was causing the issues. Every time a batch of transactions would hit a threshold an incident would be raised and then investigations would go on and result in no clear answer on what was the cause. This would keep repeating over and over each time looking at different components that have had historical issues.
In the background I'd been reading about Tensorflow and seeing what machine learning could do. There's a lot of interest at the moment as it's just hit a milestone release, however nothing immediately applied to any problems I was having and it was more of an interest topic. What did come about while trying to think of ways machine learning could be used to apply to the above problems we were seeing was framing my question a bit better: How do I identify which metrics correlate the most to a transaction taking longer than the defined SLA?
Google wasn't so helpful with my new question, however it did lead me to discover more about how to select which data to train your machine learning models with. Within machine learning it's important to identify which features give the best information about what you're classifying so that you don't waste time training your model with unnecessary data, so out of that comes a topic called feature importance.
After reading about this and experimenting with the scikit-learn example for a few hours I was able to set up a notebook in ipython, load our data in and produce a graph which clearly identified which of the components were causing issues. It was a component we hadn't looked at before, and the output clearly showed that any increase in the timings on that component had a massive correlation with the total transaction time exceeding the SLA. I had loaded in a lot of other metrics along with this, such as the time of day and the datacenter the processing took place in, but the clearest measure by a huge margin was this one often ignored component.
What this allowed me to do was help redirect the investigations to this component so we can find out why it would impacted total transaction time so much. Eventually investigations may have ended up looking around this area, but with this technique I'm able to sift through the massive volumes of recorded metrics to short cut those investigations.
I've put together an ipython notebook to show how the application of feature importance could be used in identifying an application performance problem here. There is a lot of other information that I've included in the descriptions also. It should be very easy to pick this up and apply it to some other problem where you need to find what exactly correlates to an expected output.
I've recently been involved in trying to diagnose an intermittent performance problem in a production system of which there were many performance metrics collected. However with the volume of data being recorded for each transaction there were no clear identifiers to what was causing the issues. Every time a batch of transactions would hit a threshold an incident would be raised and then investigations would go on and result in no clear answer on what was the cause. This would keep repeating over and over each time looking at different components that have had historical issues.
In the background I'd been reading about Tensorflow and seeing what machine learning could do. There's a lot of interest at the moment as it's just hit a milestone release, however nothing immediately applied to any problems I was having and it was more of an interest topic. What did come about while trying to think of ways machine learning could be used to apply to the above problems we were seeing was framing my question a bit better: How do I identify which metrics correlate the most to a transaction taking longer than the defined SLA?
Google wasn't so helpful with my new question, however it did lead me to discover more about how to select which data to train your machine learning models with. Within machine learning it's important to identify which features give the best information about what you're classifying so that you don't waste time training your model with unnecessary data, so out of that comes a topic called feature importance.
After reading about this and experimenting with the scikit-learn example for a few hours I was able to set up a notebook in ipython, load our data in and produce a graph which clearly identified which of the components were causing issues. It was a component we hadn't looked at before, and the output clearly showed that any increase in the timings on that component had a massive correlation with the total transaction time exceeding the SLA. I had loaded in a lot of other metrics along with this, such as the time of day and the datacenter the processing took place in, but the clearest measure by a huge margin was this one often ignored component.
What this allowed me to do was help redirect the investigations to this component so we can find out why it would impacted total transaction time so much. Eventually investigations may have ended up looking around this area, but with this technique I'm able to sift through the massive volumes of recorded metrics to short cut those investigations.
I've put together an ipython notebook to show how the application of feature importance could be used in identifying an application performance problem here. There is a lot of other information that I've included in the descriptions also. It should be very easy to pick this up and apply it to some other problem where you need to find what exactly correlates to an expected output.