This article series has covered a lot of ground: it presented an overview of application performance management (APM), it identified the challenges in implementing an APM strategy, it proposed a top-5 list of important metrics to measure to assess the health of an enterprise Node.js application, and it presented AppDynamics’ approach to building an APM solution. In this final installment this article provides some tips-and-tricks to help you implement an optimal APM strategy. Specifically, this article addresses the following topics:
- Business Transaction Optimization
- Snapshot Tuning
- Threshold Tuning
- Tier Management
- Capturing Contextual Information
1. Business Transaction Optimization
Over and over throughout this article series I have been emphasizing the importance of business transactions to your monitoring solution. To get the most out of your business transaction monitoring, however, you need to do a few things:
- Properly name your business transactions to match your business functions
- Properly identify your business transactions
- Reduce noise by excluding business transactions that you do not care about
AppDynamics will automatically identify business transactions for you and try to name them the best that it can, but depending on how your application is written, these names may or may not be reflective of the business transactions themselves. For example, you may have a business transaction identified as “POST /payment” that equates to your checkout flow. In this case, it is going to be easier for your operations staff, as well as when generating reports that you might share with executives, if business transactions names reflect their business function. So consider renaming this business transaction to “Checkout”.
Next, if you have multiple business transactions that are identified by a single entry-point, take the time to break those into individual business transactions. There are several examples where this might happen, which include the following:
- URLs that route to the same MVC controller and action
- Business Transactions that determine their function based on their payload
- Business Transactions that determine their function based on GET parameters
- Complex URI paths
If a single entry-point corresponds to multiple business functions then configure the business transactions based on the differentiating criteria. For example, if the body of an HTTP POST has an “operation” element that identifies the operation to perform then break the transaction based on that operation. Or if there is an “execute” action that accepts a “command” URI argument, then break the transaction based on the “command” segment. Finally, URI patterns can vary from application to application, so it is important for you to choose the one that best matches your application. For example, AppDynamics automatically defines business transactions for URIs based on two segments, such as /one/two. For most Node.js MVC frameworks, this automatically routes to the application controller and action. If your application uses one segment or if it uses four segments, then you need to define your business transactions based on your naming convention.
Naming and identifying business transactions is important to ensuring that you’re capturing the correct business functionality, but it is equally important to exclude as much noise as possible. Do you have any business transactions that you really do not care about? For example, is there a web game that checks high scores every couple minutes? Or is there a Node.js CLI cron job that runs every night, takes a long time, but because it is offline and does not impact the end user, you do not care? If so then exclude these transactions so that they do not add noise to your analysis.
2. Snapshot Tuning
As mentioned in the previous article, AppDynamics intelligently captures performance snapshots at a specified interval and by limiting the number of snapshots captured in a performance session. Because both of these values can be tuned, it can benefit you to tune them.
Out-of-the-box, AppDynamics captures the entire process call graph while trimming any granularity below the configured threshold. If you are only interested in “big” performance problems then you may not require granularity as fine as 10 milliseconds. If you were to increase this interval to 50 milliseconds, you will lose granularity. If you are finely tuning your application then you may want 10-millisecond granularity, but if you have no intention of tuning methods that execute in under 50 milliseconds, then why do you need that level of granularity? The point is that you should analyze your requirements and tune accordingly.
Next, observe your production troubleshooting patterns and determine whether or not the number of process snapshots that AppDynamics captures is appropriate for your situation. If you find that, while capturing up 2 snapshots every minute is resulting in too many snapshots, then you can configure AppDynamics to adjust the snapshot intervals. Try configuring AppDynamics to capture up to 1 process snapshot every minute. And if you’re only interested in systemic problems then you can turn down the maximum number of attempts to 5. This will significantly reduce that constant overhead, but at the cost of possibly not capturing a representative snapshot.
3. Threshold Tuning
AppDynamics has designed a generic monitoring solution and, as such, it defaults to alerting to business transactions that are slower than two standard deviations from normal. This works well in most circumstances, but you need to identify how volatile your application response times are to determine whether or not this is the best configuration for your business needs.
AppDynamics defines three types of thresholds against which business transactions are evaluated with their baselines:
- Standard Deviation: compares the response time of a business transaction against a number of standard deviations from its baseline
- Percentage: compares the response time of a business transaction against a percentage of difference from baseline
- Static SLAs: compares the response time of a business transaction against a static value, such as 2 seconds
If your application response times are volatile, then the default threshold of two standard deviations might result in too many false alerts. In this case you might want to increase this to more standard deviations or even switch to another strategy. If your application response times have low volatility then you might want to decrease your thresholds to alert you to problems sooner. Furthermore, if you have services or APIs that you provide to users that have specific SLAs then you should setup a static SLA value for that business transaction. AppDynamics provides you with the flexibility of defining alerting rules generally or on individual business transactions.
You need to analyze your application behavior and configure the alerting engine accordingly.
4. Tier Management
I’ve described how AppDynamics captures baselines for business transactions, but it also captures baselines for business transactions across tiers. For example, if your business transaction calls a rules engine service tier then AppDynamics will capture the number of calls and the average response time for that tier as a contributor to the business transaction baseline. Therefore, you want to ensure that all of your tiers are clearly identified.
Out of the box, AppDynamics identifies tiers across common protocols, such as HTTP, JMS, JDBC, and so forth. For example, if it sees you make a database call then it assumes that there is a database and allocates the time spent in the method call to the database. This is important because you don’t want to think that you have a very slow “save” method in a ORM class, instead you want to know how long it takes to persist your object to the database and attribute that time to the database.
AppDynamics does a good job of identifying tiers that follow common protocols, but there are times when you’re communication with a back-end system does not use a common protocol. For example, I was working at an insurance company that used an AS/400 for quoting. We leveraged a library that used a proprietary socket protocol to make a connection to the server. Obviously AppDynamics would know nothing about that socket connection and how it was being used, so the answer to our problem was to identify the method call that makes the connection to the AS/400 and identify it as a custom back-end resource. When you do this, AppDynamics treats that method call as a tier and counts the number of calls and captures the average response time of that method execution.
You might be able to use the out of the box functionality, but if you have special requirements then AppDynamics provides a mechanism that allows you to manually define your application tiers by using the Node.js API functions to further tailor your application.
5. Capturing Contextual Information
When performance problems occur, they are sometimes limited to a specific browser or mobile device, or they may only occur based on input associated with a request. If the problem is not systemic (across all of your servers), then how do you identify the subset of requests that are causing the problem?
The answer is that you need to capture context-specific information in your snapshots so that you can look for commonalities. These might include:
- HTTP headers, such as browser type (user-agent), cookies, or referrer
- HTTP GET parameter values
- Method parameter values
- Application variables and their values
Think about all of the pieces of information that you might need to troubleshoot and isolate a subset of poor performing Node.js transactions. For example, if you capture the User-Agent HTTP header then you can know the browser that the user was using to execute your business transaction. If your HTTP request accepts GET parameters, such as a search string, then you might want to see the value of one or more of those parameters, e.g. what was the user searching for? Additionally, if you have code-level understanding about how your application works, you might want to see the values of specific method parameters.
AppDynamics can be configured to capture contextual information and add it to snapshots, which can include all of the aforementioned types of values. The process can be summarized as follow:
- AppDynamics observes that a business transaction is running slow
- It triggers the capture of a session of snapshots
- On each snapshot, it captures the contextual information that you requested and associates it with the snapshot
The result is that when you find a snapshot illustrating the problem, you can review this contextual information to see if it provides you with more diagnostic information.
The only warning is that this comes at a small price: AppDynamics uses code instrumentation to capture the values of methods parameters. In other words, use this functionality where you need to, but use it sparingly.
Application Performance Management (APM) is a challenge that balances the richness of data and the ability to diagnose the root cause of Node.js performance problems with minimum overhead required to capture that data. There are configuration options and tuning capabilities that you can employ to provide you with the information you need while minimizing the amount of overhead on your application.