Monitoring Grails Apps part 2 – data collection

Congratulations you’ve got your new Grails site live – now you just need to make sure it keeps running!

We’ve already seen how to expose custom data for a Grails application via JMX – but with this post we’ll focus on the data items that we might want to collect and monitor.

If we take a step back, what are we trying to achieve with monitoring our application?

Business perspective
Your answer may depend upon the maturity of the operational support for the application under scrutiny, but the first big question to answer is usually:

Is my site up?

For some people it may be “is my site making me money?“, but they probably won’t make money if the site is down – so let’s go with availability.

This is a reactive question – if the answer is no, get the sys admin out of bed at 3 in the morning and restore the service!

However, this is critically important so you would normally have an HTTP check to hit the front page of the site – the HTTP response would be checked for a 2xx code and often you may check for an expected string in the body.

For the practical side of this post, we’ll be using Opsview to demonstrate the theory. Assuming you’ve got the Opsview VMWare appliance installed & have completed the Quick Start, this check for a generic Grails application called ‘Monitored’ could be:

check_http -H $HOSTADDRESS$ -p 8443 --ssl -w 5 -c 10 -u '/monitored/js/application.js' -s 'spinner'

The TCP port is specified with -p and the –ssl flag tells the check to perform an SSL handshake.
The -w and -c options are warning and critical threshold values respectively for response time in seconds.
The context path portion of the URL is specified by the -u option and -s is used to check for an expected string in the content.
Note: Opsview can show plugin help that will list the other options & provide more information on usage.

If you front your application server with a web server, it is common practice to have two checks: one to check via the web server and the other to check directly on the application server. This helps pinpoint where the problem is in the event of an outage. If you have load-balanced servers – then the checks should be applied across all the nodes.

As with all things, the effectiveness of these ‘front door’ checks depends on the point of view of the monitoring server. If the web pages are served from the same data centre as the monitoring server, you might choose to check outbound connectivity to a well known site or use a 2nd view point to determine whether your data centre is isolated from the Internet.

For example, this could be done by GETting http://www.downforeveryoneorjustme.com/ with your desired URL and checking for the response string “It’s just you.”…

A second big question is usually “What are our customers experiencing?” – the realm of user experience monitoring / synthetic transactions are beyond the scope of this post, however you might be interested in this blog post on using Opsview with Selenium.

Operational perspective

Going down a level, from an operational point of view, you will want to ensure that:

1. the server is up
This is done with a host check command (e.g. ping / SSH)

2. the server is responding to connections on the appropriate ports (including 8080/8443)
Here we use check_tcp.

3. the application server can connect to the database
This is a check_tcp but executed by the agent on the app server and invoked by check_nrpe or check_by_ssh.

4. the database server is responding to queries
You can use check_sql_advanced, other Nagios check_sql plugins or a custom check (e.g. https://gist.github.com/994973).

5. the email server is working
You guessed it, check_smtp.

More proactive checks

As this is about Grails applications, we’ll focus on Tomcat & the JVM and assume you’re using MySQL and running it on a Linux server with the Opsview Agent package installed (Opsview can also monitor Windows servers with NSClient++) and accessible via NRPE (I strongly recommend using iptables to restrict access to the NRPE port to just the monitoring server).

Top 5 things to check at the OS level with examples:

1. CPU utilisation / load average on the server
check_load -w 8,5,2, -c 20,9,4

2. Amount of free memory
check_memory -w 90 -c 98

3. Amount of free disk space
check_disk -w 5% -c 2% -p /

4. That the Tomcat process is running
check_procs -C jsvc -a tomcat -w 3:3 -c 3:3

5. That the MySQL process is running
check_procs -a mysqld --metric=CPU -w 60 -c 90

We’ll ignore scanning log files for this post & assume you’re using jAlarms to notify Opsview of any problems that the application detects (how-to here).

JMX Checks

These are queried by check_jmx. There are a number of variants including JNRPE, which has the benefit of not running up a JVM for each check, however for simplicity we’ll assume the use of the standard jmxquery.jar check.

1. Heap space (e.g. for -Xmx512m)
check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi -O java.lang:type=Memory -A HeapMemoryUsage -K used -I HeapMemoryUsage -J used -vvvv -w 400000000 -c 500000000

2. Active database connections
check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi -O bean:name=dataSourceMBean -A NumActive -vvvv -w 16 -c 19

3. Any key application metrics that you’ve exposed over JMX (assuming web analytics are handled separately). The standard check_jmx implementations expect these to be numeric

This will appear in Opsview like so (the observant will notice that there are graphs available – keep reading to find out how):

The art of monitoring

Thresholds
These will require tuning once you understand how your application behaves; you need to ensure that ‘false positives’ are minimised yet alerts are raised if something is going wrong.

Frequency of checks
Again, this needs to be balanced: if you check too frequently you will place additional burden on your system and use computing power that could be serving customers; if you check too infrequently, you might miss an opportunity to avert a disaster or worse still be unaware of an outage / breach of SLA until the phone rings… Ultimately the choice is yours on this one – though Opsview defaults to a 5 minute check period.

Notifications
Obviously we want our operations team to be notified in the event of a problem. Opsview supports multiple channels such as email, RSS with SMS and service desk integration in the Enterprise edition. Further information on notifications methods is here.

Graphing
Alerts are about the ‘here & now’, while graphs allow you to view trend information and can assist in correlation or identifying areas for investigation whilst troubleshooting root causes.
E.g. If you were monitoring the number of active sessions and site response time you would see the corresponding peaks of a slash-dot effect.

Opsview will graph performance data but not ‘unmapped’ status data. The data for the graphs are stored in ‘Round-Robin Database’ (RRD) format under /usr/local/nagios/var/rrd/<host-name>/<servicecheck-name>.

Mapping JMX service check results
The standard check_jmx only returns status data, therefore we need to map the data so that it can be graphed.
This is done in the /usr/local/nagios/etc/map.local file.

e.g. to match the JMX checks defined above

# java_HeapMemoryUsage_used
# Service type: Java JMX Heapspace check using check_jmx
# output:JMX OK HeapMemoryUsage.used=454405600{committed=532742144:init=0:max=532742144:used=454405600}
/output:JMX.*HeapMemoryUsage.used=([0-9]+).committed=([0-9]+).init=([0-9]+).max=([0-9]+)/
and push @s, [ "Heap_Memory",
[ "usedMB", GAUGE, $1/(1024**2) ],
[ "committedMB", GAUGE, $2/(1024**2) ],
[ "initMB", GAUGE, $3/(1024**2) ],
[ "maxMB", GAUGE, $4/(1024**2) ] ];


# JMX Active DB connections
# output:JMX OK NumActive=3
/output:JMX.*NumActive=([0-9]+)/
and push @s, [ "NumActive",
[ "NumActive", GAUGE, $1 ] ];

You’ll need to ensure your map.local file is valid after you’ve saved it – this can be done by:
perl -c map.local

Why can’t I see my graphs straight away?
The RRDs won’t be created until the map.local changes have been picked up – this should happen when the performance data has been processed. However Opsview may not be aware that graphs are available for a service check until the configuration is reloaded.
So if you’re impatient like me you could always try restarting Opsview, forcing the desired servicecheck to be executed, checking for the RRD in nagios/var/rrd and then performing an admin reload…

Then you should see e.g. for the heap space:

Lastly, a recent introduction to Opsview was the very useful viewport sparkline graphs on the performance view allowing rapid correlation (with filtering):

Summary
We’ve had a very quick run through of Grails application monitoring with Opsview – hopefully there’s enough there to keep you busy and your application healthy!

About these ads

5 responses to “Monitoring Grails Apps part 2 – data collection

  1. Pingback: Blog bookmarks 05/30/2011 « My Diigo bookmarks

  2. Or you can just install grails.org/plugin/grails-melody

    • You could install JavaMelody and it would give you a good set of statistics out of the box – however I’d say it is only really appropriate for single server setups. If you have multiple applications you want to be able to monitor them all from one place.

      Opsview is an enterprise-grade monitoring solution:
      – being built on top of Nagios gives it access to a vast amount of existing service check plugins
      – provides alert notifications
      – can run in a distributed topology for large scale monitoring deployments
      – monitors the whole infrastructure stack (supports SNMP)
      – …
      – has professional support available

      I also think that there is scope for integrating the two with a set of service checks.

  3. Pingback: Quick tip: Enabling remote access to Tomcat JMX | Lean Java Engineering

  4. Pingback: Monitoring Apache Solr | Lean Java Engineering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s