What thresholds should I set when monitoring SQL server performance counters

Question:

We are looking to use Kaseya / KNM / Traverse or a third party tool to monitor the SQL server that Kaseya is using.

What is "normal" for SQL?

Answer:

There is no single answer to this - no set of values that are valid for all SQL servers - the size of database depends on how much data is stored, the number of page locks per second depends on the number of queries running against the database, etc.

You need to establish a baseline - what is "normal" for the server, and what do things look like when there are problems?

Below are some useful things to look at

1. RAM - Available Megabytes - Clearly, if the system is low on RAM, then performance is affected as Windows will start writing to disk. The exact amount of RAM available is dependent on how much RAM is in the system and what the system is doing. Look at the system when it is not under load, and again when all applications are running, to understand what the lower and higher thresholds for the system are.

2. %Disk Time - This is a little easier to use to spot performance issues. If a disk is running at 100% or more, then you have a problem. 100% means that the disk is constantly working, greater than 100% and your disk is being given more requests than it can service. Having said that, disks do get busy when under load, so a 1 second blip of 200% Disk time does not mean that it is time to upgrade the disk. If the average over the day is at or above 80%, then this would be more of a problem that needs addressing

3. CPU Queue - Much the same as the disk queue. This is not % CPU Time, as a 100% utilised CPU is not a problem, but if there is more work being asked of the CPU than it can handle, performance issues follow right behind.

SQL Specific

4. Locks / second - When a query runs against a table, row or page of a database, a lock is enabled so that the data is not modified while being worked on (you see the same thing in Excel - 2 people cannot edit a cell at the same time). A high number of locks is not necessarily a problem - a database that is handling 1 million updates per second will naturally have a higher number of locks than a database that is only doing 10 updates per second.

Seeing 100,000 locks / sec on the first database may be normal, but would obviously indicate an issue on the second server.

Again, you need to establish a baseline, to understand what is "normal" for your server in your environment

5. Lock Wait Time - this is the amount of time that SQL had to wait for a lock. Closer to zero is better, but as with the locks, a database with millions of updates can have a big value for this, but still be performing as designed. As before, you need a baseline to understand what is normal for your server.

The key thing here for SQL and/or windows, is to understand what your system looks like when it is running correctly, and when there is a problem, then set the thresholds to trigger before the values reach critical levels.

Browse this section