Question: How to enable various monitoring/alerting solutions when a downstream DGEx stops polling tests?
Answer: When a DGEx stops polling tests, you need to be made aware of it, whether actively (email alerting) or passively (stale results cue). This article details the various ways this can be done:
1) The components of a DGE/DGEx periodically check in with the BVE. When a DGE or DGEx fails to check in with the BVE, a Warning email is sent after 5 minutes and a Critical email is sent after 15 minutes. The subject of these emails contain 'DGE Communication Lost' and are sent to the email address specified in 'SUPERUSER > GLOBAL CONFIG > DATA GATHERING ENGINE'.
Note: If you receive one of these emails, then they require immediate investigation as the DGE or DGEx may be down.
2) As per https://helpdesk.kaseya.com/hc/en-gb/articles/229042528 you can create tests on the upstream DGE that monitors if the downstream DGEx's results are being received by the DGE.
Note: It is highly recommended you add these tests every time you add a new DGEx.
The default values for this test are 15 minutes for a Warning alert and an hour for Critical. You may want to adjust these threshold values so you alerted in a more timely manner. (below 5 minutes would not be recommended):
3) Next up you should assign an appropriate Action Profile to these 'Time Since Results' tests so the appropriate people are alerted.
4) A lot of our Customers share dashboards with their customers, enabling them to see the current status of their environment. In this scenario, it would be advisable to add the 'Time Since Results' for the customers DGEx to the dashboard itself. The 'Bar Charts' and 'Test Status Data Grids' are the most visually striking to represent that results are not current/being written to the database:
5) You could also add the 'Time Since' tests to any important Infrastructure containers, so you will see if the container is critical on initial login and or be alerted of it.
6) 'SUPERUSER > HEALTH' page will show you if any BVE/DGE/DGEx components are not checking in, indicated by a Triangle warning icon:
Note: Clicking on the row itself, will show you which components have not checked in without having to directly log in to the server itself:
7) Finally, while this does not need to be configured, the 'Stale Results' indicator (grayed out, triangle icon) is an instant visual indicator that the results are behind/not being monitored. A Stale result is defined as a test result that has not been written to the MySQL database in at least 3 polling intervals. i.e. A test that polls every 5 minutes and that hasn't written a result to the database in at least 15 minutes, is considered Stale.
Note: If you see the above Stale results, you should immediately investigate and confirm all services are running on the monitoring DGE/DGEx for that device.
By implementing the above steps, you could potentially be alerted in 6 different ways if it re-occurs:
1: 'DGE Communication Lost' emails
2: 'Time Since Result' action profile alert
3: Dashboard warning/critical test indicator
4: Container indicator/alert
5: SUPERUSER > HEALTH
6: Stale Results indicator (grayed out) on the Status page