How to Read Logs: Monitoring

Advanced Monitoring with examples

This document goes over monitors and some information that can be used for troubleshooting.
(Thank you Jim Deverall and Chris Edes)

"Basics" about Monitors:

The CagService is designed to look for the monitoring update within the platform and installs the agent within the relevant directory. Once done it opens .NET runtimes as well as datalogging to record information. The CagService and AEMAgent both run as System (NT_Auth) which has impacts on certain portions of monitoring.

When an agent is added, it looks up the web remote URL and is controlled by the AEMAgent. Once the AEMAgent is approved, it wants to establish a connection to the platform that is unique to the device. It does this from a channel specified in the log files. An example is below:

11.4.0.5|2023-02-01T09:49:50.448-05|INFO|CONNECTION NEW
"https://concord-agent-notifications.centrastage.net/device/c5479d2e-f772-6986-5ff2-7d155aefcce8/notification"
|{
  "uri":"https://concord-agent-notifications.centrastage.net/device/c5479d2e-f772-6986-5ff2-7d155aefcce8/notification",
  "logcontext":"PlatformHttpClient"
}

In situations where you want to review information that the platform is expecting, you will want to use the below link to gather additional information. The below URL is used to see what the platform should be sending to the end device, the local monitors should not be different but has the possibility to do so. Please note this is useful if you need to find the monitor ID for looking within the AEMAgent.log.

https://PLATFORM-monitoring.centrastage.net/device/DEVICE-ID/monitor?key=KEY
https://PLATFORM-agent-notifications.centrastage.net/device/DEVICE-ID/notification?key=KEY

When you go to the link, it fails as it needs the agent encryption key which is done on the backend. When you add the key from the key file within:

C:\ProgramData\CentraStage\AEMAgent

it will allow you forward and constantly refresh. Keys cannot be opened by default and are restricted files. You would need to go to the security permissions to make changes and potentially add yourself as a principal.

Note that there is no polling for the agent. This means that there may not be a monitors.json. There should always be a monitors.json as there are defaults in place for metrics. If the monitors.json already exists on startup, it will use the cached monitors to populate the device. The metrics are found within the monitors.json as these are defaults added to make sure that we can record information for the New UI. The CPU, MEM and disk usage will use the same poll within RMM as one poll is enough for multiple monitor usages. To view this from the web, use the following link:

https://PLATFORM-agent-notifications.centrastage.net/device/DEVICE-ID/notification?key=KEY

“GET DEFINITIONS” are outbound requests from the device to the platform (monitors, software and alerts). You can look at the log files and search for “monitor-definitions-changed” and any results will appear, then use was a change. Overnight audits are designed to push out updated json files which can then be used on the device. These can be found in the “GET DEFINITIONS MONITORS” or “GET DEFINITIONS SOFTWARE”. You will only see “DEFINITIONS UPDATED PLATFORM” after a connection “get definitions X”.

If you would prefer not to use the information above when looking for the monitor ID, you can also use the AEMAgent Decryptor tool. Should you need a copy, please take a look here or at the bottom of this document.

What would you see when an alert is triggered?

When you take a look at a monitor, you would see monitor-definition-change when the policy is pushed.

Example: service not running for 0 mins raise an alert, when this triggers, you would see the below in the log.

11.4.0.5|2022-12-08T14:55:37.873+00|INFO|ALERT
fc8dd63e-b920-4f4d-96fb-b4a4037cb6d6
c4e95ab4-e66e-4849-84d8-d34756cf2cc7
ProcessSensitivity {"value":"Stopped","metric":"processstatus","dummy":1,"isThreshold":true}
{"logcontext":"Monitoring.Program","membername":"handler"}

Once it alerts, it will then push that information to the platform:

11.4.0.5|2022-12-08T14:55:38.272+00|INFO|POLL
"fc8dd63e-b920-4f4d-96fb-b4a4037cb6d6"
"processSensitivity":{"value":"Stopped","metric":"processstatus","dummy":1,"isThreshold":true}
{"monitorId":"fc8dd63e-b920-4f4d-96fb-b4a4037cb6d6","instanceId":"definitionName"}
{"ProcessSensitivity":"json":{"value":"Stopped"},"metric":"processstatus","dummy":1,"isThreshold":true}
{"logcontext":"Monitoring.Program","membername":"handler","httpStatusCode":200}

The alert status is triggered by the alert line but it ALSO is caused by BATCH POST. Without the BATCH POST, this will not trigger the alert on the platform. Once it finds the Alert.dat, it changes to alert.dat.tosend and the POST pushes this to the platform to show the actual trigger, you can think of this as the receipt of the alert. The latest value is based on the POLL under the “value”. RLPOLL’s are used for longer polling, such as the metrics found in the New UI although this is not guaranteed.

When resolved, you would see the following:

11.4.0.5|2022-12-08T15:12:38.490+00|INFO|CONNECTION MESSAGE
"id,_data":{
"id":"fc8dd63e-b920-4f4d-96fb-b4a4037cb6d6",
"EventName":"monitor-alert-resolved",
"Retry":"",
"userEvent":{
"id":"fc8dd63e-b920-4f4d-96fb-b4a4037cb6d6",
"EventName":"monitor-alert-resolved",
"Retry":""
}}

Note that the EventName “monitor-alert-resolved” just confirms that the monitor has gotten a response status and the local device can now re‑alert on this.

Another example. If CPU usage is over 10% for more then one minute. You will see the alert as below:

11.7.1122|2023-04-19T10:23:59.274-04|INFO|ALERT
b41c546e-ecf5-45aa-9b59-2634347d5d60
2a5ae7a6-c641-4b96-8bc1-d9a0bfa0642e
Performance {"value":"11.93","metric":"performance","cpu","system","isThreshold":true}
{"diagnostics":"AEMAgent.exe : 3
%\r\nbaratella.exe : 3
%\r\nsvchost.exe : 2
%\r\nExplorer.EXE : 2
%\r\nsvchost.exe : 1
%\r\nbaratella.exe : 0
%\r\nGUI.exe : 0
%\r\nDllHost.exe : 0
%\r\nWUDFHost.exe : 0
%\r\nsvchost.exe : 0
%\r\n"}
{"logcontext":"Monitoring.Program","membername":"handler"}

The poll:

11.7.1122|2023-04-19T10:23:59.295-04|INFO|POLL
"b41c546e-ecf5-45aa-9b59-2634347d5d60"
"performance":{"value":"11.93","metric":"performance","cpu","system","isThreshold":true}
{"diagnostics":"AEMAgent.exe : 3
%\r\nbaratella.exe : 3
%\r\nsvchost.exe : 2
%\r\nExplorer.EXE : 2
%\r\nsvchost.exe : 1
%\r\nbaratella.exe : 0
%\r\nGUI.exe : 0
%\r\nDllHost.exe : 0
%\r\nWUDFHost.exe : 0
%\r\nsvchost.exe : 0
%\r\n"}
{"monitorId":"b41c546e-ecf5-45aa-9b59-2634347d5d60",
"instanceId":"",
"definitionName":"Performance",
"json":{"value":"11.93"},
"metric":"performance","cpu","system","isThreshold":true}
{"logcontext":"Monitoring.Program","membername":"handler","httpStatusCode":200}

The Batch Post:

11.7.1122|2023-04-19T10:24:27.283-04|INFO|BATCH POST
"C:\ProgramData\CentraStage\AEMAgent\DataLog\e5eea64d-fd9d-25f5-8cc8-07eb686f3cb7.alert.dat.tosend"
"C:\ProgramData\CentraStage\AEMAgent\DataLog\e5eea64d-fd9d-25f5-8cc8-07eb686f3cb7.alert.dat.tosend","httpStatusCode":200

IF we were to look at an example then there is a response component, you would get the following:

Setup the monitor for a response component and if the response component is built into the monitor, we can deliver the script inside the monitor definition. So when it triggers, if it has access to the script, it will just run it at the same time as the trigger. This is populated to the platform and could be a lengthy delay to the platform from when it originally triggered. You would see something like the below snippet:

11.4.0.5|2023-01-09T07:53:51.130-05|INFO|ALERT
b41c546e-ecf5-45aa-9b59-2634347d5d60
0a0f06a2-0e35-4ee8-9487-58dbaec56f6e
Performance {"cpu","system","isThreshold":true,"diagnostics":"chrome.exe : 5
%\r\nwinprocess.exe : 4
%\r\nsvchost.exe : 3
%\r\nAEMAgent.exe : 3
%\r\nchrome.exe : 2
%\r\nsvchost.exe : 2
%\r\nWUDFHost.exe : 1
%\r\nChrome.exe : 1
%\r\ncomponentResult":{
"StdOut":"Original\r\n",
"ExitCode":0,
"StdErr":""
}}
{"logcontext":"Monitoring.Program","membername":"handler"}

If the response component has a file, it has to trigger the alert first so that the file can be sent to the device and the response component then runs like below within the AEMAgent.log.

4.4.2190.2190|2023-03-31T10:48:38.251-05|INFO|softwareJobWorker: StdOut
C:\ProgramData\CentraStage\Packages\fde80c48-4720-4c16-aefa-07cdc4b16dfb\#! /bin/bash
"Latest rmm tool version for Windows: 1.0.0.109"
"eps.rmm download link:
https://download.bitdefender.com/SMB/RMM/Tools/Win/1.0.0.109/x64/eps.rmm.exe"
"Update process started"
"Component finishes with exit code 0"

If there is no error handling within the response component, the alert may not reach the platform.

For testing alerts you can use the below commands for event log monitors so you can do auto resolution or use the service monitor with manual resolution.

PowerShell

Write-EventLog -LogName "Application" -Source "CagService" -EventID 1384 `

-EntryType Information -Message "Testing" -Category 1 -RawData 10,20



Write-EventLog -LogName "Application" -Source "CagService" -EventID 1385 `

-EntryType Information -Message "TestingClose" -Category 1 -RawData 10,20

Tools that will help with troubleshooting:

BareTail and NotePad++, both of which are found within the ComStore.