Agents appearing online is the direct result of a constant communication handshake between the agent and the platform. Log.txt is your best friend for seeing what is happening.
Normal behavior:
You will see the below repeating over and over in log.txt when the agent is communicating properly. The agent sends a request to the platform, and the platform responds with the http 200 message that basically means "okay". It is when we do not get a 200 in response that the handshake as failed and the device status will then be "offline".
4.4.2206.2206|2024-02-02T00:01:34.743-05|INFO|12|CsControlConnection: Sending request dcdf670a-95a3-4a98-911f-4ca5c0d33c44 to CS - "ping"|{ "messageId": "dcdf670a-95a3-4a98-911f-4ca5c0d33c44", "messageMethod": "ping" }
4.4.2206.2206|2024-02-02T00:01:35.760-05|INFO|38|CsControlConnection: Response to dcdf670a-95a3-4a98-911f-4ca5c0d33c44 is Http - 200|{ "messageId": "dcdf670a-95a3-4a98-911f-4ca5c0d33c44", "responseStatus": 200 }
Troubleshooting:
For an agent to communicate with the portal these things need to happen in order (I.e. anything below requires everything above to be met)
- Agent is Installed correctly
- Confirm it didn't get uninstalled or failed installation
- Service is running (not stopped/crashed)
- Check services locally and check log.txt to confirm latest log entries are from seconds/minutes ago and that new log entries are being added
- Make sure the device isn't running as on-demand (logs will show ondemand: true/false)
- Make sure the account isn't overlicensed (logs will show account overlicensed and you can see the count of managed agents vs license in the billing section online)
- Device is configured to connect properly
- Check internet settings to ensure TLS 1.2 is enabled, device meets supportability requirements, the correct cipher suites are present to allow connection via TLS (more below)
- Service is able to send pings through the partner's network –
- Check for the above log activity – if you are seeing connection errors, it suggests the connection isn't making it out of the network
- Probe first by checking if other devices can connect properly in the same network. If multiple devices can't it's almost certainly a network issue, but if it's only one device and other devices CAN connect, it's likely a device-specific problem.
- If confirmed to be affecting the entire network, ask about allowisting configuration and/or run the healthcheck tool to show the IP's failing as proof the network is not configured correctly
- Also ask if the partner is using deep packet inspection or stateful packet inspection, this can cause connection issues even when the healthcheck shows generally connectivity success
- The platform sees the connection and deems the device as online
- Check for the account being overlicensed, as this may make device show offline even if the platform does see the connection activity. Logs will also show there is not an available license if this is the case (see below)
- In rare cases, the agent may communicate correctly with the platform but the platform fails to show the device online – get elevated support involved if this is the case
Common Errors:
- "SSL Error The remote certificate is invalid according to the validation procedure"
- "CS returned error code 429 - This request was made too soon after a previously failed login request - This is normally due to the account being over it's device limit"
- This is likely caused by them having more devices than 110% of the licenses they've purchased
- Health check will likely catch this, but should also be fairly frequent in log.txt
- True solution is to purchase more licenses, but if they are only over by a couple devices, in a pinch they could theoretically delete ones that are less important if they need to get a mission critical device online qiuckly
- "SSL Error A call to SSPI failed, see inner exception." OR
"Message":"SSL Error The client and server cannot communicate, because they do not possess a common algorithm"- Most often caused by the device not having the needed cipher suites (more common on older devices, especially servers)
- Full info and solution here: Kristopher Corpus: Windows Server devices failing to reconnect - Post v11.2 update...
- Note that when we test for cipher suites, test needs to be done in internet explorer
- See also: https://rmm.datto.com/help/en/Content/1INTRODUCTION/Infrastructure/INFRASTRUCTUREANDSECURITY.htm#Supported_TLS_Cipher_Suites
- An existing connection was forcibly closed by the remote host
- There could be a few causes but most commonly:
- General allowlisting
- Deep Packet Inpection or Stateful Packet Inspection – This is particularly likely if we see devices are generally able to connect and are online for periods of time before randomly disconnecting. Connections affected by packet inspection tend to come on and offline unpredictably, whereas other network issues would tend to make the device more completely unable to connect. If we see this behavior and error we should find out from the partner if packet inspection is being used and they will likely need to disable it.
- There could be a few causes but most commonly:
- 4 bytes requested but only 0 read in BEBinaryReader (HTTPS filter)
Cipher reply:
Hi {{ticket.requester.first_name}},
Thank you for working with us on this case and we apologize for any inconvenience.
The error messages we are seeing in our Agent Logs appear to be related to required TLS Cipher Suites not being enabled to allow TLS communication with the RMM platform.
Most of these Cipher Suites (listed below) should be enabled by default on Windows Server OS but may have been disabled at some point by a previous MSP or other means. Therefore, the device's RMM Agent is unable to connect via TLS 1.2 after our recent DRMM update as we deprecated support for the TLS 1.0 & 1.1 protocols.
To resolve the Agent connection failures on these Windows Server devices, we would recommend the following steps:
1) Please re-run the SSL Labs test within Internet Explorer browser to check for the following (4) Cipher Suite:
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA3
- SSL Labs Test: https://clienttest.ssllabs.com:8443/ssltest/viewMyClient.html
- NOTE: Need to open link in Internet Explorer only as Chrome/Edge/etc may not properly report the device's Cipher Suites (SCHANNEL configuration).
2) If missing any of these TLS Cipher Suites in the report, you can feel free to use a 3rd-party tool called "IIS Crypto" to enable the supported TLS Cipher Suites. '
- We would recommend enabling all of these suites to avoid any possible DRMM Agent connection issues as some older Windows Server OS versions may not support all of these suites within the list above.
- NOTE: Only a recommendation and showing how to use this tool is out-of-scope for Datto Support.
- Link to 3rd-party tool: https://www.nartac.com/Products/IISCrypto/
3) After configuring via 3rd-party, make sure to reboot devices for Cipher Suite changes to take effect on the device. Then the RMM Agent should be able to properly connect to our platform after the reboot.
Lastly, if you would like more information on TLS Cipher Suites and SCHANNEL, here are some Microsoft Docs:
- https://learn.microsoft.com/en-us/windows/win32/secauthn/cipher-suites-in-schannel
- https://learn.microsoft.com/en-us/windows/win32/secauthn/tls-handshake-protocol
Please report back with the results regarding our suggested fix above. Also, should you have any questions or additional information, please do not hesitate to reply to this email as well.