Network Discovery - How Does it Work?

Written by:

Team SkyNetSquad, Datto RMM

Document Purpose

This document is intended to provide a technical overview of the Network Discovery feature within Datto RMM, including its capabilities, limitations, troubleshooting scenarios, and frequently asked questions.

Intended Audience

This document is intended to be read and interpreted by an internal (Datto) audience that is:

Familiar with Datto RMM
Has a foundational understanding of computer networking
Is in a technical role supporting internal or partner-facing stakeholders.

Frequently asked questions

What does Network Discovery enable me to do?

Network Discovery enables one or more RMM agents within your customer’s site footprint to take on an additional role as an agentless “collector” that will find, onboard, and (to varying degrees) manage:

Devices that are capable of having an agent installed upon them (e.g., Mac, Linux, and Windows devices)
Devices that can be monitored by agentless monitoring protocols, such as the Simple Network Management Protocol (SNMP)
Devices that cannot be monitored with an agent or agentless monitoring protocol

What’s the difference between a Discovered Device and a Managed Device?

	Discovered Devices	Managed Devices
License consumed?	No	Yes
What’s visible within RMM?	Initial discovery audit	Full device audit Custom Monitors
Monitoring, alerting, policies, reporting?	No	Yes
Visible on topology map in new UI?	Yes	Yes

How do we classify discovered devices?

Datto RMM looks for common fingerprints that devices expose or report back with, when querying via a set of discovery protocols. The method by which initial and detailed identification is performed depends on the device type.

The questions asked aim to answer the following questions:

Who are you?
What are you?
What do you do?

Datto RMM’s device classification is a multi-step process. For example, Network Devices is a top level category with multiple sub-types (e.g., Network Device - Printer, Network Device - Router).

The algorithm aims to first perform a top-level classification of the device, and then performs further refinement through additional queries and fingerprint matching. Let’s work through an example:

A network switch running SNMP responds to a ping by the Network Node.
Through the System MIB, initial device details (such as system name, system description, and system object ID) are fetched.
Then, a sub-type classification test is performed, by testing what device-type specific OIDs are responded to.
1. For example: if the device responds to common switch-centric OIDs (bridge MIB), the device is a switch.

Fingerprints that are employed by all devices:

DHCP requests that are broadcasted within the broadcast domain (i.e., the subnet in which the Network Node resides) are picked up on and inspected by the Network Node. This allows the Network Node to conduct a technique called DHCP Fingerprinting.
- Fingerprints are on file for common operating systems (e.g., iOS, Android, Windows, Mac, Linux)

When a device is promoted to being a Managed Device, a full “audit” is conducted that provides greater visibility into hardware, software, and configuration.

Fingerprints that are device type specific:

Device Type	Protocol / Method	Notes
Workstation	WMI is used to discover if the device is running Windows. If so, the “OS” field is populated accordingly.	Physical form factor (Desktop, Laptop) is determined during an Audit (when a device is Managed)
ESXi Hypervisor Host	HTTP is used to check if the device responds to the VMware HTTP API

Devices beginning with “Datto”, such as Datto Access Point or Datto Continuity, are discovered in the same way as other third party devices. The only difference is that their onboarding is conducted through the respective integration (Datto Networking or Datto Continuity).

If a device cannot be identified, it is marked as Unknown.

How often is a scan conducted?

A full initial scan begins immediately after a user completes and exits the network discovery wizard within the UI using the additional networks and credentials provided by the user.

Additional devices can be found every 60 seconds.

A network node failing to check in to the platform will complete an initial scan upon reconnection and re-initialization.

Can I trigger a scan on-demand?

Not at this time, we are exploring the possibility of adding this capability in the future.

Can I force a full deletion and re-discovery of devices?

Not at this time, we are exploring the possibility of adding this capability in the future.

What credential(s) do I need to have inputted into RMM to have a successful scan / initial discovery of devices?

You’ll want to input credentials for all devices that are of interest. Higher quality data is typically privileged and gated through authentication with credentials.

Can I set up multiple network nodes in a load sharing or hot-standby arrangement?

Hot-standby configurations are not available. Multiple network nodes can be deployed if there are subnets within the customer environment that don’t allow for layer 3 routability between them (e.g., Subnet A is firewalled off from accessing hosts on Subnet B, and vice versa). In the multiple network node scenario, nodes will automatically determine which subnets that have been set to scan by the user they can access, and will divide up the work accordingly.

How do we develop the hierarchical layout of what devices are plugged into each other?

The topology engine ingests data from the following data sources to assert which devices are connected to others:

Routing tables
Interface table (ifTable)
Forwarding Bridge table (BridgeMIB + QBridge MIB)
ARP table (IpNetToMediaTable)

The algorithm works on the basis of detecting which “direction” (north or south) that the Internet is, from the perspective of the current device being discovered. It then iterates through each device discovered to develop a graph of devices by both hop, as well as by NIC/port adjacency.

What are some of the known limitations of the current topology engine?

Increasing the detail and efficacy of the topology engine is a continuous work-in-progress. Current gaps that we’re aware of and are working towards closing include:

Wi-Fi associations between endpoints and Access Points
Sourcing more layer 1 adjacency data from protocols such as CDP and LLDP.
Increasing the amount of IoT device identification that can be done through multicast discovery protocols.
Multi homed devices, i.e. those that connect to more than one subnet in the subject network
VLANs
Unmanaged L2/L3 Network devices and/or those that we don't have valid credentials for
Various MAC address hiding technologies like ARP proxying, etc
Multiple Internet connections for the same network
Load balanced routers similar to HSRP

Is there a rule of thumb around the number of endpoints, IP addresses, or subnets that I should associate to a given network node?

The simplest deployment is deploying a single network node to scan the customer’s network footprint. The network node is expected to scale to 100s of devices and monitors. Augmentation is only suggested on a case-by-case basis at this time.

Troubleshooting steps:

I’m not seeing hostnames for my discovered devices. Why is this?

As this is dependant upon the DHCP scanner:
- Device is not getting an IP automatically
- Device does not communicate its hostname during obtaining/renewing its DHCP lease
Other reasons
- We have not implemented NetBIOS/WIndows network scanning
- We have not implemented Bonjour.Zeroconf scanning
- We have not implemented some other method of scanning that can supply hostnames

I’m seeing a discrepancy between the hostnames in the old UI versus the new UI. Why is this?

The old UI’s discovery is populated using the CentraStage agent (CAG), which supports the NetBIOS protocol which provides hostname resolution. The new UI’s discovery does not currently make use of this.

I’m not seeing devices that I’d expect to be on these subnets. Is there something I can do to remedy this?

This can mean quite a few steps. We need to check that it really exists and is visible from the Network node.
1. Ping it from the NN
2. Shortly after ping
  1. Watch the output of the “arp -a -n” command if the device is on the same subnet as the network
  2. Check the content of the IpNetToMedia table for a NID on the same subnet as the missing device. The NID must have been discovered by the NN as such
3. If there is a record from p.2 (i.e. IP address AND MAC address) then raise a ticket with the engineering team.
4. Note: On a large network there is a period of up to 1 hour until all subnets will have been discovered and scanned at least once.

Verify that the devices in question are responding to ICMP echo requests (pings) originating from the Network Node. If not:
- Verify Windows and/or 3rd party firewall policy
- If the endpoint in question is on a subnet that is different than the subnet hosting the Network Node, check if ICMP echoes are bi-directionally permitted through the layer 3 appliance that routes between those two subnets (e.g., firewall, router, layer 3 switch).

What log(s) can Support reference when there is an issue with any of the above functionality?

Network Node logs
- The network node writes its logs into a separate log file:
  - Location
    - c:\ProgramData\CentraStage\AEMAgent\DataLog\networknode.log
    - /usr/loca/share/CentraStage/AEMAgent/DataLog/networknode.log
  - Format
    - Elements separated by the pipe (|) symbol, but compatible with CSV, so it can be imported in Excel.
    - Sample
      - 10.1.0.320|2022-01-26T14:32:52.956+00|1012|INFO|MEM|199|DEVC|56|NETC|6|NIDC|7|DDSQ|7|DEVQ|0|TOPQ|0|LOGQ|0|DEV|097ebfeb-ad1b-45cb-9b57-86518485bf5a/4C5E0CEFE28D4C5E0CEFE28E4C5E0CEFE28F4C5E0CEFE2904C5E0CEFE291|IP|10.183.5.40|TopologyDiscovery|Building topology packet
    - Elements
      - AEMAgent version
      - DateTime
      - Agent Process ID
      - Message type : INFO, WARN, ERR, DEBUG
      - Label MEM
      - Memory used by current process in megabytes
      - Label DEVC
      - Device count
      - Label NETC
      - Network count
      - Label NIDC
      - NID count
      - Label DDSQ
      - DDS Queue size(essentially number of devices pushed by scanners, waiting to be processed)
      - Label DEVQ
      - Number of discovered devices waiting to be uploaded to the platform
      - Label TOPQ
      - Number of topology packets waiting to be uploaded to the platform
      - Label LOGQ
      - Number of log messages waiting to be written
      - Label DEV
      - Device ID if applicable
      - Label IP
      - Device IP if applicable
      - Name of the discovery component that wrote the log
      - Message
Kibana
- The Kibana index is called “datto-discovery-*”
- Relevant fields to filter on
  - Platform
    - Pre-production
      - Sandbox
      - Devb
      - Staging
      - Bacchus
    - Production
      - Merlot
      - Pinotage
      - Zinfandel
      - Syrah
      - Concord
  - Component
    - Agent/NN facing
      - Devicestream
      - Topologystream
      - Devicediscoverymicroservice
      - Topologymicroservice
      - GetnicVendor
      - Updatenicvendor
      - GetManufacturer
      - Updatemanufacturer
      - Dhcpgingerprintingmservice
    - UI/FEAPI facing
      - Getdiscovereddevice
      - Getsitetopology
      - Hastopology
    - Other
      - Storagecleanup
      - Discoverymetrics
    - Onboarding related
      - Updateonboardingattempted
      - Resetonboardingattempted

I’m not seeing any discovery or audit data available for some or all of my SNMP devices. Why is this?

Typically, this is caused by the Network Node not receiving an SNMP response back from the device after a request has been made. There are a couple of typical root causes for this:
- SNMP service/agent is not enabled on the device.
  - Resolution: Enable SNMP service/agent on the device.
- The SNMP credential being utilized is
  - Not input into RMM correctly (typo)
  - If v3, certain auth params may be mis-input
- Some devices require explicit allow-list entries of the IP address of SNMP agents that will be querying them. If the partner is employing one such device, they’ll need to add entries for all network nodes that could be responsible for monitoring and managing the device.
- (Very rarely): When configuring an SNMP community or inheriting an existing configuration, sometimes the community definition may include an access-control list that grants access to a limit portion of the SNMP tree (e.g., a specific SNMP table).
  - Remediation: remove the ACL and grant read-only access to the entire SNMP tree (.1 downwards)

The topology map appears to have some discrepancies reported by the partner.

For example:

Certain links between devices are missing
Links between devices are inaccurate

Follow the case escalation steps in the next section.

Topology Case Escalation:

If a network discovery issue is not solved within initial and escalated support avenues, cases can be raised with the Product Engineering team.

Topology issues are amongst the more challenging issues to traditionally debug. This is because of the uniqueness of each site’s environment (and therefore, the uniqueness of the dataset).

What to include:

Agent log files (which include Network Node logs)
A description of the Expected topological layout. For example:
1. Switch-1 Port 1 <-> Port 5 Router-1
A description + screenshot of the Actual topological layout. For example:
1. Switch-1 Port 1 <-> Port 1 Router-1
Manufacturers and models of NIDs, especially those in whose vicinity the topology appears to be wrong
Verification that:
1. SNMP is enabled on all network devices of interest
2. SNMP credentials have been input into RMM and have been validated successfully, using the SNMP Test Tool or a third party SNMP MIB walker
3. No known ACLs, firewall policies, are impeding SNMP access to/from the network node and the device(s) in question.

Note: As wireless associations are currently not supported, engineering tickets for such issues will not be treated as defects until that capability has been introduced.

General Network Discovery Info