BMC Discovery is capable of scanning a wide variety of SNMP devices. When successful, these usually are modeled as Network Devices, SNMP Managed Devices, or Printers. However, it’s not uncommon to encounter some problems when scanning these devices. This posting will hopefully give you some tools to troubleshoot these problems, as well as some root causes and solutions for specific use cases.
The following information applies to both Helix Discovery and on-premise BMC Discovery, although some references (for example, doing a command line snmpwalk) are relevant only to on-premise BMC Discovery.
Section 1: How to troubleshoot problems with a scan, credential test, or device capture of an SNMP-enabled device?
Discovery sometimes experiences access problems when processing an SNMP-enabled device. The most common situation is when a device capture, credential test, or scan fails with "ERROR: SNMP++: SNMP request timeout" or "Device skipped - no SNMP access".
Here’s an example from a device capture:
Note that there are two ways to get a device capture (see https://docs.bmc.com/docs/discovery/113/capturing-snmp-devices-788111406.html):
- from the Discovery Access page
- from the Device Info page (which is reached from the Discovery Access page)
In some cases, doing a capture from a Device Info node may result in a blank screen. When clicking the browser "back" button, it returns to the Device Info page and briefly shows a green banner that says "Device skipped (no SNMP access)", which then fades out after a few seconds:
Here's an example of a timeout message from a credential test:
For a scan, the Discovery Access page will typically have a result of “Skipped (Device is an unsupported device)”. The session result page will show “SNMP++: SNMP request timed out”.
There are many possible reasons for this. Here are some possible causes and things to check:
- Make sure a SNMP credential is present and that the IP range includes the address being discovered (a valid credential is needed for device capture)
- Run a credential test. What is the result?
- Increase the timeout in the SNMP credential to 100 seconds and retry.
- What is the SNMP version (v1, v2c, v3) on the credential? Is the device configured to respond on that SNMP version?
- If the device supports SNMP versions other than the one specified in the credential, change the version in the credential and retry.
- For SNMP v1 and v2c, make sure the community string in the SNMP credential is correct. An invalid community string can cause a "Unable to get the deviceInfo: TRANSIENT" error.
- Ask the device administrator:
- if there might be an Access Control List (ACL) or some other configuration on the device that prevents responses to the Discovery appliance.
- if the device is configured to use the default SNMP port (161). Run nmap to confirm the port is open (see below).
- On the Discovery Access page, click on any links for "script failure" or "xx Session Results related" to look for clues.
- In the case of a timeout during a device capture, check the log in /usr/tideway/var/captures for additional clues
- In the case of a timeout during a scan, turn DEBUG logging on for Discovery, run the scan again, and check the tw_svc_discovery.log for clues. Remember to turn DEBUG logging off!
- In one case, the customer made corrections to the IP range and mask on the device, then was able to discover the device.
- From the Discovery command line, as user tideway, run the following commands and check the results:
1/ Check connectivity from the appliance to the endpoint:
2/ Check the port status of the device by running nmap. For example:
/usr/bin/nmap --privileged -sT -sU -p T:22,U:161 [device_ip_address]
The expected result is that port 161 would have a state of "open" or "open|filtered" :
3/ Do a snmpwalk to the device. For example:
/usr/tideway/snmp++/bin/snmpWalk [device_ip_address] -v2c -cpublic > /usr/tideway/snmpwalk.out
Change the SNMP version and community string as needed. If using SNMP v3, other parameters need to be specified. To see the usage notes with a list of available options, run snmpwalk with the "--help" option.
If snmpwalk also fails, please consult the device administrator.
If the problem persists, please contact Customer Support and provide the results to all the questions / checks above.
Section 2: Specific Use Cases
Use Case #1
Symptom: A scan of a supported SNMP device fails with Skipped / Unsupported device. The Discovery Access page shows a NoAccessMethod result in getMacAddresses (or other methods such as GetPortInfo).
The related Script Failure page may show the error: SNMP++: SNMP request timed out
A device capture may also fail, and the last thing written in the UI is:
Dumping range: Start of the MIB to End of the MIB
ERROR: SNMP++: SNMP request timed out
A credential test may succeed.
This problem can occur on many different devices, and has been observed on routers, load balancers, and some Lexmark printers.
By default, Discovery asks for large chunks of data at one time from the device, using the "Use GETBULK" option. Some devices may be unable to transfer so much data at one time without hitting a timeout. In other cases, the cause may be a problem with the SNMP agent on the device.
The best solution is that the problem with the device or SNMP agent is corrected. As a workaround, it is possible to disable "Use GETBULK", by editing the appropriate SNMP credential and unchecking the "Use GETBULK" option.
Use case #2
Symptom: A scan of an unsupported network device returns NoAccess instead of Skipped/Unsupported.
A test of the SNMP credential is successful.
The discovery debug log shows that:
- Discovery detects that the sysobjectid is unsupported
discovery.devices: DEBUG: no SysObjectId 188.8.131.52.4.1.388.14 found in MODELS
- Discovery reports that it can get the sysdescr, but it is UNKNOWN
api.audit: DEBUG: 184.108.40.206: snmp.getSysDesc(): Got system description status = SUCCESS
api.classifier: DEBUG: classify(): processing 'WS5100 Wireless Switch, Revision WS.02.3.3.4.0-009R MIB=01a'
discovery.heuristics.snmp: DEBUG: identifyDevice: 220.127.116.11 sysDescr is UNKNOWN
Root cause: The scan takes too much time trying other credentials (such as SSH) and hits the reasoning timeout of 30 minutes before trying the SNMP credentials.
This can occur when the system description does not contain a known keyword like "cisco" that indicates the endpoint is a network device.
To confirm the root cause, set the Discovery logging to DEBUG and run the scan again. In the discovery log, look for traces like this related to the device:
no SysObjectId <the sysobjectid of the device> found in MODELS
sysDescr is UNKNOWN
Solution: Open a support case to request that the device be integrated in Discovery. The SNMP credentials are used when the device is supported.
Use Case #3:
Symptom: An SNMP scan fails with " Unable to get the deviceinfo: TIMEOUT " after 30 minutes. The Discovery log has " credential failed: SNMP++: SNMP request timed out ".
The correct SNMP credential is at the bottom of the credential list.
The Discovery log shows that the scan had 14 SNMP credentials to try, and the first 12 failed. The 13th credential was still being tried when the scan ended.
The 14th SNMP credential (not listed in the discovery log) was actually the correct one for the device and it was at the bottom of the credential list. When this credential was moved to the top of the list, the scan was successful.
Root Cause: The 13th credential (with uuid of b6c7a4337c564c71870a0a4a50983b4d) did not timeout. To identify the 13th credential, the following was run:
-> replacing <scanner> with the actual Discovery scanner hostname or IP address, and using the uuid from the Discovery log.
Looking at the credential, it was found that the timeout was set to 9000 seconds, which exceeds the default reasoning timeout of 30 minutes. This is why the scan timed out.
Solution: The timeout on this credential was lowered to the default value.
Use Case #4:
Symptom: A scan of a supported SNMP device fails with Skipped / Unsupported device. The Discovery Access shows that getDeviceInfo, getMACAddresses, getIPAddresses, getNetworkInterfaces, and getNames all have status "OK. The getDeviceInfo method has a script failure with message "Ambiguity in determining device kind - falling back to unsupported device".
Root Cause: In the Discovery UI, on the Administration-> Discovery Configuration page, one or more of the following options have been modified to have a value of "No" :
- Use SNMP SysDescr to identify OS
- Always try "public" community when using SNMP to identify OS
- Use Open ports to identify OS
Solution: Change the above options to a value of "Yes", and the network devices will be discovered successfully.
Use Case #5:
Symptom: When scanning a Cisco Nexus device, getNetworkInterfaces fails on TIMEOUT_CallTimedOutOnClient after 30 minutes.
Root Cause #1: Cisco defect CSCtw72949. See https://quickview.cloudapps.cisco.com/quickview/bug/CSCtw72949. To work around this, Discovery always uses the getNext method instead of getBulk to scan these particular devices. This method is slower and can lead to the reported timeout.
Solution #1: Upgrade the Cisco OS. Cisco defect CSCtw72949 is fixed in Cisco NX-OS Release 5.2(1)N1(4) and above.
Root Cause #2: Same symptoms, however the device has been upgraded to a firmware version that includes the bug fix (for example "7.1(4)N1(1c)"). In this case, the root cause is a huge amount of VLANs on the device. To get edge connectivity information, Discovery is requesting info using all these VLANs and is not able to complete this before the reasoning timeout.
Solution #2: Two previous RFEs (DRDC1-10658 and DRDC1-11888) were submitted and changes were included in the September 2018 TKU. However, this may not correct the problem in all cases. An additional RFE (DRDC1-11973) has been submitted to find another way to gather this information in less time. However, as of February 2020, there is no ETA for this request.
The only known workarounds for root cause #2 are:
- Disable edge connectivity. To do this, on the Discovery Configuration page, change "Discover neighbor information when scanning network devices" to NO. Please note that on subsequent scans, all existing host-switch connections will be deleted. Reference: https://docs.bmc.com/docs/display/DISCO113/Edge+connectivity
- The scan fails because it exceeds the reasoning request timeout. It is possible to increase this timeout, however caution should be used, as doing so will force discovery to wait longer for the end of a scan, even if the scan can't finish. This could impact the performance of some scans, but there is no way to quantify this in advance.
The reasoning timeout can be increased with the command below:
tw_options -u system REASONING_REQUEST_TIMEOUT=3600000
When prompted, provide the password for the UI 'system' account.
In this example, the timeout (30 mins by default) is increased to 1 hour. It is not recommended to increase the reasoning timeout to more than 2 hours. A restart of the Discovery services is required for this option change to take effect.
Section 3: SNMP v3 specific use cases
Use Case #6:
Symptom: An SNMP v3 scan of a supported network device fails with “Skipped (Device is an unsupported device)”, or possibly a timeout in getMACAddresses.
A credential test fails with "SNMP request timed out". Increasing the timeout to 100s does not help.
An snmpwalk from the Discovery command line is successful.
Other SNMP devices are discovered by the appliance using the same credentials.
After a restart of the Discovery services (or all the services on all cluster members), the credential test succeeds, and the device is discovered successfully.
Root cause 1: Defect DRUD1-25505 - Two or more network devices present the same EngineID, which is supposed to be unique. Discovery scans the first one successfully, but then the second one fails until the cache is flushed (by the service restart) - at which point a rescan of the first would fail, and so on.
To confirm the root cause:
- If the scan works with SNMP v2 and fails with SNMP v3, the root cause is probable.
- If the SNMP v3 scan (or the credential test) fails and then works after an appliance restart, the root cause is confirmed.
- It is also possible to confirm the problem with the query below:
search NetworkDevice show name, type, vendor, model, #InferredElement:Inference:Associate:DiscoveryAccess.endpoint as 'Scanned via', #InferredElement:Inference:Associate:DiscoveryAccess.end_state as 'End State', #InferredElement:Inference:Associate:DiscoveryAccess.#DiscoveryAccess:DiscoveryAccessResult:DiscoveryResult:DeviceInfo.snmpv3_engine_id as 'SNMP v3 Engine Identifier'
If two Network Devices have the same snmpv3_engine_id, the problem is confirmed. Otherwise, restart the appliance, rescan the devices, then re-execute the query above. This is needed because NetworkDevice nodes are only created after a successful scan (which is not possible until the appliance is restarted in this case).
This query will only show the snmpv3_engine_id that Discovery was able to find. If workaround #1 below (service restart) was not used, the issue may occur even if the query above does not return anything wrong.
If the following command is executed from the appliance:
sudo tcpdump -i any -s0 host <ipAddress> -w /tmp/snmp_issue.cap
The dump may show the elements below. This is not enough to confirm the cause but it is compatible with it.
Discovery send get-request
Device send report 18.104.22.168.22.214.171.124.1.4.0 (usmStatsUnknownEngineIDs)
Discovery send get-request with EngineID
Device send report 126.96.36.199.188.8.131.52.1.5.0 (usmStatsWrongDigests)
1A- Restart the Discovery services of a standalone appliance (or the Discovery services of all members of a cluster) before scanning any of the devices that are using a duplicate engine id. Each restart will allow Discovery to scan a single one of the N devices with duplicate engineIDs. If the services are restarted once or twice a day, it could allow Discovery to scan the devices affected by this issue with a reasonable probability of success.
1B- Rescan with SNMP v2
1C- Upgrade to a version that resolves DRUD1-25505. As of January 2020, this is still in progress. When available, this change will allow the scans and credential tests to succeed even when the SNMP engine ids are duplicated.
Note that workaround 3B below (root cause 3, SNMP_USE_ENGINE_ID_CACHE) will not help if root cause 1 is confirmed.
Solution 1: Change the SNMP v3 engineID of the scanned device and make it unique. This is recommended for security reasons.
For Cisco Devices, it is possible to make it unique using MAC addresses: see https://supportforums.cisco.com/discussion/11539996/snmp-engineid-same-multiple-routers.
It may be possible to execute a similar procedure for other vendors, such as HP.
Note this solution is not suitable when pairs of master/standby devices share the same engineid (see root cause 2 below).
Root cause 2: Some devices (such as Cisco firewalls, Brocade load balancers, or Juniper devices) can be configured with an Active/Backup setup (also referred to as master/standby). This means that the active and backup devices are two different physical devices but they share internal configurations to support failover. As they share configurations, they also share SNMP v3 engineIDs. It is an SNMP v3 security standard that SNMP v3 engine IDs should be unique per device.
Workaround 2: Use the workarounds provided for root cause 1.
Root cause 3: A new network device was found, then replaced (on the same IP, i.e. was installed with a new MAC address and SNMP v3 EngineID), and rescanned.
Workaround 3: See workarounds 1A and 1B above
A) Upgrade to a version that resolves defect DRUD1-25505.
B) If not already done, upgrade to Discovery version 11.2 or 184.108.40.206 and execute the command below:
(enter system password)
Please note that this solution could have an impact on appliance performance in theory.
Use Case 7:
Symptom: When using SNMP V3, a scan of a supported SNMP device fails with various “USM” error messages
Problem: SNMP V3 scan fails with "SNMPv3: USM: Authentication failure". In some cases, the device can be discovered using SNMP v2c, but fails when using SNMP v3.
Solution: Try the following suggestions:
- The error indicates an authentication problem (for example, the digest being invalid). Please check the credential, making sure it is valid for the specified device.
- Verify that the authentication and privacy passwords are correct. Check with the device administrator.
- Test the credential by running snmpGet from the appliance command line, using SNMP V3 parameters. Run "/usr/tideway/snmp++/bin/snmpGet" to see usage notes. If snmpGet also fails, please consult the device administrator.
Problem: SNMP V3 scan or credential test fails with "SNMPv3: USM: Unknown SecurityName".
Solution: This error means that the SNMPV3 security name being used is unknown to the device. To confirm, run snmpwalk using the same SNMPV3 parameters as the Discovery credential. If snmpwalk reports the same error, consult with the device owner about the correct security name to use.
It's also possible to check the security name from the command line by running 'tw_vault_control -S'. The security name will appear (when applicable) like this:
snmp.v3.securityname = '<the value that you set>'
Check for any extra spaces or unprintable characters at the end of the security name.
Problem: SNMP V3 scan fails with Skipped / Unsupported device. A device capture fails with "no SNMP access". A snmpWalk from the appliance command line reports “SNMPv3: USM: Decryption error”.
Solution: A decryption error typically indicates that there is some problem with the Network Device security configuration. Please check with the device administrator and/or network team to confirm that these SNMP V3 values on the credential are valid for the device:
- Authentication Protocol
- Authentication Key
- Privacy Protocol
- Private key
Problem: SNMP V3 scan fails in getMACAddresses with "SNMPv3: USM: Message not in TimeWindow”. Running snmpWalk from the Discovery command line succeeds, and device MAC addresses can be found in the snmpWalk result. This indicates that the configured SNMPv3 credential has the required permissions to retrieve MAC addresses from the device.
Solution: The probable cause is that the device's SNMP agent is not able to properly process discovery GETBULK requests. As Discovery can't find the output for the getMACAddresses method in the defined time window, it reports "SNMPv3: USM: Message not in TimeWindow"
To confirm this, create a separate SNMPv3 credential just for this device and uncheck "Use GETBULK". Move this credential to the top and temporarily deactivate the original credential. Re-scan the device. If the scan completes, contact the device owner/support and ask them to check why the SNMP agent is not able to process the GETBULK requests.