Troubleshooting: why isn't my windows client connecting to its relay?

Version 11
    Share:|

    There can be multiple reasons why an agent is in red or never shows up in the console after you've installed it, i'll cover the common ones.

    Please feel free to give me your thoughts so i can update/fix it.

     

    Prerequisite: Enable full logging on the device then restart the agent

     

    - Check that you have available licenses:

     

    licenses.png

    If not you might want to delete some devices that are not in production anymore or contact the salesperson that is in charge of your account to buy more licenses or to get a temporary extension.

     

     

    - Check if the relay is started. Is it in red in the console? You might want to restart its service first.

    Relays need to be restarted often because it needs to vacuum its sqlite. If you don't, sqlites get bigger and bigger and end up slowing down the module or even corrupt its sqlite. We recommand to restart a relay's service automatically once in a while (once every two weeks or once in a month), depending on how many clients it has, on how much you send packages, patches etc.

    You might want to enable full logging on the relay to to check if it works correctly and if it can contact the master and the client.

     

    - Ip ranges in the relay list must be the real subnet range: you must calculate it from the mask if it's not the standard mask. e.g: if your device has the ip 10.5.159.56/255.255.252.0 the subnet to add in the relay list is "10.5.156.0" and not "10.5.159.0".

     

     

    - Check if there's an exception on our agent's install folder on your antivirus. We've had a lot of issues with Symantec products recently. If not give it a shot priori to trying anything else.

     

     

    - Check the client's logs in /client/log :

    - start by filtering on "ERR " (for errors) and " w  " (warnings) to see if there's anything obvious in the client and in the relay logs. Do not forget the " " or you won't filter logs efficiently in this case.

     

    - If you find the following lines the agent is not in capacity to send its identity to the relay because the module's sqlite is corrupted or locked:

    - 2013/05/23 11:35:18 AsynchronousActions ERR SQL Error: database disk image is malformed

    To solve this you'll have to:

    - stop the agent

    - delete everything in /client/data/AsynchronousActions/ except for the subfolder "sql" it contains

    - restart the agent

    - 2013/05/23 11:35:18 AsynchronousActions ERR SQL Error: database disk image is locked/corrupted

    I'm unsure of the exact error message but you might have to do the same than in the previous statement or check if there's an antivirus exception on the agent's folder

    This should also be checked on its relay (if it has one) and on its master.

     

    - the port you have set in /client/config/HttpProtocoleHandler.ini "Port=" is already used by another application, or wasn't released properly by Windows. If it's the case you should find these lines in the logs shortly after the agent has started:

    2013/05/23 11:35:18 AgentCore ERR Socket::AddListeningPort failed:

    cannot bind to port 1610

    2013/05/23 11:35:18 HttpProtocolHandler ERR failed to bind virtual host

    'HttpProtocolHandler': retrying in 7200 seconds

     

     

    - You forgot to set a parentname or you relay selection sequence is not correctly set in your rollout configuration:

    - go in your agent's installation directory and edit /client/config/Relay.ini

    - if there's nothing written into "Sequence=", check for "ParentName=" set the relay's hostname or ip address (or the master if it's the device's relay) and port in "ParentPort=" if it's empty

    - if there are things written in "Sequence=", filter your Relay.ini with each of that sequence and check how it's consistent:

    - if "static" is set, check that "StaticParentName=" and "StaticParentPort" are set correctly with the relay's name/ip and its port

    - if "list" is set, check that "ListServerUrl=" is correctly set. You can compare with a device that works normally

    - if "dhcp" is set, check that "DhcpExtendedOption=" is set with the correct DHCP option and that the option is set on the DHCP server. You might want to check using nagios or command line maybe if the option is available for devices.

    - if "script" is set, check that "ScriptPath=" is correctly set: is it the correct path to the script? Is the script in the same version than you other devices? Also check in /client/log if there's a chilli.log: if the script fails for some reason there will probably be errors in this specific log.

    -  if "backup" is set, check if "BackupRelays=" is set like that: "Relay's_Hostname:Relay's_Port"

     

    You will find more informations on these modes in this document.

     

    - check the client's logs to see if it:

    - synchronizes with its relay:

    2013/08/22 17:20:31 Relay                         I   Synchronized with relay 10.5.65.244:1610 (self_ip=10.5.159.243, relay_guid=0001343EC363E507691E73398CEC07CD26FB, relay_tunnel=1611)

    - enters the mechanisms you have set in your Relay.ini and if they're working. As an example, these logs show that the previous relay went down and that the relay module is now entering the backup mechanism and that it manages to synchronize with this new relay:

    2013/08/22 17:43:18 Relay                         W   Failed to verify the supplied relay (10.5.65.244:1610)

    2013/08/22 17:43:18 Relay                         T   Entering backup mechanism

    2013/08/22 17:43:18 Relay                         T   Processing <action.RelaySetValues>

    2013/08/22 17:43:19 AgentActionDB                 I   Invoke action RelayCheckClient on remote host http://Numara FootPrints Asset Core Agent:****@***.no-ip.com:1610

    2013/08/22 17:43:20 Relay                         I   Synchronized with relay lerch.no-ip.com:1610 (self_ip=198.147.192.8, relay_guid=00014BE9AB357DA842E62CDA291B03B3A8DE, relay_tunnel=1611)

    2013/08/22 17:43:20 Relay                         T   Processing <event.agent.runtime.parent.updated>

    - the new relay could be obtained with the new selected mechanism. These logs show that we couldn't get the parent name from the relay using the DHCP option. It could come from a bug that was fixed by a cumulative hotfix in 11.1 and 11.5:

    Line 1364: 2013/03/19 11:42:16 Relay                         T   Entering DHCP mechanism

    Line 1465: 2013/03/19 11:42:46 Relay                         W   Failed to receive DHCP response (timeout)

     

     

    - Don't forget to check for firewall configuration, ping, telnet and dns resolution:

    - the agent set exceptions in windows' firewall the first time it starts: are they set for each type of network? Have you tried to deactivate the device's firewall to be sure?

     

    - can you ping and telnet on the agent's port from the device to the relay and from the relay to the device? If not it's probably:

    - a firewall issue if the device can bind to that port

    - an issue on the relay if you can ping but not telnet it: check for the relay logs

     

    - can you resolve the relay's name from the client? Try to set the parent's ip address instead of the hostname in the client's Relay.ini

     

     

    - The client you have deployed might have the same GUIDs (Globally Unique ID) than others:

    - In 11.6  were not correctly set if you had selected multiple options in the system variables to generate it. This was fixed in the first cumulative hotfix

     

    - Are they included in an OS Deployment wim or so? If so, the agent must not have been started before you captured the device or all of your device will have the same GUID (Globally Unique ID).

     

    If you have devices in one of the two previous situations, you probably want to copy the value of the field "GUID=" of one of these devices ../config/Identify.ini in the GUID blacklist in the system variables of the console. Then devices with this GUID will be given the order to recalculate their GUID when they'll update their Identity on the master

     

    - You have devices with the same names on your network. In this situation you might need to change the GUID scheme in the system variables of the console so it takes into account other criterions than only the hostname.

    You will also need to allow duplicate device names in the system variables.

    Note: this will regenerate all the GUID of your devices so you might have some unreachable devices for some time.

     

    - Do you deploy the agent before having set a specific devicename? If your devices have the same name than others when you deploy an agent on them and that in the system variables you have chosen to generate the GUID based on the hostname they'll all have the same GUID.

     

    You can check this by copying the value set in your client's /client/config/Identity.ini "GUID=" and run this query on the database (don't forget that you might have to set the owner in the query. e.g.: "facdbuser.Devices"):

    Select DeviceName, GloballyUniqueID from Devices where GloballyUniqueID='_THE_CLIENT'S_GUID_"

    If there's another one you probably have that issue

     

    You should also filter the master's logs on this GUID and see if you see different ip addresses related to that GUID. The best to do this is to set enough logs in /master/config/mtxagent.ini to cover 25 hours.

     

    - You installed the device with some name and the agent was started before it got a new name. Agents services must only be started after the device has its final name. Another similar case would be that the device was in production then you renamed it to assign the device to someone else.

    If it is renamed after it was started you'll either need to:

    - delete the device from the console then wait for it to send a new identity update/ force an Identity update by:

    - stoping the agent service on the device

    - delete the value of the field "lastidentitysent=" of its ../config/Identity.ini

    - restart the service of the agent

    - reinitialize its ../config/Identity.ini and update the db:

    - stop the device agent service

    - delete the value of the field "GUID Scheme="

    - delete the value of the field "GUID="

    - restart the service

    - wait a bit for the device to regenerate a GUID

    - edit its ../config/Identity.ini and copy the value of its "GUID=" field to update the "GloballyuniqueID" column of this device in the table "Devices" of your DB:

    UPDATE DEVICES SET GloballyUniqueID='_DEVICE_GUID_' WHERE DeviceName='_DEVICE_NAME_';

    Where "_DEVICE_GUID_" will be the value of the "GUID=" field you just copied and "_DEVICE_NAME_" will be the name of the device to update

     

    - Check its relay logs, if it has one, for the same things as in the client's logs to make sure the relay can connect to its relay or to the master, or simply assign a basic operational rule (I usually use the step "Wait" and set it to "3") to the relay, to see if it's executed

     

    - Edit your client and your parent's /client/config/mtxagent.ini (or /client/etc/mtxagent.ini if the parent is a linux device) to see if "PAC=" and "SSL=" are set to the same values on each of your devices. If it's not the case you'll have to:

    - edit the client's mtxagent.ini to set it as the parent's, then restart the cleint's service

    - edit your rollout configuration to set it so the next clients you'll install will be installed correctly:

     

    Security.png

    "PAC=" corresponds to "Access Control", "SSL=" to "Secure communication".

     

    Note that your devices GMT date and time must all be synchronized, communication won't be possible if they ain't!

     

    - Check the size of the ../data/asynchronousactions/asynchronousactions.sqlite on its relay (if it has one) and on the master. In some version this sqlite tended to grow because the module was not able to process identities anymore. This has been fixed for a while now, make you have the latest cumulative hotfix.

     

    - Check the master logs to see if there are no errors from the module "Vision64database" that states that the master cannot write to the database.

     

    - Check if there is no jam in the table "Workqueue" of the database of your master. If you do not remember where it is, you can find some information on the master ../config/Vision64database.ini.

    Let's say that if there's more than 50 entries there you probably have a problem and should call support for a better investigation.

     

    Notes:

    - This article was written for 11.6 and previous. Logs might differ in versions previous to 10.1 included or in 11.7 as the logs were reviewed. I'll update the article when i can.