An interesting problem came up recently that I thought I would share. A customer had a cluster of version 11 appliances, configured to use SSO and Secure LDAP. As far as they were aware, there were no problems.


They were using the default appliance disk layout, and wanted to move the datastore to a new, larger disk on each appliance, which had been previously provisioned on each appliance VM. All good. So, they started the Disk Configuration UI, which started... shutting down the services (as expected) and then doing nothing more, for several hours (not as expected). NBG.


The immediate priority was, of course, to get the appliances back and usable, which consisted of:

  •     Running "tw_disk_utils --fix-interrupted" on CLI
  •     Restarting the tideway services.


Subsequent investigation showed that while the UI was working for the LDAP-based administrator user that had been used on most appliances, it was NOT working on the last machine to be added to the cluster: Also, CLI authentications failed too. Importantly, it had been provisioned after the other machines had been configured for LDAP.


This sequence of events had triggered defect DRUD1-18597, whereby the LDAP CA bundle is not distributed to a newly added member. This meant that although the local system user was working fine, when the LDAP administrator tried to initiate a disk operation, the coordinator got an error from that machine (because LDAPS authentication could not be made) but it simply retried, ad infinitum.


A simple workaround exists:

  •     Copy the file (/usr/tideway/etc/ldap_cacert.pem) from another appliance
  •     Restart the tideway service.


This should be fixed in 11.2.