1 2 Previous Next 25 Replies Latest reply on May 17, 2016 9:58 AM by Achim Hilker

    Agent user locking on Windows cluster nodes

    Jim Campbell

      Does anyone else have a problem with the agent local user ( BladelogicRSCD ) locking on Windows cluster nodes?  I am familiar with the problems with this occurring in previous agent versions but we are still seeing this behaviour with newer agents.  This is ONLY occurring on some of our windows hosts that are in clusters ( so about 20 servers out of 5000ish ).  It does not occur on all clusters and unfortunately I have not seen it occur on any of the sandbox clusters where we could test.

       

      For a while I had only observed this on Windows 2003 clusters but now have seen it on Windows 2008R2 and Windows 2012 clusters as well.  I usually just temporarily fix it ( delete user, restart agent ) but it seems to keep coming back eventually.  The agent will continue to work for some period of time but eventually the user becomes locked again.

        • 1. Re: Agent user locking on Windows cluster nodes
          Bill Robinson

          what kind of clustering is setup and what is clustered?  (is the agent part of the clustering?)

           

          where are the failed auth attempts coming from for the BladeLogicRSCD user that are locking it ?

           

          do you see any jobs or other activity running against the server prior to the failed auth attempts ?

          • 2. Re: Agent user locking on Windows cluster nodes
            Jim Campbell

            Windows clustering ( the 'Failover-Clustering' Windows server role )

            SQL Server and SQL Server Agent resources are the only services that are clustered.  The Blade agent service is not involved

             

            I don't really know when or how to determine when the failed auth attempts are occurring.  As soon as the agent locks the rscd.log files start to roll as the service automatically tries to restart 50 times.  Each log file just looks like

             

            02/21/16 01:03:42.903 INFO     rscd -  SERVERNAME_HERE 7020 SYSTEM (???): ???: The following local user will be used by the agent for user privilege mapping: BladeLogicRSCD

            02/21/16 01:03:42.913 ERROR    rscd -  SERVERNAME_HERE 7020 SYSTEM (???): ???: [listen_thread:RSCD_WinUser::logonPassword:LsaLogonUser()] : The referenced account is currently locked out and may not be logged on to. (BladeLogicRSCD@SERVERNAME_HERE)

            02/21/16 01:03:42.915 ERROR    rscd -  SERVERNAME_HERE 7020 SYSTEM (???): ???: listen_thread ERROR: 9002:Internal Error - Caught exeception.

            02/21/16 01:03:42.933 ERROR    rscd -  SERVERNAME_HERE 2496 SYSTEM (???): ???: Main: Wait Failed on handle. RSCD start failed.

             

            In the case of the one server I can find in this situation it appears to have occurred following a reboot so I would assume the failed login attempts for this one occurred as the agent attempted to start following the reboot.  The Windows security event logs have rolled since this as well.

            • 3. Re: Agent user locking on Windows cluster nodes
              Bill Robinson

              You can change the rollmaxfiles in the log4crc to 100.  And you can check the event logs on the server to find the failed auth attempts.

              • 4. Re: Agent user locking on Windows cluster nodes
                Achim Hilker

                Hi,

                 

                we have nearly the exact Problem in our Environment. The Problem occurred in Version 8.1 SP3, have had a call at BMC at that time, got the Workaround and have been told a Lifecycle would solve the Problem. Then in 8.2 the same Problem, again a Lifecycle should help, which is done by now. We are on 8.5.01.304 and still facing the Problem.

                The workaround itself works fine (Stop Process/Delete user/Start Process), but that can´t be the solution. In the past we do the most BladeLogif stuff Linux/unix related, but Windows in moe and more coming. So the Problem for us gets bigger.

                 

                It´s nearly as Jim explained in his first threat, except that we couldn´t see the Problem happen on 2k12 by now.

                I have done some investigation and it´s really strange. I try to explain it with some Logs as example.

                 

                Environment:

                A Business Unit where the Problem occurs more often:

                Win2k3 Cluster:

                ClusterA1-4

                 

                ClusterB1-4

                In summary 8, but only 2 are active.

                Win2k8 Servers:

                ServerA

                In summary 24, which haven’t to do with the Clusters directly, except they are in the same OU.

                We are in a replicated Domain Controller Environment. The RSCD Agent is installed (same Version as BL Env 8.5.01.304) with the local Default user BladeLogicRSCD.

                 

                Scenario:

                I observe some suspicious Systems for 2 month now. Which means an Agentinfo every hour and save the last two rscd Logs from the System. When ageninfo fails I get a mail. In this way I get an exact time the local user gets locked on the System. Since then I could observe a couple of locks and it was always nearly the same Scenario.

                The trigger is a Bladelogic Deployment which run against the Systems. This could be a single file deploy Job or a BL package. In any case there was a file transfer in some way. I only could observe the Problem, where some kind of file is deployed. So now the strange Part begins. The Deploy Job runs against one of the Servers in the OU and then the Server tries to log in to the Cluster with the local BladeLogicRSCD Account.

                 

                 

                Windows Events on ClusterA:

                 

                Event-ID 4625

                 

                - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

                 

                - <System>

                 

                <Provider Name="Microsoft-Windows-Security-Auditing" Guid="{54849625-5478-4994-A5BA-3E3B0328C30D}" />

                 

                <EventID>4625</EventID>

                 

                <Version>0</Version>

                 

                <Level>0</Level>

                 

                <Task>12546</Task>

                 

                <Opcode>0</Opcode>

                 

                <Keywords>0x8010000000000000</Keywords>

                 

                <TimeCreated SystemTime="2016-02-17T16:08:53.292999800Z" />

                 

                <EventRecordID>4744914</EventRecordID>

                 

                <Correlation />

                 

                <Execution ProcessID="556" ThreadID="19920" />

                 

                <Channel>Security</Channel>

                 

                <Computer>ClusterA</Computer>

                 

                <Security />

                 

                </System>

                 

                - <EventData>

                 

                <Data Name="SubjectUserSid">S-1-5-18</Data>

                 

                <Data Name="SubjectUserName">ClusterA$</Data>

                 

                <Data Name="SubjectDomainName">DOMAIN</Data>

                 

                <Data Name="SubjectLogonId">0x3e7</Data>

                 

                <Data Name="TargetUserSid">S-1-0-0</Data>

                 

                <Data Name="TargetUserName">BladeLogicRSCD</Data>

                 

                <Data Name="TargetDomainName">ClusterA</Data>

                 

                <Data Name="Status">0xc0000234</Data>    ##0xC0000234 user is currently locked out (win doku)

                 

                <Data Name="FailureReason">%%2307</Data>

                 

                <Data Name="SubStatus">0x0</Data>

                 

                <Data Name="LogonType">4</Data> ##4 Batch (i.e. scheduled task)

                 

                <Data Name="LogonProcessName">BlRscd</Data>

                 

                <Data Name="AuthenticationPackageName">MICROSOFT_AUTHENTICATION_PACKAGE_V1_0</Data>

                 

                <Data Name="WorkstationName">ClusterA</Data>

                 

                <Data Name="TransmittedServices">-</Data>

                 

                <Data Name="LmPackageName">-</Data>

                 

                <Data Name="KeyLength">0</Data>

                 

                <Data Name="ProcessId">0x4e34</Data>

                 

                <Data Name="ProcessName">C:\BladeLogic\RSCD\RSCD.exe</Data>

                 

                <Data Name="IpAddress">-</Data>

                 

                <Data Name="IpPort">-</Data>

                 

                </EventData>

                 

                </Event>

                 

                 

                 

                Event-ID 4776

                 

                - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

                 

                - <System>

                 

                <Provider Name="Microsoft-Windows-Security-Auditing" Guid="{54849625-5478-4994-A5BA-3E3B0328C30D}" />

                 

                <EventID>4776</EventID>

                 

                <Version>0</Version>

                 

                <Level>0</Level>

                 

                <Task>14336</Task>

                 

                <Opcode>0</Opcode>

                 

                <Keywords>0x8010000000000000</Keywords>

                 

                <TimeCreated SystemTime="2016-02-17T16:08:53.292999800Z" />

                 

                <EventRecordID>4744913</EventRecordID>

                 

                <Correlation />

                 

                <Execution ProcessID="556" ThreadID="19920" />

                 

                <Channel>Security</Channel>

                 

                <Computer>ClusterA</Computer>

                 

                <Security />

                 

                </System>

                 

                - <EventData>

                 

                <Data Name="PackageName">MICROSOFT_AUTHENTICATION_PACKAGE_V1_0</Data>

                 

                <Data Name="TargetUserName">BladeLogicRSCD</Data>

                 

                <Data Name="Workstation">ClusterA</Data>

                 

                <Data Name="Status">0xc0000234</Data>

                 

                </EventData>

                 

                </Event>

                 

                 

                 

                Event-ID 4611

                 

                - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

                 

                - <System>

                 

                <Provider Name="Microsoft-Windows-Security-Auditing" Guid="{54849625-5478-4994-A5BA-3E3B0328C30D}" />

                 

                <EventID>4611</EventID>

                 

                <Version>0</Version>

                 

                <Level>0</Level>

                 

                <Task>12289</Task>

                 

                <Opcode>0</Opcode>

                 

                <Keywords>0x8020000000000000</Keywords>

                 

                <TimeCreated SystemTime="2016-02-17T16:08:53.292999800Z" />

                 

                <EventRecordID>4744912</EventRecordID>

                 

                <Correlation />

                 

                <Execution ProcessID="556" ThreadID="19920" />

                 

                <Channel>Security</Channel>

                 

                <Computer>ClusterA</Computer>

                 

                <Security />

                 

                </System>

                 

                - <EventData>

                 

                <Data Name="SubjectUserSid">S-1-5-18</Data>

                 

                <Data Name="SubjectUserName">ClusterA$</Data>

                 

                <Data Name="SubjectDomainName">DOMAIN</Data>

                 

                <Data Name="SubjectLogonId">0x3e7</Data>

                 

                <Data Name="LogonProcessName">BlRscd</Data>

                 

                </EventData>

                 

                </Event>

                Okay, we see the local User is locked on the ClusterA. But the Job runs against ServerA.

                Unfortunately the Problem is in most cases in the Productive environment, where I can´t set the RSCD Agent to DEBUG level. I set it on a DEV-SERVER where the Problem happens in February, but since then I couldn´t provoke the Problem there.

                 

                Snippet of RSCD Log from ClusterA:

                 

                10:36:53.047 INFO     rscd -  ClusterA 3076 SYSTEM (Not_available): (Not_available): Main: Starting AgentHouseKeeping.

                 

                10:37:03.592 INFO     rscd -  ClusterA 9072 SYSTEM (Not_available): (Not_available): User Privilege Mapping enabled.

                 

                10:37:03.592 INFO     rscd -  ClusterA 9072 SYSTEM (Not_available): (Not_available): The following local user will be used by the agent for user privilege mapping: BladeLogicRSCD

                 

                10:37:03.608 ERROR    rscd -  ClusterA 9072 SYSTEM (Not_available): (Not_available): User Impersonation Failed ; Error Location: RSCD_WinUser::logonPassword:LsaLogonUser() ; Error Message: The referenced account is currently locked out and may not be logged on to.

                 

                ; Auxiliary Error Message: BladeLogicRSCD@ClusterA

                 

                10:37:03.608 WARN     rscd -  BLAPPSERVER 9072 SYSTEM (ROLE:user): agentinfo: Impersonation failed

                 

                10:46:53.054 INFO     rscd -  ClusterA 3076 SYSTEM (Not_available): (Not_available): Main: Starting AgentHouseKeeping.

                 

                10:56:53.061 INFO     rscd -  ClusterA 3076 SYSTEM (Not_available): (Not_available): Main: Starting AgentHouseKeeping.

                 

                11:01:45.436 INFO     rscd -  ClusterA 6720 SYSTEM (Not_available): (Not_available): User Privilege Mapping enabled.

                 

                11:01:45.436 INFO     rscd -  ClusterA 6720 SYSTEM (Not_available): (Not_available): The following local user will be used by the agent for user privilege mapping: BladeLogicRSCD

                 

                11:01:45.452 ERROR    rscd -  ClusterA 6720 SYSTEM (Not_available): (Not_available): User Impersonation Failed ; Error Location: RSCD_WinUser::logonPassword:LsaLogonUser() ; Error Message: The referenced account is currently locked out and may not be logged on to.

                 

                ; Auxiliary Error Message: BladeLogicRSCD@ ClusterA

                 

                At the time the local user gets locked, I couldn´t find anything in the Logs. We have ArcSight in our Environment and there I could do some further investigation and I could see which Server tried to log on the Cluster:

                 

                10:44:00 CEST The computer attempted to validate the credentials for an account. ServerA  ClusterA

                 

                10:44:00 CEST An account failed to log on. ServerA  ClusterA

                 

                 

                10:44:00 CEST The computer attempted to validate the credentials for an account. ServerA  ClusterA

                 

                 

                10:44:00 CEST An account failed to log on. ServerA  ClusterA

                 

                 

                10:44:00 CEST A logon was attempted using explicit credential. ServerA  ClusterA

                 

                 

                10:44:00 CEST The computer attempted to validate the credentials for an account. ServerA  ClusterA

                 

                10:44:00 CEST An account failed to log on. ServerA  ClusterA

                 

                One example RAW msg from arcsight (Removed sensible Data):

                 

                RAW   CEF:0|Microsoft|Microsoft Windows||Microsoft-Windows-Security-Auditing:4776|The computer attempted to validate the credentials for an account.|Low cnt=1 type=0 priority=3 start=1456910695000 customer=500000044 externalId=4776 msg=The specified network password is not correct. modelConfidence=4 severity=0 relevance=10 locality=Local assetCriticality=0 cat=Security deviceSeverity=Audit_failure rt=1456910695000 cs2=Account Logon:Credential Validation cs4=0xc000006a cs5=MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 originator=0 dhost=ClusterA duser=BladeLogicRSCD == shost=ServerA == reason=User name is correct but the password is wrong slat=50.0 slong=8.0 sourceGeoCountryCode=-- destinationGeoCountryCode=-- categoryObject=/Host/Operating System categoryBehavior=/Authentication/Verify categorySignificance=/Informational/Warning categoryOutcome=/Failure categoryDeviceGroup=/Operating System categoryDeviceType=Operating System aid=3gGULAkcBABCXvCTCJ+CaTw== at=logger agentAssetId=4I6i0IDwBABCd1L94w9l4HA== dvchost= ClusterA lblString1Label=Accesses lblString2Label=EventlogCategory lblString4Label=Reason or Error Code lblString5Label=Authentication Package Name lblNumber1Label=LogonType lblNumber2Label=CrashOnAuditFail lblNumber3Label=Count destinationZone=500000316 == oat=windowsfg fDeviceVendor=Microsoft fDeviceProduct=Microsoft Windows fdvchost=ClusterA

                 

                Here you can see that ServerA is trying to log to ClusterA with the BladeLogicRSCD user. The big Question now is, why should he try to do this?

                 

                The job, which provoke the behavior, runs against ServerA, which has nothing to do with ClusterA, exept, it´s in the same OU. I don´t have a clue. I discussed it with some specialist from that OU, Windows and any regarding Colleague … same result.

                 

                Have you any Idea?

                 

                I will open a Call with the Information from this Threat and hope we can solve this issue. Till now I thought it was something special in our environment, but the Problem Jim has, sounds exactly as ours. Excuse my English. I hope you could follow the Problem description.

                 

                Best regards,

                 

                Achim

                • 5. Re: Agent user locking on Windows cluster nodes
                  Jim Campbell

                  Sounds like the same behaviour.  I have had no luck replicating this on the one sandbox cluster I have available - the user has never locked on the cluster nodes at any point while this seems to be common behaviour on several other clusters.  At present I can see the user is locked on 2 nodes of a 4-node cluster and ( for now ) all of the other cluster nodes in our environment are functional.  It is especially difficult to troubleshoot as we typically only use clustering in production environments so I have yet to observe the behaviour on a server on which I can experiment.

                  • 6. Re: Agent user locking on Windows cluster nodes
                    Bill Robinson

                    when you see this:

                    "Here you can see that ServerA is trying to log to ClusterA with the BladeLogicRSCD user. The big Question now is, why should he try to do this?"

                    is that for serverB's BladeLogicRSCD account or serverA's ?

                     

                    is the rscd service part of the clustering ?

                    • 7. Re: Agent user locking on Windows cluster nodes
                      Achim Hilker

                      is that for serverB's BladeLogicRSCD account or serverA's ?

                      ServerA´s BladeLogicRSCD account locks the Account of ClusterA. The BladeLogicRSCD account from ServerA is still running after that. A ServerB isn´t part of this Scenario.

                      is the rscd service part of the clustering ?

                      No, the rscd service is local and not part of the clustering.

                      • 8. Re: Agent user locking on Windows cluster nodes
                        Bill Robinson

                        why is the cluster relevant here?  the account should be local to the node right?  each node has their own local account.  what is part of the cluster in this case ?

                        • 9. Re: Agent user locking on Windows cluster nodes
                          Achim Hilker

                          That´s right and that is the Part I really don´t understand.

                          The Cluster should not be involved. The Cluster Name is not mentioned in either the Deploy job nor in the targets. The Only common thing is the OU from the regarding business Unit.

                          But the logs say, that the Server tries to log in the Cluster in the exact moment the Deploy Job runs against the Server. And I could observe this behavior at least 5 times now in completely different constellations (Different Clusters/OUs/Servers).

                          • 10. Re: Agent user locking on Windows cluster nodes
                            Bill Robinson

                            What kind of cluster is it ?  sql ? file server ?

                            • 11. Re: Agent user locking on Windows cluster nodes
                              Achim Hilker

                              In the case I described above, there is a Application running. For HA we have 2 Sides a 4 Nodes. 1 Node is active each side with a drive mapped. The Problem is on both sides, the active Node only.

                               

                              But It was on a fileserver aswell. Im not sure what the rest does. I assume Application hosting. We mainly use Linux for databases. We only have some MSSQL on Win CLusters, but I could never observe the Problem there.

                              • 12. Re: Agent user locking on Windows cluster nodes
                                Jim Campbell

                                My theory on this is that when a bladelogic process on a node is spawned something external to bladelogic may be causing the user to attempt to map a drive to another node in the cluster.  Since both nodes have a local 'bladelogicrscd' user if this occurs the node will attempt to use its own bladelogicrscd user password.

                                 

                                We have observed this behaviour in the past when creating packages that attempt to install from Windows shares - if the installation process creates e.g. a regkey for the source it will be something like \\repository_server\share_name_here\Installation_source .  Then in the future when the bladelogicrscd user 'logs on' via some job part of the login process may entail attempting to access the remote source for whatever it was that was previously installed ( a DB2 client in our case ).  Normally this would simply fail but if both servers have a local user with the same name the client will attempt to log in to the remote server that has a local user with the same name using its own password ( and if the password is the same this will actually work even though both users are local users ).

                                 

                                I suspect this could be worked around by renaming the local user to be different on all of the nodes but have really wanted to find a smoking gun to fully explain the problem.

                                • 13. Re: Agent user locking on Windows cluster nodes
                                  Bill Robinson

                                  that may be the case - we've certainly seen issues where someone runs a net use in a bldeploy and it first tries to connect w/ the local creds for the BladeLogicRSCD account, no matter if you specify the user and password args to net use.  using an AP for this is one way around it, another way would be to rename the bladelogicrscd account w/ the reg change method on the server w/ the share.

                                   

                                  what's odd about achim's situation is that the failed auth attempts seem to come from the server itself, not some other system.

                                   

                                  in arcsight what do dhost and shost mean ?

                                  dhost=ClusterA duser=BladeLogicRSCD == shost=ServerA =

                                   

                                  do you have the actual windows event log entry ?

                                  • 14. Re: Agent user locking on Windows cluster nodes
                                    Achim Hilker

                                    dhost means destinationHost and shost means sourceHost.

                                     

                                     

                                     

                                    I will try to get the logs tomorrow

                                     

                                    Maybe you misunderstood me, or my example names where just bad. But the Server is not part of the Cluster. The job runs against the Server as the only Target. The Job copy stuff from the BL Repo to the Server. That's it.

                                    The Server then tries to log in the Cluster, with absolutely no reason. The Cluster is not in the Targets, not hidden in any nsh-scripts. From my BladeLogic point of view there is no relation between the Job/Server to the Cluster.

                                    1 2 Previous Next