Share This:

The basic idea is that in most customers you have platform owners and Subject Matter Experts, and end users.


The platform owners, like in any utility, keep the lights on, and troubleshoot basic infrastructure & performance.  In most customers, these are the primary members of the BLAdmins or equivalent role, have local administrative access (root/Administrator) to the Application servers and other infrastructure, and are ideally expert-level systems administrators and automation engineers in their own right.  They directly support the Subject Matter Experts when they run into trouble using the tool.  Any subject-specific product questions they can't answer will likely end up in a support ticket.  They relatively rarely have deep knowledge of all use cases, but usually should be able to answer mid-level questions of how to use BSA to accomplish a given task, and should be able to explain the basic idea of how BSA can help address (and what it doesn't address) around a given use case.  Most customers have at least two people in this role who back each other up.  When there's only one person in this role, it's relatively easy for them to get burnt out.  Also, it's a good practice to have expertise in both UNIX and Windows between the two or more members of the team.   In environments with more than 25,000 servers under management, platform ownership should probably be the responsibility of more than two people: not necessarily due to workload (which doesn't appear to scale linearly), but more as a hedge against turnover and knowledge loss.


At the 25,000+ server level RBAC Admin is usually at least a half-time job for someone.  Much less than this, and the RBAC design and implementation tends to lag the business by several years, which leads to challenges, as we've seen.  At the 50,000+ server level, RBAC could be a full-time job for someone.


The platform owners are usually backed up by one or more people who maintain agents.  At some customers this is either the responsibility of the entire L2 operations team, key members of that team (as they tend to have the credentials to access servers out-of-band via ssh, RDP, Vmware, lights-out or physical consoles, and can pull in their team members as necessary.)  In some cases this role is played by platform owners (the "Linux Team"), or the team that owns and maintains a given set of servers, often aligned with OS platform, project, or line of business.  In all cases, there needs to be an effective and simple process to deploy, onboard, maintain, and troubleshoot agents.  Ideally this is backed by some sort of agent compliance or health checking, but this is not common in most customers.  This is one staffing area that basically scales linearly with the number of deployed agents: I suspect that it's not uncommon to have 1 person tasked but not dedicated per 2000-5000 agents.  With a 14-20% annual server replacement (reflecting a 5-7 year server lifecycle), and 1-3% of servers down at any given time for maintenance of one sort or another, it's not unlikely that at this staffing level each admin may be asked to look at 20-150 agents per week to diagnose why they're "not visible in BSA".   At scales approaching 25,000 servers, this should be fairly automated, with an agent monitor as standard deployment, and some correlation to OS availability.   It's not uncommon to see BAO workflows doing the first pass of this work in larger environments.  With small (< 5,000 servers) environments, agent health is more commonly done as an ad-hoc activity.


The Subject Matter Experts have a fair amount of power within their subject area, and at least enough RBAC access to do their job comfortably.  Without enough access here, the Experts will become frustrated, and decrease their use of the tool in favor of anything that will give them freedom to get the job done with minimum friction, even if it may be a substantially lower-quality or feature competitor (think Puppet or ssh-keys).  For infrastructure-heavy use cases like Provisioning and Patching, they may have a fair amount of access to the core platform, potentially including provisional membership in BLAdmins or an equivalent role.  There tends to be at least one or two SMEs per use case, and one per use case and platform for Patching and Provisioning.  In environments with 25,000 servers or more, there tends to be at least a dedicated SME for Reporting, one for each Patching platform, one for each Provisioning platform, and at least a couple for Compliance (usually aligned UNIX/Windows).  They will usually also be the person to carry the responsibility for executing the business's needs in their given subject area.  This ensures that those doing the work are working from the latest requirements.  As to support, the SMEs may open their own tickets on subject-area issues, but often work with the Platform Owners for infrastructure issues.  For business critical use cases or large environments (25,000+), there are usually several SMEs in a given subject area, sometimes aligned with content creation vs. use case execution or operations.  Report execution and quality control tends to be at least a part time job in any environment with a critical use case (like Compliance for highly regulated industries).

In organizations where there are separate Build and Run or Engineering and Operations groups, the Subject Matter Expert may be in the Build/Engineering role, or there may be another Operational owner for executing the use case.  This is especially common around Patching and Provisioning use cases.  In the case of Patching, since it often is executed  for narrow windows of time as a large scale activity (2-3 days per month on Windows, less frequently for UNIX), the Operational owner will often supervise a small group of end users that execute Patching and audit/QC the results.  Typical ratios here are somewhere between 125 servers per 1 patch administrator at the low end to over 5,000 servers per 1 patch administrator for well-organized enterprises.  The total number of Subject Matter Experts can be as low as 2-3 for small customers, or as many as 25 or more SMEs for large environments with several use cases in play.


The end users who are focused on a single use case should ideally work primarily with their Subject Matter Experts for guidance and initial troubleshooting, since they have the highest shared context around how the organization is working a given problem or use case.  In > 10,000 server environments, as they are unlikely to be platform experts or have access to the core platform, basic platform troubleshooting usually goes through the SMEs or through the local product expert.  There may be hundreds or thousands of end users, so it's important to have a scaled level of support staff sufficient to answer questions and address their requests.  Otherwise, the platform owner team will quickly become overwhelmed and frustrated, and quality of service will degrade.