2 Replies Latest reply on Apr 5, 2012 5:16 AM by Mike Hibbert

    Scaling BladeLogic - Architecture for High Performance & Availability

      1. Introduction

      With the release of BladeLogic 7.4.x, our ability to scale has grown dramatically. By using multiple application servers, our clients can achieve performance levels impossible even six months ago.

      2. Overview

      The following specifications are based on our experience at other customers, who are also managing several thousands servers. It is important to note, however, that every BladeLogic systems will behave differently, depending on how it is used. The number of servers, the amounts of jobs, the frequency at which they are run, the amount of data they return, etc., all affect the specifications. It is therefore recommended, as the use of BladeLogic increases, to monitor how the resources are affected.

      In order to spread the workload, BladeLogic offers the ability to run multiple application servers in parallel. This means for instance that one server can be used to handle interactive user sessions, while other servers can be dedicated to running jobs. This mechanism can be implemented by installing the Application Server on multiple physical machines, but also by running multiple instances of the Application Server on one machine.

      The key to achieving high performance is to scale the number of application servers based on the number of agents and the performance of each app server. More importantly, bottlenecks of the overall solution must be considered (network interconnect, depot performance, and the database back-end.)

      A “rule of thumb” is to assign one application server to every 500-1000 agents. This is the formula which has been used for our three biggest customers, with great success.

      For instance, if you intend to manage 2000 managed servers, you should use between 2-4 application servers. These application servers do not have to be on discrete physical hardware; you may opt to deploy four application servers on a single hardware instance. The drawback to using a single host is a “single point of failure.” Therefore, two or more physical hardware instances are recommended.

      The BladeLogic Application Tier

      Starting with our rule-of-thumb of using one applicationserver for 500-1000 agents, we can refine the architecture based on the hardware type, operating system, and managed server workload.

      In regards to hardware type, a modern hardware architecture utilizing state-of-the-art CPUs can reduce the number of app servers required. For instance, a contemporary server outfitted with Core 2 Duo based Intel Xeon processor will perform better than a 2006 model using Intel's legacy Netburst architecture.

      The biggest impact of the operating system is due to restrictions on the JVM. At the time of this document, BladeLogic uses a 32bit JVM. More importantly, only Sun Solaris can utilize close to the theoretical 4gb limit of a 32bit JVM. Windows can only use a small fraction of the 4gb limit, approximately 1gb. Before making an architectural decision, consult with BladeLogic support, as JVM limitations may change in future releases.

      Application Server Profiles are a method to run multiple Application Servers on a single host. Application Server Profiles attempt to address the following issues with existing Application Server deployments:

      • An Application Server using a 32bit JVM can only address a fixed amount of memory (JavaVM heap limit) and a single host cannot run more than one Application Server, therefore most memory on a large server goes unutilized.
      • There is no built-in mechanism to define which parts of an Application Server are necessary for particular types of deployments (e.g. there is no standard way to setup a server just for running jobs).

        In summary, the use of one app server per 500-1000 agents is a rule-of-thumb. This recommendation can be scaled based on operating system; Solaris can utilize more memory than Windows, and is therefore less prone to memory limitations in an application server instance. (Windows can utilize approximately 1gb, and Linux 1.5gb) The clock speed of CPUs available for x86 and Linux application servers is superior at the time this article was written; this can “level the playing field” in terms of BladeLogic performance. The use of multiple application servers and multiple hardware instances is highly recommended to leverage modern server resources.

        Recommended Architecture for a 25,000 Agent Deployment

        Using six application servers to interface with thousands of BladeLogic RSCD agents is a minimum configuration. We can leverage the memory and CPU power of modern hardware one of two ways. The simplest method is to run multiple application servers on each server. By running a minimum of four app servers simultaneously, we scale the solution to a total of 24 application server instances. The other option divides each physical instance into a minimum of four virtual instances via virtualization (VMWare VMs, Solaris Zones, or Xen virtual machines.) Four is a minimum; even more instances could be supported.

        Note that BladeLogic does not currently support the installation of the Application or Reporting servers in Solaris zones. Each application server, as well as the reporting server, should be a dedicated physical machine.

        In this scenario, the primary advantage of virtualization is security. Because BladeLogic has built-in support for running multiple simultaneous application server instances, the use of Solaris zones does not offer a performance or latency advantage. In fact the additional overhead of Solaris zones could contribute to a small performance degradation.



        The Database Tier:

        The database server can typically be used to host both the Core BladeLogic and the Reporting database. In this case, however, it is recommended to use two separate, dedicated, servers. Clustering technologies such as Oracle RAC, Veritas Cluster Server or HP Service Guard are highly recommended. One of the key factors is the amount of transactions per minute they can handle (see table below for recommended values.)

        Finally, a file storage area, used to store depot objects too big to go into the database, must be accessible from all application servers. The BladeLogic file server can become a single point-of-failure unless redundancy is considered. A 200 GB NAS or SAN device is recommended at minimum.

        The Network Tier

        A key function of BladeLogic is the deployment of software packages. Deployments can place a burden on the network, due to the file content. Therefore, the network must be designed and segmented in a fashion which maximizes throughput and controlling latency. In an environment with hundred or thousands of servers, Gigabit ethernet is a minimum.

        The Storage Tier

        With thousands of agents in play, the BladeLogic depot could become a significant bottleneck unless the storage solution is given serious thought. Every aspect of the storage solution should be optimized, including the cache on the SAN/NAS, the connection to the servers, the configuration of the fabric, and the disks on the storage frame. Cutting corners on the storage solution will seriously impair overall performance, due to the nature of deploying software via BladeLogic.

        3. Recommendations

        These recommendations apply to a Solaris/Oracle environment. All CPUs should be the fastest available.

        Option 1 – Six Application Servers, 24-48 application server instances


        Option 2 – Four Application Servers, 24-48 application server instances