Search BMC.com
Search

Share: |


So, someone else owned (BladeLogic) Server Automation in your environment.  They left/quit/got fired/got promoted/moved to New Zealand to raise sheep.  Now you own Server Automation.  Your company sent the other people (let's call them Bob and Kelly) to training.  They got the knowledge transfer, they had the product expertise, they knew a support person by name. 

 

Either way, now you own it, and your manager wants to know when you'll have that inventory report working, and what you want to do about setting up Disaster Recovery in the shiny new data center. 

 

Where do you start?

 

Two easy places to start: first one is the rapidly-wiki-izing documentation, (all of those links that say docs.bmc.com).  Links here include the basic documentation, and a few of the Best Practices that have been developed recently.  Start with Deployment Architecture to understand some of the moving pieces.  Next is the list of "Howto" videos that have been developed over the last few months.  These are basic walkthroughs that show you how to build your own basic self-contained "sandbox" environment of BSA on a Windows VM

 

BSA 8.2 base documentation:

 

https://docs.bmc.com/docs/display/bsa82/Home

 

Deployment Architecture:

 

https://docs.bmc.com/docs/display/bsa82/Deployment+architecture

 

Sizing and Scalability:

 

https://docs.bmc.com/docs/display/bsa82/Sizing+and+scalability+factors

 

Disaster Recovery and High Availability:

 

https://docs.bmc.com/docs/display/bsa82/High+availability+and+disaster+recovery

 

Large Scale Installations:

 

https://docs.bmc.com/docs/display/bsa82/Large-scale+installations

 

Howto Videos:

 

•        Initial Install – Database Setup: On BMCdocs YouTube at http://www.youtube.com/watch?v=91FEUDVD6sE

•        Initial Install – File Server and App Server Installs: On Communities YouTube at http://www.youtube.com/watch?v=m7Y3SY23kuQ

•        Initial Install – Console GUI and Appserver Config: On Communities YouTube at http://www.youtube.com/watch?v=uwqlj60Lvo0

•        Compliance Content Install: On BMCdocs YouTube at http://www.youtube.com/watch?v=bXdaogDsCNc

•        Compliance Quick Audit: On BMCdocs YouTube at http://www.youtube.com/watch?v=i8BLi4WAWEY

•        Setting up Compliance – Discovery Jobs:

•        BSA 8.2 Patching - Setting Up a Windows Patch Catalog: On Communities YouTube at http://www.youtube.com/watch?v=nfpFpOuub9k.

•        Windows Patch Analysis: On Communities YouTube at http://www.youtube.com/watch?v=ODWhC01uEaQ.

•       Patching in Short Maintenance Windows with BMC BladeLogic Server Automation: On Communities YouTube at http://www.youtube.com/watch?v=o6Lfzbb3JZg.

Share: |


Common wisdom says you should buy vehicle  that will meet your needs 90% of the time, and rent whatever you need for the rest. Since my car spends 90% of its time at an airport garage or driving to an airport, I've got a small car that's comfortable for me and one other (and will seat 4 in a pinch).

 

So it is with automation tasks.  I will commonly build out install packages for the 90% use case: SQL Server installation is a common task, which, while it must be executed correctly and ideally the same way every time, is not terribly interesting once you've done the initial configuration.  There are a dozen other middleware components, a dozen agents, a dozen common configurations: changing the name of the built-in Administrator account, for example.  While all of these are "common" tasks, I often end up talking to people trying to automate the most complex configurations in their environment, or seeking tools to address 100% of the tasks at their hands. 

 

I share their interest: I rarely want to bother with the easier tasks in a given environment:it's boring, and once you've installed Oracle 11g a couple of times, there's really not that much that's interesting in it, unless you're trying to do something completely different (like stand up a 3-node super-HA RAC).  All of that said, I've been noticing lately that it's far easier for us to set aside the basic work that needs to be done, the first 80-90% of automation tasks, like setting up the various patching, security, regulatory or build compliance audits, or building provisioning or software deployment packages. 

 

Instead we tend to focus on whether that last remediation instruction is exactly correct, on whether one condition in particular works correctly on 100% of the systems.  Unfortunately, that last 10% seems to cost as much (time, money, resources) as the first 90%.  A customer I know of has a metric on their software installs: every time one of their senior resources doesn't have to spend an hour staring at billboards while a given software package installs, they add $40 to the automation bucket, and at the end of the year they total it up.

 

Now, it's not fun or exciting to setup and maintain the "first 90%" jobs, but they get the job done, and you'd be surprised how much they'll save your organization over a year.  If you want to know exactly how much, just setup a quick report to measure the number of runs over the course of a year.  Then you'll know how much time you freed up to work on the "more interesting" tasks.

Share: |


Today I was onsite at a customer that holds a special place in my heart, because they were the first POC I ever went to as a new pre-sales technical guy.  I like to go see customers, help them work through whatever's going on, and help them figure out how to get more mileage out of the software they've already got in house.  They also provide some of the best feedback for what we could be doing better.  They have the most direct, timely, and honest advice: they're in the thick of the day to day operations, and will tell you how well a given feature is working for them. 

 

Over the last couple of years, I've had a variant of this type of discussion with a number of BMC Server Automation (BSA) users, where we either start with a question, or with "how do I do this thing I'm trying to do"?  The good news is, there's lots of use cases out there: different things you can do with BSA.  It's much more than your pocket swiss army knife, but like any decently advanced bit of machinery sometimes it's hard to figure out -what- to do with it.

 

Fortunately (for both of us), our conversations seem to start with a "hey, I'm trying to do X, there's got to be a better way."  Working back from "X" to the business requirement generally tends to yield a clear, simple business requirement.  Sometimes the user's trying to dump out a list of software or some other config item,what I tend to think of as the "survey" phase, that will later be cross-checked with a list of supported/allowed versions.  The business requirement being that they've gotten bitten enough by having old agent versions around, and now they want to get more serious about updating that last 5%.  Historically, it's been fairly easy for us as systems admins to go run some command on a bunch of machines, dump out some info, then fold/spindle/mutilate it into some basic statistics and determine how much work we have (or how much we can fob off on the overnight crew).

 

http://www.yeah.org/~berry/p/2002/2002-portfolio/1k-plums-etc.jpg

 

Unfortunately, doing it that way is only so repeatable, and even if we very effectively automate -that- task, it doesn't usually scale for others: not everyone's going to be able to re-use the scripted command lines that worked for me.  Which brings us back to build compliance.  It sounds so straightforward: everyone who's been building servers for any amount of time has some kind of build standard, be it scribbled on a legal pad, documented in a spreadsheet, scripted into a set of post-provisioning command lines, or fully automated with checks and packaged remediations. 

 

The catch with build compliance is that it changes over time.  While a given agent version, software version, name service configuration is valid this month, it will be different in six months, and it may well have been different six months to a year ago.  If the servers built 3 years ago were never updated, they're not going to look terribly much like today's servers.  That makes them much more expensive to support, because now you're depending on tribal knowledge, on the admin who was working on them years ago (and still has the experience), or on someone new who will take more time to figure out "how we used to do it."  Wouldn't it be better to bring up to date and keep in compliance, if we had the capability to do so?  To drive compliance from one place, one policy, that could be easily updated, and quickly show where the new gaps were?

 

I worked with a lead developer at an e-commerce site whose motto was to never get bitten by the same customer-facing problem twice.  If something could have been averted in advance, or was a silly error, why not check for it in the future?  His goal was to add a monitor, check or change the process anywhere it would make sense to catch the particular failure.  These checks and monitors needed to be completely automated, because anything manual just wasn't going to get done.

 

What are you doing manually today, what would you do, if it was easier to survey your environment, compare and validate correct configurations, or automate the repair of your production environment?  What if it was something you could teach the new guy how to do, and he could be doing it this afternoon?

Share: |


I probably interact with 80 customer users a year.  Ideally, I visit them when things are going well, talk about current state, talk about new things they want to do, how to use what they've purchased, and how to free up resources from firefighting for project work.  Lately, however, I've been going to a number of customers who have been running for a year or two, and are running into some challenges – particularly with people changing jobs.  When those people are the linchpins that ensure the success of a given project, these projects are suddenly at risk.  Often the backup person hasn't been working on the project nearly as regularly or intensely as the primary was, or their other duties have kept them out of the loop.  This is a risk with almost any project: the only way to know the subtleties is to be among them.

 

20071224212.jpg

 

One of the challenges of BMC Server Automation (BSA) being so flexible a platform is that there's often more than one way to go about addressing a problem.  Historically, figuring out the best way to do these things has involved a services engagement to consult with the customer and understand the details of their needs, occasional spot help from a software consultant, or tickets filed with Support.

 

To help with this process, I'm starting to collect together our internal best practices and lessons learned at customers, and rolling them up into both written papers and videos talking through common ways to go about solving problems with BSA.  I've got a few years working with the product, I know a bunch of experts here and I've been soliciting ideas. I'm not saying they'll be perfect, and I'm not saying you still can't stub your toe if you do what I'm doing.  What I can tell you is that they'll explain to you how I, and others like me, will commonly approach a typical data center problem like PCI FIM Change Tracking, Solaris Patching, or Granular Access Control for Non-Admins.  I will happily take feedback and amend them over time (perhaps with kung-fu style dubbing).  I'm also planning to include the more common questions and issues that tend to come up.

IMG_0264.JPG

The ideal audience for this will be someone who already knows a little about BSA, but wants to experiment with something in a new functional area (like Compliance, or Change Tracking, or Patching) to learn more about how it works, and have enough understanding to talk about it to their own teams.

This won't replace training, education, services and consulting, or simply talking to your local sales rep or software consultant about how the products can help address your organization's challenges. My expectation is that if it's easier to figure out how to use the product, more people will use it.  I'll also be making these videos available internally for our own people to use, whether they're a new software consultant or developer, or the veteran who wants to learn more.

These will definitely be available through communities.bmc.com, and I'm planning to make them available in a video podcast or portable-device-friendly format.

If there's something that took you (as an end user, software consultant, services consultant, or customer) longer to figure out than it should have, or something that you're spending a lot of time explaining to your coworkers, please let me know at sean_berry@bmc.com. I'll add it to the list, or prioritize it higher.

I'm also taking requests for theme songs. Is "Take 5" too cheesy, too overdone, too 1990s and "Pleasantville"?

 

20071220187.jpg

Share: |


I'm always looking for something new to learn how to do, or a technique to improve something I'm already doing.  We can't sign up for new opportunities if we're spending all of our time, and more importantly, energy reinventing the wheel, or redoing the same things over and over.  In our company, as likely there is in yours, there's -lots- of new initiatives, plenty of changes, and new good things going on.  If we're stuck spinning DVDs or hand-coding kickstart configs, or figuring out the latest arguments to patch utilities (and documenting it somewhere so our team mates can mis-type them themselves), we're not likely to have much time to worry about what comes next, we'll be stuck staring at the trees, rather than thinking about how to get out of this particular forest. 

 

 

While VMware's ESX, the various J2EE application servers, have been around in one form or another for years, they get more, not fewer configuration items.  While we're bright people (hey, you're reading this, aren't you?), it can be tough to stay ahead of configurations with multiple people touching servers and applications.  The net result of policies and inventories, documented only on paper or a spreadsheet, literally out of date the day they're printed, is that we're never in compliance, our inventories aren't current.  When we know an audit will be due, we scramble, all hands on deck, and work until the machines are near enough compliant to pass.  And that next day...

 

 

The greatest loss in this situation is not the compliance, or the gap of compliance, the hundred servers, the hundred -different- servers in a given pool, or our aging inventory spreadsheet.  The greatest loss is the project work, the new revenue our company can't go after, all because we're spending our prime troubleshooting and problem-solving energy... on policies we've long ago defined, but didn't have an effective way to survey, collect, and enforce.  What project were you working on this week that slipped to next week because of yet another production outage?  What could you have been doing instead?

Share: |


So you've got some reasonable, enterprise class problem to solve, say, patching 1600 Windows servers on Patch Tuesday, or deploying a specific configuration change across 1100+ WebSphere instances in under an hour.  How do you a: figure out how long it'll take, and b: figure out whether you've got the infrastructure to do it?  I'll take the long way to get there, but perhaps it'll help to think through the different parts of a real world example.

 

When i first started working in the BladeLogic Services group, one of my tasks at the customer i worked regularly with was patching the windows server environment on six continents within a couple of days of new patches being released.

 

There's three major parts to this problem.  One, older, less reliable machines that didn't always completely restart.  Of course, if you're patching the system, the app owner looks to you to explain why their application, which has been running "fine" up until now, suddenly no longer works.  The second challenge comes in getting payloads out to these machines in a timely fashion.  For some systems, we may need to transmit several hundred MB if a large patch or a new service pack needs to go out.  And lastly, executing these changes in a timely fashion, so that we're not keeping production systems out of service any longer than necessary.  Other challenges include the fairly basic ones inherent to any automation system: tracking newly built or imported and decommissioned servers, and determining all the ways a system can become unavailable for the duration of the patching window.

 

The first challenge is relatively easy: on Windows more so than even UNIX, services expect reboots, and are usually configured to startup at boot, so a reboot will rarely cause a hardship.  Even so, a reboot executed without other changes, the weekend or window before the window we want to patch in can help greatly simplify the patching or change-making administrator's troubleshooting tasks during the maintenance window.

 

Payload distribution has always been a challenge with distributed environments.  Either every host pulls or is pushed its payloads, resulting in either slow network connections, or long time to delivery.  By using a push model, and relays or repeaters in designated data centers, administrators can balance the added overhead of maintaining another software component with the direct benefits of local copies over fast networks.  Payloads can be synchronized a number of different methods, but at least one copy's going to need to get to a remote data center one way or another, and then ideally be copied across a faster, lower cost network.

 

The execution of changes in a timely fashion is the critical piece to any successful maintenance window.  Usually there's more than one change that has to go in, usually there's more than one team, and usually it means that someone's staying up past their usual bed time.  When I would regularly execute OS patching, that meant if I could figure out a way to successfully patch twice as many boxes in parallel, it might mean I hand the system off to the next guy an hour or two early, and that, on an 11PM-7AM maintenance window, might mean -they- get home -before- their kids are up for Sunday morning, or even before the bars closed.

 

So what's this all got to do with Service Automation, or Server Automation?  We have customers who, every weekend, are trying to get 10 pounds of changes to fit in a 5 pound bag.  Through appropriate tuning of the various tools they work with, they basically can run a certain number of things at any given time.  The number of parallel threads times the wall-clock time available in any given maintenance window, for a parallelizable task, dictates how many actions you can take, how many things you can kick off, monitor, and close out in a given window.

 

We worked out the available number of parallel threads available out of the box in the environment we were working with, and found it wasn't going to be able to get our task done in the window we wanted to, with the basic configuration.  So we looked at the available resources, determined we could effectively significantly increase the number of parallel threads for this task (10x), while staying within the performance constraints of our back end systems (which were not necessarily beefy systems).  We were able to achieve a result 8x faster than our initial design constraints (think about getting 8 hours of work done in 1, being able to catch the end of Saturday Night Live at midnight instead of falling into bed at 7AM).

 

Key factors to our success: objectively evaluating the workloads for how parallel-friendly they are (and on which end, the infrastructure, or the endpoints, they're going to be most intensive), the ability to effectively measure the performance capabilities of the key systems involved, an in-depth knowledge of how these systems scale, born of long experience, the ability to effectively package change, and some practical on-site testing.

 

You too can achieve tighter maintenance windows through some basic performance assessment, using the right tools for the job, and tuning for most effective performance.  If this is something you're struggling with, reach out to me.

Share: |


I read on someone else's blog that you need to do something for about 10,000 hours (that's about 5 working years) to become competent at it.  I tend to figure that gets you just good enough to get yourself into real trouble.

 

This year, one of the things I'm learning more about is flying.  I started out on a very basic remote control helicopter a couple of birthdays ago, and eventually ended up with a bag of textbooks and a very slowly growing logbook.  But the first thing I learned to do, before I even got into an airplane, was to check out the airplane, make sure it's safe, serviceable, and legal.  That's done with a standard checklist, a little bit different for every type of airplane.

 

One of the things Atul Gawande talks about in his book, The Checklist Manifesto (http://gawande.com/the-checklist-manifesto) is that there are three kinds of tasks out there, simple, complex, and complicated.

  • Simple's easy: you learn how to do it (maybe with a bit of practice), and then you can do it over and over again.
  • Raising a child or a sustaining a marriage are complex: no two are identical, and not everyone else's advice or input will work for your situation.
  • However, the complicated, with many moving parts, -can- be mastered, and memory aids can help ensure successful executions every time.  That's how rockets get launched into space, with literally thousands of moving parts, participants and tasks: checklists.

 

While helping a customer stand up a particularly complicated process this week I am reminded again of the checklist.  We run into this every week we work with real computers: two machines that were supposed to be built the same (within the same few weeks, by two different people) that don't -quite- end up built the same.  They're usually built from a manual checklist that reads more like a task list: was backup software installed?  (quick, install the backup software!)  Was monitoring software installed?  (quick, install the monitoring software!  What version?  Where's the latest installer?  eek!).

 

While these -can- be scripted, they usually remain in document form unless there's a fairly easy way to automate them (hint: we can help here).  The problem with a checklist that only lives in paper form, and get-it-done personalities like mine, and like most sysadmins, is that we tend to gloss over things we "know".  "Of course I ran the script.  What do you mean the !*@# service didn't install/won't start?"  While there are better ways to catch and prevent install errors, that can be a longer road.  I want results -now-, without having to do -any- process engineering, and definitely without having to spend more than five minutes showing anyone how to do something.

 

Most customers I have a few minutes with, I'll throw them together a basic Build Audit in about 5 minutes.  I'll snapshot the 5-10 most common configuration objects on that platform, then we'll do a quick audit across their test lab machines.  There will be a few things that -should- be different, but many things that should be exactly the same, unless there's an upgrade going on -right- now.  And wouldn't it be nice to know which ones are done, and which ones left to do, across the entire environment, in a few minutes?

 

The checklist isn't there to tell you what to do, or how to do your job: we all know -how- to do our job: most sysadmins can install a base operating system by hand more easily than almost anything else.  The checklist is just a reminder, a way for us to validate whether we did everything for this environment we were supposed to.

 

What the build audit is there for is to check the product of our build process.  Be it fully automated, script-based, image/template-based, or completely manual (Step 1: "Take the blue folder of DVDs to the data center"), there's potential for error and failure in any process.  Just as, on a student's first solo flight, they'll forgot a couple of things if they didn't look at the checklist at the right time (because they "know" what to do), there tend to be errors that happen in build processes.  Those machines still usually go to production.  And once they're in production, it gets more expensive to fix something that didn't get put together quite right.  It'll usually require at least a change control.

 

While I'd like -everyone- to be using a fully automated build process, not everyone's gotten there yet.  (Although I would -love- to help you get there if you're not yet, and am building a guide for that very process).  In the meantime, if you have our Server Automation, Build Audits are easily within your reach, and take less than 15 minutes to see results.  Much like a Cessna 172, they're a gateway to bigger things (like the other kinds of Compliance Validation, and more automated software deployments)

Filter Blog

By date:
By tag:
It's amazing what I.T. was meant to be.