Share This:

Using cloud technologies can be a daunting task, especially with the ocean of choice that is now available to the modern enterprise. One thing that can be leveraged from the multitude of cloud providers is the ability to create Big Data clusters with Hadoop for processing information in the cloud, but without spending the large sums of time and effort required to house the clusters of servers on-premise. More importantly, the clusters don't need to be running 24/7 either, and effectively only need to be spun up on-demand for when processing is to occur. This is something that Control-M can handle easily.


The AWS Use Case


AWS offers several services for using Big Data and for data processing in general, but one service stands out, which is called Elastic Map Reduce. With EMR, users can spin up clusters of any size with Hadoop and various other tools pre-installed. Furthermore, users can not only make use of the AWS CLI toolset to perform actions in the cloud, but more importantly can tie the calls to AWS into the jobs they run in Control-M, either as an embedded script, command, or script file. This allows for total flexibility when using AWS. Control-M also has native integration to AWS, using the Cloud Control Module, whereby a Control-M Agent can interact with any AWS account, and perform actions on EC2 instances that are running, or launch new ones from templates, as examples.


Big Deal


So, what does this mean for data scientists and business intelligence professionals working to leverage the power of Big Data in the cloud? Well, using an enterprise-grade scheduling solution such as Control-M, users can now employ services in AWS in a streamlined manner, and use the output of what AWS provides to take dynamic action inside of Control-M. EMR is a powerful tool that, if properly leveraged, can result in delivering real valued results quickly, all while minimizing costs. The benefit is that the clusters only need to exist for as long as the processing needs to occur, and can then be eliminated, thus the notion of "ephemeral clusters". By processing what is needed, but only keeping the results, users effectively circumvent the need to keep these machines running 24/7, and still derive value for their company. These flows can also be triggered dynamically, as they are now service-oriented, instead of being regularly scheduled flows.




What if every day a series of files are received, and need to be treated, but these files are massive? Do you have the steps already built into your schedules to treat them? Do you want to start leveraging AWS? With some modifications, the existing flows that make the calls to perhaps a local Hadoop cluster could be made in the cloud instead, as Control-M has the ability to run anywhere, be it on-premise, off-premise, or multi-cloud, depending on the need. You could not only treat the files as needed with a file watcher automatically, but could also trigger manually whenever you'd like to refresh data in reporting tables, or generate information at a point in time. With Control-M managing these steps, you have much more power at your disposal when invoking AWS.


But don't just take my word for it...


Using Control-M, we can invoke the AWS services necessary, in the right order, so that data is received, a cluster is instantiated dynamically, attached to Control-M dynamically, data is processed, results are sent out, the cluster is de-instantiated, information is streamed to a dashboard, all done dynamically by ordering a service from a smart phone, monitored by our Batch Impact Manager to ensure the service does not run overtime to meet SLAs. Technologies at play here include the Hadoop Control Module for Control-M, our Automation API for dynamic agent provisioning and deployment, our Managed File Transfer Control Module to handle files in and out, our Batch Impact Manager job type to manage SLAs, the innate ability of Control-M to pass variables between jobs and share information, our Self-Service Portal as well as Control-M for Mobile Devices for use by business stakeholders.


Taming AWS


We start by ensuring that Control-M has access to the AWS CLI in order to make calls to the AWS account where we're triggering cluster builds and machine instantiation. By invoking the "aws emr create-cluster" command, we can set up a custom cluster with all the bells and whistles that we'd hope for. This will ready a cluster, but while we wait we'll want to invoke "aws emr wait cluster-running" until the cluster is up and running to proceed to the next portion. The AWS CLI is extremely comprehensive, you can trigger almost anything in AWS using it, and can also craft a custom Application Integrator job that can do the specific things you need to do, effectively wrapping the calls and abstracting the runtime into the Control-M WLA or Web client.


File Transfers


You might want to pipe some files in from outside of your AWS installation, and so MFT jobs would be ideal here. You can leverage FTP, FTPS, and sFTP transfers to bring files in from partners, or from other systems in your multi-cloud environment. These files can be ultimately piped into AWS S3 or HDFS. What’s great with the Control-M MFT module is that you can create jobs that will create dependencies for your successor jobs that are based on files being successfully delivered for ingestion. Here I set up a simple transfer from Azure, since that’s where one of our partners keep their data, and bring it into our server to be ultimately pushed into S3. Hadoop’s CM can actually reference S3 bucket files and pull them into HDFS for processing.


Big Data Workflows


Control-M’s Hadoop CM can install itself on the leader node of Hadoop clusters and control the flow of HDFS, Yarn, Hive, Sqoop, and most other common binaries found in the Big Data world. With tracking for these kinds of executions, and then the passing of information between jobs and workloads, you can effectively take a platform-agnostic stance when integrating your various disparate platforms together to achieve the results you’re looking for. In this case, we’re just executing some HDFS directory creation commands, some HDFS movement commands, pulling data in from an S3 bucket, and are all pulled into HIVE tables, where I run some very simple analysis. The results are emailed back to me, and the cluster is wiped away.




Agents can be deployed an un-deployed as needed with Control-M, which makes this kind of ephemeral instantiation a reality. Services like these make Control-M an ideal solution to remain flexible in a world where infrastructure is as malleable as putty, and only running servers for as long as they’re needed is a reality, such as with containers. The Hadoop control module for Control-M is the only CM we make that needs to actually be installed physically on one of the application servers, in this case the leader node of the Hadoop cluster. With the Automation API, all we do is call the necessary commands to go out and pull the packages to the server for provisioning, bringing the agent bundled along with the CM as well, and it can all be detached and removed when done.


controlm - Control-M Workload Automation.jpg  controlm - Control-M Workload Automation (2).jpg

ctm @ controlm homectmscripts.jpg




With the support of streaming functions, you can get the information you’ve built to the dashboards that need it, all as part of the workflows that have been built. This is ideal for visibility, and can force the refresh of information after every run, which can show you real-time information about the data you are collecting, and gain insight into trends that could create a competitive advantage. You can pipe this kind of stuff to dynamic dashboards that can show data loads, throughput, areas of interest, you name it. Perhaps you’re running an ELK stack, and are piping information out to it so that it then takes it and performs trend analysis for you.




All of this can be triggered by business stakeholders if needed, as services, leading to the workflows running on-demand, and refreshing data as needed. Real-time insight that can be called on-demand is a powerful concept, but is a reality when using Control-M to manage these kinds of flows. As mentioned before, you’re on the train coming to work in the morning, you see an email come in noting that the last file transfer you were waiting for overnight has completed, you make sure you’re on VPN with your iPhone or Android device, you connect to Control-M via our mobile app, and you order in your EMR flow in transit so that when you arrive in the office and sit down for your coffee, your jobs are done and the results are delivered.


Control-M Self-Service - Google Chrome.jpg


SLA Management


Take the services that were declared, and were made usable for business stakeholders and tie SLAs to them, so that not only can you have Control-M trend the runtime of the services of these workloads to report on slowdowns, you can also catch problems before they occur. Control-M has these features built in, allowing for long-term analysis of the workloads that stretch past Big Data and out to the rest of the enterprise as well. What's great is that the batch impact manager, while already able to take dynamic action based on a service runtime, will automatically create a service for you when ordered in. This is key, you want to make sure that when you order your flows in, that you’re not starting to creep in terms of delivery time, BIM will track this kind of thing for you.

Control-M Self-Service - Google Chrome (2).jpg

To Wrap Up


Since Control-M can be tailored to any situation, invoking AWS to achieve ephemeral clustering is a reality, and one which we are seeing more and more customers take advantage of. The possibility of running only what you need is a major advantage over classic always-on paradigm of hosting data lakes. Control-M with its massive flexibility and toolsets that allow for dynamic integration is an ideal choice for managing any automation that occurs in the cloud.


I always like to say that the triggering of the job or the action itself is only a tiny part of what makes up Control-M. The rest of the “bread and butter” of Control-M is what really seals the deal, the on-do actions, the condition passing, the alerting, the SLA management, the ability to play between hosts, the archiving and auditability, everything else that is brought to the table. Being a 3rd party arbiter of enterprise workloads is a powerful thing, I suggest you bring some of that power over to the public cloud and start trying it out!