Runbook
Automation systems empower organizations to design, build, orchestrate, manage,
and report on workflows that support IT processes. Run Book Automation is not
limited by IT infrastructure elements and acts as the connecting layer between disparate
IT processes and user-friendly service desk tools. With Run Book, enterprise
users can automate routine, repetitive operational processes, drive down costs,
and remove complexity from the datacenter. The solution introduces greater
accountability into the system, and furnishes organizations with tools to
measure productivity and improvements in efficiency.
1.0 RBA is the
Next IT Battleground
A study of
past trends reveal that in the early nineties, businesses were more concerned about
the management framework they needed to adopt in order to effectively and
efficiently manage their IT infrastructure. While this need was effectively met
by packaged solutions like CA Unicenter, IBM Tivoli and HP OpenView, today, the
focus is on configuration management database or CMDB, as it is more popularly
know. This unified repository of information helps organizations understand the
relationships between the various components of their information system and
track their configuration. CMDB is a powerful tool, as it can sever the
foundation to tie together all the different processes within IT. It is a
fundamental component of the IT Infrastructure Library (ITIL) framework's Configuration
Management process.
Today’s
rapidly evolving and highly competitive marketplace doesn’t allow enterprises to
focus only on current requirements. To leverage their IT investments
effectively and ensure maximum returns on the same, it has become imperative
for organizations to gain a deeper understanding of emerging trends in IT and
the developments that are likely to take place in the next 2-5 years.
The need to
design, build, and report on workflows that support IT operations processes has
become critical and traditional IT management processes such as job scheduling
products and custom scripting, are not adequate to meet this requirement. In
light of the changed requirements and demands of the IT marketplace, a solution
like Run Book Automation can deliver to enterprises what is seen as the need of
the hour, that is… According to international market analyst firm Gartner, RBA
is the next battleground for IT.
2.0 What is Run
Book Automation
In a report
published by Gartner in June 2006, Run Book automation, is defined as the
ability to design, build, orchestrate, manage and report on workflows that
support IT operations process. It refers to products that help organizations employ
workflows to automate different operational tasks across IT management
disciplines, to support IT management processes, which includes identifying
various manual tasks that are repetitive and error prone, and putting them onto
automated workflows. This is additionally supported by a lot of reporting,
auditing and enforcement. It leverages technology to replicate the processes or
activities that are otherwise performed manually especially routine and
repetitive tasks or functions. Thus RBA enables organization to streamline
their processes and make it more manageable.
In addition,
there are several critical customer requirements that need to be taken into
consideration. For instance, operators require a visual user interface that provides
step-by-step directions through the various IT processes and procedures. RBA,
therefore, leverages the visual medium to guide users through the various steps
involved in triage, diagnosis, and other repetitive maintenance tasks.
Customers have
different needs and processes and hence they might want to initiate these
automation workflows through different modes. Some customers may like to
schedule certain activities on a periodic basis; at other times, in the
likelihood of a particular event they may want to automatically trigger a series
of follow up actions. Hence, the level of flexibility that a Runbook Automation
system offers is very important.
Another key requirement
is related to the ease with which workflows can be created. It is possible that
a customer has previously leveraged scripting to create automated processes. However,
scripting is a specialized skill set. And not many IT professionals are
equipped with the same. The customer therefore faces serious limitations when
they need to broaden the base on who can create automation workflows. This raises
the need for a solution that is much easier, and also more intuitive to use
than scripting.
Nonetheless, most
customers have already invested significant sums into their legacy solutions.
And this has to be leveraged to justify those investments. This makes it
necessary to identify techniques that would enable incorporation of existing
scripts into any new solutions that are likely to be implemented. Additionally,
most customers also insist on leveraging their existing knowledge and
expertise.
Integration
with the existing solution at the customer’s end is highly important especially
because millions of dollars have already been invested in these management
frameworks and CMDBs. The challenge is in taking these investments and adding Runbook
Automation to the mix.
Customers also
demand capabilities like additional filtering, ability to change different
automation processes, and leverage existing processes to form blocks that can
be put together to create more complex processes. This forms an important
aspect of the expectations customers have from RBA systems.
Other than
this, mundane but critical aspects of running processes like reporting and
documentation needs to be automated. In addition to providing highly
sophisticated and detailed reporting, RBA systems should also be able to create
documentation on the fly.
This not only
brings about greater operational efficiency but also ensures improvements in
the quality of output while cutting down on the time generally spend on
completing the task. It also enables an organization to leverage its resources
to provide higher-level analytics and functioning, in order to ensure
continuity and holistic delivery of services.
Security is another
key concern for most organizations. Customers are often faced with a dilemma
when they have to delegate certain tasks to different individuals. Though automation
doesn’t do away with manual involvement, it puts in place processes to identify
resources involved with restarting and stopping a service. Greater security is
introduced into the system as manual involvement is now more role-based, which
demands greater accountability from each individual user.
3.0 Why RBA is
Becoming a Key Initiative
The sudden
surge in demand for Run Book Automation systems is mainly driven by two factors.
First, IT departments of most enterprises, small and large, are under
tremendous pressure to show tangible return on investment. With no significant
increase in their budgets or resources, IT departments are expected to draw a
clear chart highlighting the benefits IT brings to the organization and prove
improvements in service levels. which calls for mapping the various processes,
followed up by detailed auditing and reporting.
The second
factor responsible for the increase in the demand for Run Book Automation
systems is the need for greater control over the datacenter. The adoption of
the Information Technology Infrastructure library (ITIL) has given an impetus to
the maturing of various operational processes. This is essential if an
organization wants to maintain a predictable, repeatable and streamlined
environment in their datacenters.
Today, the
expectation is that IT should be able to support more and more systems and
applications when the demand arises. This, in today’s environment, can be a
daily occurrence. Internally, organizations are identifying the means to get
more out of their current IT investments. This calls for driving greater efficiency
into existing processes, identifying resources that can be freed from
performing banal, iterative work to more resource intensive ones. In light of these expectations, automation
seems to be the only solution for taking existing efficiencies to the next
level.
The actual
availability of technology that can satisfy these internal requirements has
also provided a boost to the rising demand for RBA systems. Traditional
automation methods like custom coding and scripts lack best practices, change
management, documentation and flexibility. However, this is imperative in an
operations environment, where business rules and configuration settings change
frequently.
Implementing
best practices is achieved primarily by defining and automating IT operations
management processes. Though most data center tools also provide evolved automation
capabilities, they do not automate processes between applications. Thankfully,
technology has now matured and is in a position to tackle the need that has
arisen internally. Today, sophisticated IT operations management platforms and
tools are available that can enable a customer to deploy a Runbook Automation
solution and experience its benefits in a live environment.
No wonder then
that Gartner feels, through 2012, Runbook Automation will have the highest RoI
in any of the IT initiatives you can partake today because automation is the
direction for IT operations going forward.
4.0 Runbook
Automation and IT Infrastructure Library (ITIL)
The ITIL
framework outlines best practices for all IT activities. The service support areas
of ITIL including incident, problem, configuration, change, and release
management make up the daily operational tasks within IT.
With ITIL, the
most important processes are under service support and service delivery. RBA
ties these different processes together: from incident problem management to
change release configuration management to even mobility capacity management. These
processes can be integrated, and then leveraging RBA the release and change
processes can be automated. Prior to RBA, putting in place such processes would
have involved investment into additional resources. However, automation
accomplishes a much more effective result by using minimal resources.
5.0 Leveraging
Existing Investments While Filling a Critical Gap
RBA is a layer
that enables an organization to leverage its existing system management tools
including monitoring systems and event consoles. The solution actually acts as
a connecting layer between an organization’s monitoring solution and service
desk solutions. Inputs from the monitoring solution are fed into the RBA
system. When certain conditions coded into an alert ID are fulfilled it triggers
an automated workflow. Since this is fed into the ticketing system it provides
detailed auditing and tracking without necessitating the need to rely on manual
data input. Thus RBA acts as a layer that sits between an organizations system
management, system monitoring and service desk tools.
For instance, iConclude’s
Opsforce Central is a web-based application, where Tier 1 system administrators
based out of the Network Operating Center (NOC) can come to find the common
repairs that are put out by iConclude or workflow automations an organization
has created to meet the requirements of its specific environment.
In addition to
these, iConclude also provides accelerator packs for all kinds of common
infrastructures like Linux, Unix, Solaris, Databases, web servers and networking.
iConclude’s solution covers all the various platforms as well as situations that
can occur across the vast majority of datacenters.
A study of these
workflows and diagnostics points to the fact that typically customers prefer to
initiate them via any of the three possible modes listed here. In the ‘guided
Run’ mode the operator running the workflow has the ability to proceed in a
step-by-step manner and ascertain exactly what’s going on.
Another
option, and which is emerging as the preferred one, is the automatic mode. Here
the workflow is integrated into a monitoring or a service desk product whereby whenever
an alert or ticket is created an automatic workflow is put into motion. The
automatic workflow process follows the triage, diagnose and repair process
wherein the system first gauges the problem from a criticality point of view
and then assigns the next step accordingly. In routine cases, the system might even carry
out the repair itself doing away with the need to escalate it further.
The third
option is running the diagnostics for system maintenance at pre-scheduled
intervals depending on whether the customer wants to opt for a daily or weekly or
another pre-decided cycle.
6.0 Common RBA
Use Cases
Below we have
provided the common Runbook Automation cases that we have encountered, and
which customers are likely to face during the course of automating various
business processes.
Out of The Box
Repair
OpsForce
Central includes critical features like an out of the box repair solution,
which is a diagnostic around Windows Server. For instance, a technician
attempting to isolate a problem in a windows datacenter would at first try to
determine whether it is a network connectivity problem, or a problem related to
CPU usage or lack of memory disk space. During the course of these tests, an
audit trail is formed, which is really the key. A lot of enterprises with very
large datacenters are struggling with factors like compliance issues. For
instance, those in the healthcare industries are struggling with issues related
to Sarbanes Oxley or HIPPA compliance, which calls for in-depth audit
requirements though they have made a lot of investments in documenting their
procedures.
Automation
tackles the repetitive processes successfully and at the same time enables
users to audit the workflow. This helps ensure that the processes that were
earlier documented are being strictly adhered to. The audit features enables an
organization to track a particular infrastructure that was touched, identify
the individual who has touched it, provides details of their interaction, and
what data was returned. This has proved highly helpful to these companies, as
it has been identified as one of the best ways to attack the audit requirements
laid down by HIPPA or Sox. The solution also raises an alarm whenever they
detect potential problems and then offers users a diagnostic panel that lists
out the potential problems in the system.
This is a
common use case that can be run either manually or automatically if an
organization is affected by poor server performance. But in today’s
datacenters, where enterprises are looking at large server farms to service an
application; they may face the need to analyze things at an application level
instead of looking at it server by server.
An
organization that has implemented application level monitoring or service level
monitoring in their environment and has some monitors looking at web pages may
get the feedback that certain key web pages are taking much longer to load than
expected. These kind of issues can be challenging because a single web page is
probably being run through a load balancer to a large number of web servers.
Using dedicated resources to check when the pages slow down is impractical
today.
However, using
automation technology, you can dynamically query the load balancer to find out
the IP addresses of the servers that are servicing a particular application.
Information on these servers can then be garnered on a real time basis, and
analyzed to identify potential problems with the server. Automation would take
that information and run it through the workflow that has been implemented
within the ticketing environment and escalate it through the workflow. While
all the work has been done automatically, the data that has been gathered will
be placed into the ticket. This would provide a technician who looks at the
ticket with all the necessary information.
The manual
process of information gathering, which can generally take 10-15 minutes for
one server and more if there are multiple servers, can now be done
automatically and the key information placed into the ticket. Often, when
people commence with automation, doing this repetitive triage and diagnosis
proves a great way to get quicker time to value.
Virtualization
Another
example is virtual machine management. Virtualization has been gaining grounds
in almost all large datacenters. However, most enterprises are finding it a
challenge to manage them. A financial services company, for instance, was
looking at ways to automate their daily business cycle. During the trading day,
they needed to dedicate am large portion of their hardware resources to
servicing their trading applications. And as the trading day came to a close,
the resources needed to automatically provision other applications and bring
them into play.
Automating
this process helps identify the virtual machines that are running trading
applications that are not heavily utilized and shutting down those virtual
machines. If there are VMs being over utilized, it keeps those running, as it
can take those out of service later. It then goes through all the hosts and
identifies ones where it can provision the other virtual machines. It then
looks at various characteristics of those hosts, to make sure there is enough
CPU utilization and memory to bring those new virtual machines online. This is
a common repetitive task that can be done on a daily basis to optimize the
operational environment.
In ITIL,
automation would help organizations garner information about their environment.
This provides the ability to start doing data mining, and business intelligence
gathering to help move from plain incident management to problem management.
The difference essentially is that while the former is a reactive approach to
problem solving the latter helps the datacenter operator or engineer take more
proactive measures.
Configurable
Dashboard
iConclude’s
configurable dashboard helps configure charts around various dimensions and
provides a visual representation of the alerts that are flowing into the
datacenter, whether they are infrastructure style alerts like CPU thresholds or
application style alerts like slow page loads. Touching these applications for
repair helps gather drill down capabilities, which shows the configuration
items that are causing the most problems; and also the actions being taken to
solve them. Based on the problem area, this information can then be driven back
to the respective teams be it the development team or the capacity management
or capacity planning units. Thus the information gathered by doing incident
management can now be leveraged to enable more proactive problem management.
OpsForce
Studio
iConclude’s
OpsForce Studio helps create and modify automation. If an enterprise has a
large number of different back up devices to manage, and if failures came up
with those, dealing with the inundation of log files would pose a serious
challenge. OpsForce Studio can automate the process of analyzing those log
files and restart backups on particular servers that were having problems. If
the servers are unable to back up due to lack of disk space, the system can
archive some files offline thus enabling the back up jobs to work without
incident. Here automation goes through back up log files, identifies errors,
analyzes server loads, checks disk capacity on those servers and then takes the
appropriate action.
RBA can also
automate the process of conducting periodic checks of an organizations network
infrastructure to ensure that the devices are running smoothly and are
up-to-date on firmware. By dynamically going through the routing infrastructure
we can verify firmware then create tickets to go ahead and have those updated.
Runbook
Automation can help glue ITIL disciplines like change management, release
management and configuration management together. Here automation touches
change management, waiting for a particular change request to get approved.
Once the change request gets approved, the particular server indicated from the
cluster can be removed, lead off transactions that were linked to that server,
take it out of monitoring, hook up with the customer’s provisioning software,
install the particular patch, reboot the server if necessary, bring that back
into monitoring, bring it back into the cluster, and then update the ticket.
Automations
should be conceptually easy to use and have reusable sub components. This makes
it easier to drill into one of these, and see that this in itself is another
automation, and each of these are in themselves sub automations. Therefore by
building things in a hierarchical manner, you have these reusable components
that make automations scalable. This can prove critical in disaster recovery.
Even if an
organization has monitoring software that incorporates the ability to restart a
service it is highly unlikely that it promises the level of sophistication
provided by RBA. Take the example of an organization that requires a service
restart. An SQL query gets the information about the machine, verifies that the
particular service is running, and if it is a mail is send informing that there
is no problem.
This is
necessary because there are transient events that are sometimes termed false
alerts. However, at times, things that are transient might just show up on the
monitoring software and go away. But if there is a problem a trouble ticket has
to be created, additional information added to that, an email has to be send to
an escalation person informing that trouble ticket has been created, try to
restart the service automatically, verify that the restart succeeded, and if it
did, update the ticket, update the database that had the information about the
trap, and then send an updated case email. If the agent didn’t successfully
restart, the ticket has to be updated with a failed notice, find out whom to escalate
to for another SQL query and then escalate that to the concerned person. If we
drill into things that seem conceptually simple like restarting a service even
that requires true process automation.
Another use
case that can be frequently seen is in the area of dealing with clustered
systems and load balancers. For instance, if an online service provider wants
to take a few servers offline, they have to ensure that there are enough other
servers to handle the load at that time. Automation would examine the server
pools, locate how many nodes are currently available to service a particular
application, check that against thresholds and only disables nodes if there are
enough other servers available to handle the load. This gives the ability to
automatically manage the environment based on current conditions.
7.0 Getting
Started with RBA
Before
investing in an RBA solution, it is necessary to analyze and identify your key
business requirements. An organization needs to first determine the most common
alerts and incidents that could be automated, which in turn would provide maximum
RoI. It is, however, important to set realistic objectives and goals. It is imperative
to plan each minor detail to ensure that you derive maximum benefit. This
includes the IT strategy that you plan to follow, the platform you intend to
adopt and the tool vendor.
After these
initial needs have been identified, an organization needs to plan out the
workflow design. The top five or 10 alerts that have been identified needs to
be documented and common steps that need to be taken to remediate them should
be enumerated. Once these initial steps have been successfully completed all an
organization needs to do is design and develop the automation flow, which is a straightforward
process using the RBAs visual user interface The next step is to pilot and
implement the initial automation flows. For the pilot project, it is advisable
to stick to a few selected processes and expand to other processes, domains and
groups once these are running smoothly.
8.0 Keys to
Ensuring RBA Success
There are three
key factors that need to be adhered to in order to ensure successful
implementation of an RBA system.
An
organization needs to understand the complexity of its infrastructure. It is
important to understand the requirements of all the different stakeholders in the
initial analysis. While analyzing process requirements, it is important for an
organization to plan for the future. While the needs may not seem many at the
moment rapid growth can change that in a very short span of time. This might
then put pressure to move to a more robust and scalable platform. Hence it is
imperative to ensure that an integration plan is in place throughout the course
of a Runbook Automation project.
During the
initial trial phase, organizations should select processes that are likely to demonstrate
quicker RoI. It is advisable to avoid highly complex processes. The complex
processes may be highly important to the organization and might also have
greater visibility. However, the likelihood of errors is higher during the
initial trial or learning phase. Hence, going in for a less complex process
will enable you to demonstrate the success of the system much more easily and
in the process provide more buy-in for an organization-wide roll out.
Enterprises
should adopt a phased rollout strategy that will enable them to take a more
proactive approach instead of being reactive. Closely consider the auditing and
data coming out of the incident problem management process, and then proceed to
a more predictive mode. Leverage Runbook Automation products to automate these
processes and integrate all existing tools and processes. This will enable you to
predict where your datacenter needs are headed.
9.0 Gartner
Recommendations
- Ensure you understand Runbook automation and make sure whatever you are doing is in line with the service level and business priority you have identified together with your business counterparts.
- Make your initial project very narrow in scope so you can deliver very tangible benefits.
- Set clear objectives.
- Ensure your process requirements are in line with your current operational tools. Select the right tool set with the right features that are going to support your most flexible set of needs.
- Make sure that integration is taken into consideration in the initial project, because you never want to implement this in a vacuum. Understand the different integration points, and where you want to hit in the initial target. You probably don’t want to integrate with everything in your pilot, but you definitely want to hit one or two key tools that are currently a backbone in your datacenter.
- Obtain full support across the organization – talk to different stakeholders.
- Put in place well defined processes and take it to the next level with Runbook Automation.
Comments
Post a Comment