Skip navigation
Share:|

Since January 2, 2018, Intel CPU's chip-level vulnerabilities Meltdown and Specter have been made public by the media, with nearly unprecedented coverage.  Yet, even after nearly a month of full industrial chain efforts to fix the vulnerabilities, there still seems to be no final solution. I think this may be one of the industry's most serious challenges, in addition to the millennium bug problem, since the birth of the computer! Some experts even predicted that CPU Moore's Law may be nearing the end because of the impact of this incident.

Reacting to CPU M / S vulnerability by, just patching, you may open a terrible Pandora's Box! In this blog, I’ll tell you why I’m not being an alarmist,  the most important things you need to think about, and what are the correct next steps.

 

The Impact of CPU M/S Bugs

1.The relevance of the vulnerability: Devices and systems you know may not be immune

According to the evaluation and report of international, well-known research institutions, equipment manufacturers, the media, etc., almost all processors manufactured in the past 20 years have been affected by these two vulnerabilities, including but not limited to Intel, AMD, Qualcomm and ARM. Even NVIDIA GPUs for home use (including GeForce, Quadro, NVS, Tesla, and GRID, mostly influenced by Specter vulnerabilities) are susceptible.

Devices and systems running on these CPUs include mobile phones, mobile terminals, PCs, servers, cloud virtual machines, and other specialized devices.  And, it doesn’t matter which operating systems are running on them because the underlying CPUs contain the vulnerabilities.

 

2.The severity/risk of the vulnerability: critical sensitive information will be leaked

By exploiting the CPU M / S vulnerability, hackers may::

  • Access the underlying operating system operating information and stealsecret key information;
  • Grab information that will allow them to bypass the kernel, or hypervisor isolation protection;
  • Gain access to multiple tenants running on shared cloud services and steal private information;
  • Capture critical information such as the victim's account number, password, bank card account number and password, other information content, email address, cookies and other user privacy information through the browser;

risk.gifS

Sample of exploiting vulnerabilities to get user’s password

 

Right Steps to react to CPU M/S Bugs

1. Enhance the security of key information assets

As of today (Feb 16, 2018) there still is no good patch available, and it is imperative to upgrade the security of key information assets. We should re-examine the information security technology and management system, and further optimize the logic of protection and deployment of technologies. Obviously, initiating dual- or multiple-factor verification of key information access and processing may be one of the most cost-effective and rapid methods.

 

2. Pay close attention to the progress

Given the current circumstances, we need to keep a close watch on announcements and evaluations of major CPU vendors, operating system vendors, equipment manufacturers and related technical communities, and be ready to act if they release new information on remediation or improved and stable software patches.

 

3. New patches: test, test, test again

I’m going to say this three times, “Test, Test and Test again.”  This is especially relevant for these vulnerabilities as the patches released to date have resulted in system failures or performance hits.   Just because a vendor released a patch, doesn’t mean it will work in your environment.

 

4. Do a good job of computing resource capacity planning

Numerous experts have analyzed the performance after patching and determined that the patch will significantly affect the system performance (slowing by 10% to 30%), and future patches are not likely to be better. Many companies’ IT systems may crash even with 10% decline in the performance.

A rational approach would include looking at past experience of capacity to anticipate where slow-downs will be a concern.  However, given the lack of history with these vulnerabilities and patches, that might not be relevant. 

Feedback from AWS and from different users in the AWS community shows that the performance degradation that results from this varies greatly from system to system.

 

Let me illustrate with this non-technical example. Suppose thatyou usually need to drive 40 minutes to go home from work each day. Now suppose that you encounter a bad traffic jam or a traffic accident., You have seen similar traffic jams in the past and have some idea that it will delay you, but you can’t be sure by how much time.  Is it just excess traffic from rush hour, or is there a serious accident up ahead that will shut down the freeway completely?

Due to the complexity of capacity,  the performance impact of the past may be completely invalid, causing you to quickly adapt to new variables requiring you to provision excess capacity to account for unknowns. 

 

 

It is time to engage capacity analysis and forecasting systems!

  1. Collect historical performance data of the system, including business performance KPIs such as number of transactions per time unit, number of concurrent users, total time for a single transaction, etc., infrastructure resource performance KPIs such as CPU utilization, memory utilization, IO read and write Speed, etc.), key component performance KPIs (typical transaction processing time for databases, middleware, etc.).
  2. Model capacity using historical performance data. Utilize tools that leverage  machine learning, other artificial intelligence algorithms, and data training, to  establish a historical capacity model.

4.jpg

Capacity model and analysis sample

 

c.Perform a performance stress test of the new patch in the system test environment, collect performance data, and re-model using capacity analysis tools.

d.Compare the difference between the two models before and after. Taking forecasted business performance KPI capacity trends from historical data models as capacity requirements, using test data models for forecasting resource requirements after patching, using resource requirements forecasted by historical data models as a resource capacity requirement floor, or using other methods consolidate two different capacity forecast results.

e.If the new test data is not enough for data training and modeling, analyze the percentage change (deterioration) of performance under typical load and conduct a what-if analysis under the historical data model. That is, if different resource performance parameters are different Proportion of changes, analysis of capacity requirements will happen what changes.

5.jpg

what-if analysis sample

 

5.Prepare for new computing resources provisioning

According to priority of the patching for this bug, referring to the capacity analysis and forecast result, prepare the corresponding resources, and make the external environment ready for the patch change and deployment.

 

6.Keep communication with business units and customers

As soon as possible inform the business unit that the CPU M / S vulnerabilities may have a negative business impact, and develop reaction plans together. Business units need to maintain good communication with customers. When preparing the patch deployment, it is necessary to evaluate the change plan, impact analysis and rollback plan jointly with the business department in advance, and wait for the business department to prepare and agree to carry out the change.

 

7.Production environment patch deployment and resource expansion

This will be a large-scale deployment. If you use the Patch Automation deployment tool to configure the deployment strategy, you do not have to worry about the deployment speed and manual operation risk. In today's automation world, many enterprise IT resource expansion has also been automated.

 

8.Continue to monitor the operation

After patching, it is necessary to timely configure the monitoring system to enhance the frequency of monitoring, closely monitor their running status, continuously collect feedback of other enterprises, communities using new patches from various channels, and take timely measures to deal with it.

 

In summary, the Meltdown / Spectre vulnerabilities are not to be taken lightly.  The vendors’ attempts at patches haven’t produced solid patches and at the same time, hackers are actively working on exploits.  You should have good capacity planning TrueSight Capacity Optimization for potential patches major impact on performance. For those patches that have been released, it’s important to test, not only whether the patches are stable, but also how any performance degradation impacts your applications or services.  One thing that you can do now, is better understand where those vulnerabilities exist in your environment (TrueSight Vulnerability Mgmt can help with this) and plan for a fast and smooth remediation process once you have tested the patches (BladeLogic Automation can help with this).

 

Got a question or feedback? Talk to me in the comments section below.   

                
  [i] Li Peng has contribution to this article


Got a question or feedback? Talk to me in the comments section below.
Share:|

Summary

 

 

 

 

In recent years, DevOps has become incredibly popular, from CIO to IT engineers. There are various industrial pain points that foster its popularity:

  • The cycle of traditional software development model (such as waterfall) is too long
  • It is difficult for the development team and operation team to cooperate as they don’t share the same responsibility
  • The operation team pursues stability and fully controls the production environment via the “bulky bureaucratic process” (usually with ITIL process)

 

Many evangelists even claim that DevOps + Cloud is the future of IT operations management, and ITIL will die. However, this argument is like the last concluding sentence in a fairy tale, with a happy ever after.

 

But this is far from reality. After experiencing various groping, bumping and crawling pits, many companies despondently discover that DevOps is not the silver bullet for IT operations and maintenance, and does not provide a truly systematic solution! It is far from the happy ever after, and thus raises the question of whether DevOps actually possess the capability to replace ITIL.

 

 

Why DevOps can’t replace ITIL in operations management and services support?

(1) DevOps doesn’t provide methodology for operation management

 

In theory, DevOps' major recommendation for operation is to integrate the development and operation roles into one team to eliminate organizational barriers, and unify roles and responsibilities. However, DevOps does not provide any advice on how the integrated DevOps team manages its operations.

 

For any crucial business application operation, if only the redefinition of the personnel role, or rotate developers to operation task, but lack of corresponding strategy, procedures and management control, operation management will return to the state of “chaos”, to which ITIL was born to react. No one is responsible for the long-term quality of service, and no management framework is established to organize internal collaboration and improve the quality of operations.

 

When adopting DevOps, organizations recruit "DevOps engineers" and "full stack engineers", who are fromthose software development engineers in the past, the skills have not changed, the learning ability has not changed, the requirements for domain expertise have not changed, is the word of DevOps is magic wand? Humanity has not changed, the mode of division of labor has not changed, and the quality and stability requirements have not changed. Can DevOps eliminate the basic management methods of operation specifications, control procedures and management processes?

 

Operationally, DevOps advocates implementing CI (Continuous Integration) and CD (Continuous Delivery). Even though many people interpret the CD as Continuous Deployment, it is the first step for a new software to enter the production environment and start to run. But there is no concept like CO (Continuous Operation) in DevOps to cover managing operation phase.

图片1.png

Fig1: Comparison between DevOps and traditional mode

 

 

Therefore, DevOps framework does not involve much of operation management and service support.

 

 

(2) The main founders and evangelists of DevOps suggests the integration of DevOps and ITIL

图片2.png

Fig2: Cover of book The Phoenix Project

 

In the widely circulated DevOps evangelist novel The Phoenix Project, the author Gene Kim specifically explores the relationship between DevOps and ITIL. In the book, he argues that ITIL and ITSM are still the best codifications of the business processes that underpin IT Operations, and actually describe many of the capabilities needed in order for IT Operations to support a DevOps-style work stream.

 

 

(3) Many well-known Internet companies use the combination of ITIL and DevOps to achieve orderly and efficient operation and maintenance

On Quora, the world leading knowledge-sharing website, many people have raised similar questions like “Should ITIL Die?”, which fall under popular posts. One of the questions is as follows (see note 1) :

图片3.png

Fig3: Discussion about ITIL on Quora

 

 

(4) The DevOps Master certification program launched by EXIN integrates ITIL for the operation domain guide

EXIN, the international standard training and certification institution has launched the DevOps master certification in recent years. ITIL/ITSM is one of the core modules in its knowledge map (see note 2) .

图片4.png

Fig4: EXIN DevOps Master Certification Domain Framework

 

 

 

Traditional ITIL is facing the five "sins"

 

Although many emerging concepts and models, including DevOps, are difficult to replace ITIL so far, traditional ITIL practice is facing more and more challenges:

(1) ITIL is not agile and faces difficulty to adapt to the fast-paced software development iterations and business changes

In the current enterprise practice, ITIL processes require more amount of approvals and supported documents. It is inevitable to give the impression of “bureaucratic style” and “heavy process” to the business departments and development departments. Of course, this is not a problem of ITIL, but because the process has been enforced on control points for a long time due to various reasons. It is always a one-way enhancement, and no one dares to reverse for simplification from time to time. It is needed to revisit all ITIL processes and related procedures for optimization and simplification.

 

(2) Poor user experience makes ITIL tools difficult to use

Many traditional ITIL platforms are old in architecture and technology, and the UI is outdated. It can't support mobile, location-based and social-based UX experiences and user-end devices. The process form needs to be manually filled in dozens of fields. Searching historical work order information is particularly difficult and time-consuming. The system reports are even worse than calculating the data manually. These are the ITIL phobias for the operations engineers and service users of the millennium generation.

 

 

(3) ITIL platforms are difficult to modify and customize for maintenance

Any changes to many traditional ITIL platform, even adjustments of color or font to forms require code changes, not to mention to process modifications, field creation, and KPI calculations. Today more and more non-ITIL standard processes and service process of HR, Finance, Facility function departments require quick implementation on ITSM platforms.

 

 

(4) Configuration management relies on manual input and manual auditing

As the quantity of IT systems and devices increases rapidly and the complexity increases significantly, the operation team must rely on CMDB to record and manage configuration information, so as to support scenarios such as fault impact analysis, fault root-cause analysis and location, and change risk assessment. However, it has proven to be impossible to manually maintain the accuracy of CMDB in time!

(5) It is difficult to satisfy requirements from dynamic management

The application of technologies such as cloud and container makes the operations team face the instantaneous changes of objects and operation scenarios. It cannot rely on the manual creation of work orders, manual triggering actions, etc. It requires ITIL process tools to support dynamic code/scripts calls and real-time integrations between itself and third parties.

 

 

Agile ITIL: Make the operation process agile, automated, and friendly

 

As DevOps can't replace ITIL, and traditional ITIL itself can't meet DevOps requirements, why not make ITIL agile like DevOps?

 

 

What are the characteristics of agile ITIL?

Let us use the core concept of DevOps to deduct and transform ITIL to be agile.

图片5.png

Fig5: Deducting Agile ITIL from DevOps

 

 

Key points of agile ITIL:

 

1. Change process automation and integration

As shown in the following figure, implement process integration with change and release automation as the core:

 

图片6.png

 

Fig6: process integration and automation

 

 

  • The change & release process is key to agile ITIL:
    • Process simplification: Evaluate existing change management process, and remove unnecessary requirements for approvals.
    • Change category (standard, normal, major, emergent, etc.) downgrade: Evaluate each sub-category of the existing change process, whether its implementation can be automated, whether the risk is controllable, and degrade them into standard change or lower categories.
    • Refinement of the standard change process: Evaluate the secondary and tertiary sub-category of standard changes, mapping the implementation activities into automated scripts or a flow of orchestrated scripts. This way, more and more standard change will be automated in execution.

 

  • Process integration with change and release process:
    • Incident process integration: For those incidents with low risk, low impact and standard procedures, the standard change process is automatically triggered by an incident, and the script for repairing is automatically executed.
    • Service request process: Assess how to automate the service fulfillment of each service request type, and implement the automation for those types of service request. Integrate the service request with change and release process for service logging and trace auditing, and then call automation for service delivery.
    • Software release process: Make the deployment activity automated by supporting automation platforms, and integrate into change and release process.
    • Job management process: Left-shift the operational responsibility of batch jobs with the development team, and use job as code API to generate job tasks and schedule. Through integration with change process, they are automatically delivered under central automation orchestration management.

 

 

2. Configuration management process automation and integration

图片7.png

Fig 7: Configuration Management Automation Scenarios

 

 

Turn CMDB manual management into automated management with the following steps:

  • CMDB initialization: Through automated scanning of the managed environment, information collection, relationship discovery and application system modeling of all configuration items(CI) are completed without manual input. Non-technical information still requires external input; application model and relationships still need to be refined based on auto-discovery findings.
  • Scan and audit daily: With CI automatic discovery at daily (at least) frequency and audit, ops can automatically compare the difference between the configuration baseline data and the current actual data, and find CI changes. Verified changes can be updated to the CMDB after manual confirmation.
  • Integration with change process: Scan the zones where the change will take place and take snapshot as baseline before the change is implemented, and scanned again after change implementation, then imported new information into CMDB with manual confirmation.

 

The core capabilities of the auto-discovery include:

  • The number of supported devices and software (whether it can achieve the fullest possible coverage of common types, models and versions in the industry)
  • The speed of discovery (to ensure accurate data is received on time from CMDB)
  • Capabilities of support multi-cloud service providers, etc.

 

 

3. Agile communication and support services for users

Use technology that supports mobile, location-based, social, and other popular user preference to provide IT service users with support through multi-channels (App, instant messaging, Web, SMS, etc.) to continuously improve the user's digital experience and satisfaction.

图片8.png

Fig8: agile communication & supporting service samples

 

Furthermore, intelligent customer service (chatbot) can be used to provide personalized service support for users.

 

 

4. Process customization based on a graphical configuration

The process platform must be agile and capable of supporting rapid optimization and iteration to continuously meet the needs of IT operations and service support. A process platform that provides process customization based on graphical configuration will reduce the cost, time, and risk of process creation and modification, and serve production operations faster. The platform customization relying on manual code change is becoming a historical legacy.

图片9.png

Fig9: Sample of process creation based on GUI

 

 

5. Ops as code

Ops as a code has become a trend of IT operation and maintenance tools. Considering the scale of the managed object, the complexity of the architecture, the dynamics of the scene, and the requirements of real-time management, all operation and maintenance management tools must provide an external API, and the third-party system or code may integrate and manage the logical arrangement to realize dynamic real-time operation and maintenance.

图片10.png

Fig10: Ops as code Framework

 

 

Reference Architecture of Agile ITIL

To achieve the above agile ITIL capabilities and better implement the DevOps-style agile operation and maintenance, an implementation architecture is shown in the following figure:

图片11.png

Fig11: Reference Architecture of Agile ITIL

 

The system can be divided into four layers which include:

  • User layer: Agile support and communication channels, support a variety of service technology such as mobile, location, social and other technologies, introduce intelligent customer service supported by AI, and support services in a more efficient, faster, and user-friendly way.
  • Presentation layer: Agile operation and maintenance dashboard, define KPI for agile operation, display KPI real-time and historical data in various ways, and demonstrate the efficiency, value, and control ability of agile operation and maintenance for stakeholders and management team.
  • Process management: Focus on the change release management process, automation and integration with other processes, and connect with the automation layer. CMDB configuration management based on the automatic discovery of CI information.
  • Automation layer: The automation layer automates the implementation activity of the change and release process including:
    • Enterprise-level central automation orchestration and scheduler: Realize logical orchestration of change and release tasks, change window scheduling, resource allocation, process supervision, automation action result feedback to the requested process, etc.
    • Automated deployment tools: Dynamically deliver automated scripts to target objects, providing a run-time environment for automated scripts, and as a communication channel for automated scripting and orchestration schedulers
    • Automated script library: Provides a low-coupling, fine-grained script library for easy programming of complex automation tasks
    • Customized scripts: Customized development for users with special requirements that are difficult to implement with existing automated script libraries
    • API interface: REST API interface provided by other third-party tools or scripts, called by scripts or orchestrator

 

 

BMC Agile ITIL Solution

BMC Software, the leading company for ITIL solution, can provide a full stack to implement Agile ITIL with the following products and modules:

图片12.png

Fig12: BMC Agile ITIL Solution

 

 

Conclusion

To adapt to the rapid changes in the digital era and when integrated with the DevOps development model, the traditional ITIL can be transformed, and the agile ITIL can be realized through process optimization, integration, and change automation.

 

图片13.png

 

Fig13: Features and benefits of Agile ITIL

 

Acknowledgement to Datta, Satarupa;Chatterjee, Bishakha from COE on content refinement!

 

Note:

1. Is ITIL used at companies such as Amazon, Google, Facebook? Why? - Quora

2. https://www.exin.com/en/certifications/exin-devops-master-exam