AI in Data Center Operations

RESEARCH AND INNOVATION

AIOps: The Power of AI in Data Center Operations

 
June 9, 2020

We know the world through our five basic senses; these serve as instruments that gather information and deliver it to the brain to process, in order to react, interact and protect the body. When you perceive a danger or you need to react to situations, your senses work together to provide you with the data, and your intelligence puts the information together, correlates it, and makes sense out of it to help you make a decision.

Now, let us apply this analogy to data center operations. A typical datacenter has a multitude of monitoring systems that capture metrics from across systems, applications, networks, databases and devices. The amount of data generated as metrics and events from infrastructure alone has grown so rapidly that it has become practically impossible to analyze the data to pinpoint bottlenecks when troubleshooting performance issues.

The monitoring systems are individually configured to send alerts on exceptions, but more often than not it becomes difficult to segregate genuine alerts from the noise. It is also difficult to correlate alert events from different sources to infer the underlying cause and take appropriate remedial action. Efficient data center operations require a system that simulates the intelligence of the human brain, sifts through information received from diverse sources, automatically identifies related events, and helps to pinpoint the root cause to facilitate faster troubleshooting.

Welcome to the world of AIOps.

Towards intelligent IT operations: Two key use cases of AIOps

AIOps is a technology practice that leverages Artificial Intelligence techniques and methods to enhance the efficiency of operations and thereby business value. It involves using cognitive models that simulate human reasoning, self-learning machines that constantly improve the accuracy of decision-making, and powerful analytical engines that choose the most appropriate algorithms to deliver contextual operational insights.

AIOps is at the intersection of three crucial aspects of IT Operations —service management, performance management, and automation. It bridges these aspects to enrich the information available from various data sources, derive continuous insights, and resolve issues in real-time.

Consider the scenario of performance degradation of a critical application.

The underlying root cause could be across any of the layers - from the web UI, application server, middleware, database, network to the server infrastructure. Leveraging AIOps, the alerts, metrics and logs from the varied monitored and monitoring systems are aggregated, filtered to remove noise, passed through analytical models to narrow down the possible root causes, and if validated, automate the issue resolution and testing. All this is done within minutes, as against the hours that it would take to do the same otherwise.

Another example where AIOps can significantly help businesses through customer satisfaction and productivity improvement is in the Service Desk. In typical support operations, considerable effort and time is spent in comprehending business user needs, appropriately categorizing them and routing them to the correct team for resolution. Since customer satisfaction is key to any support operation and directly impacts business, it is imperative to respond and service the user as quickly as possible. Leveraging the incident data, AI models are built to derive the intent of the user, appropriately categorize the incident, identify relevant knowledge documents and assign the incidents to the appropriate resolution group. Chatbots with built-in intelligence and self-learning capabilities assist the support team to execute commands and make the results available in the chat channel for collaborative troubleshooting. This can significantly improve the productivity of the team by 25-30%.

The concept of AIOps can also be extended beyond IT operations to include business and security operations as well.

Some key benefits of AIOps include:

  • Resilient operations: Operational issues are identified, analyzed and acted upon to restore services as quickly as possible through intelligent automation thereby improving the reliability of services.
  • Adaptive management: Cognitive models provide critical recommendations for dynamic orchestration of workflows to ensure a seamless user experience.
  • Improved productivity: Incident analysis, correlation and intelligent automation together facilitate faster resolution, thereby improving operational efficiency and productivity.
  • Contextual insights: Analyzing metrics in context with the events and logs, using machine intelligence, provides the necessary insight to comprehend current operations and predict likely behavior in future.
     

AIOps: The new way forward for IT operations
Gartner predicts that “Large enterprise exclusive use of AIOps and digital experience monitoring tools to monitor applications and infrastructure will rise from 5% in 2018 to 30% in 2023.” Going forward, AIOps will be a key enabler to organizations’ digital transformation success as it provides end-to-end visibility into infrastructure and applications – a key imperative given the rising complexity of IT environments.

Jayashree is a Technical Architect working with the Incubation unit, Research and Innovation, TCS. Her focus areas are in identifying and developing new offerings leveraging automation and machine learning in delivering IT services. She holds a graduate degree in Electronics and Communications Engineering and has 23+ years of experience in IT Service Management, Network Management, Operations Automation, DevOps and more recently, MLOps.