Purdue, Adobe researchers use AI to cut cloud service failures
Purdue, Adobe researchers use AI to cut cloud service failures
 
	A new research breakthrough is helping technology companies like Adobe quickly pinpoint the causes of failures in complex cloud-based systems, potentially saving millions in downtime and improving the reliability of services many people use every day.
Cloud applications—such as photo editors, file storage services, or customer support platforms—are made up of countless interlinked components called microservices. If even one of these microservices fails, it can set off a chain reaction, making it difficult and time-consuming to figure out where the original problem began.
But researchers from Purdue University led by Saurabh Bagchi and Murat Kocaoglu, both faculty with the Elmore Family School of Electrical and Computer Engineering, working with Adobe researchers, have developed a new method to detect and diagnose these failures faster using a branch of artificial intelligence known as “causal inference”. The technique works like a digital detective, helping systems automatically trace problems back to their roots.
The research paper was recently accepted at the 41st Conference on Uncertainty in Artificial Intelligence (UAI), one of the top conferences in the field. The lead authors on the paper are ECE PhD students, Azam Ikram and Kenneth Lee.
Shiv Saini, Principal Research Scientist at Adobe and a co-author on the paper, said, “Internally at Adobe we did a Proof of Concept (PoC) implementation of the theory in this paper and that has showed promising results in reducing time to resolution. We are now evaluating it for deployment with Adobe's commercial cloud offerings to improve performance and reduce the risk of outages.” Traditionally, engineers may spend hours manually tracking down the source of a cloud service failure. For example, prior studies show that diagnosing problems in platforms like IBM's Bluemix can take an average of three hours. In that time, customers may experience slow performance or complete outages, leading to frustration and financial losses for companies.
The Purdue-led team tackled this challenge by treating the system of microservices like a map of connections. By using causal inference—an AI method that understands how one thing leads to another—they created an algorithm that traverses the graph more efficiently than previous methods. The algorithm can also figure out how to ask the right questions even when parts of the graph are unknown – which is often the case in practice as only some cause-effect relations are fully understood.
“Our solution works when the causal graph’s structure is not fully known” said Saurabh Bagchi, professor of ECE. “Further, it can work with cases where there is a single root cause as well as cases with multiple root causes.”
The researchers’ method guarantees the fewest possible steps needed to find a problem in the worst-case scenario when the causal graph is fully known—something that's especially valuable for large, dynamic systems with hundreds or thousands of parts.
“We identified a new statistical test that can immediately inform us if a microservice is the problem when only a partial causal graph is available,” said Kocaoglu, assistant professor of ECE. “With that, we were able to design a theoretically grounded, but also practical algorithm for complex, modern cloud platforms.”
In a world where digital experiences must be fast and flawless, accurate failure diagnosis means better service for users and less stress for engineers working behind the scenes. This is where this project is expanding the frontier on the theoretical side as well as the cloud application side.
