arXiv:2402.13264v1 Announce Type: new
Abstract: Fault localization is challenging in online micro-service due to the wide variety of monitoring data volume, types, events and complex interdependencies in service and components. Faults events in services are propagative and can trigger a cascade of alerts in a short period of time. In the industry, fault localization is typically conducted manually by experienced personnel. This reliance on experience is unreliable and lacks automation. Different modules present information barriers during manual localization, making it difficult to quickly align during urgent faults. This inefficiency lags stability assurance to minimize fault detection and repair time. Though actionable methods aimed to automatic the process, the accuracy and efficiency are less than satisfactory. The precision of fault localization results is of paramount importance as it underpins engineers trust in the diagnostic conclusions, which are derived from multiple perspectives and offer comprehensive insights. Therefore, a more reliable method is required to automatically identify the associative relationships among fault events and propagation path. To achieve this, KGroot uses event knowledge and the correlation between events to perform root cause reasoning by integrating knowledge graphs and GCNs for RCA. FEKG is built based on historical data, an online graph is constructed in real-time when a failure event occurs, and the similarity between each knowledge graph and online graph is compared using GCNs to pinpoint the fault type through a ranking strategy. Comprehensive experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level. This performance matches the level of real-time fault diagnosis in the industrial environment and significantly surpasses state-of-the-art baselines in RCA in terms of effectiveness and efficiency.
Fault Localization in Online Micro-Services: An Analysis of KGroot
Fault localization in online micro-services is a complex and challenging task due to the various types of monitoring data, events, and interdependencies involved. The industry has traditionally relied on manual localization by experienced personnel, which is not only unreliable but also lacks automation. This manual process becomes even more difficult during urgent faults when different modules present information barriers that hinder quick alignment.
To address these inefficiencies and minimize fault detection and repair time, a more reliable and automated approach is needed. KGroot offers a potential solution by using event knowledge and the correlation between events to perform root cause analysis (RCA) through the integration of knowledge graphs (KGs) and graph convolutional networks (GCNs).
The KGroot approach involves building an online graph in real-time when a failure event occurs, based on historical data. The similarity between each knowledge graph and the online graph is then compared using GCNs to identify the fault type through ranking strategy. This method enables KGroot to automatically identify the associative relationships among fault events and propagation paths, leading to accurate root cause localization.
The multi-disciplinary nature of KGroot’s approach is noteworthy. It combines techniques from knowledge graphs, graph convolutional networks, and fault diagnosis to provide comprehensive insights into fault localization. By leveraging the power of GCNs and integrating them with knowledge graphs, KGroot surpasses existing baselines in terms of effectiveness and efficiency in RCA.
In comprehensive experiments, KGroot demonstrated an impressive accuracy of 93.5% in identifying the top 3 potential causes at the second-level. This level of performance is comparable to real-time fault diagnosis in industrial environments, highlighting the practicality and reliability of KGroot in fault localization.
Overall, KGroot presents a promising solution for automated fault localization in online micro-services. Its integration of knowledge graphs and GCNs offers a multi-faceted approach that enhances the accuracy and efficiency of root cause analysis. As the industry continues to rely on micro-services for various applications, tools like KGroot will play a crucial role in maintaining stability and minimizing downtime.