ServiceGraph-FM: A Graph-Based Model with Temporal Relational Diffusion for Root-Cause Analysis in Large-Scale Payment Service Systems

Zhuoqi Zeng*, Mengjie Zhou

*Corresponding author for this work

Research output: Contribution to journalArticle (Academic Journal)peer-review

Abstract

Root-cause analysis (RCA) in large-scale microservice-based payment systems is challenging due to complex failure propagation along service dependencies, limited availability of labeled incident data, and heterogeneous service topologies across deployments. We propose ServiceGraph-FM, a pretrained graph-based model for RCA, where “foundation” denotes a self-supervised graph encoder pretrained on large-scale production cluster traces and then adapted to downstream diagnosis. ServiceGraph-FM introduces three components: (1) masked graph autoencoding pretraining to learn transferable service-dependency embeddings for cross-topology generalization; (2) a temporal relational diffusion module that models anomaly propagation as graph diffusion on dynamic service graphs (i.e., Laplacian-governed information flow with learnable edge propagation strengths); and (3) a causal attention mechanism that leverages multi-hop path signals to better separate likely causes from correlated downstream effects. Experiments on the Alibaba Cluster Trace and synthetic PayPal-style topologies show that ServiceGraph-FM outperforms state-of-the-art baselines, improving Top-1 accuracy by 23.7% and Top-3 accuracy by 18.4% on average, and reducing mean time to detection by 31.2%. In zero-shot deployment on unseen architectures, the pretrained model retains 78.3% of its fully fine-tuned performance, indicating strong transferability for practical incident management.
Original languageEnglish
Article number236
Number of pages24
JournalMathematics
Volume14
Issue number2
Early online date8 Jan 2026
DOIs
Publication statusE-pub ahead of print - 8 Jan 2026

Bibliographical note

Publisher Copyright:
© 2026 by the authors.

Keywords

  • temporal graph networks
  • root-cause analysis
  • microservice architecture
  • anomaly detection
  • AIOps
  • graph-based models
  • 68T07

Fingerprint

Dive into the research topics of 'ServiceGraph-FM: A Graph-Based Model with Temporal Relational Diffusion for Root-Cause Analysis in Large-Scale Payment Service Systems'. Together they form a unique fingerprint.

Cite this