Graph Neural Networks (GNNs) excel in various domains, from detecting
e-commerce spam to social network classification problems. However, the lack of
public graph datasets hampers research progress, particularly in heterogeneous
information networks (HIN). The demand for datasets for fair HIN comparisons is
growing due to advancements in GNN interpretation models. In response, we
propose SynHIN, a unique method for generating synthetic heterogeneous
information networks. SynHIN identifies motifs in real-world datasets,
summarizes graph statistics, and constructs a synthetic network. Our approach
utilizes In-Cluster and Out-Cluster Merge modules to build the synthetic HIN
from primary motif clusters. After In/Our-Cluster mergers and a post-pruning
process fitting the real dataset constraints, we ensure the synthetic graph
statistics align closely with the reference one. SynHIN generates a synthetic
heterogeneous graph dataset for node classification tasks, using the primary
motif as the explanation ground truth. It can adapt and address the lack of
heterogeneous graph datasets and motif ground truths, proving beneficial for
assessing heterogeneous graph neural network explainers. We further present a
benchmark dataset for future heterogeneous graph explainer model research. Our
work marks a significant step towards explainable AI in HGNNs.
Generating Synthetic Heterogeneous Information Networks with SynHIN
Graph Neural Networks (GNNs) have showcased their effectiveness in various domains, ranging from e-commerce spam detection to social network classification problems. However, the lack of public graph datasets, particularly in the context of heterogeneous information networks (HIN), has been a major hindrance to research progress in this area. As GNN interpretation models continue to advance, there is a growing demand for datasets that enable fair comparisons and evaluations.
In response to this challenge, we present SynHIN, a unique method for generating synthetic heterogeneous information networks. SynHIN takes inspiration from real-world datasets by identifying motifs, which are recurring patterns or substructures, and leveraging them to construct a synthetic network. By summarizing graph statistics and employing In-Cluster and Out-Cluster Merge modules, our approach builds the synthetic HIN based on primary motif clusters.
One of the key considerations in developing SynHIN is ensuring that the synthetic graph closely aligns with the reference dataset. To achieve this, we incorporate In/Out-Cluster mergers and perform a post-pruning process that ensures the synthetic graph adheres to constraints imposed by the real dataset. By doing so, we generate a synthetic heterogeneous graph dataset that exhibits similar characteristics and statistics to the reference dataset.
The primary objective of SynHIN is to serve as a ground truth explanation for node classification tasks. By utilizing the primary motif as the explanation ground truth, researchers can assess the efficacy of heterogeneous graph neural network explainers. This aspect of our approach is particularly valuable in addressing the lack of available heterogeneous graph datasets and motif ground truths.
Moreover, in addition to providing a solution for generating synthetic HINs, we also present a benchmark dataset for future research on heterogeneous graph explainer models. This benchmark dataset will serve as a valuable resource for evaluating and comparing different explainers in the context of heterogeneous graph neural networks.
Overall, our work with SynHIN is a significant step towards advancing the field of explainable AI in heterogeneous graph neural networks (HGNNs). By addressing the scarcity of public graph datasets and providing a means to generate synthetic HINs, we enable researchers to make meaningful progress in understanding and interpreting the behavior of these complex networks. The multi-disciplinary nature of our approach, which combines concepts from graph theory, machine learning, and data synthesis, underscores the wide-ranging potential and applicability of this work.