Understanding intermediate representations of the concepts learned by deep
learning classifiers is indispensable for interpreting general model behaviors.
Existing approaches to reveal learned concepts often rely on human supervision,
such as pre-defined concept sets or segmentation processes. In this paper, we
propose a novel unsupervised method for discovering distributed representations
of concepts by selecting a principal subset of neurons. Our empirical findings
demonstrate that instances with similar neuron activation states tend to share
coherent concepts. Based on the observations, the proposed method selects
principal neurons that construct an interpretable region, namely a Relaxed
Decision Region (RDR), encompassing instances with coherent concepts in the
feature space. It can be utilized to identify unlabeled subclasses within data
and to detect the causes of misclassifications. Furthermore, the applicability
of our method across various layers discloses distinct distributed
representations over the layers, which provides deeper insights into the
internal mechanisms of the deep learning model.
Unsupervised Method for Discovering Distributed Representations of Concepts in Deep Learning
Understanding the intermediate representations of concepts learned by deep learning classifiers is crucial for interpreting the general behaviors of these models. However, most existing approaches rely on human supervision or predefined concept sets, limiting their applicability and interpretability. In this paper, a novel unsupervised method is proposed, which aims to discover distributed representations of concepts by selecting a principal subset of neurons.
The key finding of this study is that instances with similar neuron activation states tend to share coherent concepts. Based on this observation, the proposed method selects principal neurons to form an interpretable region called a Relaxed Decision Region (RDR). The RDR encompasses instances with coherent concepts in the feature space, allowing for the identification of unlabeled subclasses within the data and the detection of causes of misclassifications.
One of the notable aspects of this approach is its interdisciplinary nature. It combines concepts from deep learning, unsupervised learning, and interpretability research. By leveraging unsupervised learning techniques, the method avoids the need for labeled data or human supervision. This makes it more scalable and applicable to real-world scenarios where labeled data may be scarce or expensive to obtain.
Moreover, the discovery of distributed representations over various layers reveals distinct internal mechanisms within the deep learning model. This provides deeper insights into how information is processed and transformed across different levels of abstraction. The multi-disciplinary nature of this study is particularly valuable, as it bridges the gap between deep learning theory and interpretability research.
The results from empirical findings demonstrate the effectiveness of the proposed method. By selecting principal neurons that form coherent regions in the feature space, the method successfully uncovers meaningful concepts learned by deep learning classifiers. This opens up possibilities for understanding and interpreting the inner workings of these models in a more transparent and understandable manner.
In conclusion, this paper presents a novel unsupervised method for discovering distributed representations of concepts in deep learning. By selecting principal neurons that form interpretable regions, the proposed method enables the identification of unlabeled subclasses and the detection of misclassifications. Its interdisciplinary nature and ability to provide deeper insights into the internal mechanisms of deep learning models make it a valuable contribution to both the deep learning and interpretability research communities.