Systematic adaptation of network depths at runtime can be an effective way to
control inference latency and meet the resource condition of various devices.
However, previous depth adaptive networks do not provide general principles and
a formal explanation on why and which layers can be skipped, and, hence, their
approaches are hard to be generalized and require long and complex training
steps. In this paper, we present an architectural pattern and training method
for adaptive depth networks that can provide flexible accuracy-efficiency
trade-offs in a single network. In our approach, every residual stage is
divided into 2 consecutive sub-paths with different properties. While the first
sub-path is mandatory for hierarchical feature learning, the other is optimized
to incur minimal performance degradation even if it is skipped. Unlike previous
adaptive networks, our approach does not iteratively self-distill a fixed set
of sub-networks, resulting in significantly shorter training time. However,
once deployed on devices, it can instantly construct sub-networks of varying
depths to provide various accuracy-efficiency trade-offs in a single model. We
provide a formal rationale for why the proposed architectural pattern and
training method can reduce overall prediction errors while minimizing the
impact of skipping selected sub-paths. We also demonstrate the generality and
effectiveness of our approach with various residual networks, both from
convolutional neural networks and vision transformers.

Expert Commentary: The Flexibility of Adaptive Depth Networks

Runtime adaptation of neural network depth can be a powerful technique to control inference latency and meet the computational restrictions of diverse devices. However, previous approaches to adaptive depth networks have lacked general principles and formal explanations for why and which layers can be skipped, making them difficult to generalize and requiring extensive training steps.

This paper introduces an architectural pattern and training method for adaptive depth networks that address these limitations and offer flexible accuracy-efficiency trade-offs within a single network. The proposed approach divides each residual stage into two consecutive sub-paths with different properties. The first sub-path is deemed mandatory for hierarchical feature learning, while the second is optimized to minimize performance degradation when skipped.

Notably, the key distinction in this approach is that it does not rely on iteratively self-distilling a fixed set of sub-networks, which significantly reduces training time. Instead, once deployed on devices, it can instantly construct sub-networks of varying depths to provide a range of accuracy-efficiency trade-offs within a single model.

The authors also provide a formal rationale for how this proposed architectural pattern and training method can reduce overall prediction errors while minimizing the impact of skipping selected sub-paths. This provides valuable insights into the mechanics behind adaptive depth networks.

Furthermore, the generality and effectiveness of the approach are demonstrated by applying it to various residual networks, including both convolutional neural networks (CNNs) and vision transformers. This highlights the multi-disciplinary nature of the concepts presented, as they are applicable to different network architectures beyond just CNNs.

Conclusion

The introduction of an architectural pattern and training method for adaptive depth networks fills a crucial gap in the field of deep learning. By providing a formal rationale and general principles, this approach allows for more efficient and flexible network architectures. Moreover, its effectiveness with different network types emphasizes its broad applicability. Moving forward, this research opens up possibilities for further exploration and optimization in adaptive depth networks across various domains.

Read the original article