Densely Connected Convolutional Networks
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet
最近的研究表明，如果卷积网络在靠近输入层与输出层之间的地方有更短的连接（shorter connections），就可以训练更深、更准确、更有效的卷积网络。本文中，我们拥抱这个观点，介绍了稠密卷积网络（DenseNet），该网络在前馈时将每一层都与其他的任一层进行了连接。传统的\(L\) 层卷积网络有\(L\)个连接——每一层与后一层有一个连接——我们的网络有\(L(L+1)/2\)个连接。每一层都将之前的所有层的特征图作为输入，而它自己的特征图是之后所有层的输入。DenseNets有一些显著优点：缓解梯度消失问题，加强特征传播，鼓励特征的重复利用，还大大减少参数量。我们在四个目标识别任务(CIFAR-10，CIFAR-100，SVHN和ImageNet）中评估了我们提出了结构。DenseNets在大部分数据集上相对state-of-the-art有明显提高，而且使用更少的计算量就可以获得高性能。代码和预训练模型可以在https://github.com/liuzhuang13/DenseNet上获得。
Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago , improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5  consisted of 5 layers, VGG featured 19 , and only last year Highway Networks  and Residual Networks (ResNets)  have surpassed the 100-layer barrier.
As CNNs become increasingly deep, a new research problem emerges: as information about the input or gradient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets  and Highway Networks  bypass signal from one layer to the next via identity connections. Stochastic depth  shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets  repeatedly combine several parallel layer sequences with different number of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers.
In this paper, we propose an architecture that distills this insight into a simple connectivity pattern: to ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Figure 1 illustrates this layout schematically. Crucially, in contrast to ResNets, we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them. Hence, the
th layer has inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−` subsequent layers. This introduces L(L+1) 2 connections in an L-layer network, instead of just L, as in traditional architectures. Because of its dense connectivity pattern, we refer to our approach as Dense Convolutional Network (DenseNet).
A possibly counter-intuitive effect of this dense connectivity pattern is that it requires fewer parameters than traditional convolutional networks, as there is no need to relearn redundant feature-maps. Traditional feed-forward architectures can be viewed as algorithms with a state, which is passed on from layer to layer. Each layer reads the state from its preceding layer and writes to the subsequent layer. It changes the state but also passes on information that needs to be preserved. ResNets  make this information preservation explicit through additive identity transformations. Recent variations of ResNets  show that many layers contribute very little and can in fact be randomly dropped during training. This makes the state of ResNets similar to (unrolled) recurrent neural networks , but the number of parameters of ResNets is substantially larger because each layer has its own weights. Our proposed DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved. DenseNet layers are very narrow (e.g., 12 filters per layer), adding only a small set of feature-maps to the “collective knowledge” of the network and keep the remaining featuremaps unchanged—and the final classifier makes a decision based on all feature-maps in the network.
这种密集连接模式的可能的反直觉效应是，它比传统卷积网络需要的参数少，因为不需要重新学习冗余特征图。传统的前馈架构可以被看作是具有状态的算法，它是从一层传递到另一层的。每个层从上一层读取状态并写入后续层。它改变状态，但也传递需要保留的信息。 ResNet通过附加的恒等转换使此信息保持明确。 ResNet的最新变种表明，许多层次贡献很小，实际上可以在训练过程中随机丢弃。这使得ResNet的状态类似于（展开的）循环神经网络(recurrent neural network)，但是ResNet的参数数量太大，因为每个层都有自己的权重。我们提出的DenseNet架构明确区分添加到网络的信息和保留的信息。 DenseNet层非常窄（例如，每层12个卷积核），仅将一小组特征图添加到网络的“集体知识”，并保持剩余的特征图不变。最终分类器基于网络中的所有特征图。
Besides better parameter efficiency, one big advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easy to train. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision . This helps training of deeper network architectures. Further, we also observe that dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.
We evaluate DenseNets on four highly competitive benchmark datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). Our models tend to require much fewer parameters than existing algorithms with comparable accuracy. Further, we significantly outperform the current state-ofthe-art results on most of the benchmark tasks.
The exploration of network architectures has been a part of neural network research since their initial discovery. The recent resurgence in popularity of neural networks has also revived this research domain. The increasing number of layers in modern networks amplifies the differences between architectures and motivates the exploration of different connectivity patterns and the revisiting of old research ideas
自网络架构最初发现以来，对它的探索一直是神经网络研究的一部分。 最近神经网络名气的复兴也使这一研究领域复苏。 现代网络中越来越多的层次放大了架构之间的差异，激发了对不同连接模式的探索和对旧的研究思路的重新审视
A cascade structure similar to our proposed dense network layout has already been studied in the neural networks literature in the 1980s . Their pioneering work focuses on fully connected multi-layer perceptrons trained in a layer-by-layer fashion. More recently, fully connected cascade networks to be trained with batch gradient descent were proposed . Although effective on small datasets, this approach only scales to networks with a few hundred parameters.
类似于我们提出的密集网络布局的级联结构，已经在20世纪80年代的神经网络文献中进行了研究。 他们的开创性工作专注于以层层方式训练的完全连接的多层感知器。 最近，提出了用批量梯度下降训练的全连接的级联网络。 虽然对小型数据集有效，但这种方法只能扩展到具有几百个参数的网络。
In [9, 23, 31, 41], utilizing multi-level features in CNNs through skip-connnections has been found to be effective for various vision tasks. Parallel to our work,  derived a purely theoretical framework for networks with cross-layer connections similar to ours.
在[9,23,31,41]中，已经发现通过skip-connnections利用CNN中的多级特征对于各种视觉任务是有效的。 与我们的工作并行，为具有类似于我们的跨层连接(cross-layer connection)的网络提供了纯粹的理论框架
Highway Networks  were amongst the first architectures that provided a means to effectively train end-to-end networks with more than 100 layers. Using bypassing paths along with gating units, Highway Networks with hundreds of layers can be optimized without difficulty. The bypassing paths are presumed to be the key factor that eases the training of these very deep networks. This point is further supported by ResNets , in which pure identity mappings are used as bypassing paths. ResNets have achieved impressive, record-breaking performance on many challenging image recognition, localization, and detection tasks, such as ImageNet and COCO object detection . Recently, stochastic depth was proposed as a way to successfully train a 1202-layer ResNet . Stochastic depth improves the training of deep residual networks by dropping layers randomly during training. This shows that not all layers may be needed and highlights that there is a great amount of redundancy in deep (residual) networks. Our paper was partly inspired by that observation. ResNets with pre-activation also facilitate the training of state-of-the-art networks with > 1000 layers .
Highway Networks 是首个提供有效训练100多层端到端网络的架构之一。使用旁路路径(bypassing path)和门控装置，可以毫不费力地优化具有数百层的Highway Networks。bypassing path被认为是易于训练这些非常深的网络的关键因素。 ResNets 进一步支持这一点，其中纯identity mappings用作旁路路径。ResNets在许多具有挑战性的图像识别，定位和检测任务（如ImageNet和COCO对象检测）上取得了令人印象深刻的创纪录性能。最近，随机深度被提出作为成功训练1202层ResNet的一种方式。随机深度通过在训练期间随机丢弃层来改进深度残差网络的训练。这表明并非所有层都必须，并强调在深（残差）网络中存在大量冗余。我们的论文部分受到这一观察的启发。具有预激活的ResNets同样有助于训练具有> 1000层的state-of-the-art网络。
An orthogonal approach to making networks deeper (e.g., with the help of skip connections) is to increase the network width. The GoogLeNet [36, 37] uses an “Inception module” which concatenates feature-maps produced by filters of different sizes. In , a variant of ResNets with wide generalized residual blocks was proposed. In fact, simply increasing the number of filters in each layer of ResNets can improve its performance provided the depth is sufficient . FractalNets also achieve competitive results on several datasets using a wide network structure .
使网络更深的一种正交方法（例如，借助于skip connection）是增加网络宽度。 GoogLeNet [36,37]使用“Inception module”来连接由不同大小的过滤器生成的特征图。 在中，提出了具有宽的广义残差块的ResNets的变体。 事实上，只要增加每层ResNets中的滤波器数量，就可以提高其性能，只要深度足够。 FractalNets还使用宽网络结构在几个数据集上达到有竞争力的结果。
Instead of drawing representational power from extremely deep or wide architectures, DenseNets exploit the potential of the network through feature reuse, yielding condensed models that are easy to train and highly parameter efficient. Concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency. This constitutes a major difference between DenseNets and ResNets. Compared to Inception networks [36, 37], which also concatenate features from different layers, DenseNets are simpler and more efficient.
DenseNets不是从极深或宽的架构中提取表征能力，而是通过特征重用开发网络的潜力，产生易于训练和高参数效率的压缩模型。 拼接不同层学习特征图增加了后续层输入的变化，并提高了效率。 这构成了DenseNets和ResNets之间的主要区别。 与Inception网络[36,37]相比，它们也拼接了不同层的特征，DenseNets更简单，更高效。
There are other notable network architecture innovations which have yielded competitive results. The Network in Network (NIN)  structure includes micro multi-layer perceptrons into the filters of convolutional layers to extract more complicated features. In Deeply Supervised Network (DSN) , internal layers are directly supervised by auxiliary classifiers, which can strengthen the gradients received by earlier layers. Ladder Networks [27, 25] introduce lateral connections into autoencoders, producing impressive accuracies on semi-supervised learning tasks. In , Deeply-Fused Nets (DFNs) were proposed to improve information flow by combining intermediate layers of different base networks. The augmentation of networks with pathways that minimize reconstruction losses was also shown to improve image classification models .
Densely Connected Convolutional Networks