(Neural Best-buddies, Aberman et al. in ACM SIGGRAPH 2018)
(Robust flash deblurring, Zhuo et al. in CVPR 2010)
(Scene Completion Using Millions of Photographs, Hays et al. in ACM SIGGRAPH 2007)
(Visual Attribute Transfer through Deep Image Analogy, Liao et al. in ACM SIGGRAPH 2017)
(Cross-domain Correspondence Learning for Exemplar-based Image Translation, Zhang et al. in CVPR 2020)
(Across Scales & Across Dimensions: Temporal Super-Resolution using Deep Internal Learning, Zuckerman et al. in ECCV 2020)
(Light Field Super-Resolution with Zero-Shot Learning, Cheng et al. in CVPR 2021)
Target: Controllable Image Manipulation Conditional Generation needs additional prior(edge/layout) Finding semantically meaningful latent code is non-trivial
Key idea: image swapping as a pretext task
Origin:
Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization, Huang et al. ICCV 2017
FiLM: Visual Reasoning with a General Conditioning Layer, Perez et al. AAAI 2018
Application:
Recovering Realistic Texture in Image Super-resolution byDeep Spatial Feature Transform, Wang et al. CVPR 2018
Guided Image-to-Image Translation with Bi-Directional Feature Transformation, AlBahar and Huang, ICCV 2019
SEAN: Image Synthesis with Semantic Region-Adaptive Normalization, Zhu et al. CVPR 2020
Analyzing and Improving the Image Quality of StyleGAN, Karras et al. CVPR 2020
Adaptive Convolutions for Structure-Aware Style Transfer, Chandran et al. CVPR 2021
Target: Few-shot Synthesis(Hard to train or adapt) Key idea: Transfer relationships/similarities
Pairwise similarity Constraint
(1) Cross-domain consistency loss L_dist aims to preserve the relative pairwise distances between source and target generations. In this case, the relative similarities between synthesized images from z_0 and other latent codes are encouraged to be similar. (2) Relaxed realism is implemented by using two discriminators, D_img for noise sampled from the anchor region (z_anch) and D_patch otherwise.
StyleGAN2 Distillation for Feed-forward Image Manipulation, Viazovetskyi et al. ECCV20
Controlling Style and Semantics in Weakly-Supervised Image Generation, Pavllo et al. ECCV 2020
COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder, Saito et al. ECCV 2020
Example-Guided Image Synthesis using Masked Spatial-Channel Attention and Self-Supervision, Zheng et la. ECCV 2020
Online Exemplar Fine-Tuning for Image-to-Image Translation, Kang et al. ArXiv 2020
Swapping Autoencoder for Deep Image Manipulation, Park et al. NeurIPS 2020
Conditional Generative Modeling via Learning the Latent Space, Ramasinghe et al. ICLR 2021
Semantic Layout Manipulation with High-Resolution Sparse Attention, Zheng et al. CVPR 2021
Spatially-Invariant Style-Codes Controlled Makeup Transfer, Deng et al. CVPR 2021
Adaptive Convolutions for Structure-Aware Style Transfer, Chandran et al. CVPR 2021
ReMix: Towards Image-to-Image Translation with Limited Data, Cao et al. CVPR 2021
Learning Semantic Person Image Generation by Region-Adaptive Normalization, Lv et al. CVPR 2021 Spatially-Adaptive Pixelwise Networks for Fast Image Translation, Shaham et al. CVPR 2021
Bi-level Feature Alignment for Versatile Image Translation and Manipulation, Zhan et al. ArXiv 2021
Controllable Person Image Synthesis with Spatially-Adaptive Warped Normalization, Zhang et al. ArXiv 2021
Sketch Your Own GAN, Wang et al. ICCV 2021
]]>Semantically multi-modal image synthesis (SMIS): generating multi-modal images at the semantic level.
Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes.
Gu et al.(CVPR 2019) focused on portrait editing. However, this type of methods soon face degradation in performance, a linear increase of training time and computational resource consumption under a growing number of classes. (really weak)
the key is to divide the latent code into a series of class-specific latent codes each of which controls only a specific semantic class generation.
Applications: Semantically multi-modal image synthesis, Appearance mixture, Semantic manipulation, Style morphing
Controllable Image Synthesis
The process of working in feature space typically involves the following stages: an image is fed to a pre-trained classification network; its feature responses from different layers are extracted, and optionally manipulated according to the application at hand. The manipulated features are then inverted back to an image by solving a reconstruction optimization problem. However, the problem of inverting deep features into a realistic image is challenging – there is no one-to-one mapping between the deep features and an image, especially when the features are taken from deep layers. This has been addressed so far mostly by imposing regularization priors on the reconstructed image, which often leads to blurry unrealistic reconstructions and limits the type of features that can be used.
A hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. By doing so, we bridge the gap between optimization based methods for feature inversion and generative adversarial learning.
Semantic pyramid image pipeline. (a) The generator works in full mirror-like conjunction with a pre-trained classification model. Each stage of the classification model has a corresponding block in the generator. (b) Specification of a single generator block. the feature map is first multiplied by its input mask. The masked feature map then undergoes a convolution layer and the result is summed with the result of the corresponding generator block.
generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing
High-quality image synthesis from salient object layout.
I don’t think this is a new task.
Top row: images synthesized from semantic segmentation maps. Bottom row: high-resolution images synthesized from salient object layouts, which allows users to create an image by drawing only a few bounding boxes.
Scene graph (Johnson et al. CVPR 2018), with rich structural representation, can potentially reveal more visual relations of objects in an image. However, pairwise object relation labels are difficult to obtain in real-life applications. The lack of object size, location and background information also limits the quality of synthesized images.
Layout2im (CVPR 2019) proposed the task of image synthesis from object layout; however, both foreground and background object layouts are required, and only low-resolution images are generated.
High-resolution(?) synthesis and background inference from foreground layout.
BachGAN generates an image via two steps: (i) a background retrieval module selects from a large candidate pool a set of segmentation maps most relevant to the given object layout; (ii) these candidate layouts are then encoded via a background fusion module to hallucinate a best-matching background. With this retrieval-and-hallucination approach, BachGAN can dynamically provide detailed and realistic background that aligns well with any given foreground layout.
a little bit high-resolution version of Layout2im, paper-level progress
3D Controllable Image Synthesis: We define this task as an unsupervised learning problem, where a 3D controllable generative image synthesis model that allows for manipulating 3D scene properties is learned without 3D supervision.
Current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation.
Our key idea is to learn the image generation process jointly in 3D and 2D space by combining a 3D generator with a differentiable renderer and a 2D image synthesis model. This allows our model to learn abstract 3D representations which conform to the physical image formation process, thereby retaining interpretability and controllability.
a 3D image generation baseline
Latent space editting for image synthesis
First, we introduce noise optimization as a complement to the W+ latent space embedding. Second, we extend the global W + latent space embedding to enable local embeddings. Third, we combine embedding with activation tensor manipulation to perform high quality local edits along with global semantic edits on images.
some fancy face editing results
synthetic image generation
Starting from SPADE, 1) use only one style code for whole image, 2) insert style code only in the beginning of network.
None of previous networks use style information to generate spatially varying normalization parameters.
control the style of each semantic region individually, e.g., we can specify one style reference image per region
use style input images to create spatially varying normalization parameters per semantic region. An important aspect of this work is that the spatially varying normalization parameters are dependent on the segmentation mask as well as the style input images.
SEAN normalization
The input are style matrix ST and segmentation mask M. In the upper part, the style codes in ST undergo a per style convolution and are then broadcast to their corresponding regions according to M to yield a style map. The style map is processed by conv layers to produce per pixel normalization values $\gamma^s$ and $\beta^s$ . The lower part (light blue layers) creates per pixel normalization values using only the region information similar to SPADE.
The Generator
(A) On the left, the style encoder takes an input image and outputs a style matrix ST. The generator on the right consists of interleaved SEAN ResBlocks and Upsampling layers. (B) A detailed view of a SEAN ResBlock used in (A).
application: style interpolation an per-region extension to SPADE
Conditional image generation of a GAN framework using our proposed attentive normalization module. (a) Class conditional image generation. (b) Image inpainting.
Conditional Image Synthesis
Traditional convolution-based generative adversarial networks synthesize images based on hierarchical local operations, where long-range dependency relation is implicitly modeled with a Markov chain. It is still not sufficient for categories with complicated structures.
Self-Attention GAN: the self-attention module requires computing the correlation between every two points in the feature map. Therefore, the computational cost grows rapidly as the feature map becomes large.
Instance Normalization (IN): the previous solution of (IN) normalizes the mean and variance of a feature map along its spatial dimensions. This strategy ignores the fact that different locations may correspond to semantics with varying mean and variance.
Attentive Normalization (AN) predicts a semantic layout from the input feature map and then conduct regional instance normalization on the feature map based on this layout.
AN is formed by the proposed semantic layout learning (SLL) module, and a regional normalization, as shown in Figure 2. It has a semantics learning branch and a self-sampling branch. The semantic learning branch employs a certain number of convolutional filters to capture regions with different semantics (which are activated by a specific filter), with the assumption that each filter in this branch corresponds to some semantic entities.
semantics-aware attention + regional normalization
an image-to-image translation problem suitable for the setting when domain labels are unavailable.
Image-to-image translation approaches require domain labels at training as well as at inference time. The recent FUNIT model relaxes this constraint partially. Thus, to extract the style at inference time, it uses several images from the target domain as guidance for translation (known as the few-shot setting). The domain annotations are however still needed during training.
The only external (weak) supervision used by our approach are coarse segmentation maps estimated using an off-the-shelf semantic segmentation network.
HiDT learning data flow. We show half of the (symmetric) architecture; s′ = Es(x′) is the style extracted from the other image x′, and ŝ′ is obtained similarly to ŝ with x and x′ swapped. Light blue nodes denote data elements; light green, loss functions; others, functions (subnetworks). Functions with identical labels have shared weights. Adversarial losses are omitted for clarity.
Enhancement scheme: the input is split into subimages (color-coded) that are translated individually by HiDT at medium resolution. The outputs are then merged using the merging network Genh. For illustration purposes, we show upsampling by a factor of two, but in the experiments we use a factor of four. We also apply bilinear downsampling (with shifts – see text for detail) rather than strided subsampling when decomposing the input into medium resolution images
High-resolution translation
Swapping styles between two images. Original images are shown on the main diagonal. The examples show that HiDT is capable to swap the styles between two real images while preserving details.
Unsupervised image-to-image translation
Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by reusing it for encoding the images of the target domain.
We reuse early layers of certain number in the discriminator as the encoder of the target domain
We develop a decoupled training strategy by which the encoder is only trained when maximizing the adversary loss while keeping frozen otherwise.
sounds like a plug-in strategy to all I2I frameworks.
Few-shot(both in source and target) unpaired image-to-image translation
(c) Few-shot semi-supervised (Ours): same as few-shot, but the source domain has only a limited amount of labeled data at train time.
First, the target domain is required to contain the same categories or attributes as the source domain at test time, therefore failing to scale to unseen categories (see Fig. 1(a)). Second, they highly rely upon having access to vast quantities of labeled data (Fig. 1(a, b)) at train time. Such labels provide useful information during the training process and play a key role in some settings (e.g. scalable I2I translation).
We propose using semi-supervised learning to reduce the requirement of labeled source images and effectively use unlabeled data. More concretely, we assign pseudo-labels to the unlabeled images based on an initial small set of labeled images. These pseudo-labels provide soft supervision to train an image translation model from source images to unseen target domains. Since this mechanism can potentially introduce noisy labels, we employ a pseudo-labeling technique that is highly robust to noisy labels. In order to further leverage the unlabeled images from the dataset (or even external images), we use a cycle consistency constraint [48].
exemplar-based image translation
Previous exemplar-based method only use style code globally. The style code only characterizes the global style of the exemplar, regardless of spatial relevant information. Thus, it causes some local style “wash away” in the ultimate image.
Deep Image Analogy is not cross-domain, may fail to handle a more challenging mapping from mask (or edge, keypoints) to photo since the pretrained network does not recognize such images.
With the cross-domain correspondence, we present a general solution to exemplar-based image translation, that for the first time, outputs images resembling the fine structures of the exemplar at instance level.
The network architecture comprises two sub-networks: 1) Cross-domain correspondence Network transforms the inputs from distinct domains to an intermediate feature domain where reliable dense correspondence can be established; 2) Translation network, employs a set of spatially-variant de-normalization blocks [38] to progressively synthesizes the output, using the style details from a warped exemplar which is semantically aligned to the mask (or edge, keypoints map) according to the estimated correspondence.
Applications: Image editing, Makeup transfer
multiple domain image translation
Existing Image-to-image translation methods have only considered a mapping between two domains, they are not scalable to the increasing number of domains. For example, having K domains, these methods require to train K(K-1) generators to handle translations between each and every domain, limiting their practical usage.
StarGAN still learns a deterministic mapping per each domain, which does not capture the multi-modal nature of the data distribution.
generate diverse images across multiple domains.
(a) The generator translates an input image into an output image reflecting the domain-specific style code. (b) The mapping network transforms a latent code into style codes for multiple domains, one of which is randomly selected during training. (c) The style encoder extracts the style code of an image, allowing the generator to perform reference-guided image synthesis. (d) The discriminator distinguishes between real and fake images from multiple domains
In particular, we start from StarGAN and replace its domain label with our proposed domain-specific style code that can represent diverse styles of a specific domain. To this end, we introduce two modules, a mapping network and a style encoder. The mapping network learns to transform random Gaussian noise into a style code, while the encoder learns to extract the style code from a given reference image. Considering multiple domains, both modules have multiple output branches, each of which provides style codes for a specific domain. Finally, utilizing these style codes, our generator learns to successfully synthesize diverse images over multiple domains
reference-guided synthesis
from panoptic map to image
Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. This is the result of conventional convolution and upsampling algorithms being independent of class and instance boundaries.
We are interested in panoptic maps because semantic maps do not provide sufficient information to synthesize “things” (instances) especially in complex environments with multiple of them interacting with each other.
We propose Panoptic aware upsampling that addresses the misalignment between the upsampled low resolution features and high resolution panoptic maps. This ensures that the semantic and instance details are not lost, and that we also maintain higher accuracy alignment between the generated images and the panoptic maps.
Panoptic Aware Convolution Layer
Panoptic aware partial convolution layer takes a panoptic map (colorized for visualization) and based on the center of each sliding window it generates a binary mask, M. The pixels that share the same identity with the center of the window are assigned 1 and the others 0.
Panoptic Aware Upsampling Layer
As shown in Figure (top), first we correct misalignment by replicating a feature vector from a neighboring pixel that belongs to the same panoptic instance. This operation is different from nearest neighbor upsampling which would always replicate the top-left feature. Second, as shown in Figure (bottom), we resolve pixels where new semantic or instance classes have just appeared by encoding new features from semantic maps with Panoptic aware convolution layer.
controllably generating realistic images with many objects and relationships from a freehand scene-level sketch
The author argue that this is a new problem.
Using freehand sketches as conditional signal is hard.
Two sequential stages, foreground and background generation, based on the characteristics of scene-level sketching. The first stage focuses on foreground generation where the generated image content is supposed to exactly meet the user’s specific requirement. The second stage is responsible for background generation where the generated image content may be loosely aligned with the sketches.
Dataset: SketchyCOCO
Illustration of five-tuple ground truth data of SketchyCOCO, i.e., (a) {foreground image, foreground sketch, foreground edge maps} (training: 18,869, test: 1,329), (b) {background image, background sketch} (training: 11,265, test: 2,816), (c) {scene image, foreground image & background sketch} (training: 11,265, test: 2,816), (d) {scene image, scene sketch} (training: 11,265, test: 2,816), and (e) sketch segmentation (training: 11,265, test: 2,816)
EdgeGAN
It contains four sub-networks: two generators $G_I$ and $G_E$ , three discriminators $D_I$ , $D_E$ , and $D_J$ , an edge encoder $E$ and an image classifier $C$. EdgeGAN learns a joint embedding for an image and various-style edge maps depicting this image into a shared latent space where vectors can encode high-level attribute information from cross-modality data.
A weird task setting.
]]>The model detect an object as a pair of bounding box corners grouped together. A convolutional network outputs a heatmap for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The network is trained to predict similar embeddings for corners that belong to the same object.
The backbone network is followed by two prediction modules, one for the top-left corners and the other for the bottom-right corners. Using the predictions from both modules, we locate and group the corners.
CornerNet-Saccade speeds up inference by reducing the number of pixels to process. It uses an attention mechanism similar to saccades in human vision. It starts with a downsized full image and generates an attention map, which is then zoomed in on and processed further by the model. This differs from the original CornerNet in that it is applied fully convolutionally across multiple scales.
We predict a set of possible object locations from the attention maps and bounding boxes generated on a downsized full image. We zoom into each location and crop a small region around that location. Then we detect objects in each region. We control the efficiency by ranking the object locations and choosing top k locations to process. Finally, we merge the detections by NMS.
CornerNet-Squeeze speeds up inference by reducing the amount of processing per pixel. It incorporates ideas from SqueezeNet and MobileNets, and introduces a new, compact hourglass backbone that makes extensive use of 1×1 convolution, bottleneck layer, and depth-wise separable convolution.
In this paper, we propose ExtremeNet, a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object. We use a state-of-the-art keypoint estimation framework to find extreme points, by predicting four multi-peak heatmaps for each object category. In addition, we use one heatmap per category predicting the object center, as the average of two bounding box edges in both the x and y dimension. We group extreme points into objects with a purely geometry-based approach. We group four extreme points, one from each map, if and only if their geometric center is predicted in the center heatmap with a score higher than a pre-defined threshold. We enumerate all $O(n^4)$combinations of extreme point prediction, and select the valid ones.
Given four extreme points t, b, r, l extracted from heatmaps Ŷ (t) , Ŷ (l) , Ŷ (b) , Ŷ (r), we compute their geometric center $c=\left(\frac{l_{x}+t_{x}}{2}, \frac{t_{y}+b_{y}}{2}\right)$. If this center is predicted2 2with a high response in the center map Ŷ (c), we commit the extreme points as a valid detection: Ŷcx ,cy ≥ τc for a threshold τc. We then enumerate over all quadruples of keypoints t, b, r, l in a brute force manner. We extract detections foreach class independently.
In this paper, we present a low-cost yet effective solution named CenterNet, which explores the central part of a proposal, i.e., the region that is close to the geometric center of a box, with one extra keypoint. We intuit that if a predicted bounding box has a high IoU with the ground-truth box, then the probability that the center keypoint in the central region of the bounding box will be predicted as the same class is high, and vice versa. Thus, during inference, after a proposal is generated as a pair of corner keypoints, we determine if the proposal is indeed an object by checking if there is a center keypoint of the same class falling within its central region.
A convolutional backboneoutput two corner heatmaps and a center keypoint heatmap, respectively.the similar embeddings are used to detect a potential boundingthe final bounding boxes. network applies cascade corner pooling and center pooling to Similar to CornerNet, a pair of detected corners and box. Then the detected center keypoints are used to determine the final bounding boxes.
We represent objects by a single point at their bounding box center. Other properties, such as object size, dimension, 3D extent, orientation, and pose are then regressed directly from image features at the center location. Object detection is then a standard keypoint estimation problem. We simply feed the input image to a fully convolutional network that generates a heatmap. Peaks in this heatmap correspond to object centers. Image features at each peak predict the objects bounding box height and weight. The model trains using standard dense supervised learning. Inference is a single network forward-pass, without non-maximal suppression for post-processing.
The numbers in the boxes represent the stride to the image. (a): Hourglass Network, We use it as is in CornerNet.. (b): ResNet with transpose convolutions. We add one 3 × 3 deformable convolutional layer before each up-sampling layer. Specifically, we first use deformable convolution to change the channels and then use transposed convolution to upsample the feature map (such two steps are shown separately in 32 → 16. We show these two steps together as a dashed arrow for 16 → 8 and 8 → 4). (c): The original DLA-34 for semantic segmentation. (d): Our modified DLA-34. We add more skip connections from the bottom layers and upgrade every convolutional layer in upsampling stages to deformable convolutional layer.
!(CenterNet-Z)Objects as Points
Our motivation is to let each instance select the best level of feature freely to optimize the network, so there should be no anchor boxes to constrain the feature selection in our module. Instead, we encode the instances in an anchor-free manner to learn the parameters for classification and regression. An anchor-free branch is built per level of feature pyramid, independent to the anchor-based branch. Similar to the anchor-based branch, it consists of a classification subnet and a regression subnet. An instance can be assigned to arbitrary level of the anchor-free branch. During training, we dynamically select the most suitable level of feature for each instance based on the instance content instead of just the size of instance box. The selected level of feature then learns to detect the assigned instances. At inference, the FSAF module can run independently or jointly with anchor-based branches.
The FSAF module only introduces two additional conv layers (dashed feature maps) per pyramid level, keeping the architecture fully convolutional.
Online feature selection mechanism. Each instance is passing through all levels of anchor-free branches to compute the averaged classification (focal) loss and regression (IoU) loss over effective regions. Then the level with minimal summation of two losses is selected to set up the supervision signals for that instance.
FoveaBox is motivated from the fovea of human eyes: the center of the vision field (object) is with the highest visual acuity. FoveaBox jointly predicts the locations where the object’s center area is likely to exist as well as the bounding box at each valid location. Thanks to the feature pyramidal representations, different scales of objects are naturally detected from multiple levels of features.
The design of the architecture follows RetinaNet to make a fair comparison. FoveaBox uses a Feature Pyramid Network backbone on top of a feedforward ResNet architecture. To this backbone, FoveaBox attaches two subnetworks, one for classifying the corresponding cells and one for predict the (x1 , y1 , x2 , y2 ) of ground-truth object boxes. For each spatial output location, the FoveaBox predicts one score output for each class and the corresponding 4-dimensional box, which is different from previous works attaching A anchors in each position (usually A = 9).
In order to suppress these low-quality detections, we introduce a novel “center-ness” branch (only one layer) to predict the deviation of a pixel to the center of its corresponding bounding box. This score is then used to down-weight low-quality detected bounding boxes and merge the detection results in NMS. The simple yet effective center-ness branch allows the FCN-based detector to outperform anchor-based counterparts under exactly the same training and testing settings.
The network architecture of FCOS, where C3, C4, and C5 denote the feature maps of the backbone network and P3 to P7 are the feature levels used for the final prediction. H × W is the height and width of feature maps. ‘/s’ (s = 8, 16, …, 128) is the downsampling ratio of the feature maps at the level to the input image. As an example, all the numbers are computed with an 800 × 1024 input.
RepPoints is a set of points that learns to adaptively position themselves over an object in a manner that circumscribes the object’s spatial extent and indicates semantically significant local areas. The training of RepPoints is driven jointly by object localization and recognition targets, such that the RepPoints are tightly bound by the ground-truth bounding box and guide the detector toward correct object classification. This adaptive and differentiable representation can be coherently used across the different stages of a modern object detector, and does not require the use of anchors to sample over a space of bounding boxes.
Overview of the proposed RPDet (RepPoints detector). Whilebone, we only draw the afterwards pipeline of one scale of FPN featureshare the same afterwards network architecture and the same model weights. feature pyramidal networks (FPN) are adopted as the backmaps for clear illustration. Note all scales of FPN feature maps share the same afterwards network architecture and the same model weights.
A Generative Model learns a probability distribution from data with prior knowledge, producing new images from learned distribution.
There are two main choices for learned representation: factorized model and latent variable model.
Factorized model writes probability distribution as a product of simpler terms, via chain rule.
Latent variable model defines a latent space to extract the core information from data, which is much smaller than the original one.
Max Likelihood Estimation
Adversarial Training
Comparison of GAN, VAE and Flow-based Models
Highlights
The key idea: approximate the posterior $p_θ(z|x)$ with a simpler, tractable distribution $q_ϕ(z|x)$.
The graphical model involved in Variational Autoencoder. Solid lines denote the generative distribution $p_θ(.)$ and dashed lines denote the distribution $q_ϕ(z | x)$ to approximate the intractable posterior $p_θ(z | x)$. |
Loss Function: ELBO Using KL Divergence: \(D_{\mathrm{KL}}\left(q_{\phi}(\mathbf{z} | \mathbf{x}) \| p_{\theta}(\mathbf{z} | \mathbf{x})\right)=\log p_{\theta}(\mathbf{x})+D_{\mathrm{KL}}\left(q_{\phi}(\mathbf{z} | \mathbf{x}) \| p_{\theta}(\mathbf{z})\right)-\mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z} | \mathbf{x})} \log p_{\theta}(\mathbf{x} | \mathbf{z})\)
ELOB defined as: \(\begin{aligned} L_{\mathrm{VAE}}(\theta, \phi) &=-\log p_{\theta}(\mathbf{x})+D_{\mathrm{KL}}\left(q_{\phi}(\mathbf{z} | \mathbf{x}) \| p_{\theta}(\mathbf{z} | \mathbf{x})\right) \\ &=-\mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z} | \mathbf{x})} \log p_{\theta}(\mathbf{x} | \mathbf{z})+D_{\mathrm{KL}}\left(q_{\phi}(\mathbf{z} | \mathbf{x}) \| p_{\theta}(\mathbf{z})\right) \\ \theta^{*}, \phi^{*} &=\arg \min _{\theta, \phi} L_{\mathrm{VAE}} \end{aligned}\)
By minimizing the loss we are maximizing the lower bound of the probability of generating real data samples.
The Reparameterization Trick
The expectation term in the loss function invokes generating samples from $z∼q_ϕ(z|x)$. Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable $z$ as a deterministic variable $\mathbf{z}=\mathcal{T}{\phi}(\mathbf{x}, \boldsymbol{\epsilon})$, where $ϵ$ is an auxiliary independent random variable, and the transformation function $\mathcal{T}{\phi}$ parameterized by $ϕ$ converts $ϵ$ to $z$.
For example, a common choice of the form of $q_ϕ(z|x)$ ltivariate Gaussian with a diagonal covariance structure: \(\begin{array}{l}{\mathbf{z} \sim q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right)=\mathcal{N}\left(\mathbf{z} ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)} \boldsymbol{I}\right)} \\ {\mathbf{z}=\boldsymbol{\mu}+\boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \text { where } \boldsymbol{\epsilon} \sim \mathcal{N}(0, \boldsymbol{I})}\end{array}\) where $⊙$ refers to element-wise product.
Highlights
Vector Quantisation(VQ) Vector quantisation (VQ) is a method to map $K$-dimensional vectors into a finite set of “code” vectors. The encoder output $E(\mathbf{x})=\mathbf{z}_{e}$ goes through a nearest-neighbor lookup to match to one of $K$ embedding vectors and then this matched code vector becomes the input for the decoder $D(.)$:
\[z_{q}(x)=e_{k}, \quad \text { where } \quad k=\operatorname{argmin}_{j}\left\|z_{e}(x)-e_{j}\right\|_{2}\]The dictionary items are updated using Exponential Moving Averages(EMA), which is similar to EM methods like K-Means.
Loss Design
where sq[.] is the stop_gradient
operator.
Training PixelCNN and WaveNet for images and audio respectively on learned latent space, the VA-VAE model avoids “posterior collapse” problem which VAE suffers from.
Highlights
Stage 1: Training hierarchical VQ-VAE The design of hierarchical latent variables intends to separate local patterns (i.e., texture) from global information (i.e., object shapes). The training of the larger bottom level codebook is conditioned on the smaller top level code too, so that it does not have to learn everything from scratch.
Stage 2: Learning a prior over the latent discrete codebook The decoder can receive input vectors sampled from a similar distribution as the one in training. A powerful autoregressive model enhanced with multi-headed self-attention layers is used to capture the correlations in spatial locations that are far apart in the image with a larger receptive field.
There are two types of flow: normalizing flow and autoregressive flow.
A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a chain of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem and eventually obtain a probability distribution of the final target variable. Illustration of a normalizing flow model, transforming a simple distribution $p_0(z_0)$ to a complex one $p_K(z_K)$ step by step.
NICE defines additive coupling layer: \(\left\{\begin{array}{l}{\mathbf{y}_{1 : d}=\mathbf{x}_{1 : d}} \\ {\mathbf{y}_{d+1 : D}=\mathbf{x}_{d+1 : D}+m\left(\mathbf{x}_{1 : d}\right)}\end{array} \Leftrightarrow\left\{\begin{array}{l}{\mathbf{x}_{1 : d}=\mathbf{y}_{1 : d}} \\ {\mathbf{x}_{d+1 : D}=\mathbf{y}_{d+1 : D}-m\left(\mathbf{y}_{1 : d}\right)}\end{array}\right.\right.\)
Real NVP implements a normalizing flow by stacking a sequence of invertible bijective transformation functions. In each bijection $f:x↦y$, known as affine coupling layer, the input dimensions are split into two parts:
where $s(.)$ and $t(.)$ are scale and translation functions and both map $\mathbb{R}^{d} \mapsto \mathbb{R}^{D-d}$. The $⊙$ operation is the element-wise product.
Masked Autoregressive Flow is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network. MAF is very similar to Inverse Autoregressive Flow (IAF) introduced later. See more discussion on the relationship between MAF and IAF in the next section.
Given two random variables, $z∼π(z)$ and $x∼p(x)$ and the probability density function $π(z)$ is known, MAF aims to learn $p(x)$. MAF generates each $x_i$ conditioned on the past dimensions $x_{1:i−1}$.
Precisely the conditional probability is an affine transformation of $z$, where the scale and shift terms are functions of the observed part of $x$.
Data generation, producing a new $x$: \(x_{i} \sim p\left(x_{i} | \mathbf{x}_{1 : i-1}\right)=z_{i} \odot \sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)+\mu_{i}\left(\mathbf{x}_{1 : i-1}\right), \text { where } \mathbf{z} \sim \pi(\mathbf{z})\) Density estimation, given a known $x$: \(p(\mathbf{x})=\prod_{i=1}^{D} p\left(x_{i} | \mathbf{x}_{1 : i-1}\right)\)
The generation procedure is sequential, so it is slow by design. While density estimation only needs one pass the network using architecture like MADE. The transformation function is trivial to inverse and the Jacobian determinant is easy to compute too.
The gray unit $x_i$ is the unit we are trying to compute, and the blue units are the values it depends on. αi and μi are scalars that are computed by passing $x_{1:i−1}$ through neural networks (magenta, orange circles). Even though the transformation is a mere scale-and-shift, the scale and shift can have complex dependencies on previous variables. For the first unit $x_1$, $μ$ and $α$ are usually set to learnable scalar variables that don’t depend on any $x$ or $u$.
The inverse pass:
Similar to MAF, Inverse autoregressive flow (IAF; Kingma et al., 2016) models the conditional probability of the target variable as an autoregressive model too, but with a reversed flow, thus achieving a much efficient sampling process.
First, let’s reverse the affine transformation in MAF: \(z_{i}=\frac{x_{i}-\mu_{i}\left(\mathbf{x}_{1 : i-1}\right)}{\sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)}=-\frac{\mu_{i}\left(\mathbf{x}_{1 : i-1}\right)}{\sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)}+x_{i} \odot \frac{1}{\sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)}\) if let: \(\begin{array}{l}{\mathbf{x}=\mathbf{z}, p( .)=\pi( .), \mathbf{x} \sim p(\mathbf{x})} \\ {\mathbf{z}=\mathbf{x}, \pi( .)=p( .), \mathbf{z} \sim \pi(\mathbf{z})}\end{array}\\ \begin{aligned} \mu_{i}\left(\mathbf{z}_{i : i-1}\right) &=\mu_{i}\left(\mathbf{x}_{1 : i-1}\right)=-\frac{\mu_{i}\left(\mathbf{x}_{1 : i-1}\right)}{\sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)} \\ \sigma\left(\mathbf{z}_{i : i-1}\right) &=\sigma\left(\mathbf{x}_{1 : i-1}\right)=\frac{1}{\sigma_{i}\left(\mathbf{x}_{1 : i-1}\right)} \end{aligned}\) Then we have:
IAF intends to estimate the probability density function of $x̃$ given that $π̃ (z̃ )$ is already known. The inverse flow is an autoregressive affine transformation too, same as in MAF, but the scale and shift terms are autoregressive functions of observed variables from the known distribution $π̃ (z̃)$.
Like other normalizing flows, drawing samples from an approximate posterior with Inverse AutoregressiveFlow(IAF) consists of an initial sample $z$ drawn from a simple distribution, such as a Gaussian with diagonal covariance, followed by a chain of nonlinear invertible transformations of z, each with a simple Jacobian determinants.
Computations of the individual elements $x̃ i$ do not depend on each other, so they are easily parallelizable (only one pass using MADE). The density estimation for a known $x̃ $ is not efficient, because we have to recover the value of $z̃ i$ in a sequential order, $z̃ i=(x̃ i−μ̃ i(z̃ 1:i−1))/σ̃ i(z̃ 1:i−1)$ thus D times in total.
The proposed flow
The authors propose a generative flow where each step (left) consists of an actnorm step, followed by an invertible 1 × 1 convolution, followed by an affine transformation (Dinh et al., 2014). This flow is combined with a multi-scale architecture (right).
There are three steps in one stage of flow in Glow.
Step 1:Activation normalization (short for “actnorm”)
It performs an affine transformation using a scale and bias parameter per channel, similar to batch normalization, but works for mini-batch size 1. The parameters are trainable but initialized so that the first minibatch of data have mean 0 and standard deviation 1 after actnorm.
Step 2: Invertible 1x1 conv
Between layers of the RealNVP flow, the ordering of channels is reversed so that all the data dimensions have a chance to be altered. A 1×1 convolution with equal number of input and output channels is a generalization of any permutation of the channel ordering.
Say, we have an invertible 1x1 convolution of an input $h×w×c$ tensor $h$ with a weight matrix $W$ of size $c×c$. The output is a $h×w×c$ tensor, labeled as $ f=𝚌𝚘𝚗𝚟𝟸𝚍(h;W)$. In order to apply the change of variable rule, we need to compute the Jacobian determinant $ | det∂f/∂h | $. |
Both the input and output of 1x1 convolution here can be viewed as a matrix of size $h×w$. Each entry $x_{ij}$($i=1,2…h, j=1,2,…,w$) in $h$ is a vector of $c$ channels and each entry is multiplied by the weight matrix $W$ to obtain the corresponding entry $y_{ij}$ in the output matrix respectively. The derivative of each entry is $\partial \mathbf{x}{i j} \mathbf{W} / \partial \mathbf{x}{i j}=\mathbf{w}$ and there are $h×w$ such entries in total:
The inverse 1x1 convolution depends on the inverse matrix $W^{−1}$ . Since the weight matrix is relatively small, the amount of computation for the matrix determinant (tf.linalg.det) and inversion (tf.linalg.inv) is still under control.
Step 3: Affine coupling layer
The design is same as in RealNVP.
The three main components of proposed flow, their reverses, and their log-determinants. Here, $x$ signifies the input of the layer, and $y$ signifies its output. Both $x$ and $y$ are tensors of shape $[h × w × c]$ with spatial dimensions (h, w) and channel dimension $c$. With $(i, j)$ we denote spatial indices into tensors $x$ and $y$. The function NN() is a nonlinear mapping, such as a (shallow) convolutional neural network like in ResNets (He et al., 2016) and RealNVP (Dinh et al., 2016).
Highlights
Design To estimate the joint distribution $p(x)$ we write it as the product of the conditional distributions over the pixels:
\[p(\mathbf{x})=\prod_{i=1}^{n^{2}} p\left(x_{i} | x_{1}, \ldots, x_{i-1}\right)\]Generating pixel-by-pixel with CNN, LSTM:
Highlights
Design Typically, to make sure the CNN can only use information about pixels above and to the left of the current pixel, the filters of the convolution in PixelCNN are masked. However, its computational cost rise rapidly when stacked.
The gated activation unit: \(\mathbf{y}=\tanh \left(W_{k, f} * \mathbf{x}\right) \odot \sigma\left(W_{k, g} * \mathbf{x}\right),\) where $σ$ is the sigmoid non-linearity, $k$ is the number of the layer, $⊙$ is the element-wise product and $∗$ is the convolution operator.
Add a high-level image description represented as a latent vector $h$: \(\mathbf{y}=\tanh \left(W_{k, f} * \mathbf{x}+V_{k, f}^{T} \mathbf{h}\right) \odot \sigma\left(W_{k, g} * \mathbf{x}+V_{k, g}^{T} \mathbf{h}\right)\)
WaveNet consists of a stack of causal convolution which is a convolution operation designed to respect the ordering: the prediction at a certain timestamp can only consume the data observed in the past, no dependency on the future. In PixelCNN, the causal convolution is implemented by masked convolution kernel. The causal convolution in WaveNet is simply to shift the output by a number of timestamps to the future so that the output is aligned with the last input element.
One big drawback of convolution layer is a very limited size of receptive field. The output can hardly depend on the input hundreds or thousands of timesteps ago, which can be a crucial requirement for modeling long sequences. WaveNet therefore adopts dilated convolution (animation), where the kernel is applied to an evenly-distributed subset of samples in a much larger receptive field of the input.
Highlights
Design By choosing a simple continuous distribution for modeling $ν$ we obtain a smooth and memory efficient predictive distribution for $x$. Here, we take this continuous univariate distribution to be a mixture of logistic distributions which allows us to easily calculate the probability on the observed discretized value $x$ For all sub-pixel values $x$ excepting the edge cases 0 and 255 we have: \(\nu \sim \sum_{i=1}^{K} \pi_{i} \operatorname{logistic}\left(\mu_{i}, s_{i}\right)\)
\[P(x | \pi, \mu, s)=\sum_{i=1}^{K} \pi_{i}\left[\sigma\left(\left(x+0.5-\mu_{i}\right) / s_{i}\right)-\sigma\left(\left(x-0.5-\mu_{i}\right) / s_{i}\right)\right]\]The output of our network is thus of much lower dimension, yielding much denser gradients of the loss with respect to our parameters.
General structure of a Generative Adversarial Network, where the generator G takes a noise vector z as input and output a synthetic sample G(z), and the discriminator takes both the synthetic input G(z) and true sample x as inputs and predict whether they are real or fake.
Generative Adversarial Net (GAN) consists of two separate neural networks: a generator G that takes a random noise vector z, and outputs synthetic data G(z); a discriminator D that takes an input x or G(z) and output a probability D(x) or D(G(z)) to indicate whether it is synthetic or from the true data distribution, as shown in Figure 1. Both of the generator and discriminator can be arbitrary neural networks.
In other words, D and G play the following two-player minimax game with value function $V (G, D)$: \(\min _{G} \max _{D} V(D, G)=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}(\boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}}(\boldsymbol{z})[\log (1-D(G(\boldsymbol{z})))]\)
The main loop of GAN training. Novel data samples, $x′$, may be drawn by passing random samples, $z$ through the generator network. The gradient of the discriminator may be updated $k$ times before updating the generator.
GAN provide an implicit way to model data distribution, which is much more versatile than explicit ones like PixelCNN.
In the original GAN, we have no control of what to be generated, since the output is only dependent on random noise. However, we can add a conditional input $c$ to the random noise $z$ so that the generated image is defined by $G(c,z)$ . Typically, the conditional input vector c is concatenated with the noise vector z, and the resulting vector is put into the generator as it is in the original GAN. Besides, we can perform other data augmentation on $c$ and $z$. The meaning of conditional input $c$ is arbitrary, for example, it can be the class of image, attributes of object or an embedding of text descriptions of the image we want to generate.
The objective function of a two-player minimax game would be: \(\min _{G} \max _{D} V(D, G)=\mathbb{E}_{\boldsymbol{x} \sim p_{\mathrm{data}}(\boldsymbol{x})}[\log D(\boldsymbol{x} | \boldsymbol{y})]+\mathbb{E}_{\boldsymbol{z} \sim p_{z}}(\boldsymbol{z})[\log (1-D(G(\boldsymbol{z} | \boldsymbol{y})))]\)
Architecture of GAN with auxiliary classifier, where $y$ is the conditional input label and $C$ is the classifier that takes the synthetic image $G(y, z)$ as input and predict its label $\hat{y}$.
Building blocks of DCGAN, where the generator uses transposed convolution, batch-normalization and ReLU activation, while the discriminator uses convolution, batch-normalization and LeakyReLU activation.
DCGAN provides significant contributions to GAN in that its suggested convolution neural network (CNN)architecture greatly stabilizes GAN training. DCGAN suggests an architecture guideline in which the generator is modeled with a transposed CNN, and the discriminator is modeled with a CNN with an output dimension 1. It also proposes other techniques such as batch normalization and types of activation functions for the generator and the discriminator to help stabilize the GAN training. As it solves the instability of training GAN only through architecture, it becomes a baseline for modeling various GANs proposed later.
Improved GAN proposed several useful tricks to stabilize the training of GANs.
Feature matching This technique substitutes the discriminator’s output in the objective function with an activation function’s output of an intermediate layer of the discriminator to prevent overfitting from the current discriminator. Feature matching does not aim on the discriminator’s output, rather it guides the generator to see the statistics or features of real training data, in an effort to stabilize training.
Label smoothing As mentioned previously, $V (G, D)$ is a binary cross entropy loss whose real data label is 1 and its generated data label is 0. However, since a deep neural network classifier tends to output a class probability with extremely high confidence, label smoothing encourages a deep neural network classifier to produce a more soft estimation by assigning label values lower than 1. Importantly, for GAN, label smoothing has to be made for labels of real data, not for labels of fake data, since, if not, the discriminator can act incorrectly.
Minibatch Discrimination
With minibatch discrimination, the discriminator is able to digest the relationship between training data points in one batch, instead of processing each point independently.
In one minibatch, we approximate the closeness between every pair of samples, $c(x_i,x_j)$, and get the overall summary of one data point by summing up how close it is to other samples in the same batch, $o\left(x_{i}\right)=\sum_{i} c\left(x_{i}, x_{i}\right)$. Then $o(x_i)$ is explicitly added to the input of the model.
Historical Averaging
For both models, add $\left|\mathbb{E}{x \sim p{r}} f(x)-\mathbb{E}{z \sim p{z}(z)} f(G(z))\right|_{2}^{2}$into the loss function, where $Θ$ is the model parameter and $Θ_i$ is how the parameter is configured at the past training time $i$. This addition piece penalizes the training speed when $Θ$ is changing too dramatically in time.
Virtual Batch Normalization (VBN)
Each data sample is normalized based on a fixed batch (“reference batch”) of data rather than within its minibatch. The reference batch is chosen once at the beginning and stays the same through the training.
Adding Noises
Based on the discussion in the previous section, we now know $p_r$ and $p_g$ are disjoint in a high dimensional space and it causes the problem of vanishing gradient. To artificially “spread out” the distribution and to create higher chances for two probability distributions to have overlaps, one solution is to add continuous noises onto the inputs of the discriminator $D$.
Use Better Metric of Distribution Similarity
The loss function of the vanilla GAN measures the JS divergence between the distributions of $p_r$ and $p_g$. This metric fails to provide a meaningful value when two distributions are disjoint.
The theoretical and practical issues of GAN
The Kullback-Leibler (KL) divergence \(K L\left(\mathbb{P}_{r} \| \mathbb{P}_{g}\right)=\int \log \left(\frac{P_{r}(x)}{P_{g}(x)}\right) P_{r}(x) d \mu(x)\) where both $P_r$ and $P_g$ are assumed to be absolutely continuous, and therefore admit densities, with respect to a same measure $μ$ defined on $\mathcal{X}^2$ The KL divergence is famously assymetric and possibly infinite when there are points such that $P_g(x) = 0$ and $P_r(x) > 0$.
The Jensen-Shannon (JS) divergence \(J S\left(\mathbb{P}_{r}, \mathbb{P}_{g}\right)=K L\left(\mathbb{P}_{r} \| \mathbb{P}_{m}\right)+K L\left(\mathbb{P}_{g} \| \mathbb{P}_{m}\right)\) where $P_m$ is the mixture $(P_r + P_g)/2$. This divergence is symmetrical and always defined because we can choose $μ = P_m$.
The Earth-Mover (EM) distance or Wasserstein-1
where$Π(Pr,Pg)$denotes the set of all joint distributions $γ(x,y)$, whose marginals are respectively Pr and Pg. Intuitively, $γ(x,y)$ indicates how much “mass” must be transported from x to y in order to transform the distributions $P_r$ into the distribution $P_g$. The EM distance then is the “cost” of the optimal transport plan.
Compared to the original GAN algorithm, the WGAN undertakes the following changes:
(left) Gradient norms of deep WGAN critics during training on the Swiss Roll dataset either explode or vanish when using weight clipping, but not when using a gradient penalty. (right) Weight clipping (top) pushes weights towards two values (the extremes of the clipping range), unlike gradient penalty (bottom).
Gradient penalty
The authors implicitly define $Pxˆ $ sampling uniformly along straight lines between pairs of points sampled from the data distribution $P_r$ and the generator distribution $P_g$. This is motivated by the fact that the optimal critic contains straight lines with gradient norm 1 connecting coupled points from $P_r$ and $P_g$. Given that enforcing the unit gradient norm constraint everywhere is intractable, enforcing it only along these straight lines seems sufficient and experimentally results in good performance.
Generating high resolution images is highly challenging since a large scale generated image is easily distinguished by the discriminator, so the generator often fails to be trained. Moreover, there is a memory issue in that we are forced to set a low mini-batch size due to the large size of neural networks. Therefore, some studies adopt hierarchical stacks of multiple generators and discriminators. This strategy divides a large complex generator’s mapping space step by step for each GAN pair, making it easier to learn to generate high resolution images. However, Progressive GAN succeeds in generating high resolution images in a single GAN, making training faster and more stable.
Progressive GAN generates high resolution images by stacking each layer of the generator and the discriminator incrementally. It starts training to generate a very low spatial resolution (e.g. 4×4), and progressively doubles the resolution of generated images by adding layers to the generator and the discriminator incrementally. In addition, it proposes various training techniques such as pixel normalization, equalized learning rate and mini-batch standard deviation, all of which help GAN training to become more stable.
The training starts with both the generator (G) and discriminator (D) having a low spatial resolution of 4×4 pixels. As the training advances, we incrementally add layers to G and D, thus increasing the spatial resolution of the generated images. All existing layers remain trainable throughout the process. Here refers to convolutional layers operating on N × N spatial resolution. This allows stable synthesis in high resolutions and also speeds up training considerably. One the right we show six example images generated using progressive growing at 1024 × 1024.
For GAN models trained with ImageNet, they are good at classes with a lot of texture (landscape, sky) but perform much worse for structure. For example, GAN may render the fur of a dog nicely but fail badly for the dog’s legs. While convolutional filters are good at exploring spatial locality information, the receptive fields may not be large enough to cover larger structures. We can increase the filter size or the depth of the deep network but this will make GANs even harder to train. Alternatively, we can apply the attention concept. For example, to refine the image quality of the eye region (the red dot on the left figure), SAGAN only uses the feature map region on the highlight area in the middle figure. As shown below, this region has a larger receptive field and the context is more focus and more relevant. The right figure shows another example on the mouth area (the green dot).
Code: PyTorch, TensorFlow
The StyleGAN architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis.
While a traditional generator feeds the latent code though the input layer only, we first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, be- fore evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network f consists of 8 layers and the synthesis network g consists of 18 layers—two for each resolution ($4^2 − 1024^2$)
Code: PyTorch
Code: PyTorch
The authors demonstrate that GANs benefit dramatically from scaling, and train models with two to four times as many parameters and eight times the batch size compared to prior art. We introduce two simple, general architectural changes that improve scalability, and modify a regularization scheme to improve conditioning, demonstrably boosting performance.
As a side effect of our modifications, their models become amenable to the “truncation trick,” a simple sampling technique that allows explicit, fine-grained control of the trade- off between sample variety and fidelity.
They discover instabilities specific to large scale GANs, and characterize them empirically. Leveraging insights from this analysis, we demonstrate that a combination of novel and existing techniques can reduce these instabilities, but complete training stability can only be achieved at a dramatic cost to performance.
(a) A typical architectural layout for BigGAN’s G; details are in the following tables. (b) A Residual Block (ResBlock up) in BigGAN’s G. (c) A Residual Block (ResBlock down) in BigGAN’s D.
(a) A typical architectural layout for BigGAN-deep’s G; details are in the following tables. (b) A Residual Block (ResBlock up) in BigGAN-deep’s G. (c) A Residual Block (ResBlock down) in BigGAN-deep’s D. A ResBlock (without up or down) in BigGAN-deep does not include the Upsample or Average Pooling layers, and has identity skip connections.
Inspired by Karpathy’s impressive post A Recipe for Training Neural Networks, I decided to review and share my daily route of engineering works on Machine Learning related research. Compared to Karpathy’s advice, I’d like to talk more about approaches or practices instead of strategies. This article is merely a personal revision from a newbie, so any suggestions are welcomed.
Everything is a file.
– one of defining features of Unix
I believe that it is extremely important to learn from failures, before which recording and observing may be the first step. So I’m obsessed with store and organize everything for future review.
I added a plus symbol to emphasize the difference between the original dataset and experiment-usable dataset. From my point of view, a dataset+ may include the following parts:
tf.record
, .lmdb
, etcBecause of a) the complicated and unstable optimization process b) fast-developing frameworks c) data dependency, the reproduction problem is one of bad reputations of learning-based methods. For a proposed learning-based model, not only the algorithms described in paper, but also its entire reproduction stack as well as the datasets should be taken notice of.
Thanks to container technology and Python environment management tools, namely Docker and virtualenv
etc, we’ve got much more control on the environment where our codes run. I prepared two Dockerfile
s, one of which is for development and another for production (release). Upon docker images, a virtual env will be created for each project. A lot of tools is useful for environment management in Python, including conda-env
, pyenv
, peotry
, pipenv
, or just a requirements.txt
file.
Every time running an experiment, a docker container will be built and only projects dependencies are installed, which both can be totally controlled through Dockerfile
and requirements.txt
(or equivalents).
All codes are monitored using git
, with at least one commit per day if any files changed. I once suffered a lot from a unintentional bug which can be found immediately if I used VCS in my early days. I encourage myself to only write codes which can be released.
Configs mainly include hyper-parameters of an experiment. Common items are: dataset root directory, output directory, checkpoints directory, learning rate, batch size, GPU ids, epochs, input shapes, loss weight, etc. I think that a perfect configure system should be:
argparse
: easy to implement, need extra effort to manage scripts. e.g. SPADEconfig
variable: require utility functions to parse YAML files and update config
. e.g. StackGAN-v2dict
: easy to implement, but add unnecessary code pieces. e.g. mmdetectiongin
package. e.g. compare_ganI used to create a checkpoint
folder under my home directory and organize checkpoints based on projects and datasets. On the other hand, a model
folder is created for pretrained models and released models. They are usually shared across projects. Inside projects, different runs are identified with starting time.
Logs are stored in the corresponding checkpoint
folder during training, with stack version, hyper-params, train loss per batch and val loss per epoch recorded. I also store middle results in a separate folder for debug. Some scripts for parsing logs are required.
Tools like Tensorboard
and Visdom
are useful for visualizing loss changes on the fly during training. Furthermore, they can basically record nearly everything. However, they usually need additional ports and eat memory. It’s recommended to mount the log
folder and then run these visualization tools on another machine.
The output
folder is where I store predicted images, numpy arrays, and text files for evaluation (validation set) or submission (test set). In this way, the evaluation process can be decoupled from the training process, and it’s rather suitable when comparing different methods (or models from different runs).
The playground refers to a super environment for trying new codes, packages or getting familiar with some functions. It can also be a single jupyter notebook inside the project folder.
Shell scripts to bundle a series works for an environment, e.g. a preprocessing-training-evaluation-visualization-test pipeline.
Debug folder stores anything provide insights from debugging process. It also serves as a transfer station between training nodes and my personal computer.
I often writing notes as comments inside codes and git
comments. They will be reorganized to note-taking apps during the review process.
I hope that creativity is the only constraint of my research, not engineering. Thanks for your time and feel free to leave your advice below.
]]>The reason of time-consuming is that CNN is performed separately on each Proposal, with no shared calculations.
The picture above shows the architecture of Fast R-CNN. The image is generated by the feature extractor, and the Selective Search algorithm is used to map the RoI (Region of Interest) to the feature map. Then, the RoI Pooling operation is performed for each RoI to obtain the feature vector of the same length. Classification and BBox Regression.
This structure of Fast R-CNN is the prototype of the meta-structure used in the main 2-stage method of the detection task. The entire system consists of several components: Proposal, Feature Extractor, Object Recognition & Localization. The Proposal part is replaced by RPN (Faster R-CNN), the Feature Extractor part uses SOTA’s classified CNN network (ResNet, etc.), and the last part is often a parallel multitasking structure (Mask R-CNN, etc.).
Check full introduction at Fast RCNN - Grishick - ICCV 2015 - Caffe Code
Faster R-CNN is the mainstream method of 2-stage method. The proposed RPN network replaces the Selective Search algorithm so that the detection task can be completed end-to-end by the neural network. Roughly speaking, Faster R-CNN = RPN + Fast R-CNN, the nature of the convolution calculation shared with RCNN makes the calculations introduced by RPN very small, allowing Faster R-CNN to run at 5fps on a single GPU. Reach SOTA in terms of accuracy.
The RPN network models the Proposal task as a two-category problem.
The first step is to generate an anchor box of different size and aspect ratio on a sliding window, determine the threshold of the IOU, and calibrate the positive and negative of the anchor box according to Ground Truth. Thus, the sample that is passed into the RPN network is the anchor box and whether there is an object in each anchor box. The RPN network maps each sample to a probability value and four coordinate values. The probability value reflects the probability that the anchor box has an object, and the four coordinate values are used to regress the position of the defined object. Finally, the two classifications and the coordinates of the Loss are unified to be the target training of the RPN network.
The RPN network has a large number of super-parameters, the size and length-to-width ratio of the anchor box, the threshold of IoU, and the ratio of Proposal positive and negative samples on each image.
Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.
The article points out that there is an unnatural design of the framework before the detection task, that is, the feature extraction part of the full convolution + the fully connected classifier, and the best performing image classifier is a full convolution structure (ResNet, etc.). One point is caused by the contradiction between the translation invariance of the classification task and the translation sensitivity of the detection task. In other words, the detection model uses the feature extractor of the classification model, and the position information is lost. This article proposes to solve this problem by using a “location-sensitive score map” approach.
The comparison with Faster R-CNN shows that R-FCN achieves better accuracy while maintaining shorter inference time.
Check full introduction at R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016
Starting from the picture, the cascading feature extraction is performed as usual, and a return path is added: starting from the highest feature map, the nearest neighbor is sampled down to get the return feature map of the same size as the low-level feature map. A lateral connection at the element position is then made to form features in this depth.
The belief in this operation is that the low-level feature map contains more location information, and the high-level feature map contains better classification information, combining the two to try to achieve the location classification dual requirements of the detection task.
The main experimental results of the article are as follows:
Comparing the different head parts, the input feature changes do improve the detection accuracy, and the lateral and top-down operations are also indispensable.
Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.
The loss function is divided into three parts: coordinate error, object error, and class error. In order to balance the effects of category imbalance and large and small objects, weights are added to the loss and the root length is taken.
Compared to Fast-RCNN, YOLO’s background false detections account for a small proportion of errors, while position errors account for a large proportion (no log coding).
Check full introduction at You Only Look Once: Unified, Real Time Object Detection - Redmon et al. - CVPR 2016.
A significant improvement for YOLO).
After this improvement, YOLOv2 absorbs the advantages of a lot of work, achieving the accuracy and faster inference speed of SSD.
Check full introduction at YOLO9000: Better, Faster, Stronger - Redmon et al. - 2016.
In single-stage models, a massive number of training samples are calculated in the loss function at the same time, because of the lack of proposing candidate regions. Based on findings that the loss of a single-stage model is dominated by easy samples(usually backgrounds), Focal Loss introduces a suppression factor on losses from these easy samples, in order to let hard cases play a bigger role in the training process.
Utilizing focal loss term, a dense detector called RetinaNet is designed based on ResNet and FPN:
The author’s experiments show that a single-stage detector can achieve comparable accuracy like two-stage models thanks to the proposed loss function. However, Focal Loss function brings in two additional hyper-parameters. The authors use a grid search to optimize these two hyper-parameters, which is not inspiring at all since it provides little experience when using the proposed loss function on other datasets or scenarios. Focal Loss optimizes the weight between easy and hard training samples in the loss function from the perspective of sample imbalance.
The process of SNIP:
Select 3 image resolutions: (480, 800) to train [120, ∞) proposals, (800, 1200) to train [40, 160] proposals, (1400, 2000) to train [0, 80] for proposals
For each resolution image, BP only returns the gradient of the proposal within the corresponding scale.
This ensures that only one network is used, but the size of each training object is the same, and the size of the object of ImageNet is consistent to solve the problem of domain shift, and it is consistent with the experience of the backbone, and the training and test dimensions are consistent, satisfying “ ImageNet pre-trained size, an object size, a network, a receptive field, these four match each other, and the train and test dimensions are the same.
A network, but using all the object training, compared to the scale specific detector, SNIP is fully conducive to the data
During the test, the same detector is measured once on each of the three resolution images, and only the detected boxes of the corresponding scale are retained for each resolution image, and then merged to execute SoftNMS.
The authors conducted experiments for RFCN and Faster-RCNN and SNIP improves performance for both meta architectures.
Check full introduction at An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018.
Following SNIP, the authors put crops of an image which contain objects to be detected(called chips) into training instead of the entire image. This design also makes large-batch training possible, which accelerates the training process. This training method utilizes the context of the object, which can save unnecessary calculations for simple background(such as the sky) so that the utilization rate of training data is improved.
The core design of SNIPER is the selection strategy for ROIs from a chip(a crop of entire image). The authors use several hyper-params to filter boxes with proper size in a batch, hopping that the detector network only learns features beyond object size.
Due to its memory efficient design, SNIPER can benefit from Batch Normalization during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU. Hence, there is no need to synchronize batch-normalization statistics across GPUs.
An improvement of the accuracy of small-size objects was reported according to the author’s experiments.
There is a 1:3 strategy in Faster-RCNN network, which samples negative ROIs(backgrounds) to balance the ratio for positive and negative data in a batch. It’s empirical and hand-designed(need additional effort when setting hyper-params).
The authors designed an additional sub-network to “learn” the sampling process for negative ROIs, forcing the network focus on ones which are similar to objects(the hard ones), such as backgrounds contain part of objects.
The ‘hard’ examples are defined using probability from detection head, which means that the sample network is exactly the classification network. In practice, the selecting range is set to [0.1, 0.5].
OHEM can improve performance even after adding bells and whistles like Multi-scale training and Iterative bbox regression.
A common practice that used in earlier works such as R-CNN is to pre-train a backbone network on a categorical dataset like ImageNet, and then use these pre-trained weights as initialization of detection model. Although I have once successfully trained a small detection network from random initialization on a large dataset, there are few models trained from scratch when the number of instances in a dataset is limited like Pascal VOC and COCO. Actually, using better pre-trained weights is one of the tricks in detection challenges. DSOD attempts to train the detection network from scratch with the help of “Deep Supervision” from DenseNet.
The 4 principles authors argued for object detection networks:
1. Proposal-free
2. Deep supervision
3. Stem Block
4. Dense Prediction Structure
DSOD outperforms detectors with pre-trained weights.
Ablation Study on parts: