February 1, 2020

3138 words 15 mins read

Paper Group AWR 351

Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. Mish: A Self Regularized Non-Monotonic Neural Activation Function. A Feasible Framework for Arbitrary-Shaped Scene Text Recognition. Auto-Encoding Progressive Generative Adver …


Title	Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Authors	Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Abstract	Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps).
Tasks	Video Denoising, Video Inpainting, Video-to-Video Synthesis
Published	2019-05-08
URL	https://arxiv.org/abs/1905.02949v1
PDF	https://arxiv.org/pdf/1905.02949v1.pdf
PWC	https://paperswithcode.com/paper/deep-blind-video-decaptioning-by-temporal
Repo	https://github.com/shwoo93/video_decaptioning
Framework	pytorch

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization


Title	One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Authors	Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee
Abstract	Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision.
Tasks	Voice Conversion
Published	2019-04-10
URL	https://arxiv.org/abs/1904.05742v4
PDF	https://arxiv.org/pdf/1904.05742v4.pdf
PWC	https://paperswithcode.com/paper/one-shot-voice-conversion-by-separating
Repo	https://github.com/jjery2243542/adaptive_voice_conversion
Framework	pytorch

Mish: A Self Regularized Non-Monotonic Neural Activation Function


Title	Mish: A Self Regularized Non-Monotonic Neural Activation Function
Authors	Diganta Misra
Abstract	The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.
Tasks	Image Classification
Published	2019-08-23
URL	https://arxiv.org/abs/1908.08681v2
PDF	https://arxiv.org/pdf/1908.08681v2.pdf
PWC	https://paperswithcode.com/paper/mish-a-self-regularized-non-monotonic-neural
Repo	https://github.com/thomasbrandon/mish-cuda
Framework	pytorch

A Feasible Framework for Arbitrary-Shaped Scene Text Recognition


Title	A Feasible Framework for Arbitrary-Shaped Scene Text Recognition
Authors	Jinjin Zhang, Wei Wang, Di Huang, Qingjie Liu, Yunhong Wang
Abstract	Deep learning based methods have achieved surprising progress in Scene Text Recognition (STR), one of classic problems in computer vision. In this paper, we propose a feasible framework for multi-lingual arbitrary-shaped STR, including instance segmentation based text detection and language model based attention mechanism for text recognition. Our STR algorithm not only recognizes Latin and Non-Latin characters, but also supports arbitrary-shaped text recognition. Our method wins the championship on Scene Text Spotting Task (Latin Only, Latin and Chinese) of ICDAR2019 Robust Reading Challenge on ArbitraryShaped Text Competition. Code is available at https://github.com/zhang0jhon/AttentionOCR.
Tasks	Instance Segmentation, Language Modelling, Scene Text Recognition, Semantic Segmentation, Text Spotting
Published	2019-12-10
URL	https://arxiv.org/abs/1912.04561v2
PDF	https://arxiv.org/pdf/1912.04561v2.pdf
PWC	https://paperswithcode.com/paper/a-feasible-framework-for-arbitrary-shaped
Repo	https://github.com/zhang0jhon/AttentionOCR
Framework	tf

Auto-Encoding Progressive Generative Adversarial Networks For 3D Multi Object Scenes


Title	Auto-Encoding Progressive Generative Adversarial Networks For 3D Multi Object Scenes
Authors	Vedant Singh, Manan Oza, Himanshu Vaghela, Pratik Kanani
Abstract	3D multi object generative models allow us to synthesize a large range of novel 3D multi object scenes and also identify objects, shapes, layouts and their positions. But multi object scenes are difficult to create because of the dataset being multimodal in nature. The conventional 3D generative adversarial models are not efficient in generating multi object scenes, they usually tend to generate either one object or generate fuzzy results of multiple objects. Auto-encoder models have much scope in feature extraction and representation learning using the unsupervised paradigm in probabilistic spaces. We try to make use of this property in our proposed model. In this paper we propose a novel architecture using 3DConvNets trained with the progressive training paradigm that has been able to generate realistic high resolution 3D scenes of rooms, bedrooms, offices etc. with various pieces of furniture and objects. We make use of the adversarial auto-encoder along with the WGAN-GP loss parameter in our discriminator loss function. Finally this new approach to multi object scene generation has also been able to generate more number of objects per scene.
Tasks	Representation Learning, Scene Generation
Published	2019-03-08
URL	http://arxiv.org/abs/1903.03477v1
PDF	http://arxiv.org/pdf/1903.03477v1.pdf
PWC	https://paperswithcode.com/paper/auto-encoding-progressive-generative
Repo	https://github.com/yunishi3/3D-FCR-alphaGAN
Framework	tf

LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators


Title	LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators
Authors	Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, Tingfa Xu
Abstract	Layout is important for graphic design and scene generation. We propose a novel Generative Adversarial Network, called LayoutGAN, that synthesizes layouts by modeling geometric relations of different types of 2D elements. The generator of LayoutGAN takes as input a set of randomly-placed 2D graphic elements and uses self-attention modules to refine their labels and geometric parameters jointly to produce a realistic layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design.
Tasks	Scene Generation
Published	2019-01-21
URL	http://arxiv.org/abs/1901.06767v1
PDF	http://arxiv.org/pdf/1901.06767v1.pdf
PWC	https://paperswithcode.com/paper/layoutgan-generating-graphic-layouts-with
Repo	https://github.com/zyf12389/LayoutGAN-Alpha
Framework	pytorch

Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices


Title	Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices
Authors	Chandramouli Shama Sastry, Sageev Oore
Abstract	When presented with Out-of-Distribution (OOD) examples, deep neural networks yield confident, incorrect predictions. Detecting OOD examples is challenging, and the potential risks are high. In this paper, we propose to detect OOD examples by identifying inconsistencies between activity patterns and class predicted. We find that characterizing activity patterns by Gram matrices and identifying anomalies in gram matrix values can yield high OOD detection rates. We identify anomalies in the gram matrices by simply comparing each value with its respective range observed over the training data. Unlike many approaches, this can be used with any pre-trained softmax classifier and does not require access to OOD data for fine-tuning hyperparameters, nor does it require OOD access for inferring parameters. The method is applicable across a variety of architectures and vision datasets and, for the important and surprisingly hard task of detecting far-from-distribution out-of-distribution examples, it generally performs better than or equal to state-of-the-art OOD detection methods (including those that do assume access to OOD examples).
Tasks
Published	2019-12-28
URL	https://arxiv.org/abs/1912.12510v2
PDF	https://arxiv.org/pdf/1912.12510v2.pdf
PWC	https://paperswithcode.com/paper/detecting-out-of-distribution-examples-with
Repo	https://github.com/VectorInstitute/gram-ood-detection
Framework	pytorch

Towards Generating Stylized Image Captions via Adversarial Training


Title	Towards Generating Stylized Image Captions via Adversarial Training
Authors	Omid Mohamad Nezami, Mark Dras, Stephen Wan, Cecile Paris, Len Hamey
Abstract	While most image captioning aims to generate objective descriptions of images, the last few years have seen work on generating visually grounded image captions which have a specific style (e.g., incorporating positive or negative sentiment). However, because the stylistic component is typically the last part of training, current models usually pay more attention to the style at the expense of accurate content description. In addition, there is a lack of variability in terms of the stylistic aspects. To address these issues, we propose an image captioning model called ATTEND-GAN which has two core components: first, an attention-based caption generator to strongly correlate different parts of an image with different parts of a caption; and second, an adversarial training mechanism to assist the caption generator to add diverse stylistic components to the generated captions. Because of these components, ATTEND-GAN can generate correlated captions as well as more human-like variability of stylistic patterns. Our system outperforms the state-of-the-art as well as a collection of our baseline models. A linguistic analysis of the generated captions demonstrates that captions generated using ATTEND-GAN have a wider range of stylistic adjectives and adjective-noun pairs.
Tasks	Image Captioning
Published	2019-08-08
URL	https://arxiv.org/abs/1908.02943v1
PDF	https://arxiv.org/pdf/1908.02943v1.pdf
PWC	https://paperswithcode.com/paper/towards-generating-stylized-image-captions
Repo	https://github.com/omidmnezami/Style-GAN
Framework	tf

Aligning Linguistic Words and Visual Semantic Units for Image Captioning


Title	Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Authors	Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, Hanqing Lu
Abstract	Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processers. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches.
Tasks	Image Captioning
Published	2019-08-06
URL	https://arxiv.org/abs/1908.02127v1
PDF	https://arxiv.org/pdf/1908.02127v1.pdf
PWC	https://paperswithcode.com/paper/aligning-linguistic-words-and-visual-semantic
Repo	https://github.com/ltguo19/VSUA-Captioning
Framework	pytorch

Multi-View Stereo by Temporal Nonparametric Fusion


Title	Multi-View Stereo by Temporal Nonparametric Fusion
Authors	Yuxin Hou, Juho Kannala, Arno Solin
Abstract	We propose a novel idea for depth estimation from multi-view image-pose pairs, where the model has capability to leverage information from previous latent-space encodings of the scene. This model uses pairs of images and poses, which are passed through an encoder–decoder model for disparity estimation. The novelty lies in soft-constraining the bottleneck layer by a nonparametric Gaussian process prior. We propose a pose-kernel structure that encourages similar poses to have resembling latent spaces. The flexibility of the Gaussian process (GP) prior provides adapting memory for fusing information from previous views. We train the encoder–decoder and the GP hyperparameters jointly end-to-end. In addition to a batch method, we derive a lightweight estimation scheme that circumvents standard pitfalls in scaling Gaussian process inference, and demonstrate how our scheme can run in real-time on smart devices.
Tasks	Depth Estimation, Disparity Estimation
Published	2019-04-12
URL	https://arxiv.org/abs/1904.06397v2
PDF	https://arxiv.org/pdf/1904.06397v2.pdf
PWC	https://paperswithcode.com/paper/multi-view-stereo-by-temporal-nonparametric
Repo	https://github.com/AaltoML/GP-MVS
Framework	pytorch

Aesthetic Attributes Assessment of Images


Title	Aesthetic Attributes Assessment of Images
Authors	Xin Jin, Le Wu, Geng Zhao, Xiaodong Li, Xiaokun Zhang, Shiming Ge, Dongqing Zou, Bin Zhou, Xinghui Zhou
Abstract	Image aesthetic quality assessment has been a relatively hot topic during the last decade. Most recently, comments type assessment (aesthetic captions) has been proposed to describe the general aesthetic impression of an image using text. In this paper, we propose Aesthetic Attributes Assessment of Images, which means the aesthetic attributes captioning. This is a new formula of image aesthetic assessment, which predicts aesthetic attributes captions together with the aesthetic score of each attribute. We introduce a new dataset named \emph{DPC-Captions} which contains comments of up to 5 aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset. Then, we propose Aesthetic Multi-Attribute Network (AMAN), which is trained on a mixture of fully-annotated small-scale PCCD dataset and weakly-annotated large-scale DPC-Captions dataset. Our AMAN makes full use of transfer learning and attention model in a single framework. The experimental results on our DPC-Captions and PCCD dataset reveal that our method can predict captions of 5 aesthetic attributes together with numerical score assessment of each attribute. We use the evaluation criteria used in image captions to prove that our specially designed AMAN model outperforms traditional CNN-LSTM model and modern SCA-CNN model of image captions.
Tasks	Image Captioning, Transfer Learning
Published	2019-07-11
URL	https://arxiv.org/abs/1907.04983v2
PDF	https://arxiv.org/pdf/1907.04983v2.pdf
PWC	https://paperswithcode.com/paper/aesthetic-attributes-assessment-of-images
Repo	https://github.com/BestiVictory/DPC-Captions
Framework	none

Image Captioning: Transforming Objects into Words


Title	Image Captioning: Transforming Objects into Words
Authors	Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares
Abstract	Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset.
Tasks	Image Captioning
Published	2019-06-14
URL	https://arxiv.org/abs/1906.05963v2
PDF	https://arxiv.org/pdf/1906.05963v2.pdf
PWC	https://paperswithcode.com/paper/image-captioning-transforming-objects-into
Repo	https://github.com/yahoo/object_relation_transformer
Framework	pytorch

Towards Interpretable Reinforcement Learning Using Attention Augmented Agents


Title	Towards Interpretable Reinforcement Learning Using Attention Augmented Agents
Authors	Alex Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, Danilo J. Rezende
Abstract	Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. This model uses a soft, top-down attention mechanism to create a bottleneck in the agent, forcing it to focus on task-relevant information by sequentially querying its view of the environment. The output of the attention mechanism allows direct observation of the information used by the agent to select its actions, enabling easier interpretation of this model than of traditional models. We analyze different strategies that the agents learn and show that a handful of strategies arise repeatedly across different games. We also show that the model learns to query separately about space and content (`where' vs.` what’). We demonstrate that an agent using this mechanism can achieve performance competitive with state-of-the-art models on ATARI tasks while still being interpretable.
Tasks	Image Captioning, Question Answering
Published	2019-06-06
URL	https://arxiv.org/abs/1906.02500v1
PDF	https://arxiv.org/pdf/1906.02500v1.pdf
PWC	https://paperswithcode.com/paper/towards-interpretable-reinforcement-learning
Repo	https://github.com/aluscher/torchbeastpopart
Framework	pytorch

SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures


Title	SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures
Authors	Hsin-Pai Cheng, Tunhou Zhang, Yukun Yang, Feng Yan, Shiyu Li, Harris Teague, Hai Li, Yiran Chen
Abstract	Designing neural architectures for edge devices is subject to constraints of accuracy, inference latency, and computational cost. Traditionally, researchers manually craft deep neural networks to meet the needs of mobile devices. Neural Architecture Search (NAS) was proposed to automate the neural architecture design without requiring extensive domain expertise and significant manual efforts. Recent works utilized NAS to design mobile models by taking into account hardware constraints and achieved state-of-the-art accuracy with fewer parameters and less computational cost measured in Multiply-accumulates (MACs). To find highly compact neural architectures, existing works relies on predefined cells and directly applying width multiplier, which may potentially limit the model flexibility, reduce the useful feature map information, and cause accuracy drop. To conquer this issue, we propose GRAM(GRAph propagation as Meta-knowledge) that adopts fine-grained (node-wise) search method and accumulates the knowledge learned in updates into a meta-graph. As a result, GRAM can enable more flexible search space and achieve higher search efficiency. Without the constraints of predefined cell or blocks, we propose a new structure-level pruning method to remove redundant operations in neural architectures. SwiftNet, which is a set of models discovered by GRAM, outperforms MobileNet-V2 by 2.15x higher accuracy density and 2.42x faster with similar accuracy. Compared with FBNet, SwiftNet reduces the search cost by 26x and achieves 2.35x higher accuracy density and 1.47x speedup while preserving similar accuracy. SwiftNetcan obtain 63.28% top-1 accuracy on ImageNet-1K with only 53M MACs and 2.07M parameters. The corresponding inference latency is only 19.09 ms on Google Pixel 1.
Tasks	Neural Architecture Search
Published	2019-06-19
URL	https://arxiv.org/abs/1906.08305v2
PDF	https://arxiv.org/pdf/1906.08305v2.pdf
PWC	https://paperswithcode.com/paper/swiftnet-using-graph-propagation-as-meta
Repo	https://github.com/newwhitecheng/swiftnet
Framework	none

Functional Variational Bayesian Neural Networks


Title	Functional Variational Bayesian Neural Networks
Authors	Shengyang Sun, Guodong Zhang, Jiaxin Shi, Roger Grosse
Abstract	Variational Bayesian neural networks (BNNs) perform variational inference over weights, but it is difficult to specify meaningful priors and approximate posteriors in a high-dimensional weight space. We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i.e. distributions over functions. We prove that the KL divergence between stochastic processes equals the supremum of marginal KL divergences over all finite sets of inputs. Based on this, we introduce a practical training objective which approximates the functional ELBO using finite measurement sets and the spectral Stein gradient estimator. With fBNNs, we can specify priors entailing rich structures, including Gaussian processes and implicit stochastic processes. Empirically, we find fBNNs extrapolate well using various structured priors, provide reliable uncertainty estimates, and scale to large datasets.
Tasks	Bayesian Inference, Gaussian Processes
Published	2019-03-14
URL	http://arxiv.org/abs/1903.05779v1
PDF	http://arxiv.org/pdf/1903.05779v1.pdf
PWC	https://paperswithcode.com/paper/functional-variational-bayesian-neural-1
Repo	https://github.com/ssydasheng/FBNN
Framework	tf