Paper Group AWR 351
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. Mish: A Self Regularized Non-Monotonic Neural Activation Function. A Feasible Framework for Arbitrary-Shaped Scene Text Recognition. Auto-Encoding Progressive Generative Adver …
Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence
Title | Deep Blind Video Decaptioning by Temporal Aggregation and Recurrence |
Authors | Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon |
Abstract | Blind video decaptioning is a problem of automatically removing text overlays and inpainting the occluded parts in videos without any input masks. While recent deep learning based inpainting methods deal with a single image and mostly assume that the positions of the corrupted pixels are known, we aim at automatic text removal in video sequences without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a residual connection from the input frame to the decoder output to enforce our network to focus on the corrupted regions only. Our proposed model was ranked in the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track2: Video decaptioning. In addition, we further improve this strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but also provides strong clues on where the corrupted pixels are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally consistent video results in real time (50+ fps). |
Tasks | Video Denoising, Video Inpainting, Video-to-Video Synthesis |
Published | 2019-05-08 |
URL | https://arxiv.org/abs/1905.02949v1 |
https://arxiv.org/pdf/1905.02949v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-blind-video-decaptioning-by-temporal |
Repo | https://github.com/shwoo93/video_decaptioning |
Framework | pytorch |
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Title | One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization |
Authors | Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee |
Abstract | Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision. |
Tasks | Voice Conversion |
Published | 2019-04-10 |
URL | https://arxiv.org/abs/1904.05742v4 |
https://arxiv.org/pdf/1904.05742v4.pdf | |
PWC | https://paperswithcode.com/paper/one-shot-voice-conversion-by-separating |
Repo | https://github.com/jjery2243542/adaptive_voice_conversion |
Framework | pytorch |
Mish: A Self Regularized Non-Monotonic Neural Activation Function
Title | Mish: A Self Regularized Non-Monotonic Neural Activation Function |
Authors | Diganta Misra |
Abstract | The concept of non-linearity in a Neural Network is introduced by an activation function which serves an integral role in the training and performance evaluation of the network. Over the years of theoretical research, many activation functions have been proposed, however, only a few are widely used in mostly all applications which include ReLU (Rectified Linear Unit), TanH (Tan Hyperbolic), Sigmoid, Leaky ReLU and Swish. In this work, a novel neural activation function called as Mish is proposed. The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models. |
Tasks | Image Classification |
Published | 2019-08-23 |
URL | https://arxiv.org/abs/1908.08681v2 |
https://arxiv.org/pdf/1908.08681v2.pdf | |
PWC | https://paperswithcode.com/paper/mish-a-self-regularized-non-monotonic-neural |
Repo | https://github.com/thomasbrandon/mish-cuda |
Framework | pytorch |
A Feasible Framework for Arbitrary-Shaped Scene Text Recognition
Title | A Feasible Framework for Arbitrary-Shaped Scene Text Recognition |
Authors | Jinjin Zhang, Wei Wang, Di Huang, Qingjie Liu, Yunhong Wang |
Abstract | Deep learning based methods have achieved surprising progress in Scene Text Recognition (STR), one of classic problems in computer vision. In this paper, we propose a feasible framework for multi-lingual arbitrary-shaped STR, including instance segmentation based text detection and language model based attention mechanism for text recognition. Our STR algorithm not only recognizes Latin and Non-Latin characters, but also supports arbitrary-shaped text recognition. Our method wins the championship on Scene Text Spotting Task (Latin Only, Latin and Chinese) of ICDAR2019 Robust Reading Challenge on ArbitraryShaped Text Competition. Code is available at https://github.com/zhang0jhon/AttentionOCR. |
Tasks | Instance Segmentation, Language Modelling, Scene Text Recognition, Semantic Segmentation, Text Spotting |
Published | 2019-12-10 |
URL | https://arxiv.org/abs/1912.04561v2 |
https://arxiv.org/pdf/1912.04561v2.pdf | |
PWC | https://paperswithcode.com/paper/a-feasible-framework-for-arbitrary-shaped |
Repo | https://github.com/zhang0jhon/AttentionOCR |
Framework | tf |
Auto-Encoding Progressive Generative Adversarial Networks For 3D Multi Object Scenes
Title | Auto-Encoding Progressive Generative Adversarial Networks For 3D Multi Object Scenes |
Authors | Vedant Singh, Manan Oza, Himanshu Vaghela, Pratik Kanani |
Abstract | 3D multi object generative models allow us to synthesize a large range of novel 3D multi object scenes and also identify objects, shapes, layouts and their positions. But multi object scenes are difficult to create because of the dataset being multimodal in nature. The conventional 3D generative adversarial models are not efficient in generating multi object scenes, they usually tend to generate either one object or generate fuzzy results of multiple objects. Auto-encoder models have much scope in feature extraction and representation learning using the unsupervised paradigm in probabilistic spaces. We try to make use of this property in our proposed model. In this paper we propose a novel architecture using 3DConvNets trained with the progressive training paradigm that has been able to generate realistic high resolution 3D scenes of rooms, bedrooms, offices etc. with various pieces of furniture and objects. We make use of the adversarial auto-encoder along with the WGAN-GP loss parameter in our discriminator loss function. Finally this new approach to multi object scene generation has also been able to generate more number of objects per scene. |
Tasks | Representation Learning, Scene Generation |
Published | 2019-03-08 |
URL | http://arxiv.org/abs/1903.03477v1 |
http://arxiv.org/pdf/1903.03477v1.pdf | |
PWC | https://paperswithcode.com/paper/auto-encoding-progressive-generative |
Repo | https://github.com/yunishi3/3D-FCR-alphaGAN |
Framework | tf |
LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators
Title | LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators |
Authors | Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, Tingfa Xu |
Abstract | Layout is important for graphic design and scene generation. We propose a novel Generative Adversarial Network, called LayoutGAN, that synthesizes layouts by modeling geometric relations of different types of 2D elements. The generator of LayoutGAN takes as input a set of randomly-placed 2D graphic elements and uses self-attention modules to refine their labels and geometric parameters jointly to produce a realistic layout. Accurate alignment is critical for good layouts. We thus propose a novel differentiable wireframe rendering layer that maps the generated layout to a wireframe image, upon which a CNN-based discriminator is used to optimize the layouts in image space. We validate the effectiveness of LayoutGAN in various experiments including MNIST digit generation, document layout generation, clipart abstract scene generation and tangram graphic design. |
Tasks | Scene Generation |
Published | 2019-01-21 |
URL | http://arxiv.org/abs/1901.06767v1 |
http://arxiv.org/pdf/1901.06767v1.pdf | |
PWC | https://paperswithcode.com/paper/layoutgan-generating-graphic-layouts-with |
Repo | https://github.com/zyf12389/LayoutGAN-Alpha |
Framework | pytorch |
Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices
Title | Detecting Out-of-Distribution Examples with In-distribution Examples and Gram Matrices |
Authors | Chandramouli Shama Sastry, Sageev Oore |
Abstract | When presented with Out-of-Distribution (OOD) examples, deep neural networks yield confident, incorrect predictions. Detecting OOD examples is challenging, and the potential risks are high. In this paper, we propose to detect OOD examples by identifying inconsistencies between activity patterns and class predicted. We find that characterizing activity patterns by Gram matrices and identifying anomalies in gram matrix values can yield high OOD detection rates. We identify anomalies in the gram matrices by simply comparing each value with its respective range observed over the training data. Unlike many approaches, this can be used with any pre-trained softmax classifier and does not require access to OOD data for fine-tuning hyperparameters, nor does it require OOD access for inferring parameters. The method is applicable across a variety of architectures and vision datasets and, for the important and surprisingly hard task of detecting far-from-distribution out-of-distribution examples, it generally performs better than or equal to state-of-the-art OOD detection methods (including those that do assume access to OOD examples). |
Tasks | |
Published | 2019-12-28 |
URL | https://arxiv.org/abs/1912.12510v2 |
https://arxiv.org/pdf/1912.12510v2.pdf | |
PWC | https://paperswithcode.com/paper/detecting-out-of-distribution-examples-with |
Repo | https://github.com/VectorInstitute/gram-ood-detection |
Framework | pytorch |
Towards Generating Stylized Image Captions via Adversarial Training
Title | Towards Generating Stylized Image Captions via Adversarial Training |
Authors | Omid Mohamad Nezami, Mark Dras, Stephen Wan, Cecile Paris, Len Hamey |
Abstract | While most image captioning aims to generate objective descriptions of images, the last few years have seen work on generating visually grounded image captions which have a specific style (e.g., incorporating positive or negative sentiment). However, because the stylistic component is typically the last part of training, current models usually pay more attention to the style at the expense of accurate content description. In addition, there is a lack of variability in terms of the stylistic aspects. To address these issues, we propose an image captioning model called ATTEND-GAN which has two core components: first, an attention-based caption generator to strongly correlate different parts of an image with different parts of a caption; and second, an adversarial training mechanism to assist the caption generator to add diverse stylistic components to the generated captions. Because of these components, ATTEND-GAN can generate correlated captions as well as more human-like variability of stylistic patterns. Our system outperforms the state-of-the-art as well as a collection of our baseline models. A linguistic analysis of the generated captions demonstrates that captions generated using ATTEND-GAN have a wider range of stylistic adjectives and adjective-noun pairs. |
Tasks | Image Captioning |
Published | 2019-08-08 |
URL | https://arxiv.org/abs/1908.02943v1 |
https://arxiv.org/pdf/1908.02943v1.pdf | |
PWC | https://paperswithcode.com/paper/towards-generating-stylized-image-captions |
Repo | https://github.com/omidmnezami/Style-GAN |
Framework | tf |
Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Title | Aligning Linguistic Words and Visual Semantic Units for Image Captioning |
Authors | Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, Hanqing Lu |
Abstract | Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processers. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. |
Tasks | Image Captioning |
Published | 2019-08-06 |
URL | https://arxiv.org/abs/1908.02127v1 |
https://arxiv.org/pdf/1908.02127v1.pdf | |
PWC | https://paperswithcode.com/paper/aligning-linguistic-words-and-visual-semantic |
Repo | https://github.com/ltguo19/VSUA-Captioning |
Framework | pytorch |
Multi-View Stereo by Temporal Nonparametric Fusion
Title | Multi-View Stereo by Temporal Nonparametric Fusion |
Authors | Yuxin Hou, Juho Kannala, Arno Solin |
Abstract | We propose a novel idea for depth estimation from multi-view image-pose pairs, where the model has capability to leverage information from previous latent-space encodings of the scene. This model uses pairs of images and poses, which are passed through an encoder–decoder model for disparity estimation. The novelty lies in soft-constraining the bottleneck layer by a nonparametric Gaussian process prior. We propose a pose-kernel structure that encourages similar poses to have resembling latent spaces. The flexibility of the Gaussian process (GP) prior provides adapting memory for fusing information from previous views. We train the encoder–decoder and the GP hyperparameters jointly end-to-end. In addition to a batch method, we derive a lightweight estimation scheme that circumvents standard pitfalls in scaling Gaussian process inference, and demonstrate how our scheme can run in real-time on smart devices. |
Tasks | Depth Estimation, Disparity Estimation |
Published | 2019-04-12 |
URL | https://arxiv.org/abs/1904.06397v2 |
https://arxiv.org/pdf/1904.06397v2.pdf | |
PWC | https://paperswithcode.com/paper/multi-view-stereo-by-temporal-nonparametric |
Repo | https://github.com/AaltoML/GP-MVS |
Framework | pytorch |
Aesthetic Attributes Assessment of Images
Title | Aesthetic Attributes Assessment of Images |
Authors | Xin Jin, Le Wu, Geng Zhao, Xiaodong Li, Xiaokun Zhang, Shiming Ge, Dongqing Zou, Bin Zhou, Xinghui Zhou |
Abstract | Image aesthetic quality assessment has been a relatively hot topic during the last decade. Most recently, comments type assessment (aesthetic captions) has been proposed to describe the general aesthetic impression of an image using text. In this paper, we propose Aesthetic Attributes Assessment of Images, which means the aesthetic attributes captioning. This is a new formula of image aesthetic assessment, which predicts aesthetic attributes captions together with the aesthetic score of each attribute. We introduce a new dataset named \emph{DPC-Captions} which contains comments of up to 5 aesthetic attributes of one image through knowledge transfer from a full-annotated small-scale dataset. Then, we propose Aesthetic Multi-Attribute Network (AMAN), which is trained on a mixture of fully-annotated small-scale PCCD dataset and weakly-annotated large-scale DPC-Captions dataset. Our AMAN makes full use of transfer learning and attention model in a single framework. The experimental results on our DPC-Captions and PCCD dataset reveal that our method can predict captions of 5 aesthetic attributes together with numerical score assessment of each attribute. We use the evaluation criteria used in image captions to prove that our specially designed AMAN model outperforms traditional CNN-LSTM model and modern SCA-CNN model of image captions. |
Tasks | Image Captioning, Transfer Learning |
Published | 2019-07-11 |
URL | https://arxiv.org/abs/1907.04983v2 |
https://arxiv.org/pdf/1907.04983v2.pdf | |
PWC | https://paperswithcode.com/paper/aesthetic-attributes-assessment-of-images |
Repo | https://github.com/BestiVictory/DPC-Captions |
Framework | none |
Image Captioning: Transforming Objects into Words
Title | Image Captioning: Transforming Objects into Words |
Authors | Simao Herdade, Armin Kappeler, Kofi Boakye, Joao Soares |
Abstract | Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset. |
Tasks | Image Captioning |
Published | 2019-06-14 |
URL | https://arxiv.org/abs/1906.05963v2 |
https://arxiv.org/pdf/1906.05963v2.pdf | |
PWC | https://paperswithcode.com/paper/image-captioning-transforming-objects-into |
Repo | https://github.com/yahoo/object_relation_transformer |
Framework | pytorch |
Towards Interpretable Reinforcement Learning Using Attention Augmented Agents
Title | Towards Interpretable Reinforcement Learning Using Attention Augmented Agents |
Authors | Alex Mott, Daniel Zoran, Mike Chrzanowski, Daan Wierstra, Danilo J. Rezende |
Abstract | Inspired by recent work in attention models for image captioning and question answering, we present a soft attention model for the reinforcement learning domain. This model uses a soft, top-down attention mechanism to create a bottleneck in the agent, forcing it to focus on task-relevant information by sequentially querying its view of the environment. The output of the attention mechanism allows direct observation of the information used by the agent to select its actions, enabling easier interpretation of this model than of traditional models. We analyze different strategies that the agents learn and show that a handful of strategies arise repeatedly across different games. We also show that the model learns to query separately about space and content (where' vs. what’). We demonstrate that an agent using this mechanism can achieve performance competitive with state-of-the-art models on ATARI tasks while still being interpretable. |
Tasks | Image Captioning, Question Answering |
Published | 2019-06-06 |
URL | https://arxiv.org/abs/1906.02500v1 |
https://arxiv.org/pdf/1906.02500v1.pdf | |
PWC | https://paperswithcode.com/paper/towards-interpretable-reinforcement-learning |
Repo | https://github.com/aluscher/torchbeastpopart |
Framework | pytorch |
SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures
Title | SwiftNet: Using Graph Propagation as Meta-knowledge to Search Highly Representative Neural Architectures |
Authors | Hsin-Pai Cheng, Tunhou Zhang, Yukun Yang, Feng Yan, Shiyu Li, Harris Teague, Hai Li, Yiran Chen |
Abstract | Designing neural architectures for edge devices is subject to constraints of accuracy, inference latency, and computational cost. Traditionally, researchers manually craft deep neural networks to meet the needs of mobile devices. Neural Architecture Search (NAS) was proposed to automate the neural architecture design without requiring extensive domain expertise and significant manual efforts. Recent works utilized NAS to design mobile models by taking into account hardware constraints and achieved state-of-the-art accuracy with fewer parameters and less computational cost measured in Multiply-accumulates (MACs). To find highly compact neural architectures, existing works relies on predefined cells and directly applying width multiplier, which may potentially limit the model flexibility, reduce the useful feature map information, and cause accuracy drop. To conquer this issue, we propose GRAM(GRAph propagation as Meta-knowledge) that adopts fine-grained (node-wise) search method and accumulates the knowledge learned in updates into a meta-graph. As a result, GRAM can enable more flexible search space and achieve higher search efficiency. Without the constraints of predefined cell or blocks, we propose a new structure-level pruning method to remove redundant operations in neural architectures. SwiftNet, which is a set of models discovered by GRAM, outperforms MobileNet-V2 by 2.15x higher accuracy density and 2.42x faster with similar accuracy. Compared with FBNet, SwiftNet reduces the search cost by 26x and achieves 2.35x higher accuracy density and 1.47x speedup while preserving similar accuracy. SwiftNetcan obtain 63.28% top-1 accuracy on ImageNet-1K with only 53M MACs and 2.07M parameters. The corresponding inference latency is only 19.09 ms on Google Pixel 1. |
Tasks | Neural Architecture Search |
Published | 2019-06-19 |
URL | https://arxiv.org/abs/1906.08305v2 |
https://arxiv.org/pdf/1906.08305v2.pdf | |
PWC | https://paperswithcode.com/paper/swiftnet-using-graph-propagation-as-meta |
Repo | https://github.com/newwhitecheng/swiftnet |
Framework | none |
Functional Variational Bayesian Neural Networks
Title | Functional Variational Bayesian Neural Networks |
Authors | Shengyang Sun, Guodong Zhang, Jiaxin Shi, Roger Grosse |
Abstract | Variational Bayesian neural networks (BNNs) perform variational inference over weights, but it is difficult to specify meaningful priors and approximate posteriors in a high-dimensional weight space. We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i.e. distributions over functions. We prove that the KL divergence between stochastic processes equals the supremum of marginal KL divergences over all finite sets of inputs. Based on this, we introduce a practical training objective which approximates the functional ELBO using finite measurement sets and the spectral Stein gradient estimator. With fBNNs, we can specify priors entailing rich structures, including Gaussian processes and implicit stochastic processes. Empirically, we find fBNNs extrapolate well using various structured priors, provide reliable uncertainty estimates, and scale to large datasets. |
Tasks | Bayesian Inference, Gaussian Processes |
Published | 2019-03-14 |
URL | http://arxiv.org/abs/1903.05779v1 |
http://arxiv.org/pdf/1903.05779v1.pdf | |
PWC | https://paperswithcode.com/paper/functional-variational-bayesian-neural-1 |
Repo | https://github.com/ssydasheng/FBNN |
Framework | tf |