February 2, 2020

# Paper Group AWR 25

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations. Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms. Bidirectional Learning for Domain Adaptation of Semantic Segmentation. From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings M …

#### DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

Title DRIT++: Diverse Image-to-Image Translation via Disentangled Representations
Authors Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, Ming-Hsuan Yang
Abstract Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for this task: 1) lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images. To synthesize diverse outputs, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and attribute vectors sampled from the attribute space to synthesize diverse outputs at test time. To handle unpaired training data, we introduce a cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative evaluations, we measure realism with user study and Fr'{e}chet inception distance, and measure diversity with the perceptual distance metric, Jensen-Shannon divergence, and number of statistically-different bins.
Published 2019-05-02
URL https://arxiv.org/abs/1905.01270v2
PDF https://arxiv.org/pdf/1905.01270v2.pdf
PWC https://paperswithcode.com/paper/drit-diverse-image-to-image-translation-via
Repo https://github.com/taki0112/DRIT-Tensorflow
Framework tf

#### Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms

Title Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms
Authors Matthia Sabatelli, Gilles Louppe, Pierre Geurts, Marco A. Wiering
Abstract This paper makes one step forward towards characterizing a new family of \textit{model-free} Deep Reinforcement Learning (DRL) algorithms. The aim of these algorithms is to jointly learn an approximation of the state-value function ($V$), alongside an approximation of the state-action value function ($Q$). Our analysis starts with a thorough study of the Deep Quality-Value Learning (DQV) algorithm, a DRL algorithm which has been shown to outperform popular techniques such as Deep-Q-Learning (DQN) and Double-Deep-Q-Learning (DDQN) \cite{sabatelli2018deep}. Intending to investigate why DQV’s learning dynamics allow this algorithm to perform so well, we formulate a set of research questions which help us characterize a new family of DRL algorithms. Among our results, we present some specific cases in which DQV’s performance can get harmed and introduce a novel \textit{off-policy} DRL algorithm, called DQV-Max, which can outperform DQV. We then study the behavior of the $V$ and $Q$ functions that are learned by DQV and DQV-Max and show that both algorithms might perform so well on several DRL test-beds because they are less prone to suffer from the overestimation bias of the $Q$ function.
Published 2019-09-01
URL https://arxiv.org/abs/1909.01779v2
PDF https://arxiv.org/pdf/1909.01779v2.pdf
Repo https://github.com/paintception/Deep-Quality-Value-DQV-Learning-
Framework tf

#### Bidirectional Learning for Domain Adaptation of Semantic Segmentation

Title Bidirectional Learning for Domain Adaptation of Semantic Segmentation
Authors Yunsheng Li, Lu Yuan, Nuno Vasconcelos
Abstract Domain adaptation for semantic image segmentation is very necessary since manually labeling large datasets with pixel-level labels is expensive and time consuming. Existing domain adaptation techniques either work on limited datasets, or yield not so good performance compared with supervised learning. In this paper, we propose a novel bidirectional learning framework for domain adaptation of segmentation. Using the bidirectional learning, the image translation model and the segmentation adaptation model can be learned alternatively and promote to each other. Furthermore, we propose a self-supervised learning algorithm to learn a better segmentation adaptation model and in return improve the image translation model. Experiments show that our method is superior to the state-of-the-art methods in domain adaptation of segmentation with a big margin. The source code is available at https://github.com/liyunsheng13/BDL.
Published 2019-04-24
URL http://arxiv.org/abs/1904.10620v1
PDF http://arxiv.org/pdf/1904.10620v1.pdf
Repo https://github.com/liyunsheng13/BDL
Framework pytorch

#### From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

Title From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Authors Mika Hämäläinen, Simon Hengchen
Abstract A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
Tasks Machine Translation, Optical Character Recognition, Word Embeddings
Published 2019-10-12
URL https://arxiv.org/abs/1910.05535v1
PDF https://arxiv.org/pdf/1910.05535v1.pdf
PWC https://paperswithcode.com/paper/from-the-paft-to-the-fiiture-a-fully-1
Repo https://github.com/mikahama/natas
Framework none

#### Learning to Explain: Answering Why-Questions via Rephrasing

Title Learning to Explain: Answering Why-Questions via Rephrasing
Authors Allen Nie, Erin D. Bennett, Noah D. Goodman
Abstract Providing plausible responses to why questions is a challenging but critical goal for language based human-machine interaction. Explanations are challenging in that they require many different forms of abstract knowledge and reasoning. Previous work has either relied on human-curated structured knowledge bases or detailed domain representation to generate satisfactory explanations. They are also often limited to ranking pre-existing explanation choices. In our work, we contribute to the under-explored area of generating natural language explanations for general phenomena. We automatically collect large datasets of explanation-phenomenon pairs which allow us to train sequence-to-sequence models to generate natural language explanations. We compare different training strategies and evaluate their performance using both automatic scores and human ratings. We demonstrate that our strategy is sufficient to generate highly plausible explanations for general open-domain phenomena compared to other models trained on different datasets.
Published 2019-06-04
URL https://arxiv.org/abs/1906.01243v1
PDF https://arxiv.org/pdf/1906.01243v1.pdf
Repo https://github.com/windweller/L2EWeb
Framework pytorch

#### FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network

Title FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network
Authors Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain, Manik Varma
Abstract This paper develops the FastRNN and FastGRNN algorithms to address the twin RNN limitations of inaccurate training and inefficient prediction. Previous approaches have improved accuracy at the expense of prediction costs making them infeasible for resource-constrained and real-time applications. Unitary RNNs have increased accuracy somewhat by restricting the range of the state transition matrix’s singular values but have also increased the model size as they require a larger number of hidden units to make up for the loss in expressive power. Gated RNNs have obtained state-of-the-art accuracies by adding extra parameters thereby resulting in even larger models. FastRNN addresses these limitations by adding a residual connection that does not constrain the range of the singular values explicitly and has only two extra scalar parameters. FastGRNN then extends the residual connection to a gate by reusing the RNN matrices to match state-of-the-art gated RNN accuracies but with a 2-4x smaller model. Enforcing FastGRNN’s matrices to be low-rank, sparse and quantized resulted in accurate models that could be up to 35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to accurately recognize the “Hey Cortana” wakeword with a 1 KB model and to be deployed on severely resource-constrained IoT microcontrollers too tiny to store other RNN models. FastGRNN’s code is available at https://github.com/Microsoft/EdgeML/.
Tasks Action Classification, Language Modelling, Speech Recognition, Time Series, Time Series Classification
Published 2019-01-08
URL http://arxiv.org/abs/1901.02358v1
PDF http://arxiv.org/pdf/1901.02358v1.pdf
PWC https://paperswithcode.com/paper/fastgrnn-a-fast-accurate-stable-and-tiny
Repo https://github.com/Microsoft/EdgeML
Framework tf

#### Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations

Title Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations
Authors Marko Järvenpää, Michael Gutmann, Aki Vehtari, Pekka Marttinen
Abstract We consider Bayesian inference when only a limited number of noisy log-likelihood evaluations can be obtained. This occurs for example when complex simulator-based statistical models are fitted to data, and synthetic likelihood (SL) method is used to form the noisy log-likelihood estimates using computationally costly forward simulations. We frame the inference task as a sequential Bayesian experimental design problem, where the log-likelihood function is modelled with a hierarchical Gaussian process (GP) surrogate model, which is used to efficiently select additional log-likelihood evaluation locations. Motivated by recent progress in the related problem of batch Bayesian optimisation, we develop various batch-sequential design strategies which allow to run some of the potentially costly simulations in parallel. We analyse the properties of the resulting method theoretically and empirically. Experiments with several toy problems and simulation models suggest that our method is robust, highly parallelisable, and sample-efficient.
Published 2019-05-03
URL https://arxiv.org/abs/1905.01252v4
PDF https://arxiv.org/pdf/1905.01252v4.pdf
PWC https://paperswithcode.com/paper/parallel-gaussian-process-surrogate-method-to
Repo https://github.com/mjarvenpaa/parallel-GP-SL
Framework none

#### DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Title DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Authors Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf
Abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Tasks Language Modelling, Linguistic Acceptability, Natural Language Inference, Question Answering, Semantic Textual Similarity, Sentiment Analysis, Transfer Learning
Published 2019-10-02
URL https://arxiv.org/abs/1910.01108v4
PDF https://arxiv.org/pdf/1910.01108v4.pdf
PWC https://paperswithcode.com/paper/distilbert-a-distilled-version-of-bert
Repo https://github.com/mkavim/finetune_bert
Framework tf

#### Virtual Conditional Generative Adversarial Networks

Title Virtual Conditional Generative Adversarial Networks
Authors Haifeng Shi, Guanyu Cai, Yuqin Wang, Shaohua Shang, Lianghua He
Abstract When trained on multimodal image datasets, normal Generative Adversarial Networks (GANs) are usually outperformed by class-conditional GANs and ensemble GANs, but conditional GANs is restricted to labeled datasets and ensemble GANs lack efficiency. We propose a novel GAN variant called virtual conditional GAN (vcGAN) which is not only an ensemble GAN with multiple generative paths while adding almost zero network parameters, but also a conditional GAN that can be trained on unlabeled datasets without explicit clustering steps or objectives other than the adversary loss. Inside the vcGAN’s generator, a learnable analog-to-digital converter (ADC)” module maps a slice of the inputted multivariate Gaussian noise to discrete/digital noise (virtual label), according to which a selector selects the corresponding generative path to produce the sample. All the generative paths share the same decoder network while in each path the decoder network is fed with a concatenation of a different pre-computed amplified one-hot vector and the inputted Gaussian noise. We conducted a lot of experiments on several balanced/imbalanced image datasets to demonstrate that vcGAN converges faster and achieves improved Frech'et Inception Distance (FID). In addition, we show the training byproduct that the ADC in vcGAN learned the categorical probability of each mode and that each generative path generates samples of specific mode, which enables class-conditional sampling. Codes are available at \url{https://github.com/annonnymmouss/vcgan} |
Tasks Conditional Image Generation, Image Generation
Published 2019-01-25
URL http://arxiv.org/abs/1901.09822v1
PDF http://arxiv.org/pdf/1901.09822v1.pdf
Repo https://github.com/annonnymmouss/vcgan
Framework tf

#### SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Title SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
Authors Shion Honda, Shoi Shi, Hiroki R. Ueda
Abstract In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.
Published 2019-11-12
URL https://arxiv.org/abs/1911.04738v1
PDF https://arxiv.org/pdf/1911.04738v1.pdf
PWC https://paperswithcode.com/paper/smiles-transformer-pre-trained-molecular
Repo https://github.com/DSPsleeporg/smiles-transformer
Framework pytorch

#### Fair Regression for Health Care Spending

Title Fair Regression for Health Care Spending
Authors Anna Zink, Sherri Rose
Abstract The distribution of health care payments to insurance plans has substantial consequences for social policy. Risk adjustment formulas predict spending in health insurance markets in order to provide fair benefits and health care coverage for all enrollees, regardless of their health status. Unfortunately, current risk adjustment formulas are known to underpredict spending for specific groups of enrollees leading to undercompensated payments to health insurers. This incentivizes insurers to design their plans such that individuals in undercompensated groups will be less likely to enroll, impacting access to health care for these groups. To improve risk adjustment formulas for undercompensated groups, we expand on concepts from the statistics, computer science, and health economics literature to develop new fair regression methods for continuous outcomes by building fairness considerations directly into the objective function. We additionally propose a novel measure of fairness while asserting that a suite of metrics is necessary in order to evaluate risk adjustment formulas more fully. Our data application using the IBM MarketScan Research Databases and simulation studies demonstrate that these new fair regression methods may lead to massive improvements in group fairness (e.g., 98%) with only small reductions in overall fit (e.g., 4%).
Published 2019-01-28
URL https://arxiv.org/abs/1901.10566v2
PDF https://arxiv.org/pdf/1901.10566v2.pdf
PWC https://paperswithcode.com/paper/fair-regression-for-health-care-spending
Repo https://github.com/zinka88/Fair-Regression
Framework none

#### Using Embeddings to Correct for Unobserved Confounding in Networks

Title Using Embeddings to Correct for Unobserved Confounding in Networks
Authors Victor Veitch, Yixin Wang, David M. Blei
Abstract We consider causal inference in the presence of unobserved confounding. We study the case where a proxy is available for the unobserved confounding in the form of a network connecting the units. For example, the link structure of a social network carries information about its members. We show how to effectively use the proxy to do causal inference. The main idea is to reduce the causal estimation problem to a semi-supervised prediction of both the treatments and outcomes. Networks admit high-quality embedding models that can be used for this semi-supervised prediction. We show that the method yields valid inferences under suitable (weak) conditions on the quality of the predictive model. We validate the method with experiments on a semi-synthetic social network dataset. Code is available at github.com/vveitch/causal-network-embeddings.
Published 2019-02-11
URL https://arxiv.org/abs/1902.04114v2
PDF https://arxiv.org/pdf/1902.04114v2.pdf
PWC https://paperswithcode.com/paper/using-embeddings-to-correct-for-unobserved
Repo https://github.com/vveitch/causal-embeddings
Framework tf

#### Index Network

Title Index Network
Authors Hao Lu, Yutong Dai, Chunhua Shen, Songcen Xu
Abstract We show that existing upsampling operators can be unified using the notion of the index function. This notion is inspired by an observation in the decoding process of deep image matting where indices-guided unpooling can often recover boundary details considerably better than other upsampling operators such as bilinear interpolation. By viewing the indices as a function of the feature map, we introduce the concept of “learning to index”, and present a novel index-guided encoder-decoder framework where indices are self-learned adaptively from data and are used to guide the downsampling and upsampling stages, without extra training supervision. At the core of this framework is a new learnable module, termed Index Network (IndexNet), which dynamically generates indices conditioned on the feature map itself. IndexNet can be used as a plug-in applying to almost all off-the-shelf convolutional networks that have coupled downsampling and upsampling stages, giving the networks the ability to dynamically capture variations of local patterns. In particular, we instantiate and investigate five families of IndexNet and demonstrate their effectiveness on four dense prediction tasks, including image denoising, image matting, semantic segmentation, and monocular depth estimation. Code and models have been made available at: https://tinyurl.com/IndexNetV1
Tasks Denoising, Depth Estimation, Image Denoising, Image Matting, Monocular Depth Estimation, Semantic Segmentation
Published 2019-08-11
URL https://arxiv.org/abs/1908.09895v1
PDF https://arxiv.org/pdf/1908.09895v1.pdf
PWC https://paperswithcode.com/paper/index-network
Repo https://github.com/poppinace/indexnet_matting
Framework pytorch

#### End-to-End Neural Speaker Diarization with Self-attention

Title End-to-End Neural Speaker Diarization with Self-attention
Authors Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe
Abstract Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.
Published 2019-09-13
URL https://arxiv.org/abs/1909.06247v1
PDF https://arxiv.org/pdf/1909.06247v1.pdf
PWC https://paperswithcode.com/paper/end-to-end-neural-speaker-diarization-with
Repo https://github.com/hitachi-speech/EEND
Framework none

#### End-to-End Neural Speaker Diarization with Permutation-Free Objectives

Title End-to-End Neural Speaker Diarization with Permutation-Free Objectives
Authors Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe
Abstract In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachi-speech/EEND.