April 2, 2020

# Paper Group ANR 146

A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability. Exploiting Temporal Coherence for Multi-modal Video Categorization. Objective Mismatch in Model-based Reinforcement Learning. Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization. On the Complexity of Minim …

#### A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability

Title A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability
Authors Yafei Han, Christopher Zegras, Francisco Camara Pereira, Moshe Ben-Akiva
Abstract Discrete choice models (DCMs) and neural networks (NNs) can complement each other. We propose a neural network embedded choice model - TasteNet-MNL, to improve the flexibility in modeling taste heterogeneity while keeping model interpretability. The hybrid model consists of a TasteNet module: a feed-forward neural network that learns taste parameters as flexible functions of individual characteristics; and a choice module: a multinomial logit model (MNL) with manually specified utility. TasteNet and MNL are fully integrated and jointly estimated. By embedding a neural network into a DCM, we exploit a neural network’s function approximation capacity to reduce specification bias. Through special structure and parameter constraints, we incorporate expert knowledge to regularize the neural network and maintain interpretability. On synthetic data, we show that TasteNet-MNL can recover the underlying non-linear utility function, and provide predictions and interpretations as accurate as the true model; while examples of logit or random coefficient logit models with misspecified utility functions result in large parameter bias and low predictability. In the case study of Swissmetro mode choice, TasteNet-MNL outperforms benchmarking MNLs’ predictability; and discovers a wider spectrum of taste variations within the population, and higher values of time on average. This study takes an initial step towards developing a framework to combine theory-based and data-driven approaches for discrete choice modeling.
Published 2020-02-03
URL https://arxiv.org/abs/2002.00922v1
PDF https://arxiv.org/pdf/2002.00922v1.pdf
PWC https://paperswithcode.com/paper/a-neural-embedded-choice-model-tastenet-mnl
Repo
Framework

#### Exploiting Temporal Coherence for Multi-modal Video Categorization

Title Exploiting Temporal Coherence for Multi-modal Video Categorization
Authors Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee
Abstract Multimodal ML models can process data in multiple modalities (e.g., video, images, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding). In this paper, we focus on the problem of video categorization by using a multimodal approach. We have developed a novel temporal coherence-based regularization approach, which applies to different types of models (e.g., RNN, NetVLAD, Transformer). We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
Published 2020-02-07
URL https://arxiv.org/abs/2002.03844v1
PDF https://arxiv.org/pdf/2002.03844v1.pdf
PWC https://paperswithcode.com/paper/exploiting-temporal-coherence-for-multi-modal
Repo
Framework

#### Objective Mismatch in Model-based Reinforcement Learning

Title Objective Mismatch in Model-based Reinforcement Learning
Authors Nathan Lambert, Brandon Amos, Omry Yadan, Roberto Calandra
Abstract Model-based reinforcement learning (MBRL) has been shown to be a powerful framework for data-efficiently learning control of continuous tasks. Recent work in MBRL has mostly focused on using more advanced function approximators and planning schemes, with little development of the general framework. In this paper, we identify a fundamental issue of the standard MBRL framework – what we call the objective mismatch issue. Objective mismatch arises when one objective is optimized in the hope that a second, often uncorrelated, metric will also be optimized. In the context of MBRL, we characterize the objective mismatch between training the forward dynamics model w.r.t.~the likelihood of the one-step ahead prediction, and the overall goal of improving performance on a downstream control task. For example, this issue can emerge with the realization that dynamics models effective for a specific task do not necessarily need to be globally accurate, and vice versa globally accurate models might not be sufficiently accurate locally to obtain good control performance on a specific task. In our experiments, we study this objective mismatch issue and demonstrate that the likelihood of one-step ahead predictions is not always correlated with control performance. This observation highlights a critical limitation in the MBRL framework which will require further research to be fully understood and addressed. We propose an initial method to mitigate the mismatch issue by re-weighting dynamics model training. Building on it, we conclude with a discussion about other potential directions of research for addressing this issue.
Published 2020-02-11
URL https://arxiv.org/abs/2002.04523v1
PDF https://arxiv.org/pdf/2002.04523v1.pdf
PWC https://paperswithcode.com/paper/objective-mismatch-in-model-based-1
Repo
Framework

#### Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Title Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization
Authors Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, Jian Sun
Abstract Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. But its performance can awfully degrade with insufficient batch size. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption. In this paper, we reveal that there are two extra batch statistics involved in backward propagation of BN, on which has never been well discussed before. The extra batch statistics associated with gradients also can severely affect the training of deep neural network. Based on our analysis, we propose a novel normalization method, named Moving Average Batch Normalization (MABN). MABN can completely restore the performance of vanilla BN in small batch cases, without introducing any additional nonlinear operations in inference procedure. We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO. The code has been released in https://github.com/megvii-model/MABN.
Published 2020-01-19
URL https://arxiv.org/abs/2001.06838v1
PDF https://arxiv.org/pdf/2001.06838v1.pdf
PWC https://paperswithcode.com/paper/towards-stabilizing-batch-statistics-in-1
Repo
Framework

#### On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Title On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions
Authors Yossi Arjevani, Amit Daniely, Stefanie Jegelka, Hongzhou Lin
Abstract Recent advances in randomized incremental methods for minimizing $L$-smooth $\mu$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/\mu})\log(1/\epsilon))$ and $O(n+\sqrt{nL/\epsilon})$, where $\mu>0$ and $\mu=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least $\Omega(n^2)$ iterations to obtain $O(1/n^2)$-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of $\tilde{O}((n^2+n\sqrt{L/\mu})\log(1/\epsilon))$ and $O(n\sqrt{L/\epsilon})$, for $\mu>0$ and $\mu=0$, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of $\tilde{\Omega}(n^2+\sqrt{nL/\mu}\log(1/\epsilon))$ and $\tilde{\Omega}(n^2+\sqrt{nL/\epsilon})$, for $\mu>0$ and $\mu=0$, respectively.
Published 2020-02-09
URL https://arxiv.org/abs/2002.03273v1
PDF https://arxiv.org/pdf/2002.03273v1.pdf
PWC https://paperswithcode.com/paper/on-the-complexity-of-minimizing-convex-finite
Repo
Framework

#### Analytic Properties of Trackable Weak Models

Title Analytic Properties of Trackable Weak Models
Authors Mark Chilenski, George Cybenko, Isaac Dekine, Piyush Kumar, Gil Raz
Abstract We present several new results on the feasibility of inferring the hidden states in strongly-connected trackable weak models. Here, a weak model is a directed graph in which each node is assigned a set of colors which may be emitted when that node is visited. A hypothesis is a node sequence which is consistent with a given color sequence. A weak model is said to be trackable if the worst case number of such hypotheses grows as a polynomial in the sequence length. We show that the number of hypotheses in strongly-connected trackable models is bounded by a constant and give an expression for this constant. We also consider the problem of reconstructing which branch was taken at a node with same-colored out-neighbors, and show that it is always eventually possible to identify which branch was taken if the model is strongly connected and trackable. We illustrate these properties by assigning transition probabilities and employing standard tools for analyzing Markov chains. In addition, we present new results for the entropy rates of weak models according to whether they are trackable or not. These theorems indicate that the combination of trackability and strong connectivity dramatically simplifies the task of reconstructing which nodes were visited. This work has implications for any problem which can be described in terms of an agent traversing a colored graph, such as the reconstruction of hidden states in a hidden Markov model (HMM).
Published 2020-01-08
URL https://arxiv.org/abs/2001.07608v1
PDF https://arxiv.org/pdf/2001.07608v1.pdf
PWC https://paperswithcode.com/paper/analytic-properties-of-trackable-weak-models
Repo
Framework

#### Fundamental Issues Regarding Uncertainties in Artificial Neural Networks

Title Fundamental Issues Regarding Uncertainties in Artificial Neural Networks
Authors Neil A. Thacker, Carole J. Twining, Paul D. Tar, Scott Notley, Visvanathan Ramesh
Abstract Artificial Neural Networks (ANNs) implement a specific form of multi-variate extrapolation and will generate an output for any input pattern, even when there is no similar training pattern. Extrapolations are not necessarily to be trusted, and in order to support safety critical systems, we require such systems to give an indication of the training sample related uncertainty associated with their output. Some readers may think that this is a well known issue which is already covered by the basic principles of pattern recognition. We will explain below how this is not the case and how the conventional (Likelihood estimate of) conditional probability of classification does not correctly assess this uncertainty. We provide a discussion of the standard interpretations of this problem and show how a quantitative approach based upon long standing methods can be practically applied. The methods are illustrated on the task of early diagnosis of dementing diseases using Magnetic Resonance Imaging.
Published 2020-02-25
URL https://arxiv.org/abs/2002.11152v1
PDF https://arxiv.org/pdf/2002.11152v1.pdf
PWC https://paperswithcode.com/paper/fundamental-issues-regarding-uncertainties-in
Repo
Framework

#### Causal Structure Discovery from Distributions Arising from Mixtures of DAGs

Title Causal Structure Discovery from Distributions Arising from Mixtures of DAGs
Authors Basil Saeed, Snigdha Panigrahi, Caroline Uhler
Abstract We consider distributions arising from a mixture of causal models, where each model is represented by a directed acyclic graph (DAG). We provide a graphical representation of such mixture distributions and prove that this representation encodes the conditional independence relations of the mixture distribution. We then consider the problem of structure learning based on samples from such distributions. Since the mixing variable is latent, we consider causal structure discovery algorithms such as FCI that can deal with latent variables. We show that such algorithms recover a “union” of the component DAGs and can identify variables whose conditional distribution across the component DAGs vary. We demonstrate our results on synthetic and real data showing that the inferred graph identifies nodes that vary between the different mixture components. As an immediate application, we demonstrate how retrieval of this causal information can be used to cluster samples according to each mixture component.
Published 2020-01-31
URL https://arxiv.org/abs/2001.11940v1
PDF https://arxiv.org/pdf/2001.11940v1.pdf
PWC https://paperswithcode.com/paper/causal-structure-discovery-from-distributions
Repo
Framework

#### TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions

Title TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions
Authors Benjamin Regler, Matthias Scheffler, Luca M. Ghiringhelli
Abstract The identification of relevant features, i.e., the driving variables that determine a process or the property of a system, is an essential part of the analysis of data sets whose entries are described by a large number of variables. The preferred measure for quantifying the relevance of nonlinear statistical dependencies is mutual information, which requires as input probability distributions. Probability distributions cannot be reliably sampled and estimated from limited data, especially for real-valued data samples such as lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependencies based on cumulative probability distributions. TCMI can be estimated directly from sample data and is a non-parametric, robust and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of the set of relevant features that are statistical related to the process or the property of a system, while taking into account the number of data samples as well as the cardinality of the feature subsets. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate dependence measures, and demonstrate the effectiveness of our feature selection method on a set of standard data sets and a typical scenario in materials science.
Published 2020-01-30
URL https://arxiv.org/abs/2001.11212v1
PDF https://arxiv.org/pdf/2001.11212v1.pdf
PWC https://paperswithcode.com/paper/tcmi-a-non-parametric-mutual-dependence
Repo
Framework

#### FsNet: Feature Selection Network on High-dimensional Biological Data

Title FsNet: Feature Selection Network on High-dimensional Biological Data
Abstract Biological data are generally high-dimensional and require efficient machine learning methods that are well generalized and scalable to discover their complex nonlinear patterns. The recent advances in the domain of artificial intelligence and machine learning can be attributed to deep neural networks (DNNs) because they accomplish a variety of tasks in computer vision and natural language processing. However, standard DNNs are not suitable for handling high-dimensional data and data with small number of samples because they require a large pool of computing resources as well as plenty of samples to learn a large number of parameters. In particular, although interpretability is important for high-dimensional biological data such as gene expression data, a nonlinear feature selection algorithm for DNN models has not been fully investigated. In this paper, we propose a novel nonlinear feature selection method called the Feature Selection Network (FsNet), which is a scalable concrete neural network architecture, under high-dimensional and small number of samples setups. Specifically, our network consists of a selector layer that uses a concrete random variable for discrete feature selection and a supervised deep neural network regularized with the reconstruction loss. Because a large number of parameters in the selector and reconstruction layer can easily cause overfitting under a limited number of samples, we use two tiny networks to predict the large virtual weight matrices of the selector and reconstruction layers. The experimental results on several real-world high-dimensional biological datasets demonstrate the efficacy of the proposed approach.
Published 2020-01-23
URL https://arxiv.org/abs/2001.08322v1
PDF https://arxiv.org/pdf/2001.08322v1.pdf
PWC https://paperswithcode.com/paper/fsnet-feature-selection-network-on-high
Repo
Framework

#### Investigating Classification Techniques with Feature Selection For Intention Mining From Twitter Feed

Title Investigating Classification Techniques with Feature Selection For Intention Mining From Twitter Feed
Abstract In the last decade, social networks became most popular medium for communication and interaction. As an example, micro-blogging service Twitter has more than 200 million registered users who exchange more than 65 million posts per day. Users express their thoughts, ideas, and even their intentions through these tweets. Most of the tweets are written informally and often in slang language, that contains misspelt and abbreviated words. This paper investigates the problem of selecting features that affect extracting user’s intention from Twitter feeds based on text mining techniques. It starts by presenting the method we used to construct our own dataset from extracted Twitter feeds. Following that, we present two techniques of feature selection followed by classification. In the first technique, we use Information Gain as a one-phase feature selection, followed by supervised classification algorithms. In the second technique, we use a hybrid approach based on forward feature selection algorithm in which two feature selection techniques employed followed by classification algorithms. We examine these two techniques with four classification algorithms. We evaluate them using our own dataset, and we critically review the results.
Published 2020-01-22
URL https://arxiv.org/abs/2001.10380v1
PDF https://arxiv.org/pdf/2001.10380v1.pdf
PWC https://paperswithcode.com/paper/investigating-classification-techniques-with
Repo
Framework

#### Analytic Study of Double Descent in Binary Classification: The Impact of Loss

Title Analytic Study of Double Descent in Binary Classification: The Impact of Loss
Authors Ganesh Kini, Christos Thrampoulidis
Abstract Extensive empirical evidence reveals that, for a wide range of different learning methods and datasets, the risk curve exhibits a double-descent (DD) trend as a function of the model size. In a recent paper [Zeyu,Kammoun,Thrampoulidis,2019] the authors studied binary linear classification models and showed that the test error of gradient descent (GD) with logistic loss undergoes a DD. In this paper, we complement these results by extending them to GD with square loss. We show that the DD phenomenon persists, but we also identify several differences compared to logistic loss. This emphasizes that crucial features of DD curves (such as their transition threshold and global minima) depend both on the training data and on the learning algorithm. We further study the dependence of DD curves on the size of the training set. Similar to our earlier work, our results are analytic: we plot the DD curves by first deriving sharp asymptotics for the test error under Gaussian features. Albeit simple, the models permit a principled study of DD features, the outcomes of which theoretically corroborate related empirical findings occurring in more complex learning tasks.
Published 2020-01-30
URL https://arxiv.org/abs/2001.11572v1
PDF https://arxiv.org/pdf/2001.11572v1.pdf
PWC https://paperswithcode.com/paper/analytic-study-of-double-descent-in-binary
Repo
Framework

#### TeCNO: Surgical Phase Recognition with Multi-Stage Temporal Convolutional Networks

Title TeCNO: Surgical Phase Recognition with Multi-Stage Temporal Convolutional Networks
Authors Tobias Czempiel, Magdalini Paschali, Matthias Keicher, Walter Simson, Hubertus Feussner, Seong Tae Kim, Nassir Navab
Abstract Automatic surgical phase recognition is a challenging and crucial task with the potential to improve patient safety and become an integral part of intra-operative decision-support systems. In this paper, we propose, for the first time in workflow analysis, a Multi-Stage Temporal Convolutional Network (MS-TCN) that performs hierarchical prediction refinement for surgical phase recognition. Causal, dilated convolutions allow for a large receptive field and online inference with smooth predictions even during ambiguous transitions. Our method is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos with and without the use of additional surgical tool information. Outperforming various state-of-the-art LSTM approaches, we verify the suitability of the proposed causal MS-TCN for surgical phase recognition.
Published 2020-03-24
URL https://arxiv.org/abs/2003.10751v1
PDF https://arxiv.org/pdf/2003.10751v1.pdf
PWC https://paperswithcode.com/paper/tecno-surgical-phase-recognition-with-multi
Repo
Framework

#### Bridging the gap between AI and Healthcare sides: towards developing clinically relevant AI-powered diagnosis systems

Title Bridging the gap between AI and Healthcare sides: towards developing clinically relevant AI-powered diagnosis systems
Authors Changhee Han, Leonardo Rundo, Kohei Murao, Takafumi Nemoto, Hideki Nakayama, Shin’ichi Satoh
Abstract This work aims to identify/bridge the gap between Artificial Intelligence (AI) and Healthcare sides in Japan towards developing medical AI fitting into a clinical environment in five years. Moreover, we attempt to confirm the clinical relevance for diagnosis of our research-proven pathology-aware Generative Adversarial Network (GAN)-based medical image augmentation: a data wrangling and information conversion technique to address data paucity. We hold a clinically valuable AI-envisioning workshop among 2 Medical Imaging experts, 2 physicians, and 3 Healthcare/Informatics generalists. A qualitative/quantitative questionnaire survey for 3 project-related physicians and 6 project non-related radiologists evaluates the GAN projects in terms of Data Augmentation (DA) and physician training. The workshop reveals the intrinsic gap between AI/Healthcare sides and its preliminary solutions on Why (i.e., clinical significance/interpretation) and How (i.e., data acquisition, commercial deployment, and safety/feeling safe). The survey confirms our pathology-aware GANs’ clinical relevance as a clinical decision support system and non-expert physician training tool. Radiologists generally have high expectations for AI-based diagnosis as a reliable second opinion and abnormal candidate detection, instead of replacing them. Our findings would play a key role in connecting inter-disciplinary research and clinical applications, not limited to the Japanese medical context and pathology-aware GANs. We find that better DA and expert physician training would require atypical image generation via further GAN-based extrapolation.
Tasks Data Augmentation, Image Augmentation, Image Generation
Published 2020-01-12
URL https://arxiv.org/abs/2001.03923v1
PDF https://arxiv.org/pdf/2001.03923v1.pdf
PWC https://paperswithcode.com/paper/bridging-the-gap-between-ai-and-healthcare
Repo
Framework

#### DADA: Differentiable Automatic Data Augmentation

Title DADA: Differentiable Automatic Data Augmentation
Authors Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M. Robertson, Yongxin Yang
Abstract Data augmentation (DA) techniques aim to increase data variability, and thus train deep networks with better generalisation. The pioneering AutoAugment automated the search for optimal DA policies with reinforcement learning. However, AutoAugment is extremely computationally expensive, limiting its wide applicability. Followup work such as PBA and Fast AutoAugment improved efficiency, but optimization speed remains a bottleneck. In this paper, we propose Differentiable Automatic Data Augmentation (DADA) which dramatically reduces the cost. DADA relaxes the discrete DA policy selection to a differentiable optimization problem via Gumbel-Softmax. In addition, we introduce an unbiased gradient estimator, RELAX, leading to an efficient and effective one-pass optimization strategy to learn an efficient and accurate DA policy. We conduct extensive experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Furthermore, we demonstrate the value of Auto DA in pre-training for downstream detection problems. Results show our DADA is at least one order of magnitude faster than the state-of-the-art while achieving very comparable accuracy.