February 1, 2020

3058 words 15 mins read

Paper Group AWR 284

Automated Spectral Kernel Learning. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. L_DMI: An Information-theoretic Noise-robust Loss Function. Safeguarded Dynamic Label Regression for Generalized Noisy Supervision. Controlling the Amount of Verbatim Copying in Abstractive Summarization. Leveragin …

Automated Spectral Kernel Learning


Title	Automated Spectral Kernel Learning
Authors	Jian Li, Yong Liu, Weiping Wang
Abstract	The generalization performance of kernel methods is largely determined by the kernel, but common kernels are stationary thus input-independent and output-independent, that limits their applications on complicated tasks. In this paper, we propose a powerful and efficient spectral kernel learning framework and learned kernels are dependent on both inputs and outputs, by using non-stationary spectral kernels and flexibly learning the spectral measure from the data. Further, we derive a data-dependent generalization error bound based on Rademacher complexity, which estimates the generalization ability of the learning framework and suggests two regularization terms to improve performance. Extensive experimental results validate the effectiveness of the proposed algorithm and confirm our theoretical results.
Tasks
Published	2019-09-11
URL	https://arxiv.org/abs/1909.04894v2
PDF	https://arxiv.org/pdf/1909.04894v2.pdf
PWC	https://paperswithcode.com/paper/automated-spectral-kernel-learning
Repo	https://github.com/superlj666/Automated-Spectral-Kernel-Learning
Framework	pytorch

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup


Title	Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup
Authors	Sebastian Goldt, Madhu S. Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborová
Abstract	Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
Tasks
Published	2019-06-18
URL	https://arxiv.org/abs/1906.08632v2
PDF	https://arxiv.org/pdf/1906.08632v2.pdf
PWC	https://paperswithcode.com/paper/dynamics-of-stochastic-gradient-descent-for
Repo	https://github.com/cristianoBY/nn2pp-torch
Framework	pytorch

L_DMI: An Information-theoretic Noise-robust Loss Function


Title	L_DMI: An Information-theoretic Noise-robust Loss Function
Authors	Yilun Xu, Peng Cao, Yuqing Kong, Yizhou Wang
Abstract	Accurately annotating large scale dataset is notoriously expensive both in time and in money. Although acquiring low-quality-annotated dataset can be much cheaper, it often badly damages the performance of trained models when using such dataset without particular treatment. Various methods have been proposed for learning with noisy labels. However, most methods only handle limited kinds of noise patterns, require auxiliary information or steps (e.g. , knowing or estimating the noise transition matrix), or lack theoretical justification. In this paper, we propose a novel information-theoretic loss function, $\mathcal{L}{DMI}$, for training deep neural networks robust to label noise. The core of $\mathcal{L}{DMI}$ is a generalized version of mutual information, termed Determinant based Mutual Information (DMI), which is not only information-monotone but also relatively invariant. \emph{To the best of our knowledge, $\mathcal{L}{DMI}$ is the first loss function that is provably robust to instance-independent label noise, regardless of noise pattern, and it can be applied to any existing classification neural networks straightforwardly without any auxiliary information}. In addition to theoretical justification, we also empirically show that using $\mathcal{L}{DMI}$ outperforms all other counterparts in the classification task on both image dataset and natural language dataset include Fashion-MNIST, CIFAR-10, Dogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts, as well as a real-world dataset Clothing1M. Codes are available at https://github.com/Newbeeer/L_DMI .
Tasks
Published	2019-09-08
URL	https://arxiv.org/abs/1909.03388v2
PDF	https://arxiv.org/pdf/1909.03388v2.pdf
PWC	https://paperswithcode.com/paper/l_dmi-an-information-theoretic-noise-robust
Repo	https://github.com/Newbeeer/L_DMI
Framework	pytorch

Safeguarded Dynamic Label Regression for Generalized Noisy Supervision


Title	Safeguarded Dynamic Label Regression for Generalized Noisy Supervision
Authors	Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Jun Sun
Abstract	Learning with noisy labels, which aims to reduce expensive labors on accurate annotations, has become imperative in the Big Data era. Previous noise transition based method has achieved promising results and presented a theoretical guarantee on performance in the case of class-conditional noise. However, this type of approaches critically depend on an accurate pre-estimation of the noise transition, which is usually impractical. Subsequent improvement adapts the pre-estimation along with the training progress via a Softmax layer. However, the parameters in the Softmax layer are highly tweaked for the fragile performance due to the ill-posed stochastic approximation. To address these issues, we propose a Latent Class-Conditional Noise model (LCCN) that naturally embeds the noise transition under a Bayesian framework. By projecting the noise transition into a Dirichlet-distributed space, the learning is constrained on a simplex based on the whole dataset, instead of some ad-hoc parametric space. We then deduce a dynamic label regression method for LCCN to iteratively infer the latent labels, to stochastically train the classifier and to model the noise. Our approach safeguards the bounded update of the noise transition, which avoids previous arbitrarily tuning via a batch of samples. We further generalize LCCN for open-set noisy labels and the semi-supervised setting. We perform extensive experiments with the controllable noise data sets, CIFAR-10 and CIFAR-100, and the agnostic noise data sets, Clothing1M and WebVision17. The experimental results have demonstrated that the proposed model outperforms several state-of-the-art methods.
Tasks
Published	2019-03-06
URL	http://arxiv.org/abs/1903.02152v1
PDF	http://arxiv.org/pdf/1903.02152v1.pdf
PWC	https://paperswithcode.com/paper/safeguarded-dynamic-label-regression-for
Repo	https://github.com/Sunarker/Safeguarded-Dynamic-Label-Regression-for-Noisy-Supervision
Framework	tf

Controlling the Amount of Verbatim Copying in Abstractive Summarization


Title	Controlling the Amount of Verbatim Copying in Abstractive Summarization
Authors	Kaiqiang Song, Bingqing Wang, Zhe Feng, Liu Ren, Fei Liu
Abstract	An abstract must not change the meaning of the original text. A single most effective way to achieve that is to increase the amount of copying while still allowing for text abstraction. Human editors can usually exercise control over copying, resulting in summaries that are more extractive than abstractive, or vice versa. However, it remains poorly understood whether modern neural abstractive summarizers can provide the same flexibility, i.e., learning from single reference summaries to generate multiple summary hypotheses with varying degrees of copying. In this paper, we present a neural summarization model that, by learning from single human abstracts, can produce a broad spectrum of summaries ranging from purely extractive to highly generative ones. We frame the task of summarization as language modeling and exploit alternative mechanisms to generate summary hypotheses. Our method allows for control over copying during both training and decoding stages of a neural summarization model. Through extensive experiments we illustrate the significance of our proposed method on controlling the amount of verbatim copying and achieve competitive results over strong baselines. Our analysis further reveals interesting and unobvious facts.
Tasks	Abstractive Text Summarization, Language Modelling
Published	2019-11-23
URL	https://arxiv.org/abs/1911.10390v1
PDF	https://arxiv.org/pdf/1911.10390v1.pdf
PWC	https://paperswithcode.com/paper/controlling-the-amount-of-verbatim-copying-in
Repo	https://github.com/ucfnlp/control-over-copying
Framework	pytorch

Leveraging Domain Knowledge to Improve Microscopy Image Segmentation with Lifted Multicuts


Title	Leveraging Domain Knowledge to Improve Microscopy Image Segmentation with Lifted Multicuts
Authors	Constantin Pape, Alex Matskevych, Adrian Wolny, Julian Hennies, Giula Mizzon, Marion Louveaux, Jacob Musser, Alexis Maizel, Detlev Arendt, Anna Kreshuk
Abstract	The throughput of electron microscopes has increased significantly in recent years, enabling detailed analysis of cell morphology and ultrastructure. Analysis of neural circuits at single-synapse resolution remains the flagship target of this technique, but applications to cell and developmental biology are also starting to emerge at scale. The amount of data acquired in such studies makes manual instance segmentation, a fundamental step in many analysis pipelines, impossible. While automatic segmentation approaches have improved significantly thanks to the adoption of convolutional neural networks, their accuracy still lags behind human annotations and requires additional manual proof-reading. A major hindrance to further improvements is the limited field of view of the segmentation networks preventing them from exploiting the expected cell morphology or other prior biological knowledge which humans use to inform their segmentation decisions. In this contribution, we show how such domain-specific information can be leveraged by expressing it as long-range interactions in a graph partitioning problem known as the lifted multicut problem. Using this formulation, we demonstrate significant improvement in segmentation accuracy for three challenging EM segmentation problems from neuroscience and cell biology.
Tasks	graph partitioning, Instance Segmentation, Semantic Segmentation
Published	2019-05-25
URL	https://arxiv.org/abs/1905.10535v2
PDF	https://arxiv.org/pdf/1905.10535v2.pdf
PWC	https://paperswithcode.com/paper/leveraging-domain-knowledge-to-improve-em
Repo	https://github.com/constantinpape/cluster_tools
Framework	none

Hybrid Function Sparse Representation towards Image Super Resolution


Title	Hybrid Function Sparse Representation towards Image Super Resolution
Authors	Junyi Bian, Baojun Lin, Ke Zhang
Abstract	Sparse representation with training-based dictionary has been shown successful on super resolution(SR) but still have some limitations. Based on the idea of making the magnification of function curve without losing its fidelity, we proposed a function based dictionary on sparse representation for super resolution, called hybrid function sparse representation (HFSR). The dictionary we designed is directly generated by preset hybrid functions without additional training, which can be scaled to any size as is required due to its scalable property. We mixed approximated Heaviside function (AHF), sine function and DCT function as the dictionary. Multi-scale refinement is then proposed to utilize the scalable property of the dictionary to improve the results. In addition, a reconstruct strategy is adopted to deal with the overlaps. The experiments on Set14 SR dataset show that our method has an excellent performance particularly with regards to images containing rich details and contexts compared with non-learning based state-of-the art methods.
Tasks	Image Super-Resolution, Super-Resolution
Published	2019-06-11
URL	https://arxiv.org/abs/1906.04363v1
PDF	https://arxiv.org/pdf/1906.04363v1.pdf
PWC	https://paperswithcode.com/paper/hybrid-function-sparse-representation-towards
Repo	https://github.com/Eulring/Hybrid-Function-Sparse-Representation
Framework	none

Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching


Title	Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching
Authors	Hongteng Xu, Dixin Luo, Lawrence Carin
Abstract	We propose a scalable Gromov-Wasserstein learning (S-GWL) method and establish a novel and theoretically-supported paradigm for large-scale graph analysis. The proposed method is based on the fact that Gromov-Wasserstein discrepancy is a pseudometric on graphs. Given two graphs, the optimal transport associated with their Gromov-Wasserstein discrepancy provides the correspondence between their nodes and achieves graph matching. When one of the graphs has isolated but self-connected nodes ($i.e.$, a disconnected graph), the optimal transport indicates the clustering structure of the other graph and achieves graph partitioning. Using this concept, we extend our method to multi-graph partitioning and matching by learning a Gromov-Wasserstein barycenter graph for multiple observed graphs; the barycenter graph plays the role of the disconnected graph, and since it is learned, so is the clustering. Our method combines a recursive $K$-partition mechanism with a regularized proximal gradient algorithm, whose time complexity is $\mathcal{O}(K(E+V)\log_K V)$ for graphs with $V$ nodes and $E$ edges. To our knowledge, our method is the first attempt to make Gromov-Wasserstein discrepancy applicable to large-scale graph analysis and unify graph partitioning and matching into the same framework. It outperforms state-of-the-art graph partitioning and matching methods, achieving a trade-off between accuracy and efficiency.
Tasks	Graph Matching, graph partitioning
Published	2019-05-18
URL	https://arxiv.org/abs/1905.07645v5
PDF	https://arxiv.org/pdf/1905.07645v5.pdf
PWC	https://paperswithcode.com/paper/scalable-gromov-wasserstein-learning-for
Repo	https://github.com/HongtengXu/s-gwl
Framework	none

Learning by Cheating


Title	Learning by Cheating
Authors	Dian Chen, Brady Zhou, Vladlen Koltun, Philipp Krähenbühl
Abstract	Vision-based urban driving is hard. The autonomous system needs to learn to perceive the world and act in it. We show that this challenging learning problem can be simplified by decomposing it into two stages. We first train an agent that has access to privileged information. This privileged agent cheats by observing the ground-truth layout of the environment and the positions of all traffic participants. In the second stage, the privileged agent acts as a teacher that trains a purely vision-based sensorimotor agent. The resulting sensorimotor agent does not have access to any privileged information and does not cheat. This two-stage training procedure is counter-intuitive at first, but has a number of important advantages that we analyze and empirically demonstrate. We use the presented approach to train a vision-based autonomous driving system that substantially outperforms the state of the art on the CARLA benchmark and the recent NoCrash benchmark. Our approach achieves, for the first time, 100% success rate on all tasks in the original CARLA benchmark, sets a new record on the NoCrash benchmark, and reduces the frequency of infractions by an order of magnitude compared to the prior state of the art. For the video that summarizes this work, see https://youtu.be/u9ZCxxD-UUw
Tasks	Autonomous Driving
Published	2019-12-27
URL	https://arxiv.org/abs/1912.12294v1
PDF	https://arxiv.org/pdf/1912.12294v1.pdf
PWC	https://paperswithcode.com/paper/learning-by-cheating
Repo	https://github.com/dianchen96/LearningByCheating
Framework	none

Towards Federated Learning at Scale: System Design


Title	Towards Federated Learning at Scale: System Design
Authors	Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander
Abstract	Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralized data. We have built a scalable production system for Federated Learning in the domain of mobile devices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of the challenges and their solutions, and touch upon the open problems and future directions.
Tasks
Published	2019-02-04
URL	http://arxiv.org/abs/1902.01046v2
PDF	http://arxiv.org/pdf/1902.01046v2.pdf
PWC	https://paperswithcode.com/paper/towards-federated-learning-at-scale-system
Repo	https://github.com/zillerium/broomy
Framework	none

Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models


Title	Estimation and Feature Selection in Mixtures of Generalized Linear Experts Models
Authors	Bao Tuyen Huynh, Faicel Chamroukhi
Abstract	Mixtures-of-Experts (MoE) are conditional mixture models that have shown their performance in modeling heterogeneity in data in many statistical learning approaches for prediction, including regression and classification, as well as for clustering. Their estimation in high-dimensional problems is still however challenging. We consider the problem of parameter estimation and feature selection in MoE models with different generalized linear experts models, and propose a regularized maximum likelihood estimation that efficiently encourages sparse solutions for heterogeneous data with high-dimensional predictors. The developed proximal-Newton EM algorithm includes proximal Newton-type procedures to update the model parameter by monotonically maximizing the objective function and allows to perform efficient estimation and feature selection. An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data, compared to the main state-of-the art competitors.
Tasks	Feature Selection
Published	2019-07-14
URL	https://arxiv.org/abs/1907.06994v1
PDF	https://arxiv.org/pdf/1907.06994v1.pdf
PWC	https://paperswithcode.com/paper/estimation-and-feature-selection-in-mixtures
Repo	https://github.com/fchamroukhi/prEMME
Framework	none

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition


Title	EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Authors	Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
Abstract	We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities – RGB, Flow and Audio – and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previous works, modalities are fused before temporal aggregation, with shared modality and fusion weights over time. Our proposed architecture is trained end-to-end, outperforming individual modalities as well as late-fusion of modalities. We demonstrate the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects. Our method achieves state of the art results on both the seen and unseen test sets of the largest egocentric dataset: EPIC-Kitchens, on all metrics using the public leaderboard.
Tasks
Published	2019-08-22
URL	https://arxiv.org/abs/1908.08498v1
PDF	https://arxiv.org/pdf/1908.08498v1.pdf
PWC	https://paperswithcode.com/paper/epic-fusion-audio-visual-temporal-binding-for
Repo	https://github.com/ekazakos/temporal-binding-network
Framework	tf

Generalized Inner Loop Meta-Learning


Title	Generalized Inner Loop Meta-Learning
Authors	Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, Franziska Meier, Douwe Kiela, Kyunghyun Cho, Soumith Chintala
Abstract	Many (but not all) approaches self-qualifying as “meta-learning” in deep learning and reinforcement learning fit a common pattern of approximating the solution to a nested optimization problem. In this paper, we give a formalization of this shared pattern, which we call GIMLI, prove its general requirements, and derive a general-purpose algorithm for implementing similar approaches. Based on this analysis and algorithm, we describe a library of our design, higher, which we share with the community to assist and enable future research into these kinds of meta-learning approaches. We end the paper by showcasing the practical applications of this framework and library through illustrative experiments and ablation studies which they facilitate.
Tasks	Meta-Learning
Published	2019-10-03
URL	https://arxiv.org/abs/1910.01727v2
PDF	https://arxiv.org/pdf/1910.01727v2.pdf
PWC	https://paperswithcode.com/paper/generalized-inner-loop-meta-learning
Repo	https://github.com/facebookresearch/higher
Framework	pytorch

Generative Synthesis of Insurance Datasets


Title	Generative Synthesis of Insurance Datasets
Authors	Kevin Kuo
Abstract	One of the impediments in advancing actuarial research and developing open source assets for insurance analytics is the lack of realistic publicly available datasets. In this work, we develop a workflow for synthesizing insurance datasets leveraging state-of-the-art neural network techniques. We evaluate the predictive modeling efficacy of datasets synthesized from publicly available data in the domains of general insurance pricing and life insurance shock lapse modeling. The trained synthesizers are able to capture representative characteristics of the real datasets. This workflow is implemented via an R interface to promote adoption by researchers and data owners.
Tasks
Published	2019-12-05
URL	https://arxiv.org/abs/1912.02423v1
PDF	https://arxiv.org/pdf/1912.02423v1.pdf
PWC	https://paperswithcode.com/paper/generative-synthesis-of-insurance-datasets
Repo	https://github.com/kasaai/ctgan
Framework	none

Effective Aesthetics Prediction with Multi-level Spatially Pooled Features


Title	Effective Aesthetics Prediction with Multi-level Spatially Pooled Features
Authors	Vlad Hosu, Bastian Goldlucke, Dietmar Saupe
Abstract	We propose an effective deep learning approach to aesthetics quality assessment that relies on a new type of pre-trained features, and apply it to the AVA data set, the currently largest aesthetics database. While previous approaches miss some of the information in the original images, due to taking small crops, down-scaling or warping the originals during training, we propose the first method that efficiently supports full resolution images as an input, and can be trained on variable input sizes. This allows us to significantly improve upon the state of the art, increasing the Spearman rank-order correlation coefficient (SRCC) of ground-truth mean opinion scores (MOS) from the existing best reported of 0.612 to 0.756. To achieve this performance, we extract multi-level spatially pooled (MLSP) features from all convolutional blocks of a pre-trained InceptionResNet-v2 network, and train a custom shallow Convolutional Neural Network (CNN) architecture on these new features.
Tasks	Aesthetics Quality Assessment, Image Quality Assessment
Published	2019-04-02
URL	http://arxiv.org/abs/1904.01382v1
PDF	http://arxiv.org/pdf/1904.01382v1.pdf
PWC	https://paperswithcode.com/paper/effective-aesthetics-prediction-with-multi
Repo	https://github.com/subpic/ava-mlsp
Framework	tf