April 1, 2020

3246 words 16 mins read

Paper Group NANR 39

Projected Canonical Decomposition for Knowledge Base Completion. Learning a Spatio-Temporal Embedding for Video Instance Segmentation. A Stochastic Derivative Free Optimization Method with Momentum. Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks. Non-linear System Identification from Partial Observations via Iterative Smo …

Projected Canonical Decomposition for Knowledge Base Completion


Title	Projected Canonical Decomposition for Knowledge Base Completion
Authors	Anonymous
Abstract	The leading approaches to tensor completion and link prediction are based on the canonical polyadic (CP) decomposition of tensors. While these approaches were originally motivated by low rank approximations, the best performances are usually obtained for ranks as high as permitted by computation constraints. For large scale factorization problems where the factor dimensions have to be kept small, the performances of these approaches tend to drop drastically. The other main tensor factorization model, Tucker decomposition, is more flexible than CP for fixed factor dimensions, so we expect Tucker-based approaches to yield better performance under strong constraints on the number of parameters. However, as we show in this paper through experiments on standard benchmarks of link prediction in knowledge bases, ComplEx, a variant of CP, achieves similar performances to recent approaches based on Tucker decomposition on all operating points in terms of number of parameters. In a control experiment, we show that one problem in the practical application of Tucker decomposition to large-scale tensor completion comes from the adaptive optimization algorithms based on diagonal rescaling, such as Adagrad. We present a new algorithm for a constrained version of Tucker which implicitly applies Adagrad to a CP-based model with an additional projection of the embeddings onto a fixed lower dimensional subspace. The resulting Tucker-style extension of ComplEx obtains similar best performances as ComplEx, with substantial gains on some datasets under constraints on the number of parameters.
Tasks	Knowledge Base Completion, Link Prediction
Published	2020-01-01
URL	https://openreview.net/forum?id=ByeAK1BKPB
PDF	https://openreview.net/pdf?id=ByeAK1BKPB
PWC	https://paperswithcode.com/paper/projected-canonical-decomposition-for
Repo
Framework

Learning a Spatio-Temporal Embedding for Video Instance Segmentation


Title	Learning a Spatio-Temporal Embedding for Video Instance Segmentation
Authors	Anonymous
Abstract	Understanding object motion is one of the core problems in computer vision. It requires segmenting and tracking objects over time. Significant progress has been made in instance segmentation, but such models cannot track objects, and more crucially, they are unable to reason in both 3D space and time. We propose a new spatio-temporal embedding loss on videos that generates temporally consistent video instance segmentation. Our model includes a temporal network that learns to model temporal context and motion, which is essential to produce smooth embeddings over time. Further, our model also estimates monocular depth, with a self-supervised loss, as the relative distance to an object effectively constrains where it can be next, ensuring a time-consistent embedding. Finally, we show that our model can accurately track and segment instances, even with occlusions and missed detections, advancing the state-of-the-art on the KITTI Multi-Object and Tracking Dataset.
Tasks	Instance Segmentation, Semantic Segmentation
Published	2020-01-01
URL	https://openreview.net/forum?id=HyxTJxrtvr
PDF	https://openreview.net/pdf?id=HyxTJxrtvr
PWC	https://paperswithcode.com/paper/learning-a-spatio-temporal-embedding-for
Repo
Framework

A Stochastic Derivative Free Optimization Method with Momentum


Title	A Stochastic Derivative Free Optimization Method with Momentum
Authors	Anonymous
Abstract	We consider the problem of unconstrained minimization of a smooth objective function in $\mathbb{R}^d$ in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP) Bergou et al. (2019). We show new complexity results for non-convex, convex and strongly convex functions. We test our method on a collection of learning to continuous control tasks on several MuJoCo Todorov et al. (2012) environments with varying difficulty and compare against STP, other state-of-the-art derivative-free optimization algorithms and against policy gradient methods. SMTP significantly outperforms STP and all other methods that we considered in our numerical experiments. Our second contribution is SMTP with importance sampling which we call SMTP_IS. We provide convergence analysis of this method for non-convex, convex and strongly convex objectives.
Tasks	Continuous Control, Policy Gradient Methods
Published	2020-01-01
URL	https://openreview.net/forum?id=HylAoJSKvH
PDF	https://openreview.net/pdf?id=HylAoJSKvH
PWC	https://paperswithcode.com/paper/a-stochastic-derivative-free-optimization
Repo
Framework

Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks


Title	Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks
Authors	Anonymous
Abstract	High-performance Deep Neural Networks (DNNs) are increasingly deployed in many real-world applications e.g., cloud prediction APIs. Recent advances in model functionality stealing attacks via black-box access (i.e., inputs in, predictions out) threaten the business model of such applications, which require a lot of time, money, and effort to develop. Existing defenses take a passive role against stealing attacks, such as by truncating predicted information. We find such passive defenses ineffective against DNN stealing attacks. In this paper, we propose the first defense which actively perturbs predictions targeted at poisoning the training objective of the attacker. We find our defense effective across a wide range of challenging datasets and DNN model stealing attacks, and additionally outperforms existing defenses. Our defense is the first that can withstand highly accurate model stealing attacks for tens of thousands of queries, amplifying the attacker’s error rate up to a factor of 85$\times$ with minimal impact on the utility for benign users.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=SyevYxHtDB
PDF	https://openreview.net/pdf?id=SyevYxHtDB
PWC	https://paperswithcode.com/paper/prediction-poisoning-towards-defenses-against
Repo
Framework

Non-linear System Identification from Partial Observations via Iterative Smoothing and Learning


Title	Non-linear System Identification from Partial Observations via Iterative Smoothing and Learning
Authors	Anonymous
Abstract	System identification is the process of building a mathematical model of an unknown system from measurements of its inputs and outputs. It is a key step for model-based control, estimator design, and output prediction. This work presents an algorithm for non-linear offline system identification from partial observations, i.e. situations in which the system’s full-state is not directly observable. The algorithm presented, called SISL, iteratively infers the system’s full state through non-linear optimization and then updates the model parameters. We test our algorithm on a simulated system of coupled Lorenz attractors, showing our algorithm’s ability to identify high-dimensional systems that prove intractable for particle-based approaches. We also use SISL to identify the dynamics of an aerobatic helicopter. By augmenting the state with unobserved fluid states, we learn a model that predicts the acceleration of the helicopter better than state-of-the-art approaches.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1gR3ANFPS
PDF	https://openreview.net/pdf?id=B1gR3ANFPS
PWC	https://paperswithcode.com/paper/non-linear-system-identification-from-partial
Repo
Framework

Continuous Graph Flow


Title	Continuous Graph Flow
Authors	Anonymous
Abstract	In this paper, we propose Continuous Graph Flow, a generative continuous flow based method that aims to model complex distributions of graph-structured data. Once learned, the model can be applied to an arbitrary graph, defining a probability density over the random variables represented by the graph. It is formulated as an ordinary differential equation system with shared and reusable functions that operate over the graphs. This leads to a new type of neural graph message passing scheme that performs continuous message passing over time. This class of models offers several advantages: a flexible representation that can generalize to variable data dimensions; ability to model dependencies in complex data distributions; reversible and memory-efficient; and exact and efficient computation of the likelihood of the data. We demonstrate the effectiveness of our model on a diverse set of generation tasks across different domains: graph generation, image puzzle generation, and layout generation from scene graphs. Our proposed model achieves significantly better performance compared to state-of-the-art models.
Tasks	Graph Generation
Published	2020-01-01
URL	https://openreview.net/forum?id=BkgZSCEtvr
PDF	https://openreview.net/pdf?id=BkgZSCEtvr
PWC	https://paperswithcode.com/paper/continuous-graph-flow
Repo
Framework

Multichannel Generative Language Models


Title	Multichannel Generative Language Models
Authors	Anonymous
Abstract	A channel corresponds to a viewpoint or transformation of an underlying meaning. A pair of parallel sentences in English and French express the same underlying meaning but through two separate channels corresponding to their languages. In this work, we present Multichannel Generative Language Models (MGLM), which models the joint distribution over multiple channels, and all its decompositions using a single neural network. MGLM can be trained by feeding it k way parallel-data, bilingual data, or monolingual data across pre-determined channels. MGLM is capable of both conditional generation and unconditional sampling. For conditional generation, the model is given a fully observed channel, and generates the k-1 channels in parallel. In the case of machine translation, this is akin to giving it one source, and the model generates k-1 targets. MGLM can also do partial conditional sampling, where the channels are seeded with prespecified words, and the model is asked to infill the rest. Finally, we can sample from MGLM unconditionally over all k channels. Our experiments on the Multi30K dataset containing English, French, Czech, and German languages suggest that the multitask training with the joint objective leads to improvements in bilingual translations. We provide a quantitative analysis of the quality-diversity trade-offs for different variants of the multichannel model for conditional generation, and a measurement of self-consistency during unconditional generation. We provide qualitative examples for parallel greedy decoding across languages and sampling from the joint distribution of the 4 languages.
Tasks	Machine Translation
Published	2020-01-01
URL	https://openreview.net/forum?id=r1xQNlBYPS
PDF	https://openreview.net/pdf?id=r1xQNlBYPS
PWC	https://paperswithcode.com/paper/multichannel-generative-language-models
Repo
Framework

Training Neural Networks for and by Interpolation


Title	Training Neural Networks for and by Interpolation
Authors	Anonymous
Abstract	In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning. Specifically, we use it to compute an adaptive learning-rate in closed form at each iteration. This results in the Adaptive Learning-rates for Interpolation with Gradients (ALI-G) algorithm. ALI-G retains the main advantage of SGD which is a low computational cost per iteration. But unlike SGD, the learning-rate of ALI-G uses a single constant hyper-parameter and does not require a decay schedule, which makes it considerably easier to tune. We provide convergence guarantees of ALI-G in the stochastic convex setting. Notably, all our convergence results tackle the realistic case where the interpolation property is satisfied up to some tolerance. We provide experiments on a variety of architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. ALI-G produces state-of-the-art results among adaptive methods, and even yields comparable performance with SGD, which requires manually tuned learning-rate schedules. Furthermore, ALI-G is simple to implement in any standard deep learning framework and can be used as a drop-in replacement in existing code.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJevJCVYvB
PDF	https://openreview.net/pdf?id=BJevJCVYvB
PWC	https://paperswithcode.com/paper/training-neural-networks-for-and-by-1
Repo
Framework

Learning Neural Causal Models from Unknown Interventions


Title	Learning Neural Causal Models from Unknown Interventions
Authors	Anonymous
Abstract	Meta-learning over a set of distributions can be interpreted as learning different types of parameters corresponding to short-term vs long-term aspects of the mechanisms underlying the generation of data. These are respectively captured by quickly-changing \textit{parameters} and slowly-changing \textit{meta-parameters}. We present a new framework for meta-learning causal models where the relationship between each variable and its parents is modeled by a neural network, modulated by structural meta-parameters which capture the overall topology of a directed graphical model. Our approach avoids a discrete search over models in favour of a continuous optimization procedure. We study a setting where interventional distributions are induced as a result of a random intervention on a single unknown variable of an unknown ground truth causal model, and the observations arising after such an intervention constitute one meta-example. To disentangle the slow-changing aspects of each conditional from the fast-changing adaptations to each intervention, we parametrize the neural network into fast parameters and slow meta-parameters. We introduce a meta-learning objective that favours solutions \textit{robust} to frequent but sparse interventional distribution change, and which generalize well to previously unseen interventions. Optimizing this objective is shown experimentally to recover the structure of the causal graph. Finally, we find that when the learner is unaware of the intervention variable, it is able to infer that information, improving results further and focusing the parameter and meta-parameter updates where needed.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=H1gN6kSFwS
PDF	https://openreview.net/pdf?id=H1gN6kSFwS
PWC	https://paperswithcode.com/paper/learning-neural-causal-models-from-unknown
Repo
Framework

Sliced Cramer Synaptic Consolidation for Preserving Deeply Learned Representations


Title	Sliced Cramer Synaptic Consolidation for Preserving Deeply Learned Representations
Authors	Anonymous
Abstract	Deep neural networks suffer from the inability to preserve the learned data representation (i.e., catastrophic forgetting) in domains where the input data distribution is non-stationary, and it changes during training. Various selective synaptic plasticity approaches have been recently proposed to preserve network parameters, which are crucial for previously learned tasks while learning new tasks. We explore such selective synaptic plasticity approaches through a unifying lens of memory replay and show the close relationship between methods like Elastic Weight Consolidation (EWC) and Memory-Aware-Synapses (MAS). We then propose a fundamentally different class of preservation methods that aim at preserving the distribution of internal neural representations for previous tasks while learning a new one. We propose the sliced Cram'{e}r distance as a suitable choice for such preservation and evaluate our Sliced Cramer Preservation (SCP) algorithm through extensive empirical investigations on various network architectures in both supervised and unsupervised learning settings. We show that SCP consistently utilizes the learning capacity of the network better than online-EWC and MAS methods on various incremental learning tasks.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJge3TNKwH
PDF	https://openreview.net/pdf?id=BJge3TNKwH
PWC	https://paperswithcode.com/paper/sliced-cramer-synaptic-consolidation-for
Repo
Framework

Hierarchical Disentangle Network for Object Representation Learning


Title	Hierarchical Disentangle Network for Object Representation Learning
Authors	Anonymous
Abstract	An object can be described as the combination of primary visual attributes. Disentangling such underlying primitives is the long objective of representation learning. It is observed that categories have the natural multi-granularity or hierarchical characteristics, i.e. any two objects can share some common primitives in a particular category granularity while they may possess their unique ones in another granularity. However, previous works usually operate in a flat manner (i.e. in a particular granularity) to disentangle the representations of objects. Though they may obtain the primitives to constitute objects as the categories in that granularity, their results are obviously not efficient and complete. In this paper, we propose the hierarchical disentangle network (HDN) to exploit the rich hierarchical characteristics among categories to divide the disentangling process in a coarse-to-fine manner, such that each level only focuses on learning the specific representations in its granularity and finally the common and unique representations in all granularities jointly constitute the raw object. Specifically, HDN is designed based on an encoder-decoder architecture. To simultaneously ensure the disentanglement and interpretability of the encoded representations, a novel hierarchical generative adversarial network (GAN) is elaborately designed. Quantitative and qualitative evaluations on four object datasets validate the effectiveness of our method.
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=rkg8xTEtvB
PDF	https://openreview.net/pdf?id=rkg8xTEtvB
PWC	https://paperswithcode.com/paper/hierarchical-disentangle-network-for-object
Repo
Framework

In-Domain Representation Learning For Remote Sensing


Title	In-Domain Representation Learning For Remote Sensing
Authors	Anonymous
Abstract	Given the importance of remote sensing, surprisingly little attention has been paid to it by the representation learning community. To address it and to speed up innovation in this domain, we provide simplified access to 5 diverse remote sensing datasets in a standardized form. We specifically explore in-domain representation learning and address the question of “what characteristics should a dataset have to be a good source for remote sensing representation learning”. The established baselines achieve state-of-the-art performance on these datasets.
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=BJx_JAVKDB
PDF	https://openreview.net/pdf?id=BJx_JAVKDB
PWC	https://paperswithcode.com/paper/in-domain-representation-learning-for-remote
Repo
Framework

Knowledge Hypergraphs: Prediction Beyond Binary Relations


Title	Knowledge Hypergraphs: Prediction Beyond Binary Relations
Authors	Anonymous
Abstract	A Knowledge Hypergraph is a knowledge base where relations are defined on two or more entities. In this work, we introduce two embedding-based models that perform link prediction in knowledge hypergraphs: (1) HSimplE is a shift-based method that is inspired by an existing model operating on knowledge graphs, in which the representation of an entity is a function of its position in the relation, and (2) HypE is a convolution-based method which disentangles the representation of an entity from its position in the relation. We test our models on two new knowledge hypergraph datasets that we obtain from Freebase, and show that both HSimplE and HypE are more effective in predicting links in knowledge hypergraphs than the proposed baselines and existing methods. Our experiments show that HypE outperforms HSimplE when trained with fewer parameters and when tested on samples that contain at least one entity in a position never encountered during training.
Tasks	Knowledge Graphs, Link Prediction
Published	2020-01-01
URL	https://openreview.net/forum?id=ryxIZR4tvS
PDF	https://openreview.net/pdf?id=ryxIZR4tvS
PWC	https://paperswithcode.com/paper/knowledge-hypergraphs-prediction-beyond
Repo
Framework

MaskConvNet: Training Efficient ConvNets from Scratch via Budget-constrained Filter Pruning


Title	MaskConvNet: Training Efficient ConvNets from Scratch via Budget-constrained Filter Pruning
Authors	Raden Mu’az Mun’im, Jie Lin, Vijay Chandrasekhar, Koichi Shinoda
Abstract	In this paper, we propose a framework, called MaskConvNet, for ConvNets filter pruning. MaskConvNet provides elegant support for training budget-aware pruned networks from scratch, by adding a simple mask module to a ConvNet architecture. MaskConvNet enjoys several advantages - (1) Flexible, the mask module can be integrated with any ConvNets in a plug-and-play manner. (2) Simple, the mask module is implemented by a hard Sigmoid function with a small number of trainable mask variables, adding negligible memory and computational overheads to the networks during training. (3) Effective, it is able to achieve competitive pruning rate while maintaining comparable accuracy with the baseline ConvNets without pruning, regardless of the datasets and ConvNet architectures used. (4) Fast, it is observed that the number of training epochs required by MaskConvNet is close to training a baseline without pruning. (5) Budget-aware, with a sparsity budget on target metric (e.g. model size and FLOP), MaskConvNet is able to train in a way that the optimizer can adaptively sparsify the network and automatically maintain sparsity level, till the pruned network produces good accuracy and fulfill the budget constraint simultaneously. Results on CIFAR-10 and ImageNet with several ConvNet architectures show that MaskConvNet works competitively well compared to previous pruning methods, with budget-constraint well respected. Code is available at https://www.dropbox.com/s/c4zi3n7h1bexl12/maskconv-iclr-code.zip?dl=0. We hope MaskConvNet, as a simple and general pruning framework, can address the gaps in existing literate and advance future studies to push the boundaries of neural network pruning.
Tasks	Network Pruning
Published	2020-01-01
URL	https://openreview.net/forum?id=S1gyl6Vtvr
PDF	https://openreview.net/pdf?id=S1gyl6Vtvr
PWC	https://paperswithcode.com/paper/maskconvnet-training-efficient-convnets-from
Repo
Framework

The Intriguing Effects of Focal Loss on the Calibration of Deep Neural Networks


Title	The Intriguing Effects of Focal Loss on the Calibration of Deep Neural Networks
Authors	Anonymous
Abstract	Miscalibration – a mismatch between a model’s confidence and its correctness – of Deep Neural Networks (DNNs) makes their predictions hard for downstream components to trust. Ideally, we want networks to be accurate, calibrated and confident. Temperature scaling, the most popular calibration approach, will calibrate a DNN without affecting its accuracy, but it will also make its correct predictions under-confident. In this paper, we show that replacing the widely used cross-entropy loss with focal loss allows us to learn models that are already very well calibrated. When combined with temperature scaling, focal loss, whilst preserving accuracy and yielding state-of-the-art calibrated models, also preserves the confidence of the model’s correct predictions, which is extremely desirable for downstream tasks. We provide a thorough analysis of the factors causing miscalibration, and use the insights we glean from this to theoretically justify the empirically excellent performance of focal loss. We perform extensive experiments on a variety of computer vision (CIFAR-10/100) and NLP (SST, 20 Newsgroup) datasets, and with a wide variety of different network architectures, and show that our approach achieves state-of-the-art accuracy and calibration in almost all cases.
Tasks	Calibration
Published	2020-01-01
URL	https://openreview.net/forum?id=SJxTZeHFPH
PDF	https://openreview.net/pdf?id=SJxTZeHFPH
PWC	https://paperswithcode.com/paper/the-intriguing-effects-of-focal-loss-on-the
Repo
Framework