Paper Group NANR 42
Improved Structural Discovery and Representation Learning of Multi-Agent Data. Extreme Language Model Compression with Optimal Subwords and Shared Projections. Feature Map Transform Coding for Energy-Efficient CNN Inference. Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Nonconvex Optimization. Deep Spike Decoder (DSD). Ad …
Improved Structural Discovery and Representation Learning of Multi-Agent Data
Title | Improved Structural Discovery and Representation Learning of Multi-Agent Data |
Authors | Anonymous |
Abstract | Central to all machine learning algorithms is data representation. For multi-agent systems, selecting a representation which adequately captures the interactions among agents is challenging due to the latent group structure which tends to vary depending on various contexts. However, in multi-agent systems with strong group structure, we can simultaneously learn this structure and map a set of agents to a consistently ordered representation for further learning. In this paper, we present a dynamic alignment method which provides a robust ordering of structured multi-agent data which allows for representation learning to occur in a fraction of the time of previous methods. We demonstrate the value of this approach using a large amount of soccer tracking data from a professional league. |
Tasks | Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1ervR4FwH |
https://openreview.net/pdf?id=H1ervR4FwH | |
PWC | https://paperswithcode.com/paper/improved-structural-discovery-and |
Repo | |
Framework | |
Extreme Language Model Compression with Optimal Subwords and Shared Projections
Title | Extreme Language Model Compression with Optimal Subwords and Shared Projections |
Authors | Anonymous |
Abstract | Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model’s memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT-BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques. |
Tasks | Language Modelling, Model Compression, Word Embeddings |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1x6ueSKPr |
https://openreview.net/pdf?id=S1x6ueSKPr | |
PWC | https://paperswithcode.com/paper/extreme-language-model-compression-with |
Repo | |
Framework | |
Feature Map Transform Coding for Energy-Efficient CNN Inference
Title | Feature Map Transform Coding for Energy-Efficient CNN Inference |
Authors | Anonymous |
Abstract | Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a variety of tasks in computer vision and beyond. One of the major obstacles hindering the ubiquitous use of CNNs for inference on low-power edge devices is their high computational complexity and memory bandwidth requirements. The latter often dominates the energy footprint on modern hardware. In this paper, we introduce a lossy transform coding approach, inspired by image and video compression, designed to reduce the memory bandwidth due to the storage of intermediate activation calculation results. Our method does not require fine-tuning the network weights and halves the data transfer volumes to the main memory by compressing feature maps, which are highly correlated, with variable length coding. Our method outperform previous approach in term of the number of bits per value with minor accuracy degradation on ResNet-34 and MobileNetV2. We analyze the performance of our approach on a variety of CNN architectures and demonstrate that FPGA implementation of ResNet-18 with our approach results in a reduction of around 40% in the memory energy footprint, compared to quantized network, with negligible impact on accuracy. When allowing accuracy degradation of up to 2%, the reduction of 60% is achieved. A reference implementation}accompanies the paper. |
Tasks | Video Compression |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BJeTCAEtDB |
https://openreview.net/pdf?id=BJeTCAEtDB | |
PWC | https://paperswithcode.com/paper/feature-map-transform-coding-for-energy-1 |
Repo | |
Framework | |
Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Nonconvex Optimization
Title | Convergence Analysis of a Momentum Algorithm with Adaptive Step Size for Nonconvex Optimization |
Authors | Anonymous |
Abstract | Although Adam is a very popular algorithm for optimizing the weights of neural networks, it has been recently shown that it can diverge even in simple convex optimization examples. Therefore, several variants of Adam have been proposed to circumvent this convergence issue. In this work, we study the algorithm for smooth nonconvex optimization under a boundedness assumption on the adaptive learning rate. The bound on the adaptive step size depends on the Lipschitz constant of the gradient of the objective function and provides safe theoretical adaptive step sizes. Under this boundedness assumption, we show a novel first order convergence rate result in both deterministic and stochastic contexts. Furthermore, we establish convergence rates of the function value sequence using the Kurdyka-Lojasiewicz property. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyeYiyHFDH |
https://openreview.net/pdf?id=SyeYiyHFDH | |
PWC | https://paperswithcode.com/paper/convergence-analysis-of-a-momentum-algorithm |
Repo | |
Framework | |
Deep Spike Decoder (DSD)
Title | Deep Spike Decoder (DSD) |
Authors | Anonymous |
Abstract | Spike-sorting is of central importance for neuroscience research. We introducea novel spike-sorting method comprising a deep autoencoder trained end-to-endwith a biophysical generative model, biophysically motivated priors, and a self-supervised loss function to training a deep autoencoder. The encoder infers the ac-tion potential event times for each source, while the decoder parameters representeach source’s spatiotemporal response waveform. We evaluate this approach inthe context of real and synthetic multi-channel surface electromyography (sEMG)data, a noisy superposition of motor unit action potentials (MUAPs). Relative toan established spike-sorting method, this autoencoder-based approach shows su-perior recovery of source waveforms and event times. Moreover, the biophysicalnature of the loss functions facilitates interpretability and hyperparameter tuning.Overall, these results demonstrate the efficacy and motivate further developmentof self-supervised spike sorting techniques. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1eZOeBKDS |
https://openreview.net/pdf?id=S1eZOeBKDS | |
PWC | https://paperswithcode.com/paper/deep-spike-decoder-dsd |
Repo | |
Framework | |
Additive Powers-of-Two Quantization: A Non-uniform Discretization for Neural Networks
Title | Additive Powers-of-Two Quantization: A Non-uniform Discretization for Neural Networks |
Authors | Anonymous |
Abstract | We proposed Additive Powers-of-Two (APoT) quantization, an efficient nonuniform quantization scheme that attends to the bell-shaped and long-tailed distribution of weights in neural networks. By constraining all quantization levels as a sum of several Powers-of-Two terms, APoT quantization enjoys overwhelming efficiency of computation and a good match with weights’ distribution. A simple reparameterization on clipping function is applied to generate better-defined gradient for updating of optimal clipping threshold. Moreover, weight normalization is presented to refine the input distribution of weights to be more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models demonstrating the effectiveness of our proposed APoT quantization. For example, our 3-bit quantized ResNet-34 on ImageNet only drops 0.3% Top-1 and 0.2% Top-5 accuracy without bells and whistles, while the computation of our model is approximately 2× less than uniformly quantized neural networks. |
Tasks | Quantization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkgXT24tDS |
https://openreview.net/pdf?id=BkgXT24tDS | |
PWC | https://paperswithcode.com/paper/additive-powers-of-two-quantization-a-non-1 |
Repo | |
Framework | |
Deep 3D-Zoom Net: Unsupervised Learning of Photo-Realistic 3D-Zoom
Title | Deep 3D-Zoom Net: Unsupervised Learning of Photo-Realistic 3D-Zoom |
Authors | Anonymous |
Abstract | The 3D-zoom operation is the positive translation of the camera in the Z-axis, perpendicular to the image plane. In contrast, the optical zoom changes the focal length and the digital zoom is used to enlarge a certain region of an image to the original image size. In this paper, we are the first to formulate an unsupervised 3D-zoom learning problem where images with an arbitrary zoom factor can be generated from a given single image. An unsupervised framework is convenient, as it is a challenging task to obtain a 3D-zoom dataset of natural scenes due to the need for special equipment to ensure camera movement is restricted to the Z-axis. Besides, the objects in the scenes should not move when being captured, which hinders the construction of a large dataset of outdoor scenes. We present a novel unsupervised framework to learn how to generate arbitrarily 3D-zoomed versions of a single image, not requiring a 3D-zoom ground truth, called the Deep 3D-Zoom Net. The Deep 3D-Zoom Net incorporates the following features: (i) transfer learning from a pre-trained disparity estimation network via a back re-projection reconstruction loss; (ii) a fully convolutional network architecture that models depth-image-based rendering (DIBR), taking into account high-frequency details without the need for estimating the intermediate disparity; and (iii) incorporating a discriminator network that acts as a no-reference penalty for unnaturally rendered areas. Even though there is no baseline to fairly compare our results, our method outperforms previous novel view synthesis research in terms of realistic appearance on large camera baselines. We performed extensive experiments to verify the effectiveness of our method on the KITTI and Cityscapes datasets. |
Tasks | Disparity Estimation, Novel View Synthesis, Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HylohJrKPS |
https://openreview.net/pdf?id=HylohJrKPS | |
PWC | https://paperswithcode.com/paper/deep-3d-zoom-net-unsupervised-learning-of-1 |
Repo | |
Framework | |
Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular Graph Analysis
Title | Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular Graph Analysis |
Authors | Anonymous |
Abstract | Graph Neural Network (GNN) is a popular architecture for the analysis of chemical molecules, and it has numerous applications in material and medicinal science. Current lines of GNNs developed for molecular analysis, however, do not fit well on the training set, and their performance does not scale well with the complexity of the network. In this paper, we propose an auxiliary module to be attached to a GNN that can boost the representation power of the model without hindering the original GNN architecture. Our auxiliary module can improve the representation power and the generalization ability of a wide variety of GNNs, including those that are used commonly in biochemical applications. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1l66nNFvB |
https://openreview.net/pdf?id=S1l66nNFvB | |
PWC | https://paperswithcode.com/paper/graph-warp-module-an-auxiliary-module-for-1 |
Repo | |
Framework | |
Rethinking Curriculum Learning With Incremental Labels And Adaptive Compensation
Title | Rethinking Curriculum Learning With Incremental Labels And Adaptive Compensation |
Authors | Anonymous |
Abstract | Like humans, deep networks learn better when samples are organized and introduced in a meaningful order or curriculum. While conventional approaches to curriculum learning emphasize the difficulty of samples as the core incremental strategy, it forces networks to learn from small subsets of data while introducing pre-computation overheads. In this work, we propose Learning with Incremental Labels and Adaptive Compensation (LILAC), which introduces a novel approach to curriculum learning. LILAC emphasizes incrementally learning labels instead of incrementally learning difficult samples. It works in two distinct phases: first, in the incremental label introduction phase, we unmask ground-truth labels in fixed increments during training, to improve the starting point from which networks learn. In the adaptive compensation phase, we compensate for failed predictions by adaptively altering the target vector to a smoother distribution. We evaluate LILAC against the closest comparable methods in batch and curriculum learning and label smoothing, across three standard image benchmarks, CIFAR-10, CIFAR-100, and STL-10. We show that our method outperforms batch learning with higher mean recognition accuracy as well as lower standard deviation in performance consistently across all benchmarks. We further extend LILAC to state-of-the-art performance across CIFAR-10 using simple data augmentation while exhibiting label order invariance among other important properties. |
Tasks | Data Augmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1lTUCVYvH |
https://openreview.net/pdf?id=H1lTUCVYvH | |
PWC | https://paperswithcode.com/paper/rethinking-curriculum-learning-with |
Repo | |
Framework | |
CGT: Clustered Graph Transformer for Urban Spatio-temporal Prediction
Title | CGT: Clustered Graph Transformer for Urban Spatio-temporal Prediction |
Authors | Anonymous |
Abstract | Deep learning based approaches have been widely used in various urban spatio-temporal forecasting problems, but most of them fail to account for the unsmoothness issue of urban data in their architecture design, which significantly deteriorates their prediction performance. The aim of this paper is to develop a novel clustered graph transformer framework that integrates both graph attention network and transformer under an encoder-decoder architecture to address such unsmoothness issue. Specifically, we propose two novel structural components to refine the architectures of those existing deep learning models. In spatial domain, we propose a gradient-based clustering method to distribute different feature extractors to regions in different contexts. In temporal domain, we propose to use multi-view position encoding to address the periodicity and closeness of urban time series data. Experiments on real datasets obtained from a ride-hailing business show that our method can achieve 10%-25% improvement than many state-of-the-art baselines. |
Tasks | Spatio-Temporal Forecasting, Time Series |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1eJAANtvr |
https://openreview.net/pdf?id=H1eJAANtvr | |
PWC | https://paperswithcode.com/paper/cgt-clustered-graph-transformer-for-urban |
Repo | |
Framework | |
Calibration, Entropy Rates, and Memory in Language Models
Title | Calibration, Entropy Rates, and Memory in Language Models |
Authors | Anonymous |
Abstract | Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing. Towards this end, we present a calibration-based approach to measure long-term discrepancies between a generative sequence model and the true distribution, and use these discrepancies to improve the model. Empirically, we show that state-of-the-art language models, including LSTMs and Transformers, are \emph{miscalibrated}: the entropy rates of their generations drift dramatically upward over time. We then provide provable methods to mitigate this phenomenon. Furthermore, we show how this calibration-based approach can also be used to measure the amount of memory that language models use for prediction. |
Tasks | Calibration |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1eQcCEtDB |
https://openreview.net/pdf?id=B1eQcCEtDB | |
PWC | https://paperswithcode.com/paper/calibration-entropy-rates-and-memory-in-1 |
Repo | |
Framework | |
Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels
Title | Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels |
Authors | Anonymous |
Abstract | We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an empirical analysis of the cost trade-off between incrementally gathering more labeled versus unlabeled data. This analysis can be used to measure the relative value of labeled/unlabeled data at different points of the learning curve, where we find that although the incremental value of labeled data can be as much as 20x that of unlabeled, it quickly diminishes to less than 3x once more than 2,000 labeled example are observed. |
Tasks | Active Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxWl0NKPB |
https://openreview.net/pdf?id=HJxWl0NKPB | |
PWC | https://paperswithcode.com/paper/combining-mixmatch-and-active-learning-for |
Repo | |
Framework | |
Mutual Exclusivity as a Challenge for Deep Neural Networks
Title | Mutual Exclusivity as a Challenge for Deep Neural Networks |
Authors | Anonymous |
Abstract | Strong inductive biases allow children to learn in fast and adaptable ways. Children use the mutual exclusivity (ME) bias to help disambiguate how words map to referents, assuming that if an object has one label then it does not need another. In this paper, we investigate whether or not standard neural architectures have a ME bias, demonstrating that they lack this learning assumption. Moreover, we show that their inductive biases are poorly matched to lifelong learning formulations of classification and translation. We demonstrate that there is a compelling case for designing neural networks that reason by mutual exclusivity, which remains an open challenge. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1lvn0NtwH |
https://openreview.net/pdf?id=S1lvn0NtwH | |
PWC | https://paperswithcode.com/paper/mutual-exclusivity-as-a-challenge-for-deep |
Repo | |
Framework | |
Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information
Title | Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information |
Authors | Anonymous |
Abstract | Posterior sampling for reinforcement learning (PSRL) is a useful framework for making decisions in an unknown environment. PSRL maintains a posterior distribution of the environment and then makes planning on the environment sampled from the posterior distribution. Though PSRL works well on single-agent reinforcement learning problems, how to apply PSRL to multi-agent reinforcement learning problems is relatively unexplored. In this work, we extend PSRL to two-player zero-sum extensive-games with imperfect information (TZIEG), which is a class of multi-agent systems. More specifically, we combine PSRL with counterfactual regret minimization (CFR), which is the leading algorithm for TZIEG with a known environment. Our main contribution is a novel design of interaction strategies. With our interaction strategies, our algorithm provably converges to the Nash Equilibrium at a rate of $O(\sqrt{\log T/T})$. Empirical results show that our algorithm works well. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Syg-ET4FPS |
https://openreview.net/pdf?id=Syg-ET4FPS | |
PWC | https://paperswithcode.com/paper/posterior-sampling-for-multi-agent |
Repo | |
Framework | |
GRAPH NEIGHBORHOOD ATTENTIVE POOLING
Title | GRAPH NEIGHBORHOOD ATTENTIVE POOLING |
Authors | Anonymous |
Abstract | Network representation learning (NRL) is a powerful technique for learning low-dimensional vector representation of high-dimensional and sparse graphs. Most studies explore the structure and meta data associated with the graph using random walks and employ a unsupervised or semi-supervised learning schemes. Learning in these methods is context-free, because only a single representation per node is learned. Recently studies have argued on the sufficiency of a single representation and proposed a context-sensitive approach that proved to be highly effective in applications such as link prediction and ranking. However, most of these methods rely on additional textual features that require RNNs or CNNs to capture high-level features or rely on a community detection algorithm to identifying multiple contexts of a node. In this study, without requiring additional features nor a community detection algorithm, we propose a novel context-sensitive algorithm called GAP that learns to attend on different part of a node’s neighborhood using attentive pooling networks. We show the efficacy of GAP using three real-world datasets on link prediction and node clustering tasks and compare it against 10 popular and state-of-the-art (SOTA) baselines. GAP consistently outperforms them and achieves up to ≈9% and ≈20% gain over the best performing methods on link prediction and clustering tasks, respectively. |
Tasks | Community Detection, Link Prediction, Representation Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BkeqATVYwr |
https://openreview.net/pdf?id=BkeqATVYwr | |
PWC | https://paperswithcode.com/paper/graph-neighborhood-attentive-pooling |
Repo | |
Framework | |