Paper Group NANR 10
Self-Attentional Credit Assignment for Transfer in Reinforcement Learning. An Empirical Study of Encoders and Decoders in Graph-Based Dependency Parsing. Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation. Mode Connectivity and Sparse Neural Networks. Analysis of Video Feature Learning in Two-Stream CNNs on the Exa …
Self-Attentional Credit Assignment for Transfer in Reinforcement Learning
Title | Self-Attentional Credit Assignment for Transfer in Reinforcement Learning |
Authors | Anonymous |
Abstract | The ability to transfer knowledge to novel environments and tasks is a sensible desiderata for general learning agents. Despite the apparent promises, transfer in RL is still an open and little exploited research area. In this paper, we take a brand-new perspective about transfer: we suggest that the ability to assign credit unveils structural invariants in the tasks that can be transferred to make RL more sample efficient. Our main contribution is Secret, a novel approach to transfer learning for RL that uses a backward-view credit assignment mechanism based on a self-attentive architecture. Two aspects are key to its generality: it learns to assign credit as a separate offline supervised process and exclusively modifies the reward function. Consequently, it can be supplemented by transfer methods that do not modify the reward function and it can be plugged on top of any RL algorithm. |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1xybgSKwB |
https://openreview.net/pdf?id=B1xybgSKwB | |
PWC | https://paperswithcode.com/paper/self-attentional-credit-assignment-for |
Repo | |
Framework | |
An Empirical Study of Encoders and Decoders in Graph-Based Dependency Parsing
Title | An Empirical Study of Encoders and Decoders in Graph-Based Dependency Parsing |
Authors | Anonymous |
Abstract | Graph-based dependency parsing consists of two steps: first, an encoder produces a feature representation for each parsing substructure of the input sentence, which is then used to compute a score for the substructure; and second, a decoder} finds the parse tree whose substructures have the largest total score. Over the past few years, powerful neural techniques have been introduced into the encoding step which substantially increases parsing accuracies. However, advanced decoding techniques, in particular high-order decoding, have seen a decline in usage. It is widely believed that contextualized features produced by neural encoders can help capture high-order decoding information and hence diminish the need for a high-order decoder. In this paper, we empirically evaluate the combinations of different neural and non-neural encoders with first- and second-order decoders and provide a comprehensive analysis about the effectiveness of these combinations with varied training data sizes. We find that: first, when there is large training data, a strong neural encoder with first-order decoding is sufficient to achieve high parsing accuracy and only slightly lags behind the combination of neural encoding and second-order decoding; second, with small training data, a non-neural encoder with a second-order decoder outperforms the other combinations in most cases. |
Tasks | Dependency Parsing |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1ehEgHKwH |
https://openreview.net/pdf?id=r1ehEgHKwH | |
PWC | https://paperswithcode.com/paper/an-empirical-study-of-encoders-and-decoders |
Repo | |
Framework | |
Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation
Title | Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation |
Authors | Jiawei Ma*, Zheng Shou*, Alireza Zareian, Hassan Mansour, Anthony Vetro, Shih-Fu Chang |
Abstract | Many real-world applications involve multivariate, geo-tagged time series data: at each location, multiple sensors record corresponding measurements. For example, air quality monitoring system records PM2.5, CO, etc. The resulting time-series data often has missing values due to device outages or communication errors. In order to impute the missing values, state-of-the-art methods are built on Recurrent Neural Networks (RNN), which process each time stamp sequentially, prohibiting the direct modeling of the relationship between distant time stamps. Recently, the self-attention mechanism has been proposed for sequence modeling tasks such as machine translation, significantly outperforming RNN because the relationship between each two time stamps can be modeled explicitly. In this paper, we are the first to adapt the self-attention mechanism for multivariate, geo-tagged time series data. In order to jointly capture the self-attention across different dimensions (i.e. time, location and sensor measurements) while keep the size of attention maps reasonable, we propose a novel approach called Cross-Dimensional Self-Attention (CDSA) to process each dimension sequentially, yet in an order-independent manner. On three real-world datasets, including one our newly collected NYC-traffic dataset, extensive experiments demonstrate the superiority of our approach compared to state-of-the-art methods for both imputation and forecasting tasks. |
Tasks | Imputation, Machine Translation, Time Series |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJxRKT4Fwr |
https://openreview.net/pdf?id=SJxRKT4Fwr | |
PWC | https://paperswithcode.com/paper/cross-dimensional-self-attention-for |
Repo | |
Framework | |
Mode Connectivity and Sparse Neural Networks
Title | Mode Connectivity and Sparse Neural Networks |
Authors | Anonymous |
Abstract | We uncover a connection between two seemingly unrelated empirical phenomena: mode connectivity and sparsity. On the one hand, there is growing catalog of situations where, across multiple runs, SGD learns weights that fall into minima that are connected (mode connectivity). A striking example is described by Nagarajan & Kolter (2019). They observe that test error on MNIST does not change along the linear path connecting the end points of two independent SGD runs, starting from the same random initialization. On the other hand, there is the lottery ticket hypothesis of Frankle & Carbin (2019), where dense, randomly initialized networks have sparse subnetworks capable of training in isolation to full accuracy. However, neither phenomenon scales beyond small vision networks. We start by proposing a technique to find sparse subnetworks after initialization. We observe that these subnetworks match the accuracy of the full network only when two SGD runs for the same subnetwork are connected by linear paths with the no change in test error. Our findings connect the existence of sparse subnetworks that train to high accuracy with the dynamics of optimization via mode connectivity. In doing so, we identify analogues of the phenomena uncovered by Nagarajan & Kolter and Frankle & Carbin in ImageNet-scale architectures at state-of-the-art sparsity levels. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkeO-lrYwr |
https://openreview.net/pdf?id=rkeO-lrYwr | |
PWC | https://paperswithcode.com/paper/mode-connectivity-and-sparse-neural-networks |
Repo | |
Framework | |
Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification
Title | Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification |
Authors | Anonymous |
Abstract | Semmelhack et al. (2014) have achieved high classification accuracy in distinguishing swim bouts of zebrafish using a Support Vector Machine (SVM). Convolutional Neural Networks (CNNs) have reached superior performance in various image recognition tasks over SVMs, but these powerful networks remain a black box. Reaching better transparency helps to build trust in their classifications and makes learned features interpretable to experts. Using a recently developed technique called Deep Taylor Decomposition, we generated heatmaps to highlight input regions of high relevance for predictions. We find that our CNN makes predictions by analyzing the steadiness of the tail’s trunk, which markedly differs from the manually extracted features used by Semmelhack et al. (2014). We further uncovered that the network paid attention to experimental artifacts. Removing these artifacts ensured the validity of predictions. After correction, our best CNN beats the SVM by 6.12%, achieving a classification accuracy of 96.32%. Our work thus demonstrates the utility of AI explainability for CNNs. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJgQkT4twH |
https://openreview.net/pdf?id=rJgQkT4twH | |
PWC | https://paperswithcode.com/paper/analysis-of-video-feature-learning-in-two |
Repo | |
Framework | |
Explaining Time Series by Counterfactuals
Title | Explaining Time Series by Counterfactuals |
Authors | Anonymous |
Abstract | We propose a method to automatically compute the importance of features at every observation in time series, by simulating counterfactual trajectories given previous observations. We define the importance of each observation as the change in the model output caused by replacing the observation with a generated one. Our method can be applied to arbitrarily complex time series models. We compare the generated feature importance to existing methods like sensitivity analyses, feature occlusion, and other explanation baselines to show that our approach generates more precise explanations and is less sensitive to noise in the input signals. |
Tasks | Feature Importance, Time Series |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HygDF1rYDB |
https://openreview.net/pdf?id=HygDF1rYDB | |
PWC | https://paperswithcode.com/paper/explaining-time-series-by-counterfactuals |
Repo | |
Framework | |
Programmable Neural Network Trojan for Pre-trained Feature Extractor
Title | Programmable Neural Network Trojan for Pre-trained Feature Extractor |
Authors | Anonymous |
Abstract | Neural network (NN) trojaning attack is an emerging and important attack that can broadly damage the system deployed with NN models. Different from adversarial attack, it hides malicious functionality in the weight parameters of NN models. Existing studies have explored NN trojaning attacks in some small datasets for specific domains, with limited numbers of fixed target classes. In this paper, we propose a more powerful trojaning attack method for large models, which outperforms existing studies in capability, generality, and stealthiness. First, the attack is programmable that the malicious misclassification target is not fixed and can be generated on demand even after the victim’s deployment. Second, our trojaning attack is not limited in a small domain; one trojaned model on a large-scale dataset can affect applications of different domains that reuses its general features. Third, our trojan shows no biased behavior for different target classes, which makes it more difficult to defend. |
Tasks | Adversarial Attack |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bkgwp3NtDH |
https://openreview.net/pdf?id=Bkgwp3NtDH | |
PWC | https://paperswithcode.com/paper/programmable-neural-network-trojan-for-pre-1 |
Repo | |
Framework | |
Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation
Title | Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation |
Authors | Anonymous |
Abstract | Deploying machine learning systems in the real world requires both high accuracy on clean data and robustness to naturally occurring corruptions. While architectural advances have led to improved accuracy, building robust models remains challenging, involving major changes in training procedure and datasets. Prior work has argued that there is an inherent trade-off between robustness and accuracy, as exemplified by standard data augmentation techniques such as Cutout, which improves clean accuracy but not robustness, and additive Gaussian noise, which improves robustness but hurts accuracy. We introduce Patch Gaussian, a simple augmentation scheme that adds noise to randomly selected patches in an input image. Models trained with Patch Gaussian achieve state of the art on the CIFAR-10 and ImageNet Common Corruptions benchmarks while also maintaining accuracy on clean data. We find that this augmentation leads to reduced sensitivity to high frequency noise (similar to Gaussian) while retaining the ability to take advantage of relevant high frequency information in the image (similar to Cutout). We show it can be used in conjunction with other regularization methods and data augmentation policies such as AutoAugment. Finally, we find that the idea of restricting perturbations to patches can also be useful in the context of adversarial learning, yielding models without the loss in accuracy that is found with unconstrained adversarial training. |
Tasks | Data Augmentation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HkxWXkStDB |
https://openreview.net/pdf?id=HkxWXkStDB | |
PWC | https://paperswithcode.com/paper/improving-robustness-without-sacrificing-1 |
Repo | |
Framework | |
Amata: An Annealing Mechanism for Adversarial Training Acceleration
Title | Amata: An Annealing Mechanism for Adversarial Training Acceleration |
Authors | Anonymous |
Abstract | Despite of the empirical success in various domains, it has been revealed that deep neural networks are vulnerable to maliciously perturbed input data that much degrade their performance. This is known as adversarial attacks. To counter adversarial attacks, adversarial training formulated as a form of robust optimization has been demonstrated to be effective. However, conducting adversarial training brings much computational overhead compared with standard training. In order to reduce the computational cost, we propose a simple yet effective modification to the commonly used projected gradient descent (PGD) adversarial training by increasing the number of adversarial training steps and decreasing the adversarial training step size gradually as training proceeds. We analyze the optimality of this annealing mechanism through the lens of optimal control theory, and we also prove the convergence of our proposed algorithm. Numerical experiments on standard datasets, such as MNIST and CIFAR10, show that our method can achieve similar or even better robustness with around 1/3 to 1/2 computation time compared with PGD. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1xI_TEtwS |
https://openreview.net/pdf?id=S1xI_TEtwS | |
PWC | https://paperswithcode.com/paper/amata-an-annealing-mechanism-for-adversarial |
Repo | |
Framework | |
Path Space for Recurrent Neural Networks with ReLU Activations
Title | Path Space for Recurrent Neural Networks with ReLU Activations |
Authors | Anonymous |
Abstract | It is well known that neural networks with rectified linear units (ReLU) activation functions are positively scale-invariant (i.e., the neural network is invariant to positive rescaling of weights). Optimization algorithms like stochastic gradient descent that optimize the neural networks in the vector space of weights, which are not positively scale-invariant. To solve this mismatch, a new parameter space called path space has been proposed for feedforward and convolutional neural networks. The path space is positively scale-invariant and optimization algorithms operating in path space have been shown to be superior than that in the original weight space. However, the theory of path space and the corresponding optimization algorithm cannot be naturally extended to more complex neural networks, like Recurrent Neural Networks(RNN) due to the recurrent structure and the parameter sharing scheme over time. In this work, we aim to construct path space for RNN with ReLU activations so that we can employ optimization algorithms in path space. To achieve the goal, we propose leveraging the reduction graph of RNN which removes the influence of time-steps, and prove that all the values of whose paths can serve as a sufficient representation of the RNN with ReLU activations. We then prove that the path space for RNN is composed by the basis paths in reduction graph, and design a \emph{Skeleton Method} to identify the basis paths efficiently. With the identified basis paths, we develop the optimization algorithm in path space for RNN models. Our experiments on several benchmark datasets show that we can obtain significantly more effective RNN models in this way than using optimization methods in the weight space. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rked_6NFwH |
https://openreview.net/pdf?id=rked_6NFwH | |
PWC | https://paperswithcode.com/paper/path-space-for-recurrent-neural-networks-with |
Repo | |
Framework | |
DeepSphere: a graph-based spherical CNN
Title | DeepSphere: a graph-based spherical CNN |
Authors | Anonymous |
Abstract | Designing a convolution for a spherical neural network requires a delicate tradeoff between efficiency and rotation equivariance. DeepSphere, a method based on a graph representation of the discretized sphere, strikes a controllable balance between these two desiderata. This contribution is twofold. First, we study both theoretically and empirically how equivariance is affected by the underlying graph with respect to the number of pixels and neighbors. Second, we evaluate DeepSphere on relevant problems. Experiments show state-of-the-art performance and demonstrates the efficiency and flexibility of this formulation. Perhaps surprisingly, comparison with previous work suggests that anisotropic filters might be an unnecessary price to pay. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1e3OlStPB |
https://openreview.net/pdf?id=B1e3OlStPB | |
PWC | https://paperswithcode.com/paper/deepsphere-a-graph-based-spherical-cnn |
Repo | |
Framework | |
Context-Gated Convolution
Title | Context-Gated Convolution |
Authors | Anonymous |
Abstract | As the basic building block of Convolutional Neural Networks (CNNs), the convolutional layer is designed to extract local patterns and lacks the ability to model global context in its nature. Many efforts have been recently made to complement CNNs with the global modeling ability, especially by a family of works on global feature interaction. In these works, the global context information is incorporated into local features before they are fed into convolutional layers. However, research on neuroscience reveals that, besides influences changing the inputs to our neurons, the neurons’ ability of modifying their functions dynamically according to context is essential for perceptual tasks, which has been overlooked in most of CNNs. Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features. Moreover, our proposed CGC is lightweight, amenable to modern CNN architectures, and consistently improves the performance of CNNs according to extensive experiments on image classification, action recognition, and machine translation. |
Tasks | Image Classification, Machine Translation |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=B1lFkRNKDS |
https://openreview.net/pdf?id=B1lFkRNKDS | |
PWC | https://paperswithcode.com/paper/context-gated-convolution-1 |
Repo | |
Framework | |
Efficient Transformer for Mobile Applications
Title | Efficient Transformer for Mobile Applications |
Authors | Anonymous |
Abstract | Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for real-world mobile applications since mobile phones are tightly constrained by the hardware resources and battery. In this paper, we investigate the mobile setting (under 500M Mult-Adds) for NLP tasks to facilitate the deployment on the edge devices. We present Long-Short Range Attention (LSRA), where some heads specialize in the local context modeling (by convolution) while the others capture the long-distance relationship (by attention). Based on this primitive, we design Mobile Transformer (MBT) that is tailored for the mobile NLP application. Our MBT demonstrates consistent improvement over the transformer on two well-established language tasks: IWSLT 2014 German-English and WMT 2014 English-German. It outperforms the transformer by 0.9 BLEU under 500M Mult-Adds and 1.1 BLEU under 100M Mult-Adds on WMT’14 English-German. Without the costly architecture search that requires more than 250 GPU years, our manually-designed MBT achieves 0.4 higher BLEU than the AutoML-based Evolved Transformer under the extremely efficient mobile setting (i.e., 100M Mult-Adds). |
Tasks | AutoML, Machine Translation, Question Answering |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=ByeMPlHKPH |
https://openreview.net/pdf?id=ByeMPlHKPH | |
PWC | https://paperswithcode.com/paper/efficient-transformer-for-mobile-applications |
Repo | |
Framework | |
Neural Epitome Search for Architecture-Agnostic Network Compression
Title | Neural Epitome Search for Architecture-Agnostic Network Compression |
Authors | Anonymous |
Abstract | Traditional compression methods including network pruning, quantization, low rank factorization and knowledge distillation all assume that network architectures and parameters should be hardwired. In this work, we propose a new perspective on network compression, i.e., network parameters can be disentangled from the architectures. From this viewpoint, we present the Neural Epitome Search (NES), a new neural network compression approach that learns to find compact yet expressive epitomes for weight parameters of a specified network architecture end-to-end. The complete network to compress can be generated from the learned epitome via a novel transformation method that adaptively transforms the epitomes to match shapes of the given architecture. Compared with existing compression methods, NES allows the weight tensors to be independent of the architecture design and hence can achieve a good trade-off between model compression rate and performance given a specific model size constraint. Experiments demonstrate that, on ImageNet, when taking MobileNetV2 as backbone, our approach improves the full-model baseline by 1.47% in top-1 accuracy with 25% MAdd reduction and AutoML for Model Compression (AMC) by 2.5% with nearly the same compression ratio. Moreover, taking EfficientNet-B0 as baseline, our NES yields an improvement of 1.2% but are with 10% less MAdd. In particular, our method achieves a new state-of-the-art results of 77.5% under mobile settings (<350M MAdd). Code will be made publicly available. |
Tasks | AutoML, Model Compression, Network Pruning, Neural Network Compression, Quantization |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HyxjOyrKvr |
https://openreview.net/pdf?id=HyxjOyrKvr | |
PWC | https://paperswithcode.com/paper/neural-epitome-search-for-architecture |
Repo | |
Framework | |
Searching to Exploit Memorization Effect in Learning from Corrupted Labels
Title | Searching to Exploit Memorization Effect in Learning from Corrupted Labels |
Authors | Anonymous |
Abstract | Sample-selection approaches, which attempt to pick up clean instances from the training data set, have become one promising direction to robust learning from corrupted labels. These methods all build on the memorization effect, which means deep networks learn easy patterns first and then gradually over-fit the training data set. In this paper, we show how to properly select instances so that the training process can benefit the most from the memorization effect is a hard problem. Specifically, memorization can heavily depend on many factors, e.g., data set and network architecture. Nonetheless, there still exists general patterns of how memorization can occur. These facts motivate us to exploit memorization by automated machine learning (AutoML) techniques. First, we designed an expressive but compact search space based on observed general patterns. Then, we propose to use the natural gradient-based search algorithm to efficiently search through space. Finally, extensive experiments on both synthetic data sets and benchmark data sets demonstrate that the proposed method can not only be much efficient than existing AutoML algorithms but can also achieve much better performance than the state-of-the-art approaches for learning from corrupted labels. |
Tasks | AutoML |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxqZkSFDB |
https://openreview.net/pdf?id=rJxqZkSFDB | |
PWC | https://paperswithcode.com/paper/searching-to-exploit-memorization-effect-in-1 |
Repo | |
Framework | |