Paper Group NANR 34
Prestopping: How Does Early Stopping Help Generalization Against Label Noise?. Transferring Optimality Across Data Distributions via Homotopy Methods. Scalable and Order-robust Continual Learning with Additive Parameter Decomposition. Coloring graph neural networks for node disambiguation. SMiRL: Surprise Minimizing RL in Entropic Environments. Vis …
Prestopping: How Does Early Stopping Help Generalization Against Label Noise?
Title | Prestopping: How Does Early Stopping Help Generalization Against Label Noise? |
Authors | Anonymous |
Abstract | Noisy labels are very common in real-world training data, which lead to poor generalization on test data because of overfitting to the noisy labels. In this paper, we claim that such overfitting can be avoided by “early stopping” training a deep neural network before the noisy labels are severely memorized. Then, we resume training the early stopped network using a “maximal safe set,” which maintains a collection of almost certainly true-labeled samples at each epoch since the early stop point. Putting them all together, our novel two-phase training method, called Prestopping, realizes noise-free training under any type of label noise for practical use. Extensive experiments using four image benchmark data sets verify that our method significantly outperforms four state-of-the-art methods in test error by 0.4–8.2 percent points under existence of real-world noise. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BklSwn4tDH |
https://openreview.net/pdf?id=BklSwn4tDH | |
PWC | https://paperswithcode.com/paper/prestopping-how-does-early-stopping-help |
Repo | |
Framework | |
Transferring Optimality Across Data Distributions via Homotopy Methods
Title | Transferring Optimality Across Data Distributions via Homotopy Methods |
Authors | Anonymous |
Abstract | Homotopy methods, also known as continuation methods, are a powerful mathematical tool to efficiently solve various problems in numerical analysis, including complex non-convex optimization problems where no or only little prior knowledge regarding the localization of the solutions is available. In this work, we propose a novel homotopy-based numerical method that can be used to transfer knowledge regarding the localization of an optimum across different task distributions in deep learning applications. We validate the proposed methodology with some empirical evaluations in the regression and classification scenarios, where it shows that superior numerical performance can be achieved in popular deep learning benchmarks, i.e. FashionMNIST, CIFAR-10, and draw connections with the widely used fine-tuning heuristic. In addition, we give more insights on the properties of a general homotopy method when used in combination with Stochastic Gradient Descent by conducting a theoretical analysis in a simplified setting. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1gEIerYwH |
https://openreview.net/pdf?id=S1gEIerYwH | |
PWC | https://paperswithcode.com/paper/transferring-optimality-across-data |
Repo | |
Framework | |
Scalable and Order-robust Continual Learning with Additive Parameter Decomposition
Title | Scalable and Order-robust Continual Learning with Additive Parameter Decomposition |
Authors | Anonymous |
Abstract | While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, there are issues that remain to be tackled in order to apply them to real-world problem domains. First, a continual learning model should effectively handle catastrophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness. |
Tasks | Continual Learning, Medical Diagnosis |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1gdj2EKPB |
https://openreview.net/pdf?id=r1gdj2EKPB | |
PWC | https://paperswithcode.com/paper/scalable-and-order-robust-continual-learning |
Repo | |
Framework | |
Coloring graph neural networks for node disambiguation
Title | Coloring graph neural networks for node disambiguation |
Authors | Anonymous |
Abstract | In this paper, we show that a simple coloring scheme can improve, both theoretically and empirically, the expressive power of Message Passing Neural Networks (MPNNs). More specifically, we introduce a graph neural network called Colored Local Iterative Procedure (CLIP) that uses colors to disambiguate identical node attributes, and show that this representation is a universal approximator of continuous functions on graphs with node attributes. Our method relies on separability, a key topological characteristic that allows to extend well-chosen neural networks into universal representations. Finally, we show experimentally that CLIP is capable of capturing structural characteristics that traditional MPNNs fail to distinguish, while being state-of-the-art on benchmark graph classification datasets. |
Tasks | Graph Classification |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rJxt0JHKvS |
https://openreview.net/pdf?id=rJxt0JHKvS | |
PWC | https://paperswithcode.com/paper/coloring-graph-neural-networks-for-node |
Repo | |
Framework | |
SMiRL: Surprise Minimizing RL in Entropic Environments
Title | SMiRL: Surprise Minimizing RL in Entropic Environments |
Authors | Anonymous |
Abstract | All living organisms struggle against the forces of nature to carve out niches where they can maintain relative stasis. We propose that such a search for order amidst chaos might offer a unifying principle for the emergence of useful behaviors in artificial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing RL (SMiRL). SMiRL trains an agent with the objective of maximizing the probability of observed states under a model trained on all previously seen states. The resulting agents acquire several proactive behaviors to seek and maintain stable states such as balancing and damage avoidance, that are closely tied to the affordances of the environment and its prevailing sources of entropy, such as winds, earthquakes, and other agents. We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, and control a humanoid to avoid falls, without any task-specific reward supervision. We further show that SMiRL can be used as an unsupervised pre-training objective that substantially accelerates subsequent reward-driven learning |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=H1lDbaVYvH |
https://openreview.net/pdf?id=H1lDbaVYvH | |
PWC | https://paperswithcode.com/paper/smirl-surprise-minimizing-rl-in-entropic |
Repo | |
Framework | |
Visual Hide and Seek
Title | Visual Hide and Seek |
Authors | Anonymous |
Abstract | We train embodied agents to play Visual Hide and Seek where a prey must navigate in a simulated environment in order to avoid capture from a predator. We place a variety of obstacles in the environment for the prey to hide behind, and we only give the agents partial observations of their environment using an egocentric perspective. Although we train the model to play this game from scratch without any prior knowledge of its visual world, experiments and visualizations show that a representation of other agents automatically emerges in the learned representation. Furthermore, we quantitatively analyze how agent weaknesses, such as slower speed, effect the learned policy. Our results suggest that, although agent weaknesses make the learning problem more challenging, they also cause useful features to emerge in the representation. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Skg9aAEKwH |
https://openreview.net/pdf?id=Skg9aAEKwH | |
PWC | https://paperswithcode.com/paper/visual-hide-and-seek-1 |
Repo | |
Framework | |
Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning
Title | Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning |
Authors | Anonymous |
Abstract | Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwise adaptivity is less aggressive than adaptivity to individual coordinates, and can have a better balance between adaptivity and generalization. We show theoretically that the proposed blockwise adaptive gradient descent has comparable regret in online convex learning and convergence rate for optimizing nonconvex objective as its counterpart with coordinate-wise adaptive stepsize, but is better up to some constant. We also study its uniform stability and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity. Experimental results show that blockwise adaptive gradient descent converges faster and improves generalization performance over Nesterov’s accelerated gradient and Adam. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SJlYqRNKDS |
https://openreview.net/pdf?id=SJlYqRNKDS | |
PWC | https://paperswithcode.com/paper/blockwise-adaptivity-faster-training-and-1 |
Repo | |
Framework | |
RotationOut as a Regularization Method for Neural Network
Title | RotationOut as a Regularization Method for Neural Network |
Authors | Anonymous |
Abstract | In this paper, we propose a novel regularization method, RotationOut, for neural networks. Different from Dropout that handles each neuron/channel independently, RotationOut regards its input layer as an entire vector and introduces regularization by randomly rotating the vector. RotationOut can also be used in convolutional layers and recurrent layers with a small modification. We further use a noise analysis method to interpret the difference between RotationOut and Dropout in co-adaptation reduction. Using this method, we also show how to use RotationOut/Dropout together with Batch Normalization. Extensive experiments in vision and language tasks are conducted to show the effectiveness of the proposed method. Codes will be available. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=r1e7M6VYwH |
https://openreview.net/pdf?id=r1e7M6VYwH | |
PWC | https://paperswithcode.com/paper/rotationout-as-a-regularization-method-for |
Repo | |
Framework | |
On Stochastic Sign Descent Methods
Title | On Stochastic Sign Descent Methods |
Authors | Anonymous |
Abstract | Various gradient compression schemes have been proposed to mitigate the communication cost in distributed training of large scale machine learning models. Sign-based methods, such as signSGD (Bernstein et al., 2018), have recently been gaining popularity because of their simple compression rule and connection to adaptive gradient methods, like ADAM. In this paper, we perform a general analysis of sign-based methods for non-convex optimization. Our analysis is built on intuitive bounds on success probabilities and does not rely on special noise distributions nor on the boundedness of the variance of stochastic gradients. Extending the theory to distributed setting within a parameter server framework, we assure exponentially fast variance reduction with respect to number of nodes, maintaining 1-bit compression in both directions and using small mini-batch sizes. We validate our theoretical findings experimentally. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=rkxNelrKPB |
https://openreview.net/pdf?id=rkxNelrKPB | |
PWC | https://paperswithcode.com/paper/on-stochastic-sign-descent-methods |
Repo | |
Framework | |
Weighted Empirical Risk Minimization: Transfer Learning based on Importance Sampling
Title | Weighted Empirical Risk Minimization: Transfer Learning based on Importance Sampling |
Authors | Anonymous |
Abstract | We consider statistical learning problems, when the distribution $P'$ of the training observations $Z’_1,; \ldots,; Z’_n$ differs from the distribution $P$ involved in the risk one seeks to minimize (referred to as the \textit{test distribution}) but is still defined on the same measurable space as $P$ and dominates it. In the unrealistic case where the likelihood ratio $\Phi(z)=dP/dP’(z)$ is known, one may straightforwardly extends the Empirical Risk Minimization (ERM) approach to this specific \textit{transfer learning} setup using the same idea as that behind Importance Sampling, by minimizing a weighted version of the empirical risk functional computed from the ‘biased’ training data $Z’_i$ with weights $\Phi(Z’_i)$. Although the \textit{importance function} $\Phi(z)$ is generally unknown in practice, we show that, in various situations frequently encountered in practice, it takes a simple form and can be directly estimated from the $Z’_i$'s and some auxiliary information on the statistical population $P$. By means of linearization techniques, we then prove that the generalization capacity of the approach aforementioned is preserved when plugging the resulting estimates of the $\Phi(Z’_i)$'s into the weighted empirical risk. Beyond these theoretical guarantees, numerical results provide strong empirical evidence of the relevance of the approach promoted in this article. |
Tasks | Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=Bye2uJHYwr |
https://openreview.net/pdf?id=Bye2uJHYwr | |
PWC | https://paperswithcode.com/paper/weighted-empirical-risk-minimization-transfer |
Repo | |
Framework | |
Learning from Label Proportions with Consistency Regularization
Title | Learning from Label Proportions with Consistency Regularization |
Authors | Anonymous |
Abstract | The problem of learning from label proportions (LLP) involves training classifiers with weak labels on bags of instances, rather than strong labels on individual instances. The weak labels only contain the label proportions of each bag. The LLP problem is important for many practical applications that only allow label proportions to be collected because of data privacy or annotation costs, and has recently received lots of research attention. Most existing works focus on extending supervised learning models to solve the LLP problem, but the weak learning nature makes it hard to further improve LLP performance with a supervised angle. In this paper, we take a different angle from semi-supervised learning. In particular, we propose a novel model inspired by consistency regularization, a popular concept in semi-supervised learning that encourages the model to produce a decision boundary that better describes the data manifold. With the introduction of consistency regularization, we further extend our study to non-uniform bag-generation and validation-based parameter-selection procedures that better match practical needs. Experiments not only justify that LLP with consistency regularization achieves superior performance, but also demonstrate the practical usability of the proposed procedures. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=SyecdJSKvr |
https://openreview.net/pdf?id=SyecdJSKvr | |
PWC | https://paperswithcode.com/paper/learning-from-label-proportions-with-1 |
Repo | |
Framework | |
A Base Model Selection Methodology for Efficient Fine-Tuning
Title | A Base Model Selection Methodology for Efficient Fine-Tuning |
Authors | Anonymous |
Abstract | While the accuracy of image classification achieves significant improvement with deep Convolutional Neural Networks (CNN), training a deep CNN is a time-consuming task because it requires a large amount of labeled data and takes a long time to converge even with high performance computing resources. Fine-tuning, one of the transfer learning methods, is effective in decreasing time and the amount of data necessary for CNN training. It is known that fine-tuning can be performed efficiently if the source and the target tasks have high relativity. However, the technique to evaluate the relativity or transferability of trained models quantitatively from their parameters has not been established. In this paper, we propose and evaluate several metrics to estimate the transferability of pre-trained CNN models for a given target task by featuremaps of the last convolutional layer. We found that some of the proposed metrics are good predictors of fine-tuned accuracy, but their effectiveness depends on the structure of the network. Therefore, we also propose to combine two metrics to get a generally applicable indicator. The experimental results reveal that one of the combined metrics is well correlated with fine-tuned accuracy in a variety of network structure and our method has a good potential to reduce the burden of CNN training. |
Tasks | Image Classification, Model Selection, Transfer Learning |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=BylT8RNKPH |
https://openreview.net/pdf?id=BylT8RNKPH | |
PWC | https://paperswithcode.com/paper/a-base-model-selection-methodology-for |
Repo | |
Framework | |
Learning-Augmented Data Stream Algorithms
Title | Learning-Augmented Data Stream Algorithms |
Authors | Anonymous |
Abstract | The data stream model is a fundamental model for processing massive data sets with limited memory and fast processing time. Recently Hsu et al. (2019) incorporated machine learning techniques into the data stream model in order to learn relevant patterns in the input data. Such techniques were encapsulated by training an oracle to predict item frequencies in the streaming model. In this paper we explore the full power of such an oracle, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems. Namely, we apply the oracle to counting distinct elements on the difference of streams, estimating frequency moments, estimating cascaded aggregates, and estimating moments of geometric data streams. For the distinct elements problem, we obtain the first memory-optimal algorithms. For estimating the $p$-th frequency moment for $0 < p < 2$ we obtain the first algorithms with optimal update time. For estimating the $p$-the frequency moment for $p > 2$ we obtain a quadratic saving in memory. We empirically validate our results, demonstrating also our improvements in practice. |
Tasks | |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HyxJ1xBYDH |
https://openreview.net/pdf?id=HyxJ1xBYDH | |
PWC | https://paperswithcode.com/paper/learning-augmented-data-stream-algorithms |
Repo | |
Framework | |
Can I Trust the Explainer? Verifying Post-Hoc Explanatory Methods
Title | Can I Trust the Explainer? Verifying Post-Hoc Explanatory Methods |
Authors | Anonymous |
Abstract | For AI systems to garner widespread public acceptance, we must develop methods capable of explaining the decisions of black-box models such as neural networks. In this work, we identify two issues of current explanatory methods. First, we show that two prevalent perspectives on explanations—feature-additivity and feature-selection—lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers have only been thoroughly validated on simple models, such as linear regression, and, when applied to real-world neural networks, explainers are commonly evaluated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is based on a non-trivial neural network architecture trained on a real-world task, and for which we are able to provide guarantees on its inner workings. We validate the efficacy of our evaluation by showing the failure modes of current explainers. We aim for this framework to provide a publicly available,1 off-the-shelf evaluation when the feature-selection perspective on explanations is needed. |
Tasks | Feature Selection |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=S1e-0kBYPB |
https://openreview.net/pdf?id=S1e-0kBYPB | |
PWC | https://paperswithcode.com/paper/can-i-trust-the-explainer-verifying-post-hoc |
Repo | |
Framework | |
Towards Interpretable Evaluations: A Case Study of Named Entity Recognition
Title | Towards Interpretable Evaluations: A Case Study of Named Entity Recognition |
Authors | Anonymous |
Abstract | With the proliferation of models for natural language processing (NLP) tasks, it is even harder to understand the differences between models and their relative merits. Simply looking at differences between holistic metrics such as accuracy, BLEU, or F1 do not tell us \emph{why} or \emph{how} a particular method is better and how dataset biases influence the choices of model design. In this paper, we present a general methodology for {\emph{interpretable}} evaluation of NLP systems and choose the task of named entity recognition (NER) as a case study, which is a core task of identifying people, places, or organizations in text. The proposed evaluation method enables us to interpret the \textit{model biases}, \textit{dataset biases}, and how the \emph{differences in the datasets} affect the design of the models, identifying the strengths and weaknesses of current approaches. By making our {analysis} tool available, we make it easy for future researchers to run similar analyses and drive the progress in this area. |
Tasks | Named Entity Recognition |
Published | 2020-01-01 |
URL | https://openreview.net/forum?id=HJxTgeBtDr |
https://openreview.net/pdf?id=HJxTgeBtDr | |
PWC | https://paperswithcode.com/paper/towards-interpretable-evaluations-a-case |
Repo | |
Framework | |