April 1, 2020

3124 words 15 mins read

Paper Group NANR 40

Paper Group NANR 40

Semi-Supervised Few-Shot Learning with Prototypical Random Walks. On the Evaluation of Conditional GANs. The Visual Task Adaptation Benchmark. Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?. Variable Complexity in the Univariate and Multivariate Structural Causal Model. Forecasting Deep Learning Dynamics with Applications to Hype …

Semi-Supervised Few-Shot Learning with Prototypical Random Walks

Title Semi-Supervised Few-Shot Learning with Prototypical Random Walks
Authors Anonymous
Abstract Learning from a few examples is a key characteristic of human intelligence that inspired machine learning researchers to build data-efficient AI models. Recent progress has shown that few-shot learning can be improved with access to unlabelled data, known as semi-supervised few-shot learning(SS-FSL). We introduce an SS-FSL approach, dubbed as Prototypical Random Walk Networks (PRWN), built on top of Prototypical Networks (PN). We develop a random walk semi-supervised loss that enables the network to learn representations that are compact and well-separated. Our work is related to the very recent development on graph-based approaches for few-shot learning. However, we show that achieved compact and well-separated class embeddings can be achieved by our prototypical random walk notion without needing additional graph-NN parameters or requiring a transductive setting where collective test set is provided. Our model outperforms prior art in most benchmarks with significant improvements in some cases. For example, in a mini-Imagenet 5-shot classification task, we obtain 69.65% accuracy to the 64.59% state-of-the-art. Our model, trained with 40% of the data as labelled, compares competitively against fully supervised prototypical networks, trained on 100% of the labels, even outperforming it in the 1-shot mini-Imagenet case with 50.89% to 49.4% accuracy. We also show that our model is resistant to distractors, unlabeled data that does not belong to any of the training classes, and hence reflecting robustness to labelled/unlabelled class distribution mismatch. We also performed a challenging discriminative power test, showing a relative improvement on top of the baseline of 14% on 20 classes on mini-Imagenet and 60% on 800 classes on Ominiglot.
Tasks Few-Shot Learning
Published 2020-01-01
URL https://openreview.net/forum?id=Bygka64KPH
PDF https://openreview.net/pdf?id=Bygka64KPH
PWC https://paperswithcode.com/paper/semi-supervised-few-shot-learning-with
Repo
Framework

On the Evaluation of Conditional GANs

Title On the Evaluation of Conditional GANs
Authors Anonymous
Abstract Conditional Generative Adversarial Networks (cGANs) are finding increasingly widespread use in many application domains. Despite outstanding progress, quantitative evaluation of such models often involves multiple distinct metrics to assess different desirable properties, such as image quality, conditional consistency, and intra-conditioning diversity. In this setting, model benchmarking becomes a challenge, as each metric may indicate a different “best” model. In this paper, we propose the Frechet Joint Distance (FJD), which is defined as the Frechet distance between joint distributions of images and conditioning, allowing it to implicitly capture the aforementioned properties in a single metric. We conduct proof-of-concept experiments on a controllable synthetic dataset, which consistently highlight the benefits of FJD when compared to currently established metrics. Moreover, we use the newly introduced metric to compare existing cGAN-based models for a variety of conditioning modalities (e.g. class labels, object masks, bounding boxes, images, and text captions). We show that FJD can be used as a promising single metric for model benchmarking.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=rylxpA4YwH
PDF https://openreview.net/pdf?id=rylxpA4YwH
PWC https://paperswithcode.com/paper/on-the-evaluation-of-conditional-gans-1
Repo
Framework

The Visual Task Adaptation Benchmark

Title The Visual Task Adaptation Benchmark
Authors Anonymous
Abstract Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visual representations hinders progress. Many sub-fields promise representations, but each has different evaluation protocols that are either too constrained (linear classification), limited in scope (ImageNet, CIFAR, Pascal-VOC), or only loosely related to representation quality (generation). We present the Visual Task Adaptation Benchmark (VTAB): a diverse, realistic, and challenging benchmark to evaluate representations. VTAB embodies one principle: good representations adapt to unseen tasks with few examples. We run a large VTAB study of popular algorithms, answering questions like: How effective are ImageNet representation on non-standard datasets? Are generative models competitive? Is self-supervision useful if one already has labels?
Tasks Representation Learning
Published 2020-01-01
URL https://openreview.net/forum?id=BJena3VtwS
PDF https://openreview.net/pdf?id=BJena3VtwS
PWC https://paperswithcode.com/paper/the-visual-task-adaptation-benchmark-1
Repo
Framework

Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?

Title Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?
Authors Anonymous
Abstract Energy-based models (EBMs), a.k.a. un-normalized models, have had recent successes in continuous spaces. However, they have not been successfully applied to model text sequences. While decreasing the energy at training samples is straightforward, mining (negative) samples where the energy should be increased is difficult. In part, this is because standard gradient-based methods are not readily applicable when the input is high-dimensional and discrete. Here, we side-step this issue by generating negatives using pre-trained auto-regressive language models. The EBM then works in the {\em residual} of the language model; and is trained to discriminate real text from text generated by the auto-regressive models. We investigate the generalization ability of residual EBMs, a pre-requisite for using them in other applications. We extensively analyze generalization for the task of classifying whether an input is machine or human generated, a natural task given the training loss and how we mine negatives. Overall, we observe that EBMs can generalize remarkably well to changes in the architecture of the generators producing negatives. However, EBMs exhibit more sensitivity to the training set used by such generators.
Tasks Language Modelling
Published 2020-01-01
URL https://openreview.net/forum?id=SkgpGgrYPH
PDF https://openreview.net/pdf?id=SkgpGgrYPH
PWC https://paperswithcode.com/paper/residual-ebms-does-real-vs-fake-text
Repo
Framework

Variable Complexity in the Univariate and Multivariate Structural Causal Model

Title Variable Complexity in the Univariate and Multivariate Structural Causal Model
Authors Anonymous
Abstract We show that by comparing the individual complexities of univariante cause and effect in the Structural Causal Model, one can identify the cause and the effect, without considering their interaction at all. The entropy of each variable is ineffective in measuring the complexity, and we propose to capture it by an autoencoder that operates on the list of sorted samples. Comparing the reconstruction errors of the two autoencoders, one for each variable, is shown to perform well on the accepted benchmarks of the field. In the multivariate case, where one can ensure that the complexities of the cause and effect are balanced, we propose a new method that mimics the disentangled structure of the causal model. We extend the results of~\cite{Zhang:2009:IPC:1795114.1795190} to the multidimensional case, showing that such modeling is only likely in the direction of causality. Furthermore, the learned model is shown theoretically to perform the separation to the causal component and to the residual (noise) component. Our multidimensional method obtains a significantly higher accuracy than the literature methods.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=BJlPLlrFvH
PDF https://openreview.net/pdf?id=BJlPLlrFvH
PWC https://paperswithcode.com/paper/variable-complexity-in-the-univariate-and
Repo
Framework

Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning

Title Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning
Authors Anonymous
Abstract Well-performing deep learning models have enormous impact, but getting them to perform well is complicated, as the model architecture must be chosen and a number of hyperparameters tuned. This requires experimentation, which is timeconsuming and costly. We propose to address the problem of hyperparameter tuning by learning to forecast the training behaviour of deep learning architectures. Concretely, we introduce a forecasting model that, given a hyperparameter schedule (e.g., learning rate, weight decay) and a history of training observations (such as loss and accuracy), predicts how the training will continue. Naturally, forecasting is much faster and less expensive than running actual deep learning experiments. The main question we study is whether the forecasting model is good enough to be of use - can it indeed replace real experiments? We answer this affirmatively in two ways. For one, we show that the forecasted curves are close to real ones. On the practical side, we apply our forecaster to learn hyperparameter tuning policies. We experiment on a version of ResNet on CIFAR10 and on Transformer in a language modeling task. The policies learned using our forecaster match or exceed the ones learned in real experiments and in one case even the default schedules discovered by researchers. We study the learning rate schedules created using the forecaster are find that they are not only effective, but also lead to interesting insights.
Tasks Language Modelling
Published 2020-01-01
URL https://openreview.net/forum?id=ByxHJeBYDB
PDF https://openreview.net/pdf?id=ByxHJeBYDB
PWC https://paperswithcode.com/paper/forecasting-deep-learning-dynamics-with
Repo
Framework

Improving End-to-End Object Tracking Using Relational Reasoning

Title Improving End-to-End Object Tracking Using Relational Reasoning
Authors Fabian B. Fuchs, Adam R. Kosiorek, Li Sun, Oiwi Parker Jones, Ingmar Posner
Abstract Relational reasoning, the ability to model interactions and relations between objects, is valuable for robust multi-object tracking and pivotal for trajectory prediction. In this paper, we propose MOHART, a class-agnostic, end-to-end multi-object tracking and trajectory prediction algorithm, which explicitly accounts for permutation invariance in its relational reasoning. We explore a number of permutation invariant architectures and show that multi-headed self-attention outperforms the provided baselines and better accounts for complex physical interactions in a challenging toy experiment. We show on three real-world tracking datasets that adding relational reasoning capabilities in this way increases the tracking and trajectory prediction performance, particularly in the presence of ego-motion, occlusions, crowded scenes, and faulty sensor inputs. To the best of our knowledge, MOHART is the first fully end-to-end multi-object tracking from vision approach applied to real-world data reported in the literature.
Tasks Multi-Object Tracking, Object Tracking, Relational Reasoning, Trajectory Prediction
Published 2020-01-01
URL https://openreview.net/forum?id=Byl-264tvr
PDF https://openreview.net/pdf?id=Byl-264tvr
PWC https://paperswithcode.com/paper/improving-end-to-end-object-tracking-using
Repo
Framework

Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Ocurring in Data

Title Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Ocurring in Data
Authors Anonymous
Abstract Equivariance is a nice property to have as it produces much more parameter efficient neural architectures and preserves the structure of the input through the feature mapping. Even though some combinations of transformations might never appear (e.g. a face with a horizontal nose) current equivariant architectures consider the set of all possible transformations in the transformation group while generating feature representations. Contrarily, the human visual system is able to attend to the set of relevant transformations occurring in the environment as to assist and improve object recognition. Based on this observation, we modify conventional equivariant feature mappings such that they are able to attend to the set of co-occurring transformations in data. Our experiments show that neural networks utilizing co-attentive equivariant feature mappings consistently outperform those utilizing conventional ones both for fully (rotated MNIST) and partially (CIFAR-10) rotational settings.
Tasks Object Recognition
Published 2020-01-01
URL https://openreview.net/forum?id=r1g6ogrtDr
PDF https://openreview.net/pdf?id=r1g6ogrtDr
PWC https://paperswithcode.com/paper/co-attentive-equivariant-neural-networks
Repo
Framework

Information lies in the eye of the beholder: The effect of representations on observed mutual information

Title Information lies in the eye of the beholder: The effect of representations on observed mutual information
Authors Anonymous
Abstract Learning can be framed as trying to encode the mutual information between input and output while discarding other information in the input. Since the distribution between input and output is unknown, also the true mutual information is. To quantify how difficult it is to learn a task, we calculate a observed mutual information score by dividing the estimated mutual information by the entropy of the input. We substantiate this score analytically by showing that the estimated mutual information has an error that increases with the entropy of the data. Intriguingly depending on how the data is represented the observed entropy and mutual information can vary wildly. There needs to be a match between how data is represented and how a model encodes it. Experimentally we analyze image-based input data representations and demonstrate that performance outcomes of extensive network architectures searches are well aligned to the calculated score. Therefore to ensure better learning outcomes, representations may need to be tailored to both task and model to align with the implicit distribution of the model.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=r1gmu1SYDS
PDF https://openreview.net/pdf?id=r1gmu1SYDS
PWC https://paperswithcode.com/paper/information-lies-in-the-eye-of-the-beholder
Repo
Framework

Non-Autoregressive Dialog State Tracking

Title Non-Autoregressive Dialog State Tracking
Authors Anonymous
Abstract Recent efforts in Dialogue State Tracking (DST) for task-oriented dialogues have progressed toward open-vocabulary or generation-based approaches where the models can generate slot value candidates from the dialogue history itself. These approaches have shown good performance gain, especially in complicated dialogue domains with dynamic slot values. However, they fall short in two aspects: (1) they do not allow models to explicitly learn signals across domains and slots to detect potential dependencies among (domain, slot) pairs; and (2) existing models follow auto-regressive approaches which incur high time cost when the dialogue evolves over multiple domains and multiple turns. In this paper, we propose a novel framework of Non-Autoregressive Dialog State Tracking (NADST) which can factor in potential dependencies among domains and slots to optimize the models towards better prediction of dialogue states as a complete set rather than separate slots. In particular, the non-autoregressive nature of our method not only enables decoding in parallel to significantly reduce the latency of DST for real-time dialogue response generation, but also detect dependencies among slots at token level in addition to slot and domain level. Our empirical results show that our model achieves the state-of-the-art joint accuracy across all domains on the MultiWOZ 2.1 corpus, and the latency of our model is an order of magnitude lower than the previous state of the art as the dialogue history extends over time.
Tasks Dialogue State Tracking
Published 2020-01-01
URL https://openreview.net/forum?id=H1e_cC4twS
PDF https://openreview.net/pdf?id=H1e_cC4twS
PWC https://paperswithcode.com/paper/non-autoregressive-dialog-state-tracking
Repo
Framework

MetaPix: Few-Shot Video Retargeting

Title MetaPix: Few-Shot Video Retargeting
Authors Anonymous
Abstract We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to build a universal transcoder because humans can appear wildly different due to clothing and background scene geometry. Instead, we learn to adapt – or personalize – a universal generator to the particular human and background in the target. To do so, we make use of meta-learning to discover effective strategies for on-the-fly personalization. One significant benefit of meta-learning is that the personalized transcoder naturally enforces temporal coherence across its generated frames; all frames contain consistent clothing and background geometry of the target. We experiment on in-the-wild internet videos and images and show our approach improves over widely-used baselines for the task.
Tasks Meta-Learning
Published 2020-01-01
URL https://openreview.net/forum?id=SJx1URNKwH
PDF https://openreview.net/pdf?id=SJx1URNKwH
PWC https://paperswithcode.com/paper/metapix-few-shot-video-retargeting-1
Repo
Framework

Spatial Information is Overrated for Image Classification

Title Spatial Information is Overrated for Image Classification
Authors Anonymous
Abstract Intuitively, image classification should profit from using spatial information. Recent work, however, suggests that this might be overrated in standard CNNs. In this paper, we are pushing the envelope and aim to further investigate the reliance on and necessity of spatial information. We propose and analyze three methods, namely Shuffle Conv, GAP+FC and 1x1 Conv, that destroy spatial information during both training and testing phases. We extensively evaluate these methods on several object recognition datasets (CIFAR100, Small-ImageNet, ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152, MobileNet, SqueezeNet). Interestingly, we consistently observe that spatial information can be completely deleted from a significant number of layers with no or only small performance drops.
Tasks Image Classification, Object Recognition
Published 2020-01-01
URL https://openreview.net/forum?id=H1l7AkrFPS
PDF https://openreview.net/pdf?id=H1l7AkrFPS
PWC https://paperswithcode.com/paper/spatial-information-is-overrated-for-image
Repo
Framework

Learning to Generate Grounded Visual Captions without Localization Supervision

Title Learning to Generate Grounded Visual Captions without Localization Supervision
Authors Anonymous
Abstract When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, or if the model hallucinates based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is through an attention mechanism over the regions that are used as input to predict the next word. The model must therefore learn to predict the attentional weights without knowing the word it should localize. This is difficult to train without grounding supervision since recurrent models can propagate past information and there is no explicit signal to force the captioning model to properly ground the individual decoded words. In this work, we help the model to achieve this via a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it, and then reconstruct the sentence from the localized image region(s) to match the ground-truth. Our proposed framework only requires learning one extra fully-connected layer (the localizer), a layer that can be removed at test time. We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference for both image and video captioning tasks.
Tasks Language Modelling, Video Captioning
Published 2020-01-01
URL https://openreview.net/forum?id=SylR6n4tPS
PDF https://openreview.net/pdf?id=SylR6n4tPS
PWC https://paperswithcode.com/paper/learning-to-generate-grounded-visual-captions
Repo
Framework

Geometry-Aware Visual Predictive Models of Intuitive Physics

Title Geometry-Aware Visual Predictive Models of Intuitive Physics
Authors Anonymous
Abstract Learning object dynamics for model-based control usually involves choosing among two alternatives: i) engineered 3D state representations comprised of 3D object locations and poses, or, ii) learnt 2D image representations trained end-to-end for the dynamics prediction task. The former requires laborious human annotations to extract the 3D information from 2D images, and does not permit end-to-end learning. The latter has not shown until today to generalize across camera viewpoints or to handle camera motion and cross-object occlusions. We propose neural architectures that learn to disentangle an RGB-D video steam into camera motion and 3D scene appearance, and capture the latter into 3D feature representations that can be trained end-to-end with 3D object detection and object motion forecasting. We feed object-centric 3D feature maps and actions of the agent into differentiable neural modules and learn to forecast object 3D motion. We empirically demonstrate the proposed 3D representations learn object dynamics that generalize across camera viewpoints and can handle object occlusions. They do not suffer from error accumulation when unrolled over time thanks to the permanence of object appearance in 3D. They outperform by a margin both 2D learned image representations as well as engineered 3D ones in forecasting object dynamics.
Tasks 3D Object Detection, Motion Forecasting, Object Detection
Published 2020-01-01
URL https://openreview.net/forum?id=Bklfcxrtvr
PDF https://openreview.net/pdf?id=Bklfcxrtvr
PWC https://paperswithcode.com/paper/geometry-aware-visual-predictive-models-of
Repo
Framework

Shifted Randomized Singular Value Decomposition

Title Shifted Randomized Singular Value Decomposition
Authors Anonymous
Abstract We extend the randomized singular value decomposition (SVD) algorithm (Halko et al., 2011) to estimate the SVD of a shifted data matrix without explicitly constructing the matrix in the memory. With no loss in the accuracy of the original algorithm, the extended algorithm provides for a more efficient way of matrix factorization. The algorithm facilitates the low-rank approximation and principal component analysis (PCA) of off-center data matrices. When applied to different types of data matrices, our experimental results confirm the advantages of the extensions made to the original algorithm.
Tasks
Published 2020-01-01
URL https://openreview.net/forum?id=B1eZYkHYPS
PDF https://openreview.net/pdf?id=B1eZYkHYPS
PWC https://paperswithcode.com/paper/shifted-randomized-singular-value
Repo
Framework
comments powered by Disqus