April 1, 2020

3124 words 15 mins read

Paper Group NANR 40

Semi-Supervised Few-Shot Learning with Prototypical Random Walks. On the Evaluation of Conditional GANs. The Visual Task Adaptation Benchmark. Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?. Variable Complexity in the Univariate and Multivariate Structural Causal Model. Forecasting Deep Learning Dynamics with Applications to Hype …

Semi-Supervised Few-Shot Learning with Prototypical Random Walks


Title	Semi-Supervised Few-Shot Learning with Prototypical Random Walks
Authors	Anonymous
Abstract	Learning from a few examples is a key characteristic of human intelligence that inspired machine learning researchers to build data-efficient AI models. Recent progress has shown that few-shot learning can be improved with access to unlabelled data, known as semi-supervised few-shot learning(SS-FSL). We introduce an SS-FSL approach, dubbed as Prototypical Random Walk Networks (PRWN), built on top of Prototypical Networks (PN). We develop a random walk semi-supervised loss that enables the network to learn representations that are compact and well-separated. Our work is related to the very recent development on graph-based approaches for few-shot learning. However, we show that achieved compact and well-separated class embeddings can be achieved by our prototypical random walk notion without needing additional graph-NN parameters or requiring a transductive setting where collective test set is provided. Our model outperforms prior art in most benchmarks with significant improvements in some cases. For example, in a mini-Imagenet 5-shot classification task, we obtain 69.65% accuracy to the 64.59% state-of-the-art. Our model, trained with 40% of the data as labelled, compares competitively against fully supervised prototypical networks, trained on 100% of the labels, even outperforming it in the 1-shot mini-Imagenet case with 50.89% to 49.4% accuracy. We also show that our model is resistant to distractors, unlabeled data that does not belong to any of the training classes, and hence reflecting robustness to labelled/unlabelled class distribution mismatch. We also performed a challenging discriminative power test, showing a relative improvement on top of the baseline of 14% on 20 classes on mini-Imagenet and 60% on 800 classes on Ominiglot.
Tasks	Few-Shot Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=Bygka64KPH
PDF	https://openreview.net/pdf?id=Bygka64KPH
PWC	https://paperswithcode.com/paper/semi-supervised-few-shot-learning-with
Repo
Framework

On the Evaluation of Conditional GANs


Title	On the Evaluation of Conditional GANs
Authors	Anonymous
Abstract	Conditional Generative Adversarial Networks (cGANs) are finding increasingly widespread use in many application domains. Despite outstanding progress, quantitative evaluation of such models often involves multiple distinct metrics to assess different desirable properties, such as image quality, conditional consistency, and intra-conditioning diversity. In this setting, model benchmarking becomes a challenge, as each metric may indicate a different “best” model. In this paper, we propose the Frechet Joint Distance (FJD), which is defined as the Frechet distance between joint distributions of images and conditioning, allowing it to implicitly capture the aforementioned properties in a single metric. We conduct proof-of-concept experiments on a controllable synthetic dataset, which consistently highlight the benefits of FJD when compared to currently established metrics. Moreover, we use the newly introduced metric to compare existing cGAN-based models for a variety of conditioning modalities (e.g. class labels, object masks, bounding boxes, images, and text captions). We show that FJD can be used as a promising single metric for model benchmarking.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=rylxpA4YwH
PDF	https://openreview.net/pdf?id=rylxpA4YwH
PWC	https://paperswithcode.com/paper/on-the-evaluation-of-conditional-gans-1
Repo
Framework

The Visual Task Adaptation Benchmark


Title	The Visual Task Adaptation Benchmark
Authors	Anonymous
Abstract	Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visual representations hinders progress. Many sub-fields promise representations, but each has different evaluation protocols that are either too constrained (linear classification), limited in scope (ImageNet, CIFAR, Pascal-VOC), or only loosely related to representation quality (generation). We present the Visual Task Adaptation Benchmark (VTAB): a diverse, realistic, and challenging benchmark to evaluate representations. VTAB embodies one principle: good representations adapt to unseen tasks with few examples. We run a large VTAB study of popular algorithms, answering questions like: How effective are ImageNet representation on non-standard datasets? Are generative models competitive? Is self-supervision useful if one already has labels?
Tasks	Representation Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=BJena3VtwS
PDF	https://openreview.net/pdf?id=BJena3VtwS
PWC	https://paperswithcode.com/paper/the-visual-task-adaptation-benchmark-1
Repo
Framework

Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?


Title	Residual EBMs: Does Real vs. Fake Text Discrimination Generalize?
Authors	Anonymous
Abstract	Energy-based models (EBMs), a.k.a. un-normalized models, have had recent successes in continuous spaces. However, they have not been successfully applied to model text sequences. While decreasing the energy at training samples is straightforward, mining (negative) samples where the energy should be increased is difficult. In part, this is because standard gradient-based methods are not readily applicable when the input is high-dimensional and discrete. Here, we side-step this issue by generating negatives using pre-trained auto-regressive language models. The EBM then works in the {\em residual} of the language model; and is trained to discriminate real text from text generated by the auto-regressive models. We investigate the generalization ability of residual EBMs, a pre-requisite for using them in other applications. We extensively analyze generalization for the task of classifying whether an input is machine or human generated, a natural task given the training loss and how we mine negatives. Overall, we observe that EBMs can generalize remarkably well to changes in the architecture of the generators producing negatives. However, EBMs exhibit more sensitivity to the training set used by such generators.
Tasks	Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=SkgpGgrYPH
PDF	https://openreview.net/pdf?id=SkgpGgrYPH
PWC	https://paperswithcode.com/paper/residual-ebms-does-real-vs-fake-text
Repo
Framework

Variable Complexity in the Univariate and Multivariate Structural Causal Model


Title	Variable Complexity in the Univariate and Multivariate Structural Causal Model
Authors	Anonymous
Abstract	We show that by comparing the individual complexities of univariante cause and effect in the Structural Causal Model, one can identify the cause and the effect, without considering their interaction at all. The entropy of each variable is ineffective in measuring the complexity, and we propose to capture it by an autoencoder that operates on the list of sorted samples. Comparing the reconstruction errors of the two autoencoders, one for each variable, is shown to perform well on the accepted benchmarks of the field. In the multivariate case, where one can ensure that the complexities of the cause and effect are balanced, we propose a new method that mimics the disentangled structure of the causal model. We extend the results of~\cite{Zhang:2009:IPC:1795114.1795190} to the multidimensional case, showing that such modeling is only likely in the direction of causality. Furthermore, the learned model is shown theoretically to perform the separation to the causal component and to the residual (noise) component. Our multidimensional method obtains a significantly higher accuracy than the literature methods.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=BJlPLlrFvH
PDF	https://openreview.net/pdf?id=BJlPLlrFvH
PWC	https://paperswithcode.com/paper/variable-complexity-in-the-univariate-and
Repo
Framework

Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning


Title	Forecasting Deep Learning Dynamics with Applications to Hyperparameter Tuning
Authors	Anonymous
Abstract	Well-performing deep learning models have enormous impact, but getting them to perform well is complicated, as the model architecture must be chosen and a number of hyperparameters tuned. This requires experimentation, which is timeconsuming and costly. We propose to address the problem of hyperparameter tuning by learning to forecast the training behaviour of deep learning architectures. Concretely, we introduce a forecasting model that, given a hyperparameter schedule (e.g., learning rate, weight decay) and a history of training observations (such as loss and accuracy), predicts how the training will continue. Naturally, forecasting is much faster and less expensive than running actual deep learning experiments. The main question we study is whether the forecasting model is good enough to be of use - can it indeed replace real experiments? We answer this affirmatively in two ways. For one, we show that the forecasted curves are close to real ones. On the practical side, we apply our forecaster to learn hyperparameter tuning policies. We experiment on a version of ResNet on CIFAR10 and on Transformer in a language modeling task. The policies learned using our forecaster match or exceed the ones learned in real experiments and in one case even the default schedules discovered by researchers. We study the learning rate schedules created using the forecaster are find that they are not only effective, but also lead to interesting insights.
Tasks	Language Modelling
Published	2020-01-01
URL	https://openreview.net/forum?id=ByxHJeBYDB
PDF	https://openreview.net/pdf?id=ByxHJeBYDB
PWC	https://paperswithcode.com/paper/forecasting-deep-learning-dynamics-with
Repo
Framework

Improving End-to-End Object Tracking Using Relational Reasoning


Title	Improving End-to-End Object Tracking Using Relational Reasoning
Authors	Fabian B. Fuchs, Adam R. Kosiorek, Li Sun, Oiwi Parker Jones, Ingmar Posner
Abstract	Relational reasoning, the ability to model interactions and relations between objects, is valuable for robust multi-object tracking and pivotal for trajectory prediction. In this paper, we propose MOHART, a class-agnostic, end-to-end multi-object tracking and trajectory prediction algorithm, which explicitly accounts for permutation invariance in its relational reasoning. We explore a number of permutation invariant architectures and show that multi-headed self-attention outperforms the provided baselines and better accounts for complex physical interactions in a challenging toy experiment. We show on three real-world tracking datasets that adding relational reasoning capabilities in this way increases the tracking and trajectory prediction performance, particularly in the presence of ego-motion, occlusions, crowded scenes, and faulty sensor inputs. To the best of our knowledge, MOHART is the first fully end-to-end multi-object tracking from vision approach applied to real-world data reported in the literature.
Tasks	Multi-Object Tracking, Object Tracking, Relational Reasoning, Trajectory Prediction
Published	2020-01-01
URL	https://openreview.net/forum?id=Byl-264tvr
PDF	https://openreview.net/pdf?id=Byl-264tvr
PWC	https://paperswithcode.com/paper/improving-end-to-end-object-tracking-using
Repo
Framework

Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Ocurring in Data


Title	Co-Attentive Equivariant Neural Networks: Focusing Equivariance On Transformations Co-Ocurring in Data
Authors	Anonymous
Abstract	Equivariance is a nice property to have as it produces much more parameter efficient neural architectures and preserves the structure of the input through the feature mapping. Even though some combinations of transformations might never appear (e.g. a face with a horizontal nose) current equivariant architectures consider the set of all possible transformations in the transformation group while generating feature representations. Contrarily, the human visual system is able to attend to the set of relevant transformations occurring in the environment as to assist and improve object recognition. Based on this observation, we modify conventional equivariant feature mappings such that they are able to attend to the set of co-occurring transformations in data. Our experiments show that neural networks utilizing co-attentive equivariant feature mappings consistently outperform those utilizing conventional ones both for fully (rotated MNIST) and partially (CIFAR-10) rotational settings.
Tasks	Object Recognition
Published	2020-01-01
URL	https://openreview.net/forum?id=r1g6ogrtDr
PDF	https://openreview.net/pdf?id=r1g6ogrtDr
PWC	https://paperswithcode.com/paper/co-attentive-equivariant-neural-networks
Repo
Framework

Information lies in the eye of the beholder: The effect of representations on observed mutual information


Title	Information lies in the eye of the beholder: The effect of representations on observed mutual information
Authors	Anonymous
Abstract	Learning can be framed as trying to encode the mutual information between input and output while discarding other information in the input. Since the distribution between input and output is unknown, also the true mutual information is. To quantify how difficult it is to learn a task, we calculate a observed mutual information score by dividing the estimated mutual information by the entropy of the input. We substantiate this score analytically by showing that the estimated mutual information has an error that increases with the entropy of the data. Intriguingly depending on how the data is represented the observed entropy and mutual information can vary wildly. There needs to be a match between how data is represented and how a model encodes it. Experimentally we analyze image-based input data representations and demonstrate that performance outcomes of extensive network architectures searches are well aligned to the calculated score. Therefore to ensure better learning outcomes, representations may need to be tailored to both task and model to align with the implicit distribution of the model.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=r1gmu1SYDS
PDF	https://openreview.net/pdf?id=r1gmu1SYDS
PWC	https://paperswithcode.com/paper/information-lies-in-the-eye-of-the-beholder
Repo
Framework

Non-Autoregressive Dialog State Tracking


Title	Non-Autoregressive Dialog State Tracking
Authors	Anonymous
Abstract	Recent efforts in Dialogue State Tracking (DST) for task-oriented dialogues have progressed toward open-vocabulary or generation-based approaches where the models can generate slot value candidates from the dialogue history itself. These approaches have shown good performance gain, especially in complicated dialogue domains with dynamic slot values. However, they fall short in two aspects: (1) they do not allow models to explicitly learn signals across domains and slots to detect potential dependencies among (domain, slot) pairs; and (2) existing models follow auto-regressive approaches which incur high time cost when the dialogue evolves over multiple domains and multiple turns. In this paper, we propose a novel framework of Non-Autoregressive Dialog State Tracking (NADST) which can factor in potential dependencies among domains and slots to optimize the models towards better prediction of dialogue states as a complete set rather than separate slots. In particular, the non-autoregressive nature of our method not only enables decoding in parallel to significantly reduce the latency of DST for real-time dialogue response generation, but also detect dependencies among slots at token level in addition to slot and domain level. Our empirical results show that our model achieves the state-of-the-art joint accuracy across all domains on the MultiWOZ 2.1 corpus, and the latency of our model is an order of magnitude lower than the previous state of the art as the dialogue history extends over time.
Tasks	Dialogue State Tracking
Published	2020-01-01
URL	https://openreview.net/forum?id=H1e_cC4twS
PDF	https://openreview.net/pdf?id=H1e_cC4twS
PWC	https://paperswithcode.com/paper/non-autoregressive-dialog-state-tracking
Repo
Framework

MetaPix: Few-Shot Video Retargeting


Title	MetaPix: Few-Shot Video Retargeting
Authors	Anonymous
Abstract	We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to build a universal transcoder because humans can appear wildly different due to clothing and background scene geometry. Instead, we learn to adapt – or personalize – a universal generator to the particular human and background in the target. To do so, we make use of meta-learning to discover effective strategies for on-the-fly personalization. One significant benefit of meta-learning is that the personalized transcoder naturally enforces temporal coherence across its generated frames; all frames contain consistent clothing and background geometry of the target. We experiment on in-the-wild internet videos and images and show our approach improves over widely-used baselines for the task.
Tasks	Meta-Learning
Published	2020-01-01
URL	https://openreview.net/forum?id=SJx1URNKwH
PDF	https://openreview.net/pdf?id=SJx1URNKwH
PWC	https://paperswithcode.com/paper/metapix-few-shot-video-retargeting-1
Repo
Framework

Spatial Information is Overrated for Image Classification


Title	Spatial Information is Overrated for Image Classification
Authors	Anonymous
Abstract	Intuitively, image classification should profit from using spatial information. Recent work, however, suggests that this might be overrated in standard CNNs. In this paper, we are pushing the envelope and aim to further investigate the reliance on and necessity of spatial information. We propose and analyze three methods, namely Shuffle Conv, GAP+FC and 1x1 Conv, that destroy spatial information during both training and testing phases. We extensively evaluate these methods on several object recognition datasets (CIFAR100, Small-ImageNet, ImageNet) with a wide range of CNN architectures (VGG16, ResNet50, ResNet152, MobileNet, SqueezeNet). Interestingly, we consistently observe that spatial information can be completely deleted from a significant number of layers with no or only small performance drops.
Tasks	Image Classification, Object Recognition
Published	2020-01-01
URL	https://openreview.net/forum?id=H1l7AkrFPS
PDF	https://openreview.net/pdf?id=H1l7AkrFPS
PWC	https://paperswithcode.com/paper/spatial-information-is-overrated-for-image
Repo
Framework

Learning to Generate Grounded Visual Captions without Localization Supervision


Title	Learning to Generate Grounded Visual Captions without Localization Supervision
Authors	Anonymous
Abstract	When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, or if the model hallucinates based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is through an attention mechanism over the regions that are used as input to predict the next word. The model must therefore learn to predict the attentional weights without knowing the word it should localize. This is difficult to train without grounding supervision since recurrent models can propagate past information and there is no explicit signal to force the captioning model to properly ground the individual decoded words. In this work, we help the model to achieve this via a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it, and then reconstruct the sentence from the localized image region(s) to match the ground-truth. Our proposed framework only requires learning one extra fully-connected layer (the localizer), a layer that can be removed at test time. We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference for both image and video captioning tasks.
Tasks	Language Modelling, Video Captioning
Published	2020-01-01
URL	https://openreview.net/forum?id=SylR6n4tPS
PDF	https://openreview.net/pdf?id=SylR6n4tPS
PWC	https://paperswithcode.com/paper/learning-to-generate-grounded-visual-captions
Repo
Framework

Geometry-Aware Visual Predictive Models of Intuitive Physics


Title	Geometry-Aware Visual Predictive Models of Intuitive Physics
Authors	Anonymous
Abstract	Learning object dynamics for model-based control usually involves choosing among two alternatives: i) engineered 3D state representations comprised of 3D object locations and poses, or, ii) learnt 2D image representations trained end-to-end for the dynamics prediction task. The former requires laborious human annotations to extract the 3D information from 2D images, and does not permit end-to-end learning. The latter has not shown until today to generalize across camera viewpoints or to handle camera motion and cross-object occlusions. We propose neural architectures that learn to disentangle an RGB-D video steam into camera motion and 3D scene appearance, and capture the latter into 3D feature representations that can be trained end-to-end with 3D object detection and object motion forecasting. We feed object-centric 3D feature maps and actions of the agent into differentiable neural modules and learn to forecast object 3D motion. We empirically demonstrate the proposed 3D representations learn object dynamics that generalize across camera viewpoints and can handle object occlusions. They do not suffer from error accumulation when unrolled over time thanks to the permanence of object appearance in 3D. They outperform by a margin both 2D learned image representations as well as engineered 3D ones in forecasting object dynamics.
Tasks	3D Object Detection, Motion Forecasting, Object Detection
Published	2020-01-01
URL	https://openreview.net/forum?id=Bklfcxrtvr
PDF	https://openreview.net/pdf?id=Bklfcxrtvr
PWC	https://paperswithcode.com/paper/geometry-aware-visual-predictive-models-of
Repo
Framework

Shifted Randomized Singular Value Decomposition


Title	Shifted Randomized Singular Value Decomposition
Authors	Anonymous
Abstract	We extend the randomized singular value decomposition (SVD) algorithm (Halko et al., 2011) to estimate the SVD of a shifted data matrix without explicitly constructing the matrix in the memory. With no loss in the accuracy of the original algorithm, the extended algorithm provides for a more efficient way of matrix factorization. The algorithm facilitates the low-rank approximation and principal component analysis (PCA) of off-center data matrices. When applied to different types of data matrices, our experimental results confirm the advantages of the extensions made to the original algorithm.
Tasks
Published	2020-01-01
URL	https://openreview.net/forum?id=B1eZYkHYPS
PDF	https://openreview.net/pdf?id=B1eZYkHYPS
PWC	https://paperswithcode.com/paper/shifted-randomized-singular-value
Repo
Framework