Paper Group AWR 240
On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length. Sequence-to-Action: End-to-End Semantic Graph Generation for Semantic Parsing. Unsupervised learning from videos using temporal coherency deep networks. This Time with Feeling: Learning Expressive Musical Performance. The Pros and Cons: Rank-aware Temporal Attentio …
On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length
Title | On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length |
Authors | Stanisław Jastrzębski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey |
Abstract | Stochastic Gradient Descent (SGD) based training of neural networks with a large learning rate or a small batch-size typically ends in well-generalizing, flat regions of the weight space, as indicated by small eigenvalues of the Hessian of the training loss. However, the curvature along the SGD trajectory is poorly understood. An empirical investigation shows that initially SGD visits increasingly sharp regions, reaching a maximum sharpness determined by both the learning rate and the batch-size of SGD. When studying the SGD dynamics in relation to the sharpest directions in this initial phase, we find that the SGD step is large compared to the curvature and commonly fails to minimize the loss along the sharpest directions. Furthermore, using a reduced learning rate along these directions can improve training speed while leading to both sharper and better generalizing solutions compared to vanilla SGD. In summary, our analysis of the dynamics of SGD in the subspace of the sharpest directions shows that they influence the regions that SGD steers to (where larger learning rate or smaller batch size result in wider regions visited), the overall training speed, and the generalization ability of the final model. |
Tasks | |
Published | 2018-07-13 |
URL | https://arxiv.org/abs/1807.05031v6 |
https://arxiv.org/pdf/1807.05031v6.pdf | |
PWC | https://paperswithcode.com/paper/on-the-relation-between-the-sharpest |
Repo | https://github.com/kudkudak/dnn_sharpest_directions |
Framework | tf |
Sequence-to-Action: End-to-End Semantic Graph Generation for Semantic Parsing
Title | Sequence-to-Action: End-to-End Semantic Graph Generation for Semantic Parsing |
Authors | Bo Chen, Le Sun, Xianpei Han |
Abstract | This paper proposes a neural semantic parsing approach – Sequence-to-Action, which models semantic parsing as an end-to-end semantic graph generation process. Our method simultaneously leverages the advantages from two recent promising directions of semantic parsing. Firstly, our model uses a semantic graph to represent the meaning of a sentence, which has a tight-coupling with knowledge bases. Secondly, by leveraging the powerful representation learning and prediction ability of neural network models, we propose a RNN model which can effectively map sentences to action sequences for semantic graph generation. Experiments show that our method achieves state-of-the-art performance on OVERNIGHT dataset and gets competitive performance on GEO and ATIS datasets. |
Tasks | Graph Generation, Representation Learning, Semantic Parsing |
Published | 2018-09-04 |
URL | http://arxiv.org/abs/1809.00773v1 |
http://arxiv.org/pdf/1809.00773v1.pdf | |
PWC | https://paperswithcode.com/paper/sequence-to-action-end-to-end-semantic-graph |
Repo | https://github.com/dongpobeyond/Seq2Act |
Framework | none |
Unsupervised learning from videos using temporal coherency deep networks
Title | Unsupervised learning from videos using temporal coherency deep networks |
Authors | Carolina Redondo-Cabrera, Roberto J. López-Sastre |
Abstract | In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal coherence of close frames is used as a free form of annotation, encouraging the learned representations to exhibit small differences between these frames. But this type of approach fails to capture the dissimilarity between videos with different content, hence learning less discriminative features. We here propose two Siamese architectures for Convolutional Neural Networks, and their corresponding novel loss functions, to learn from unlabeled videos, which jointly exploit the local temporal coherence between contiguous frames, and a global discriminative margin used to separate representations of different videos. An extensive experimental evaluation is presented, where we validate the proposed models on various tasks. First, we show how the learned features can be used to discover actions and scenes in video collections. Second, we show the benefits of such an unsupervised learning from just unlabeled videos, which can be directly used as a prior for the supervised recognition tasks of actions and objects in images, where our results further show that our features can even surpass a traditional and heavily supervised pre-training plus fine-tunning strategy. |
Tasks | |
Published | 2018-01-24 |
URL | http://arxiv.org/abs/1801.08100v2 |
http://arxiv.org/pdf/1801.08100v2.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-learning-from-videos-using |
Repo | https://github.com/gramuah/unsupervised |
Framework | caffe2 |
This Time with Feeling: Learning Expressive Musical Performance
Title | This Time with Feeling: Learning Expressive Musical Performance |
Authors | Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan |
Abstract | Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct $\it performance$ generation: jointly predicting the notes $\it and$ $\it also$ their expressive timing and dynamics. We consider the significance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples. |
Tasks | Music Generation |
Published | 2018-08-10 |
URL | http://arxiv.org/abs/1808.03715v1 |
http://arxiv.org/pdf/1808.03715v1.pdf | |
PWC | https://paperswithcode.com/paper/this-time-with-feeling-learning-expressive |
Repo | https://github.com/khasanovaa/Sonia |
Framework | pytorch |
The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos
Title | The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos |
Authors | Hazel Doughty, Walterio Mayol-Cuevas, Dima Damen |
Abstract | We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts. Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model’s ability to attend to rank-aware parts of the video. |
Tasks | |
Published | 2018-12-13 |
URL | http://arxiv.org/abs/1812.05538v2 |
http://arxiv.org/pdf/1812.05538v2.pdf | |
PWC | https://paperswithcode.com/paper/the-pros-and-cons-rank-aware-temporal |
Repo | https://github.com/hazeld/rank-aware-attention-network |
Framework | pytorch |
Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities?
Title | Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities? |
Authors | Oren Mangoubi, Natesh S. Pillai, Aaron Smith |
Abstract | Hamiltonian Monte Carlo (HMC) is a very popular and generic collection of Markov chain Monte Carlo (MCMC) algorithms. One explanation for the popularity of HMC algorithms is their excellent performance as the dimension $d$ of the target becomes large: under conditions that are satisfied for many common statistical models, optimally-tuned HMC algorithms have a running time that scales like $d^{0.25}$. In stark contrast, the running time of the usual Random-Walk Metropolis (RWM) algorithm, optimally tuned, scales like $d$. This superior scaling of the HMC algorithm with dimension is attributed to the fact that it, unlike RWM, incorporates the gradient information in the proposal distribution. In this paper, we investigate a different scaling question: does HMC beat RWM for highly $\textit{multimodal}$ targets? We find that the answer is often $\textit{no}$. We compute the spectral gaps for both the algorithms for a specific class of multimodal target densities, and show that they are identical. The key reason is that, within one mode, the gradient is effectively ignorant about other modes, thus negating the advantage the HMC algorithm enjoys in unimodal targets. We also give heuristic arguments suggesting that the above observation may hold quite generally. Our main tool for answering this question is a novel simple formula for the conductance of HMC using Liouville’s theorem. This result allows us to compute the spectral gap of HMC algorithms, for both the classical HMC with isotropic momentum and the recent Riemannian HMC, for multimodal targets. |
Tasks | |
Published | 2018-08-09 |
URL | http://arxiv.org/abs/1808.03230v2 |
http://arxiv.org/pdf/1808.03230v2.pdf | |
PWC | https://paperswithcode.com/paper/does-hamiltonian-monte-carlo-mix-faster-than |
Repo | https://github.com/sir-deenicus/EvolutionaryBayes |
Framework | none |
Neural Granger Causality for Nonlinear Time Series
Title | Neural Granger Causality for Nonlinear Time Series |
Authors | Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, Emily Fox |
Abstract | While most classical approaches to Granger causality detection assume linear dynamics, many interactions in applied domains, like neuroscience and genomics, are inherently nonlinear. In these cases, using linear models may lead to inconsistent estimation of Granger causal interactions. We propose a class of nonlinear methods by applying structured multilayer perceptrons (MLPs) or recurrent neural networks (RNNs) combined with sparsity-inducing penalties on the weights. By encouraging specific sets of weights to be zero—in particular through the use of convex group-lasso penalties—we can extract the Granger causal structure. To further contrast with traditional approaches, our framework naturally enables us to efficiently capture long-range dependencies between series either via our RNNs or through an automatic lag selection in the MLP. We show that our neural Granger causality methods outperform state-of-the-art nonlinear Granger causality methods on the DREAM3 challenge data. This data consists of nonlinear gene expression and regulation time courses with only a limited number of time points. The successes we show in this challenging dataset provide a powerful example of how deep learning can be useful in cases that go beyond prediction on large datasets. We likewise demonstrate our methods in detecting nonlinear interactions in a human motion capture dataset. |
Tasks | Motion Capture, Time Series |
Published | 2018-02-16 |
URL | http://arxiv.org/abs/1802.05842v1 |
http://arxiv.org/pdf/1802.05842v1.pdf | |
PWC | https://paperswithcode.com/paper/neural-granger-causality-for-nonlinear-time |
Repo | https://github.com/icc2115/Neural-GC |
Framework | pytorch |
Deep Imbalanced Attribute Classification using Visual Attention Aggregation
Title | Deep Imbalanced Attribute Classification using Visual Attention Aggregation |
Authors | Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris |
Abstract | For many computer vision applications, such as image description and human identification, recognizing the visual attributes of humans is an essential yet challenging problem. Its challenges originate from its multi-label nature, the large underlying class imbalance and the lack of spatial annotations. Existing methods follow either a computer vision approach while failing to account for class imbalance, or explore machine learning solutions, which disregard the spatial and semantic relations that exist in the images. With that in mind, we propose an effective method that extracts and aggregates visual attention masks at different scales. We introduce a loss function to handle class imbalance both at class and at an instance level and further demonstrate that penalizing attention masks with high prediction variance accounts for the weak supervision of the attention mechanism. By identifying and addressing these challenges, we achieve state-of-the-art results with a simple attention mechanism in both PETA and WIDER-Attribute datasets without additional context or side information. |
Tasks | |
Published | 2018-07-10 |
URL | http://arxiv.org/abs/1807.03903v2 |
http://arxiv.org/pdf/1807.03903v2.pdf | |
PWC | https://paperswithcode.com/paper/deep-imbalanced-attribute-classification |
Repo | https://github.com/cvcode18/imbalanced_learning |
Framework | mxnet |
Self-Attention with Relative Position Representations
Title | Self-Attention with Relative Position Representations |
Authors | Peter Shaw, Jakob Uszkoreit, Ashish Vaswani |
Abstract | Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs. |
Tasks | Machine Translation |
Published | 2018-03-06 |
URL | http://arxiv.org/abs/1803.02155v2 |
http://arxiv.org/pdf/1803.02155v2.pdf | |
PWC | https://paperswithcode.com/paper/self-attention-with-relative-position |
Repo | https://github.com/tensorflow/tensor2tensor |
Framework | tf |
TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation
Title | TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation |
Authors | Vladimir Iglovikov, Alexey Shvets |
Abstract | Pixel-wise image segmentation is demanding task in computer vision. Classical U-Net architectures composed of encoders and decoders are very popular for segmentation of medical images, satellite images etc. Typically, neural network initialized with weights from a network pre-trained on a large data set like ImageNet shows better performance than those trained from scratch on a small dataset. In some practical applications, particularly in medicine and traffic safety, the accuracy of the models is of utmost importance. In this paper, we demonstrate how the U-Net type architecture can be improved by the use of the pre-trained encoder. Our code and corresponding pre-trained weights are publicly available at https://github.com/ternaus/TernausNet. We compare three weight initialization schemes: LeCun uniform, the encoder with weights from VGG11 and full network trained on the Carvana dataset. This network architecture was a part of the winning solution (1st out of 735) in the Kaggle: Carvana Image Masking Challenge. |
Tasks | Semantic Segmentation |
Published | 2018-01-17 |
URL | http://arxiv.org/abs/1801.05746v1 |
http://arxiv.org/pdf/1801.05746v1.pdf | |
PWC | https://paperswithcode.com/paper/ternausnet-u-net-with-vgg11-encoder-pre |
Repo | https://github.com/kaichoulyc/tgs-salts |
Framework | pytorch |
Probabilistic Video Generation using Holistic Attribute Control
Title | Probabilistic Video Generation using Holistic Attribute Control |
Authors | Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, Leonid Sigal |
Abstract | Videos express highly structured spatio-temporal patterns of visual data. A video can be thought of as being governed by two factors: (i) temporally invariant (e.g., person identity), or slowly varying (e.g., activity), attribute-induced appearance, encoding the persistent content of each frame, and (ii) an inter-frame motion or scene dynamics (e.g., encoding evolution of the person ex-ecuting the action). Based on this intuition, we propose a generative framework for video generation and future prediction. The proposed framework generates a video (short clip) by decoding samples sequentially drawn from a latent space distribution into full video frames. Variational Autoencoders (VAEs) are used as a means of encoding/decoding frames into/from the latent space and RNN as a wayto model the dynamics in the latent space. We improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. As a result, given attributes and/orthe first frame, our model is able to generate diverse but highly consistent sets ofvideo sequences, accounting for the inherent uncertainty in the prediction task. Experimental results on Chair CAD, Weizmann Human Action, and MIT-Flickr datasets, along with detailed comparison to the state-of-the-art, verify effectiveness of the framework. |
Tasks | Future prediction, Video Generation |
Published | 2018-03-21 |
URL | http://arxiv.org/abs/1803.08085v1 |
http://arxiv.org/pdf/1803.08085v1.pdf | |
PWC | https://paperswithcode.com/paper/probabilistic-video-generation-using-holistic |
Repo | https://github.com/charlescheng0117/pytorch-VideoVAE |
Framework | pytorch |
First Order Generative Adversarial Networks
Title | First Order Generative Adversarial Networks |
Authors | Calvin Seward, Thomas Unterthiner, Urs Bergmann, Nikolay Jetchev, Sepp Hochreiter |
Abstract | GANs excel at learning high dimensional distributions, but they can update generator parameters in directions that do not correspond to the steepest descent direction of the objective. Prominent examples of problematic update directions include those used in both Goodfellow’s original GAN and the WGAN-GP. To formally describe an optimal update direction, we introduce a theoretical framework which allows the derivation of requirements on both the divergence and corresponding method for determining an update direction, with these requirements guaranteeing unbiased mini-batch updates in the direction of steepest descent. We propose a novel divergence which approximates the Wasserstein distance while regularizing the critic’s first order information. Together with an accompanying update direction, this divergence fulfills the requirements for unbiased steepest descent updates. We verify our method, the First Order GAN, with image generation on CelebA, LSUN and CIFAR-10 and set a new state of the art on the One Billion Word language generation task. Code to reproduce experiments is available. |
Tasks | Image Generation, Text Generation |
Published | 2018-02-13 |
URL | http://arxiv.org/abs/1802.04591v2 |
http://arxiv.org/pdf/1802.04591v2.pdf | |
PWC | https://paperswithcode.com/paper/first-order-generative-adversarial-networks |
Repo | https://github.com/zalandoresearch/first_order_gan |
Framework | tf |
Telepresence System based on Simulated Holographic Display
Title | Telepresence System based on Simulated Holographic Display |
Authors | Diana-Margarita Córdova-Esparza, Juan Terven, Hugo Jiménez-Hernández, Ana Herrera-Navarro, Alberto Vázquez-Cervantes Juan-M. García-Huerta |
Abstract | We present a telepresence system based on a custom-made simulated holographic display that produces a full 3D model of the remote participants using commodity depth sensors. Our display is composed of a video projector and a quadrangular pyramid made of acrylic, that allows the user to experience an omnidirectional visualization of a remote person without the need for head-mounted displays. To obtain a precise representation of the participants, we fuse together multiple views extracted using a deep background subtraction method. Our system represents an attempt to democratize high-fidelity 3D telepresence using off-the-shelf components. |
Tasks | |
Published | 2018-04-06 |
URL | http://arxiv.org/abs/1804.02343v1 |
http://arxiv.org/pdf/1804.02343v1.pdf | |
PWC | https://paperswithcode.com/paper/telepresence-system-based-on-simulated |
Repo | https://github.com/jrterven/backsub |
Framework | none |
Cross-validation in high-dimensional spaces: a lifeline for least-squares models and multi-class LDA
Title | Cross-validation in high-dimensional spaces: a lifeline for least-squares models and multi-class LDA |
Authors | Matthias S. Treder |
Abstract | Least-squares models such as linear regression and Linear Discriminant Analysis (LDA) are amongst the most popular statistical learning techniques. However, since their computation time increases cubically with the number of features, they are inefficient in high-dimensional neuroimaging datasets. Fortunately, for k-fold cross-validation, an analytical approach has been developed that yields the exact cross-validated predictions in least-squares models without explicitly training the model. Its computation time grows with the number of test samples. Here, this approach is systematically investigated in the context of cross-validation and permutation testing. LDA is used exemplarily but results hold for all other least-squares methods. Furthermore, a non-trivial extension to multi-class LDA is formally derived. The analytical approach is evaluated using complexity calculations, simulations, and permutation testing of an EEG/MEG dataset. Depending on the ratio between features and samples, the analytical approach is up to 10,000x faster than the standard approach (retraining the model on each training set). This allows for a fast cross-validation of least-squares models and multi-class LDA in high-dimensional data, with obvious applications in multi-dimensional datasets, Representational Similarity Analysis, and permutation testing. |
Tasks | EEG |
Published | 2018-03-27 |
URL | http://arxiv.org/abs/1803.10016v1 |
http://arxiv.org/pdf/1803.10016v1.pdf | |
PWC | https://paperswithcode.com/paper/cross-validation-in-high-dimensional-spaces-a |
Repo | https://github.com/treder/Fast-Least-Squares |
Framework | none |
NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction
Title | NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction |
Authors | Yuan Gao, Jiayi Ma, Mingbo Zhao, Wei Liu, Alan L. Yuille |
Abstract | In this paper, we propose a novel Convolutional Neural Network (CNN) structure for general-purpose multi-task learning (MTL), which enables automatic feature fusing at every layer from different tasks. This is in contrast with the most widely used MTL CNN structures which empirically or heuristically share features on some specific layers (e.g., share all the features except the last convolutional layer). The proposed layerwise feature fusing scheme is formulated by combining existing CNN components in a novel way, with clear mathematical interpretability as discriminative dimensionality reduction, which is referred to as Neural Discriminative Dimensionality Reduction (NDDR). Specifically, we first concatenate features with the same spatial resolution from different tasks according to their channel dimension. Then, we show that the discriminative dimensionality reduction can be fulfilled by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN. The use of existing CNN components ensures the end-to-end training and the extensibility of the proposed NDDR layer to various state-of-the-art CNN architectures in a “plug-and-play” manner. The detailed ablation analysis shows that the proposed NDDR layer is easy to train and also robust to different hyperparameters. Experiments on different task sets with various base network architectures demonstrate the promising performance and desirable generalizability of our proposed method. The code of our paper is available at https://github.com/ethanygao/NDDR-CNN. |
Tasks | Multi-Task Learning |
Published | 2018-01-25 |
URL | http://arxiv.org/abs/1801.08297v4 |
http://arxiv.org/pdf/1801.08297v4.pdf | |
PWC | https://paperswithcode.com/paper/nddr-cnn-layer-wise-feature-fusing-in-multi |
Repo | https://github.com/ethanygao/NDDR-CNN |
Framework | tf |