February 2, 2020

# Paper Group AWR 26

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection. Associative Convolutional Layers. PredNet and Predictive Coding: A Critical Review. Winning the Lottery with Continuous Sparsification. FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss. Event-driven Video Frame Synthesis. Frame attenti …

#### AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Title AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
Authors Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
Abstract Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.
Published 2019-01-05
URL https://arxiv.org/abs/1901.01342v2
PDF https://arxiv.org/pdf/1901.01342v2.pdf
PWC https://paperswithcode.com/paper/ava-activespeaker-an-audio-visual-dataset-for
Repo https://github.com/cvdfoundation/ava-dataset
Framework none

#### Associative Convolutional Layers

Title Associative Convolutional Layers
Authors Hamed Omidvar, Vahideh Akhlaghi, Massimo Franceschetti, Rajesh K. Gupta
Abstract Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enabled edge devices, we provide a general and easy to implement method for significantly reducing the number of parameters of Convolutional Neural Networks (CNNs), during both the training and inference phases. We introduce a simple auxiliary neural network which can generate the convolutional filters of any CNN architecture from a low dimensional latent space. This auxiliary neural network, which we call “Convolutional Slice Generator” (CSG), is unique to the network and provides the association between its convolutional layers. During the training of the CNN, instead of training the filters of the convolutional layers, only the parameters of the CSG and their corresponding “code vectors” are trained. This results in a significant reduction of the number of parameters due to the fact that the CNN can be fully represented using only the parameters of the CSG, the code vectors, the fully connected layers, and the architecture of the CNN. We evaluate our approach by applying it to ResNet and DenseNet models when trained on CIFAR-10 and ImageNet datasets. While reducing the number of parameters by $\approx 2 \times$ on average, the accuracies of these networks remain within 1$%$ of their original counterparts and in some cases there is an increase in the accuracy.
Published 2019-06-10
URL https://arxiv.org/abs/1906.04309v3
PDF https://arxiv.org/pdf/1906.04309v3.pdf
PWC https://paperswithcode.com/paper/associative-convolutional-layers
Repo https://github.com/hamedomidvar/associativeconv
Framework pytorch

#### PredNet and Predictive Coding: A Critical Review

Title PredNet and Predictive Coding: A Critical Review
Authors Roshan Rane, Edit Szügyi, Vageesh Saxena, André Ofner, Sebastian Stober
Abstract The PredNet architecture by Lotter et al. combines a biologically plausible architecture called Predictive Coding with self-supervised video prediction in order to learn the complex structure of the visual world. While the architecture has drawn a lot of attention and various extensions of the model exist, there is a lack of a critical analysis. We fill in the gap by evaluating PredNet, both as an implementation of Predictive Coding theory and as a self-supervised video prediction model, using a challenging video action classification dataset. We also design an extended architecture to test if conditioning future frame predictions on the action class of the video and vise-versa improves the model performance. With substantial evidence, we show that PredNet does not completely follow the principles of Predictive Coding. Our comprehensive analysis and results are aimed to guide future research based on PredNet or similar architectures based on the Predictive Coding theory.
Published 2019-06-14
URL https://arxiv.org/abs/1906.11902v2
PDF https://arxiv.org/pdf/1906.11902v2.pdf
PWC https://paperswithcode.com/paper/video-action-classification-using-prednet
Repo https://github.com/RoshanRane/PredNet-and-Predictive-Coding-A-Critical-Review
Framework none

#### Winning the Lottery with Continuous Sparsification

Title Winning the Lottery with Continuous Sparsification
Authors Pedro Savarese, Hugo Silva, Michael Maire
Abstract The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for “winning tickets”, Iterative Magnitude Pruning, consistently finds sub-networks with 90-95% less parameters which train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose Continuous Sparsification, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network’s structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperform the ones learned by Iterative Magnitude Pruning, and at the same time providing faster search, when measured in number of training epochs or wall-clock time.
Published 2019-12-10
URL https://arxiv.org/abs/1912.04427v2
PDF https://arxiv.org/pdf/1912.04427v2.pdf
PWC https://paperswithcode.com/paper/winning-the-lottery-with-continuous-1
Repo https://github.com/lolemacs/continuous-sparsification
Framework pytorch

#### FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss

Title FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss
Authors Soo Ye Kim, Jihyong Oh, Munchurl Kim
Abstract Super-resolution (SR) has been widely used to convert low-resolution legacy videos to high-resolution (HR) ones, to suit the increasing resolution of displays (e.g. UHD TVs). However, it becomes easier for humans to notice motion artifacts (e.g. motion judder) in HR videos being rendered on larger-sized display devices. Thus, broadcasting standards support higher frame rates for UHD (Ultra High Definition) videos (4K@60 fps, 8K@120 fps), meaning that applying SR only is insufficient to produce genuine high quality videos. Hence, to up-convert legacy videos for realistic applications, not only SR but also video frame interpolation (VFI) is necessitated. In this paper, we first propose a joint VFI-SR framework for up-scaling the spatio-temporal resolution of videos from 2K 30 fps to 4K 60 fps. For this, we propose a novel training scheme with a multi-scale temporal loss that imposes temporal regularization on the input video sequence, which can be applied to any general video-related task. The proposed structure is analyzed in depth with extensive experiments.
Published 2019-12-16
URL https://arxiv.org/abs/1912.07213v1
PDF https://arxiv.org/pdf/1912.07213v1.pdf
PWC https://paperswithcode.com/paper/fisr-deep-joint-frame-interpolation-and-super
Repo https://github.com/JihyongOh/FISR
Framework tf

#### Event-driven Video Frame Synthesis

Title Event-driven Video Frame Synthesis
Authors Zihao W. Wang, Weixin Jiang, Kuan He, Boxin Shi, Aggelos Katsaggelos, Oliver Cossairt
Abstract Temporal Video Frame Synthesis (TVFS) aims at synthesizing novel frames at timestamps different from existing frames, which has wide applications in video codec, editing and analysis. In this paper, we propose a high framerate TVFS framework which takes hybrid input data from a low-speed frame-based sensor and a high-speed event-based sensor. Compared to frame-based sensors, event-based sensors report brightness changes at very high speed, which may well provide useful spatio-temoral information for high framerate TVFS. In our framework, we first introduce a differentiable forward model to approximate the physical sensing process, fusing the two different modes of data as well as unifying a variety of TVFS tasks, i.e., interpolation, prediction and motion deblur. We leverage autodifferentiation which propagates the gradients of a loss defined on the measured data back to the latent high framerate video. We show results with better performance compared to state-of-the-art. Second, we develop a deep learning-based strategy to enhance the results from the first step, which we refer as a residual “denoising” process. Our trained “denoiser” is beyond Gaussian denoising and shows properties such as contrast enhancement and motion awareness. We show that our framework is capable of handling challenging scenes including both fast motion and strong occlusions.
Tasks Deblurring, Denoising, Sensor Fusion, Video Frame Interpolation, Video Prediction
Published 2019-02-26
URL https://arxiv.org/abs/1902.09680v2
PDF https://arxiv.org/pdf/1902.09680v2.pdf
PWC https://paperswithcode.com/paper/event-driven-video-frame-synthesis
Repo https://github.com/winswang/int-event-fusion
Framework tf

#### Frame attention networks for facial expression recognition in videos

Title Frame attention networks for facial expression recognition in videos
Authors Debin Meng, Xiaojiang Peng, Kai Wang, Yu Qiao
Abstract The video-based facial expression recognition aims to classify a given video into several basic emotions. How to integrate facial features of individual frames is crucial for this task. In this paper, we propose the Frame Attention Networks (FAN), to automatically highlight some discriminative frames in an end-to-end framework. The network takes a video with a variable number of face images as its input and produces a fixed-dimension representation. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which embeds face images into feature vectors. The frame attention module learns multiple attention weights which are used to adaptively aggregate the feature vectors to form a single discriminative video representation. We conduct extensive experiments on CK+ and AFEW8.0 datasets. Our proposed FAN shows superior performance compared to other CNN based methods and achieves state-of-the-art performance on CK+.
Published 2019-06-29
URL https://arxiv.org/abs/1907.00193v2
PDF https://arxiv.org/pdf/1907.00193v2.pdf
PWC https://paperswithcode.com/paper/frame-attention-networks-for-facial
Repo https://github.com/Open-Debin/Emotion-FAN
Framework pytorch

#### M-VAD Names: a Dataset for Video Captioning with Naming

Title M-VAD Names: a Dataset for Video Captioning with Naming
Authors Stefano Pini, Marcella Cornia, Federico Bolelli, Lorenzo Baraldi, Rita Cucchiara
Abstract Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic “someone” tag. The lack of movie description datasets with characters’ visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the “someone” tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.
Published 2019-03-04
URL http://arxiv.org/abs/1903.01489v1
PDF http://arxiv.org/pdf/1903.01489v1.pdf
Framework none

#### The Czech Court Decisions Corpus (CzCDC): Availability as the First Step

Title The Czech Court Decisions Corpus (CzCDC): Availability as the First Step
Authors Tereza Novotná, Jakub Harašta
Abstract In this paper, we describe the Czech Court Decision Corpus (CzCDC). CzCDC is a dataset of 237,723 decisions published by the Czech apex (or top-tier) courts, namely the Supreme Court, the Supreme Administrative Court and the Constitutional Court. All the decisions were published between 1st January 1993 and 30th September 2018. Court decisions are available on the webpages of the respective courts or via commercial databases of legal information. This often leads researchers interested in these decisions to reach either to respective court or to commercial provider. This leads to delays and additional costs. These are further exacerbated by a lack of inter-court standard in the terms of the data format in which courts provide their decisions. Additionally, courts’ databases often lack proper documentation. Our goal is to make the dataset of court decisions freely available online in consistent (plain) format to lower the cost associated with obtaining data for future research. We believe that simplified access to court decisions through the CzCDC could benefit other researchers. In this paper, we describe the processing of decisions before their inclusion into CzCDC and basic statistics of the dataset. This dataset contains plain texts of court decisions and these texts are not annotated for any grammatical or syntactical features.
Published 2019-10-21
URL https://arxiv.org/abs/1910.09513v1
PDF https://arxiv.org/pdf/1910.09513v1.pdf
PWC https://paperswithcode.com/paper/the-czech-court-decisions-corpus-czcdc
Repo https://github.com/czech-case-law-relevance/czech-court-citations-dataset
Framework none

#### SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Title SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
Authors Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava
Abstract Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.
Published 2019-03-07
URL https://arxiv.org/abs/1903.03129v2
PDF https://arxiv.org/pdf/1903.03129v2.pdf
PWC https://paperswithcode.com/paper/slide-in-defense-of-smart-algorithms-over
Repo https://github.com/keroro824/HashingDeepLearning
Framework tf

#### AMR Parsing as Sequence-to-Graph Transduction

Title AMR Parsing as Sequence-to-Graph Transduction
Authors Sheng Zhang, Xutai Ma, Kevin Duh, Benjamin Van Durme
Abstract We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).
Published 2019-05-21
URL https://arxiv.org/abs/1905.08704v2
PDF https://arxiv.org/pdf/1905.08704v2.pdf
PWC https://paperswithcode.com/paper/amr-parsing-as-sequence-to-graph-transduction
Repo https://github.com/sheng-z/stog
Framework pytorch

#### GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification

Title GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification
Authors Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, Maosong Sun
Abstract Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g., merely concatenate the evidence for processing. Therefore, these methods are unable to grasp sufficient relational and logical information among the evidence. To alleviate this issue, we propose a graph-based evidence aggregating and reasoning (GEAR) framework which enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information. We further employ BERT, an effective pre-trained language representation model, to improve the performance. Experimental results on a large-scale benchmark dataset FEVER have demonstrated that GEAR could leverage multi-evidence information for FV and thus achieves the promising result with a test FEVER score of 67.10%. Our code is available at https://github.com/thunlp/GEAR.
Published 2019-07-22
URL https://arxiv.org/abs/1908.01843v1
PDF https://arxiv.org/pdf/1908.01843v1.pdf
PWC https://paperswithcode.com/paper/gear-graph-based-evidence-aggregating-and-1
Repo https://github.com/thunlp/KernelGAT
Framework pytorch

#### Layerwise Relevance Visualization in Convolutional Text Graph Classifiers

Title Layerwise Relevance Visualization in Convolutional Text Graph Classifiers
Authors Robert Schwarzenberg, Marc Hübner, David Harbecke, Christoph Alt, Leonhard Hennig
Abstract Representations in the hidden layers of Deep Neural Networks (DNN) are often hard to interpret since it is difficult to project them into an interpretable domain. Graph Convolutional Networks (GCN) allow this projection, but existing explainability methods do not exploit this fact, i.e. do not focus their explanations on intermediate states. In this work, we present a novel method that traces and visualizes features that contribute to a classification decision in the visible and hidden layers of a GCN. Our method exposes hidden cross-layer dynamics in the input graph structure. We experimentally demonstrate that it yields meaningful layerwise explanations for a GCN sentence classifier.
Published 2019-09-24
URL https://arxiv.org/abs/1909.10911v1
PDF https://arxiv.org/pdf/1909.10911v1.pdf
PWC https://paperswithcode.com/paper/layerwise-relevance-visualization-in
Repo https://github.com/DFKI-NLP/lrv
Framework pytorch

#### Neural network augmented wave-equation simulation

Title Neural network augmented wave-equation simulation
Authors Ali Siahkoohi, Mathias Louboutin, Felix J. Herrmann
Abstract Accurate forward modeling is important for solving inverse problems. An inaccurate wave-equation simulation, as a forward operator, will offset the results obtained via inversion. In this work, we consider the case where we deal with incomplete physics. One proxy of incomplete physics is an inaccurate discretization of Laplacian in simulation of wave equation via finite-difference method. We exploit intrinsic one-to-one similarities between timestepping algorithm with Convolutional Neural Networks (CNNs), and propose to intersperse CNNs between low-fidelity timesteps. Augmenting neural networks with low-fidelity timestepping algorithms may allow us to take large timesteps while limiting the numerical dispersion artifacts. While simulating the wave-equation with low-fidelity timestepping algorithm, by correcting the wavefield several time during propagation, we hope to limit the numerical dispersion artifact introduced by a poor discretization of the Laplacian. As a proof of concept, we demonstrate this principle by correcting for numerical dispersion by keeping the velocity model fixed, and varying the source locations to generate training and testing pairs for our supervised learning algorithm.
Published 2019-09-27
URL https://arxiv.org/abs/1910.00925v2
PDF https://arxiv.org/pdf/1910.00925v2.pdf
PWC https://paperswithcode.com/paper/neural-network-augmented-wave-equation
Repo https://github.com/alisiahkoohi/NN-augmented-wave-sim
Framework tf

#### wav2vec: Unsupervised Pre-training for Speech Recognition

Title wav2vec: Unsupervised Pre-training for Speech Recognition
Authors Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli
Abstract We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.