February 2, 2020

3044 words 15 mins read

Paper Group AWR 26

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection. Associative Convolutional Layers. PredNet and Predictive Coding: A Critical Review. Winning the Lottery with Continuous Sparsification. FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss. Event-driven Video Frame Synthesis. Frame attenti …

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection


Title	AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
Authors	Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru
Abstract	Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.
Tasks	Speaker Diarization, Speech Enhancement
Published	2019-01-05
URL	https://arxiv.org/abs/1901.01342v2
PDF	https://arxiv.org/pdf/1901.01342v2.pdf
PWC	https://paperswithcode.com/paper/ava-activespeaker-an-audio-visual-dataset-for
Repo	https://github.com/cvdfoundation/ava-dataset
Framework	none

Associative Convolutional Layers


Title	Associative Convolutional Layers
Authors	Hamed Omidvar, Vahideh Akhlaghi, Massimo Franceschetti, Rajesh K. Gupta
Abstract	Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enabled edge devices, we provide a general and easy to implement method for significantly reducing the number of parameters of Convolutional Neural Networks (CNNs), during both the training and inference phases. We introduce a simple auxiliary neural network which can generate the convolutional filters of any CNN architecture from a low dimensional latent space. This auxiliary neural network, which we call “Convolutional Slice Generator” (CSG), is unique to the network and provides the association between its convolutional layers. During the training of the CNN, instead of training the filters of the convolutional layers, only the parameters of the CSG and their corresponding “code vectors” are trained. This results in a significant reduction of the number of parameters due to the fact that the CNN can be fully represented using only the parameters of the CSG, the code vectors, the fully connected layers, and the architecture of the CNN. We evaluate our approach by applying it to ResNet and DenseNet models when trained on CIFAR-10 and ImageNet datasets. While reducing the number of parameters by $\approx 2 \times$ on average, the accuracies of these networks remain within 1$%$ of their original counterparts and in some cases there is an increase in the accuracy.
Tasks
Published	2019-06-10
URL	https://arxiv.org/abs/1906.04309v3
PDF	https://arxiv.org/pdf/1906.04309v3.pdf
PWC	https://paperswithcode.com/paper/associative-convolutional-layers
Repo	https://github.com/hamedomidvar/associativeconv
Framework	pytorch

PredNet and Predictive Coding: A Critical Review


Title	PredNet and Predictive Coding: A Critical Review
Authors	Roshan Rane, Edit Szügyi, Vageesh Saxena, André Ofner, Sebastian Stober
Abstract	The PredNet architecture by Lotter et al. combines a biologically plausible architecture called Predictive Coding with self-supervised video prediction in order to learn the complex structure of the visual world. While the architecture has drawn a lot of attention and various extensions of the model exist, there is a lack of a critical analysis. We fill in the gap by evaluating PredNet, both as an implementation of Predictive Coding theory and as a self-supervised video prediction model, using a challenging video action classification dataset. We also design an extended architecture to test if conditioning future frame predictions on the action class of the video and vise-versa improves the model performance. With substantial evidence, we show that PredNet does not completely follow the principles of Predictive Coding. Our comprehensive analysis and results are aimed to guide future research based on PredNet or similar architectures based on the Predictive Coding theory.
Tasks	Action Classification, Video Prediction
Published	2019-06-14
URL	https://arxiv.org/abs/1906.11902v2
PDF	https://arxiv.org/pdf/1906.11902v2.pdf
PWC	https://paperswithcode.com/paper/video-action-classification-using-prednet
Repo	https://github.com/RoshanRane/PredNet-and-Predictive-Coding-A-Critical-Review
Framework	none

Winning the Lottery with Continuous Sparsification


Title	Winning the Lottery with Continuous Sparsification
Authors	Pedro Savarese, Hugo Silva, Michael Maire
Abstract	The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for “winning tickets”, Iterative Magnitude Pruning, consistently finds sub-networks with 90-95% less parameters which train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose Continuous Sparsification, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network’s structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperform the ones learned by Iterative Magnitude Pruning, and at the same time providing faster search, when measured in number of training epochs or wall-clock time.
Tasks	Transfer Learning
Published	2019-12-10
URL	https://arxiv.org/abs/1912.04427v2
PDF	https://arxiv.org/pdf/1912.04427v2.pdf
PWC	https://paperswithcode.com/paper/winning-the-lottery-with-continuous-1
Repo	https://github.com/lolemacs/continuous-sparsification
Framework	pytorch

FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss


Title	FISR: Deep Joint Frame Interpolation and Super-Resolution with A Multi-scale Temporal Loss
Authors	Soo Ye Kim, Jihyong Oh, Munchurl Kim
Abstract	Super-resolution (SR) has been widely used to convert low-resolution legacy videos to high-resolution (HR) ones, to suit the increasing resolution of displays (e.g. UHD TVs). However, it becomes easier for humans to notice motion artifacts (e.g. motion judder) in HR videos being rendered on larger-sized display devices. Thus, broadcasting standards support higher frame rates for UHD (Ultra High Definition) videos (4K@60 fps, 8K@120 fps), meaning that applying SR only is insufficient to produce genuine high quality videos. Hence, to up-convert legacy videos for realistic applications, not only SR but also video frame interpolation (VFI) is necessitated. In this paper, we first propose a joint VFI-SR framework for up-scaling the spatio-temporal resolution of videos from 2K 30 fps to 4K 60 fps. For this, we propose a novel training scheme with a multi-scale temporal loss that imposes temporal regularization on the input video sequence, which can be applied to any general video-related task. The proposed structure is analyzed in depth with extensive experiments.
Tasks	Super-Resolution, Video Frame Interpolation
Published	2019-12-16
URL	https://arxiv.org/abs/1912.07213v1
PDF	https://arxiv.org/pdf/1912.07213v1.pdf
PWC	https://paperswithcode.com/paper/fisr-deep-joint-frame-interpolation-and-super
Repo	https://github.com/JihyongOh/FISR
Framework	tf

Event-driven Video Frame Synthesis


Title	Event-driven Video Frame Synthesis
Authors	Zihao W. Wang, Weixin Jiang, Kuan He, Boxin Shi, Aggelos Katsaggelos, Oliver Cossairt
Abstract	Temporal Video Frame Synthesis (TVFS) aims at synthesizing novel frames at timestamps different from existing frames, which has wide applications in video codec, editing and analysis. In this paper, we propose a high framerate TVFS framework which takes hybrid input data from a low-speed frame-based sensor and a high-speed event-based sensor. Compared to frame-based sensors, event-based sensors report brightness changes at very high speed, which may well provide useful spatio-temoral information for high framerate TVFS. In our framework, we first introduce a differentiable forward model to approximate the physical sensing process, fusing the two different modes of data as well as unifying a variety of TVFS tasks, i.e., interpolation, prediction and motion deblur. We leverage autodifferentiation which propagates the gradients of a loss defined on the measured data back to the latent high framerate video. We show results with better performance compared to state-of-the-art. Second, we develop a deep learning-based strategy to enhance the results from the first step, which we refer as a residual “denoising” process. Our trained “denoiser” is beyond Gaussian denoising and shows properties such as contrast enhancement and motion awareness. We show that our framework is capable of handling challenging scenes including both fast motion and strong occlusions.
Tasks	Deblurring, Denoising, Sensor Fusion, Video Frame Interpolation, Video Prediction
Published	2019-02-26
URL	https://arxiv.org/abs/1902.09680v2
PDF	https://arxiv.org/pdf/1902.09680v2.pdf
PWC	https://paperswithcode.com/paper/event-driven-video-frame-synthesis
Repo	https://github.com/winswang/int-event-fusion
Framework	tf

Frame attention networks for facial expression recognition in videos


Title	Frame attention networks for facial expression recognition in videos
Authors	Debin Meng, Xiaojiang Peng, Kai Wang, Yu Qiao
Abstract	The video-based facial expression recognition aims to classify a given video into several basic emotions. How to integrate facial features of individual frames is crucial for this task. In this paper, we propose the Frame Attention Networks (FAN), to automatically highlight some discriminative frames in an end-to-end framework. The network takes a video with a variable number of face images as its input and produces a fixed-dimension representation. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which embeds face images into feature vectors. The frame attention module learns multiple attention weights which are used to adaptively aggregate the feature vectors to form a single discriminative video representation. We conduct extensive experiments on CK+ and AFEW8.0 datasets. Our proposed FAN shows superior performance compared to other CNN based methods and achieves state-of-the-art performance on CK+.
Tasks	Facial Expression Recognition
Published	2019-06-29
URL	https://arxiv.org/abs/1907.00193v2
PDF	https://arxiv.org/pdf/1907.00193v2.pdf
PWC	https://paperswithcode.com/paper/frame-attention-networks-for-facial
Repo	https://github.com/Open-Debin/Emotion-FAN
Framework	pytorch

M-VAD Names: a Dataset for Video Captioning with Naming


Title	M-VAD Names: a Dataset for Video Captioning with Naming
Authors	Stefano Pini, Marcella Cornia, Federico Bolelli, Lorenzo Baraldi, Rita Cucchiara
Abstract	Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic “someone” tag. The lack of movie description datasets with characters’ visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the “someone” tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.
Tasks	Video Captioning
Published	2019-03-04
URL	http://arxiv.org/abs/1903.01489v1
PDF	http://arxiv.org/pdf/1903.01489v1.pdf
PWC	https://paperswithcode.com/paper/m-vad-names-a-dataset-for-video-captioning
Repo	https://github.com/aimagelab/mvad-names-dataset
Framework	none

The Czech Court Decisions Corpus (CzCDC): Availability as the First Step


Title	The Czech Court Decisions Corpus (CzCDC): Availability as the First Step
Authors	Tereza Novotná, Jakub Harašta
Abstract	In this paper, we describe the Czech Court Decision Corpus (CzCDC). CzCDC is a dataset of 237,723 decisions published by the Czech apex (or top-tier) courts, namely the Supreme Court, the Supreme Administrative Court and the Constitutional Court. All the decisions were published between 1st January 1993 and 30th September 2018. Court decisions are available on the webpages of the respective courts or via commercial databases of legal information. This often leads researchers interested in these decisions to reach either to respective court or to commercial provider. This leads to delays and additional costs. These are further exacerbated by a lack of inter-court standard in the terms of the data format in which courts provide their decisions. Additionally, courts’ databases often lack proper documentation. Our goal is to make the dataset of court decisions freely available online in consistent (plain) format to lower the cost associated with obtaining data for future research. We believe that simplified access to court decisions through the CzCDC could benefit other researchers. In this paper, we describe the processing of decisions before their inclusion into CzCDC and basic statistics of the dataset. This dataset contains plain texts of court decisions and these texts are not annotated for any grammatical or syntactical features.
Tasks
Published	2019-10-21
URL	https://arxiv.org/abs/1910.09513v1
PDF	https://arxiv.org/pdf/1910.09513v1.pdf
PWC	https://paperswithcode.com/paper/the-czech-court-decisions-corpus-czcdc
Repo	https://github.com/czech-case-law-relevance/czech-court-citations-dataset
Framework	none

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems


Title	SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
Authors	Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava
Abstract	Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.
Tasks
Published	2019-03-07
URL	https://arxiv.org/abs/1903.03129v2
PDF	https://arxiv.org/pdf/1903.03129v2.pdf
PWC	https://paperswithcode.com/paper/slide-in-defense-of-smart-algorithms-over
Repo	https://github.com/keroro824/HashingDeepLearning
Framework	tf

AMR Parsing as Sequence-to-Graph Transduction


Title	AMR Parsing as Sequence-to-Graph Transduction
Authors	Sheng Zhang, Xutai Ma, Kevin Duh, Benjamin Van Durme
Abstract	We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).
Tasks	Amr Parsing, Semantic Parsing
Published	2019-05-21
URL	https://arxiv.org/abs/1905.08704v2
PDF	https://arxiv.org/pdf/1905.08704v2.pdf
PWC	https://paperswithcode.com/paper/amr-parsing-as-sequence-to-graph-transduction
Repo	https://github.com/sheng-z/stog
Framework	pytorch

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification


Title	GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification
Authors	Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, Maosong Sun
Abstract	Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g., merely concatenate the evidence for processing. Therefore, these methods are unable to grasp sufficient relational and logical information among the evidence. To alleviate this issue, we propose a graph-based evidence aggregating and reasoning (GEAR) framework which enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information. We further employ BERT, an effective pre-trained language representation model, to improve the performance. Experimental results on a large-scale benchmark dataset FEVER have demonstrated that GEAR could leverage multi-evidence information for FV and thus achieves the promising result with a test FEVER score of 67.10%. Our code is available at https://github.com/thunlp/GEAR.
Tasks
Published	2019-07-22
URL	https://arxiv.org/abs/1908.01843v1
PDF	https://arxiv.org/pdf/1908.01843v1.pdf
PWC	https://paperswithcode.com/paper/gear-graph-based-evidence-aggregating-and-1
Repo	https://github.com/thunlp/KernelGAT
Framework	pytorch

Layerwise Relevance Visualization in Convolutional Text Graph Classifiers


Title	Layerwise Relevance Visualization in Convolutional Text Graph Classifiers
Authors	Robert Schwarzenberg, Marc Hübner, David Harbecke, Christoph Alt, Leonhard Hennig
Abstract	Representations in the hidden layers of Deep Neural Networks (DNN) are often hard to interpret since it is difficult to project them into an interpretable domain. Graph Convolutional Networks (GCN) allow this projection, but existing explainability methods do not exploit this fact, i.e. do not focus their explanations on intermediate states. In this work, we present a novel method that traces and visualizes features that contribute to a classification decision in the visible and hidden layers of a GCN. Our method exposes hidden cross-layer dynamics in the input graph structure. We experimentally demonstrate that it yields meaningful layerwise explanations for a GCN sentence classifier.
Tasks
Published	2019-09-24
URL	https://arxiv.org/abs/1909.10911v1
PDF	https://arxiv.org/pdf/1909.10911v1.pdf
PWC	https://paperswithcode.com/paper/layerwise-relevance-visualization-in
Repo	https://github.com/DFKI-NLP/lrv
Framework	pytorch

Neural network augmented wave-equation simulation


Title	Neural network augmented wave-equation simulation
Authors	Ali Siahkoohi, Mathias Louboutin, Felix J. Herrmann
Abstract	Accurate forward modeling is important for solving inverse problems. An inaccurate wave-equation simulation, as a forward operator, will offset the results obtained via inversion. In this work, we consider the case where we deal with incomplete physics. One proxy of incomplete physics is an inaccurate discretization of Laplacian in simulation of wave equation via finite-difference method. We exploit intrinsic one-to-one similarities between timestepping algorithm with Convolutional Neural Networks (CNNs), and propose to intersperse CNNs between low-fidelity timesteps. Augmenting neural networks with low-fidelity timestepping algorithms may allow us to take large timesteps while limiting the numerical dispersion artifacts. While simulating the wave-equation with low-fidelity timestepping algorithm, by correcting the wavefield several time during propagation, we hope to limit the numerical dispersion artifact introduced by a poor discretization of the Laplacian. As a proof of concept, we demonstrate this principle by correcting for numerical dispersion by keeping the velocity model fixed, and varying the source locations to generate training and testing pairs for our supervised learning algorithm.
Tasks
Published	2019-09-27
URL	https://arxiv.org/abs/1910.00925v2
PDF	https://arxiv.org/pdf/1910.00925v2.pdf
PWC	https://paperswithcode.com/paper/neural-network-augmented-wave-equation
Repo	https://github.com/alisiahkoohi/NN-augmented-wave-sim
Framework	tf

wav2vec: Unsupervised Pre-training for Speech Recognition


Title	wav2vec: Unsupervised Pre-training for Speech Recognition
Authors	Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli
Abstract	We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Tasks	Speech Recognition
Published	2019-04-11
URL	https://arxiv.org/abs/1904.05862v4
PDF	https://arxiv.org/pdf/1904.05862v4.pdf
PWC	https://paperswithcode.com/paper/wav2vec-unsupervised-pre-training-for-speech
Repo	https://github.com/pytorch/fairseq/blob/master/fairseq/models/wav2vec.py
Framework	pytorch