October 20, 2019

3183 words 15 mins read

Paper Group AWR 169

Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN. Zero-Shot Detection. The Music Streaming Sessions Dataset. Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification. Network Uncertainty Informed Semantic Feature Selection for Visual SLAM. Sof …

Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN


Title	Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
Authors	Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao
Abstract	Recurrent neural networks (RNNs) have been widely used for processing sequential data. However, RNNs are commonly difficult to train due to the well-known gradient vanishing and exploding problems and hard to learn long-term patterns. Long short-term memory (LSTM) and gated recurrent unit (GRU) were developed to address these problems, but the use of hyperbolic tangent and the sigmoid action functions results in gradient decay over layers. Consequently, construction of an efficiently trainable deep network is challenging. In addition, all the neurons in an RNN layer are entangled together and their behaviour is hard to interpret. To address these problems, a new type of RNN, referred to as independently recurrent neural network (IndRNN), is proposed in this paper, where neurons in the same layer are independent of each other and they are connected across layers. We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Moreover, an IndRNN can work with non-saturated activation functions such as relu (rectified linear unit) and be still trained robustly. Multiple IndRNNs can be stacked to construct a network that is deeper than the existing RNNs. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly. Better performances have been achieved on various tasks by using IndRNNs compared with the traditional RNN and LSTM. The code is available at https://github.com/Sunnydreamrain/IndRNN_Theano_Lasagne.
Tasks	Language Modelling, Sequential Image Classification, Skeleton Based Action Recognition
Published	2018-03-13
URL	http://arxiv.org/abs/1803.04831v3
PDF	http://arxiv.org/pdf/1803.04831v3.pdf
PWC	https://paperswithcode.com/paper/independently-recurrent-neural-network-indrnn
Repo	https://github.com/trevor-richardson/rnn_zoo
Framework	pytorch

Zero-Shot Detection


Title	Zero-Shot Detection
Authors	Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama
Abstract	As we move towards large-scale object detection, it is unrealistic to expect annotated training data, in the form of bounding box annotations around objects, for all object classes at sufficient scale, and so methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen classes. While we utilize semantic features during training, our method is agnostic to semantic information for unseen classes at test-time. Our method retains the efficiency and effectiveness of YOLOv2 for objects seen during training, while improving its performance for novel and unseen objects. The ability of state-of-art detection methods to learn discriminative object features to reject background proposals also limits their performance for unseen objects. We posit that, to detect unseen objects, we must incorporate semantic information into the visual domain so that the learned visual features reflect this information and leads to improved recall rates for unseen objects. We test our method on PASCAL VOC and MS COCO dataset and observed significant improvements on the average precision of unseen classes.
Tasks	Object Detection
Published	2018-03-19
URL	http://arxiv.org/abs/1803.07113v2
PDF	http://arxiv.org/pdf/1803.07113v2.pdf
PWC	https://paperswithcode.com/paper/zero-shot-detection
Repo	https://github.com/howBiGaStorm/ZeroShot-YOLO
Framework	pytorch

The Music Streaming Sessions Dataset


Title	The Music Streaming Sessions Dataset
Authors	Brian Brost, Rishabh Mehrotra, Tristan Jehan
Abstract	At the core of many important machine learning problems faced by online streaming services is a need to model how users interact with the content. These problems can often be reduced to a combination of 1) sequentially recommending items to the user, and 2) exploiting the user’s interactions with the items as feedback for the machine learning model. Unfortunately, there are no public datasets currently available that enable researchers to explore this topic. In order to spur that research, we release the Music Streaming Sessions Dataset (MSSD), which consists of approximately 150 million listening sessions and associated user actions. Furthermore, we provide audio features and metadata for the approximately 3.7 million unique tracks referred to in the logs. This is the largest collection of such track metadata currently available to the public. This dataset enables research on important problems including how to model user listening and interaction behaviour in streaming, as well as Music Information Retrieval (MIR), and session-based sequential recommendations.
Tasks	Information Retrieval, Music Information Retrieval
Published	2018-12-31
URL	http://arxiv.org/abs/1901.09851v1
PDF	http://arxiv.org/pdf/1901.09851v1.pdf
PWC	https://paperswithcode.com/paper/the-music-streaming-sessions-dataset
Repo	https://github.com/rguo12/awesome-causality-data
Framework	none

Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification


Title	Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification
Authors	Siyang Song, Shuimei Zhang, Björn Schuller, Linlin Shen, Michel Valstar
Abstract	The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyper-parameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks.
Tasks	Quantization, Speaker Verification, Text-Independent Speaker Verification
Published	2018-05-03
URL	http://arxiv.org/abs/1805.01259v1
PDF	http://arxiv.org/pdf/1805.01259v1.pdf
PWC	https://paperswithcode.com/paper/noise-invariant-frame-selection-a-simple
Repo	https://github.com/shuimove1234/Noise-Invariant-Frame-Selection
Framework	none

Network Uncertainty Informed Semantic Feature Selection for Visual SLAM


Title	Network Uncertainty Informed Semantic Feature Selection for Visual SLAM
Authors	Pranav Ganti, Steven L. Waslander
Abstract	In order to facilitate long-term localization using a visual simultaneous localization and mapping (SLAM) algorithm, careful feature selection can help ensure that reference points persist over long durations and the runtime and storage complexity of the algorithm remain consistent. We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel information-theoretic feature selection method for visual SLAM which incorporates semantic segmentation and neural network uncertainty into the feature selection pipeline. Our algorithm selects points which provide the highest reduction in Shannon entropy between the entropy of the current state and the joint entropy of the state, given the addition of the new feature with the classification entropy of the feature from a Bayesian neural network. Each selected feature significantly reduces the uncertainty of the vehicle state and has been detected to be a static object (building, traffic sign, etc.) repeatedly with a high confidence. This selection strategy generates a sparse map which can facilitate long-term localization. The KITTI odometry dataset is used to evaluate our method, and we also compare our results against ORB_SLAM2. Overall, SIVO performs comparably to the baseline method while reducing the map size by almost 70%.
Tasks	Feature Selection, Semantic Segmentation, Simultaneous Localization and Mapping, Visual Odometry
Published	2018-11-29
URL	https://arxiv.org/abs/1811.11946v2
PDF	https://arxiv.org/pdf/1811.11946v2.pdf
PWC	https://paperswithcode.com/paper/visual-slam-with-network-uncertainty-informed
Repo	https://github.com/navganti/SIVO
Framework	none

Soft Actor-Critic Algorithms and Applications


Title	Soft Actor-Critic Algorithms and Applications
Authors	Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine
Abstract	A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
Tasks	Decision Making
Published	2018-12-13
URL	http://arxiv.org/abs/1812.05905v2
PDF	http://arxiv.org/pdf/1812.05905v2.pdf
PWC	https://paperswithcode.com/paper/soft-actor-critic-algorithms-and-applications
Repo	https://github.com/iclavera/cassie
Framework	tf

A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications


Title	A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications
Authors	Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz
Abstract	Peer reviewing is a central component in the scientific publishing process. We present the first public dataset of scientific peer reviews available for research purposes (PeerRead v1) providing an opportunity to study this important artifact. The dataset consists of 14.7K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR. The dataset also includes 10.7K textual peer reviews written by experts for a subset of the papers. We describe the data collection process and report interesting observed phenomena in the peer reviews. We also propose two novel NLP tasks based on this dataset and provide simple baseline models. In the first task, we show that simple models can predict whether a paper is accepted with up to 21% error reduction compared to the majority baseline. In the second task, we predict the numerical scores of review aspects and show that simple models can outperform the mean baseline for aspects with high variance such as ‘originality’ and ‘impact’.
Tasks
Published	2018-04-25
URL	http://arxiv.org/abs/1804.09635v1
PDF	http://arxiv.org/pdf/1804.09635v1.pdf
PWC	https://paperswithcode.com/paper/a-dataset-of-peer-reviews-peerread-collection
Repo	https://github.com/allenai/PeerRead
Framework	none

Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction


Title	Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction
Authors	Nelly Elsayed, Anthony S. Maida, Magdy Bayoumi
Abstract	Spatiotemporal sequence prediction is an important problem in deep learning. We study next-frame(s) video prediction using a deep-learning-based predictive coding framework that uses convolutional, long short-term memory (convLSTM) modules. We introduce a novel reduced-gate convolutional LSTM (rgcLSTM) architecture that requires a significantly lower parameter budget than a comparable convLSTM. Our reduced-gate model achieves equal or better next-frame(s) prediction accuracy than the original convolutional LSTM while using a smaller parameter budget, thereby reducing training time. We tested our reduced gate modules within a predictive coding architecture on the moving MNIST and KITTI datasets. We found that our reduced-gate model has a significant reduction of approximately 40 percent of the total number of training parameters and a 25 percent redution in elapsed training time in comparison with the standard convolutional LSTM model. This makes our model more attractive for hardware implementation especially on small devices.
Tasks	Video Prediction
Published	2018-10-16
URL	http://arxiv.org/abs/1810.07251v9
PDF	http://arxiv.org/pdf/1810.07251v9.pdf
PWC	https://paperswithcode.com/paper/reduced-gate-convolutional-lstm-using
Repo	https://github.com/NellyElsayed/rgcLSTM
Framework	none

Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction


Title	Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction
Authors	Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, Graham W. Taylor
Abstract	Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly more challenging than one-step generation tasks, as such a system must understand the contents of its generated images with respect to the feedback history, the current feedback, as well as the interactions among concepts present in the feedback history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, and apply simple transformations to existing objects. We believe our approach is an important step toward interactive generation. Code and data is available at: https://www.microsoft.com/en-us/research/project/generative-neural-visual-artist-geneva/ .
Tasks	Image Generation, Text-to-Image Generation
Published	2018-11-24
URL	https://arxiv.org/abs/1811.09845v3
PDF	https://arxiv.org/pdf/1811.09845v3.pdf
PWC	https://paperswithcode.com/paper/keep-drawing-it-iterative-language-based
Repo	https://github.com/Maluuba/GeNeVA_datasets
Framework	none

Learning Invariances for Policy Generalization


Title	Learning Invariances for Policy Generalization
Authors	Remi Tachet des Combes, Philip Bachman, Harm van Seijen
Abstract	While recent progress has spawned very powerful machine learning systems, those agents remain extremely specialized and fail to transfer the knowledge they gain to similar yet unseen tasks. In this paper, we study a simple reinforcement learning problem and focus on learning policies that encode the proper invariances for generalization to different settings. We evaluate three potential methods for policy generalization: data augmentation, meta-learning and adversarial training. We find our data augmentation method to be effective, and study the potential of meta-learning and adversarial learning as alternative task-agnostic approaches. Keywords: reinforcement learning, generalization, data augmentation, meta-learning, adversarial learning.
Tasks	Data Augmentation, Meta-Learning
Published	2018-09-07
URL	http://arxiv.org/abs/1809.02591v1
PDF	http://arxiv.org/pdf/1809.02591v1.pdf
PWC	https://paperswithcode.com/paper/learning-invariances-for-policy
Repo	https://github.com/Maluuba/jumping-task
Framework	none

Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy


Title	Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy
Authors	Kin Wah Edward Lin, Balamurali B. T., Enyan Koh, Simon Lui, Dorien Herremans
Abstract	Separating a singing voice from its music accompaniment remains an important challenge in the field of music information retrieval. We present a unique neural network approach inspired by a technique that has revolutionized the field of vision: pixel-wise image classification, which we combine with cross entropy loss and pretraining of the CNN as an autoencoder on singing voice spectrograms. The pixel-wise classification technique directly estimates the sound source label for each time-frequency (T-F) bin in our spectrogram image, thus eliminating common pre- and postprocessing tasks. The proposed network is trained by using the Ideal Binary Mask (IBM) as the target output label. The IBM identifies the dominant sound source in each T-F bin of the magnitude spectrogram of a mixture signal, by considering each T-F bin as a pixel with a multi-label (for each sound source). Cross entropy is used as the training objective, so as to minimize the average probability error between the target and predicted label for each pixel. By treating the singing voice separation problem as a pixel-wise classification task, we additionally eliminate one of the commonly used, yet not easy to comprehend, postprocessing steps: the Wiener filter postprocessing. The proposed CNN outperforms the first runner up in the Music Information Retrieval Evaluation eXchange (MIREX) 2016 and the winner of MIREX 2014 with a gain of 2.2702 ~ 5.9563 dB global normalized source to distortion ratio (GNSDR) when applied to the iKala dataset. An experiment with the DSD100 dataset on the full-tracks song evaluation task also shows that our model is able to compete with cutting-edge singing voice separation systems which use multi-channel modeling, data augmentation, and model blending.
Tasks	Data Augmentation, Image Classification, Information Retrieval, Music Information Retrieval
Published	2018-12-04
URL	http://arxiv.org/abs/1812.01278v1
PDF	http://arxiv.org/pdf/1812.01278v1.pdf
PWC	https://paperswithcode.com/paper/singing-voice-separation-using-a-deep
Repo	https://github.com/EdwardLin2014/CNN-with-IBM-for-Singing-Voice-Separation
Framework	tf

Gyroscope-Aided Motion Deblurring with Deep Networks


Title	Gyroscope-Aided Motion Deblurring with Deep Networks
Authors	Janne Mustaniemi, Juho Kannala, Simo Särkkä, Jiri Matas, Janne Heikkilä
Abstract	We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur.
Tasks	Deblurring
Published	2018-10-01
URL	http://arxiv.org/abs/1810.00986v2
PDF	http://arxiv.org/pdf/1810.00986v2.pdf
PWC	https://paperswithcode.com/paper/gyroscope-aided-motion-deblurring-with-deep
Repo	https://github.com/jannemus/DeepGyro
Framework	none

Zero-Shot Object Detection by Hybrid Region Embedding


Title	Zero-Shot Object Detection by Hybrid Region Embedding
Authors	Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis
Abstract	Object detection is considered as one of the most challenging problems in computer vision, since it requires correct prediction of both classes and locations of objects in images. In this study, we define a more difficult scenario, namely zero-shot object detection (ZSD) where no visual training data is available for some of the target object classes. We present a novel approach to tackle this ZSD problem, where a convex combination of embeddings are used in conjunction with a detection framework. For evaluation of ZSD methods, we propose a simple dataset constructed from Fashion-MNIST images and also a custom zero-shot split for the Pascal VOC detection challenge. The experimental results suggest that our method yields promising results for ZSD.
Tasks	Object Detection, Zero-Shot Object Detection
Published	2018-05-16
URL	http://arxiv.org/abs/1805.06157v2
PDF	http://arxiv.org/pdf/1805.06157v2.pdf
PWC	https://paperswithcode.com/paper/zero-shot-object-detection-by-hybrid-region
Repo	https://github.com/berkandemirel/zero-shot-detection
Framework	none

Tangent Convolutions for Dense Prediction in 3D


Title	Tangent Convolutions for Dense Prediction in 3D
Authors	Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, Qian-Yi Zhou
Abstract	We present an approach to semantic scene analysis using deep convolutional networks. Our approach is based on tangent convolutions - a new construction for convolutional networks on 3D data. In contrast to volumetric approaches, our method operates directly on surface geometry. Crucially, the construction is applicable to unstructured point clouds and other noisy real-world data. We show that tangent convolutions can be evaluated efficiently on large-scale point clouds with millions of points. Using tangent convolutions, we design a deep fully-convolutional network for semantic segmentation of 3D point clouds, and apply it to challenging real-world datasets of indoor and outdoor 3D environments. Experimental results show that the presented approach outperforms other recent deep network constructions in detailed analysis of large 3D scenes.
Tasks	Semantic Segmentation
Published	2018-07-06
URL	http://arxiv.org/abs/1807.02443v1
PDF	http://arxiv.org/pdf/1807.02443v1.pdf
PWC	https://paperswithcode.com/paper/tangent-convolutions-for-dense-prediction-in
Repo	https://github.com/tatarchm/tangent_conv
Framework	tf


Title	IVD-Net: Intervertebral disc localization and segmentation in MRI with a multi-modal UNet
Authors	Jose Dolz, Christian Desrosiers, Ismail Ben Ayed
Abstract	Accurate localization and segmentation of intervertebral disc (IVD) is crucial for the assessment of spine disease diagnosis. Despite the technological advances in medical imaging, IVD localization and segmentation are still manually performed, which is time-consuming and prone to errors. If, in addition, multi-modal imaging is considered, the burden imposed on disease assessments increases substantially. In this paper, we propose an architecture for IVD localization and segmentation in multi-modal MRI, which extends the well-known UNet. Compared to single images, multi-modal data brings complementary information, contributing to better data representation and discriminative power. Our contributions are three-fold. First, how to effectively integrate and fully leverage multi-modal data remains almost unexplored. In this work, each MRI modality is processed in a different path to better exploit their unique information. Second, inspired by HyperDenseNet, the network is densely-connected both within each path and across different paths, granting the model the freedom to learn where and how the different modalities should be processed and combined. Third, we improved standard U-Net modules by extending inception modules with two dilated convolutions blocks of different scale, which helps handling multi-scale context. We report experiments over the data set of the public MICCAI 2018 Challenge on Automatic Intervertebral Disc Localization and Segmentation, with 13 multi-modal MRI images used for training and 3 for validation. We trained IVD-Net on an NVidia TITAN XP GPU with 16 GBs RAM, using ADAM as optimizer and a learning rate of 10e-5 during 200 epochs. Training took about 5 hours, and segmentation of a whole volume about 2-3 seconds, on average. Several baselines, with different multi-modal fusion strategies, were used to demonstrate the effectiveness of the proposed architecture.
Tasks	Medical Image Segmentation
Published	2018-11-19
URL	http://arxiv.org/abs/1811.08305v1
PDF	http://arxiv.org/pdf/1811.08305v1.pdf
PWC	https://paperswithcode.com/paper/ivd-net-intervertebral-disc-localization-and
Repo	https://github.com/josedolz/IVD-Net
Framework	pytorch