Paper Group AWR 169
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN. Zero-Shot Detection. The Music Streaming Sessions Dataset. Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification. Network Uncertainty Informed Semantic Feature Selection for Visual SLAM. Sof …
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
Title | Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN |
Authors | Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao |
Abstract | Recurrent neural networks (RNNs) have been widely used for processing sequential data. However, RNNs are commonly difficult to train due to the well-known gradient vanishing and exploding problems and hard to learn long-term patterns. Long short-term memory (LSTM) and gated recurrent unit (GRU) were developed to address these problems, but the use of hyperbolic tangent and the sigmoid action functions results in gradient decay over layers. Consequently, construction of an efficiently trainable deep network is challenging. In addition, all the neurons in an RNN layer are entangled together and their behaviour is hard to interpret. To address these problems, a new type of RNN, referred to as independently recurrent neural network (IndRNN), is proposed in this paper, where neurons in the same layer are independent of each other and they are connected across layers. We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Moreover, an IndRNN can work with non-saturated activation functions such as relu (rectified linear unit) and be still trained robustly. Multiple IndRNNs can be stacked to construct a network that is deeper than the existing RNNs. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly. Better performances have been achieved on various tasks by using IndRNNs compared with the traditional RNN and LSTM. The code is available at https://github.com/Sunnydreamrain/IndRNN_Theano_Lasagne. |
Tasks | Language Modelling, Sequential Image Classification, Skeleton Based Action Recognition |
Published | 2018-03-13 |
URL | http://arxiv.org/abs/1803.04831v3 |
http://arxiv.org/pdf/1803.04831v3.pdf | |
PWC | https://paperswithcode.com/paper/independently-recurrent-neural-network-indrnn |
Repo | https://github.com/trevor-richardson/rnn_zoo |
Framework | pytorch |
Zero-Shot Detection
Title | Zero-Shot Detection |
Authors | Pengkai Zhu, Hanxiao Wang, Venkatesh Saligrama |
Abstract | As we move towards large-scale object detection, it is unrealistic to expect annotated training data, in the form of bounding box annotations around objects, for all object classes at sufficient scale, and so methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen classes. While we utilize semantic features during training, our method is agnostic to semantic information for unseen classes at test-time. Our method retains the efficiency and effectiveness of YOLOv2 for objects seen during training, while improving its performance for novel and unseen objects. The ability of state-of-art detection methods to learn discriminative object features to reject background proposals also limits their performance for unseen objects. We posit that, to detect unseen objects, we must incorporate semantic information into the visual domain so that the learned visual features reflect this information and leads to improved recall rates for unseen objects. We test our method on PASCAL VOC and MS COCO dataset and observed significant improvements on the average precision of unseen classes. |
Tasks | Object Detection |
Published | 2018-03-19 |
URL | http://arxiv.org/abs/1803.07113v2 |
http://arxiv.org/pdf/1803.07113v2.pdf | |
PWC | https://paperswithcode.com/paper/zero-shot-detection |
Repo | https://github.com/howBiGaStorm/ZeroShot-YOLO |
Framework | pytorch |
The Music Streaming Sessions Dataset
Title | The Music Streaming Sessions Dataset |
Authors | Brian Brost, Rishabh Mehrotra, Tristan Jehan |
Abstract | At the core of many important machine learning problems faced by online streaming services is a need to model how users interact with the content. These problems can often be reduced to a combination of 1) sequentially recommending items to the user, and 2) exploiting the user’s interactions with the items as feedback for the machine learning model. Unfortunately, there are no public datasets currently available that enable researchers to explore this topic. In order to spur that research, we release the Music Streaming Sessions Dataset (MSSD), which consists of approximately 150 million listening sessions and associated user actions. Furthermore, we provide audio features and metadata for the approximately 3.7 million unique tracks referred to in the logs. This is the largest collection of such track metadata currently available to the public. This dataset enables research on important problems including how to model user listening and interaction behaviour in streaming, as well as Music Information Retrieval (MIR), and session-based sequential recommendations. |
Tasks | Information Retrieval, Music Information Retrieval |
Published | 2018-12-31 |
URL | http://arxiv.org/abs/1901.09851v1 |
http://arxiv.org/pdf/1901.09851v1.pdf | |
PWC | https://paperswithcode.com/paper/the-music-streaming-sessions-dataset |
Repo | https://github.com/rguo12/awesome-causality-data |
Framework | none |
Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification
Title | Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification |
Authors | Siyang Song, Shuimei Zhang, Björn Schuller, Linlin Shen, Michel Valstar |
Abstract | The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyper-parameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks. |
Tasks | Quantization, Speaker Verification, Text-Independent Speaker Verification |
Published | 2018-05-03 |
URL | http://arxiv.org/abs/1805.01259v1 |
http://arxiv.org/pdf/1805.01259v1.pdf | |
PWC | https://paperswithcode.com/paper/noise-invariant-frame-selection-a-simple |
Repo | https://github.com/shuimove1234/Noise-Invariant-Frame-Selection |
Framework | none |
Network Uncertainty Informed Semantic Feature Selection for Visual SLAM
Title | Network Uncertainty Informed Semantic Feature Selection for Visual SLAM |
Authors | Pranav Ganti, Steven L. Waslander |
Abstract | In order to facilitate long-term localization using a visual simultaneous localization and mapping (SLAM) algorithm, careful feature selection can help ensure that reference points persist over long durations and the runtime and storage complexity of the algorithm remain consistent. We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel information-theoretic feature selection method for visual SLAM which incorporates semantic segmentation and neural network uncertainty into the feature selection pipeline. Our algorithm selects points which provide the highest reduction in Shannon entropy between the entropy of the current state and the joint entropy of the state, given the addition of the new feature with the classification entropy of the feature from a Bayesian neural network. Each selected feature significantly reduces the uncertainty of the vehicle state and has been detected to be a static object (building, traffic sign, etc.) repeatedly with a high confidence. This selection strategy generates a sparse map which can facilitate long-term localization. The KITTI odometry dataset is used to evaluate our method, and we also compare our results against ORB_SLAM2. Overall, SIVO performs comparably to the baseline method while reducing the map size by almost 70%. |
Tasks | Feature Selection, Semantic Segmentation, Simultaneous Localization and Mapping, Visual Odometry |
Published | 2018-11-29 |
URL | https://arxiv.org/abs/1811.11946v2 |
https://arxiv.org/pdf/1811.11946v2.pdf | |
PWC | https://paperswithcode.com/paper/visual-slam-with-network-uncertainty-informed |
Repo | https://github.com/navganti/SIVO |
Framework | none |
Soft Actor-Critic Algorithms and Applications
Title | Soft Actor-Critic Algorithms and Applications |
Authors | Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine |
Abstract | A fork of OpenAI Baselines, implementations of reinforcement learning algorithms |
Tasks | Decision Making |
Published | 2018-12-13 |
URL | http://arxiv.org/abs/1812.05905v2 |
http://arxiv.org/pdf/1812.05905v2.pdf | |
PWC | https://paperswithcode.com/paper/soft-actor-critic-algorithms-and-applications |
Repo | https://github.com/iclavera/cassie |
Framework | tf |
A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications
Title | A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications |
Authors | Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz |
Abstract | Peer reviewing is a central component in the scientific publishing process. We present the first public dataset of scientific peer reviews available for research purposes (PeerRead v1) providing an opportunity to study this important artifact. The dataset consists of 14.7K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR. The dataset also includes 10.7K textual peer reviews written by experts for a subset of the papers. We describe the data collection process and report interesting observed phenomena in the peer reviews. We also propose two novel NLP tasks based on this dataset and provide simple baseline models. In the first task, we show that simple models can predict whether a paper is accepted with up to 21% error reduction compared to the majority baseline. In the second task, we predict the numerical scores of review aspects and show that simple models can outperform the mean baseline for aspects with high variance such as ‘originality’ and ‘impact’. |
Tasks | |
Published | 2018-04-25 |
URL | http://arxiv.org/abs/1804.09635v1 |
http://arxiv.org/pdf/1804.09635v1.pdf | |
PWC | https://paperswithcode.com/paper/a-dataset-of-peer-reviews-peerread-collection |
Repo | https://github.com/allenai/PeerRead |
Framework | none |
Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction
Title | Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction |
Authors | Nelly Elsayed, Anthony S. Maida, Magdy Bayoumi |
Abstract | Spatiotemporal sequence prediction is an important problem in deep learning. We study next-frame(s) video prediction using a deep-learning-based predictive coding framework that uses convolutional, long short-term memory (convLSTM) modules. We introduce a novel reduced-gate convolutional LSTM (rgcLSTM) architecture that requires a significantly lower parameter budget than a comparable convLSTM. Our reduced-gate model achieves equal or better next-frame(s) prediction accuracy than the original convolutional LSTM while using a smaller parameter budget, thereby reducing training time. We tested our reduced gate modules within a predictive coding architecture on the moving MNIST and KITTI datasets. We found that our reduced-gate model has a significant reduction of approximately 40 percent of the total number of training parameters and a 25 percent redution in elapsed training time in comparison with the standard convolutional LSTM model. This makes our model more attractive for hardware implementation especially on small devices. |
Tasks | Video Prediction |
Published | 2018-10-16 |
URL | http://arxiv.org/abs/1810.07251v9 |
http://arxiv.org/pdf/1810.07251v9.pdf | |
PWC | https://paperswithcode.com/paper/reduced-gate-convolutional-lstm-using |
Repo | https://github.com/NellyElsayed/rgcLSTM |
Framework | none |
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction
Title | Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction |
Authors | Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, Devon Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua Bengio, Graham W. Taylor |
Abstract | Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly more challenging than one-step generation tasks, as such a system must understand the contents of its generated images with respect to the feedback history, the current feedback, as well as the interactions among concepts present in the feedback history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, and apply simple transformations to existing objects. We believe our approach is an important step toward interactive generation. Code and data is available at: https://www.microsoft.com/en-us/research/project/generative-neural-visual-artist-geneva/ . |
Tasks | Image Generation, Text-to-Image Generation |
Published | 2018-11-24 |
URL | https://arxiv.org/abs/1811.09845v3 |
https://arxiv.org/pdf/1811.09845v3.pdf | |
PWC | https://paperswithcode.com/paper/keep-drawing-it-iterative-language-based |
Repo | https://github.com/Maluuba/GeNeVA_datasets |
Framework | none |
Learning Invariances for Policy Generalization
Title | Learning Invariances for Policy Generalization |
Authors | Remi Tachet des Combes, Philip Bachman, Harm van Seijen |
Abstract | While recent progress has spawned very powerful machine learning systems, those agents remain extremely specialized and fail to transfer the knowledge they gain to similar yet unseen tasks. In this paper, we study a simple reinforcement learning problem and focus on learning policies that encode the proper invariances for generalization to different settings. We evaluate three potential methods for policy generalization: data augmentation, meta-learning and adversarial training. We find our data augmentation method to be effective, and study the potential of meta-learning and adversarial learning as alternative task-agnostic approaches. Keywords: reinforcement learning, generalization, data augmentation, meta-learning, adversarial learning. |
Tasks | Data Augmentation, Meta-Learning |
Published | 2018-09-07 |
URL | http://arxiv.org/abs/1809.02591v1 |
http://arxiv.org/pdf/1809.02591v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-invariances-for-policy |
Repo | https://github.com/Maluuba/jumping-task |
Framework | none |
Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy
Title | Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy |
Authors | Kin Wah Edward Lin, Balamurali B. T., Enyan Koh, Simon Lui, Dorien Herremans |
Abstract | Separating a singing voice from its music accompaniment remains an important challenge in the field of music information retrieval. We present a unique neural network approach inspired by a technique that has revolutionized the field of vision: pixel-wise image classification, which we combine with cross entropy loss and pretraining of the CNN as an autoencoder on singing voice spectrograms. The pixel-wise classification technique directly estimates the sound source label for each time-frequency (T-F) bin in our spectrogram image, thus eliminating common pre- and postprocessing tasks. The proposed network is trained by using the Ideal Binary Mask (IBM) as the target output label. The IBM identifies the dominant sound source in each T-F bin of the magnitude spectrogram of a mixture signal, by considering each T-F bin as a pixel with a multi-label (for each sound source). Cross entropy is used as the training objective, so as to minimize the average probability error between the target and predicted label for each pixel. By treating the singing voice separation problem as a pixel-wise classification task, we additionally eliminate one of the commonly used, yet not easy to comprehend, postprocessing steps: the Wiener filter postprocessing. The proposed CNN outperforms the first runner up in the Music Information Retrieval Evaluation eXchange (MIREX) 2016 and the winner of MIREX 2014 with a gain of 2.2702 ~ 5.9563 dB global normalized source to distortion ratio (GNSDR) when applied to the iKala dataset. An experiment with the DSD100 dataset on the full-tracks song evaluation task also shows that our model is able to compete with cutting-edge singing voice separation systems which use multi-channel modeling, data augmentation, and model blending. |
Tasks | Data Augmentation, Image Classification, Information Retrieval, Music Information Retrieval |
Published | 2018-12-04 |
URL | http://arxiv.org/abs/1812.01278v1 |
http://arxiv.org/pdf/1812.01278v1.pdf | |
PWC | https://paperswithcode.com/paper/singing-voice-separation-using-a-deep |
Repo | https://github.com/EdwardLin2014/CNN-with-IBM-for-Singing-Voice-Separation |
Framework | tf |
Gyroscope-Aided Motion Deblurring with Deep Networks
Title | Gyroscope-Aided Motion Deblurring with Deep Networks |
Authors | Janne Mustaniemi, Juho Kannala, Simo Särkkä, Jiri Matas, Janne Heikkilä |
Abstract | We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur. |
Tasks | Deblurring |
Published | 2018-10-01 |
URL | http://arxiv.org/abs/1810.00986v2 |
http://arxiv.org/pdf/1810.00986v2.pdf | |
PWC | https://paperswithcode.com/paper/gyroscope-aided-motion-deblurring-with-deep |
Repo | https://github.com/jannemus/DeepGyro |
Framework | none |
Zero-Shot Object Detection by Hybrid Region Embedding
Title | Zero-Shot Object Detection by Hybrid Region Embedding |
Authors | Berkan Demirel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis |
Abstract | Object detection is considered as one of the most challenging problems in computer vision, since it requires correct prediction of both classes and locations of objects in images. In this study, we define a more difficult scenario, namely zero-shot object detection (ZSD) where no visual training data is available for some of the target object classes. We present a novel approach to tackle this ZSD problem, where a convex combination of embeddings are used in conjunction with a detection framework. For evaluation of ZSD methods, we propose a simple dataset constructed from Fashion-MNIST images and also a custom zero-shot split for the Pascal VOC detection challenge. The experimental results suggest that our method yields promising results for ZSD. |
Tasks | Object Detection, Zero-Shot Object Detection |
Published | 2018-05-16 |
URL | http://arxiv.org/abs/1805.06157v2 |
http://arxiv.org/pdf/1805.06157v2.pdf | |
PWC | https://paperswithcode.com/paper/zero-shot-object-detection-by-hybrid-region |
Repo | https://github.com/berkandemirel/zero-shot-detection |
Framework | none |
Tangent Convolutions for Dense Prediction in 3D
Title | Tangent Convolutions for Dense Prediction in 3D |
Authors | Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, Qian-Yi Zhou |
Abstract | We present an approach to semantic scene analysis using deep convolutional networks. Our approach is based on tangent convolutions - a new construction for convolutional networks on 3D data. In contrast to volumetric approaches, our method operates directly on surface geometry. Crucially, the construction is applicable to unstructured point clouds and other noisy real-world data. We show that tangent convolutions can be evaluated efficiently on large-scale point clouds with millions of points. Using tangent convolutions, we design a deep fully-convolutional network for semantic segmentation of 3D point clouds, and apply it to challenging real-world datasets of indoor and outdoor 3D environments. Experimental results show that the presented approach outperforms other recent deep network constructions in detailed analysis of large 3D scenes. |
Tasks | Semantic Segmentation |
Published | 2018-07-06 |
URL | http://arxiv.org/abs/1807.02443v1 |
http://arxiv.org/pdf/1807.02443v1.pdf | |
PWC | https://paperswithcode.com/paper/tangent-convolutions-for-dense-prediction-in |
Repo | https://github.com/tatarchm/tangent_conv |
Framework | tf |
IVD-Net: Intervertebral disc localization and segmentation in MRI with a multi-modal UNet
Title | IVD-Net: Intervertebral disc localization and segmentation in MRI with a multi-modal UNet |
Authors | Jose Dolz, Christian Desrosiers, Ismail Ben Ayed |
Abstract | Accurate localization and segmentation of intervertebral disc (IVD) is crucial for the assessment of spine disease diagnosis. Despite the technological advances in medical imaging, IVD localization and segmentation are still manually performed, which is time-consuming and prone to errors. If, in addition, multi-modal imaging is considered, the burden imposed on disease assessments increases substantially. In this paper, we propose an architecture for IVD localization and segmentation in multi-modal MRI, which extends the well-known UNet. Compared to single images, multi-modal data brings complementary information, contributing to better data representation and discriminative power. Our contributions are three-fold. First, how to effectively integrate and fully leverage multi-modal data remains almost unexplored. In this work, each MRI modality is processed in a different path to better exploit their unique information. Second, inspired by HyperDenseNet, the network is densely-connected both within each path and across different paths, granting the model the freedom to learn where and how the different modalities should be processed and combined. Third, we improved standard U-Net modules by extending inception modules with two dilated convolutions blocks of different scale, which helps handling multi-scale context. We report experiments over the data set of the public MICCAI 2018 Challenge on Automatic Intervertebral Disc Localization and Segmentation, with 13 multi-modal MRI images used for training and 3 for validation. We trained IVD-Net on an NVidia TITAN XP GPU with 16 GBs RAM, using ADAM as optimizer and a learning rate of 10e-5 during 200 epochs. Training took about 5 hours, and segmentation of a whole volume about 2-3 seconds, on average. Several baselines, with different multi-modal fusion strategies, were used to demonstrate the effectiveness of the proposed architecture. |
Tasks | Medical Image Segmentation |
Published | 2018-11-19 |
URL | http://arxiv.org/abs/1811.08305v1 |
http://arxiv.org/pdf/1811.08305v1.pdf | |
PWC | https://paperswithcode.com/paper/ivd-net-intervertebral-disc-localization-and |
Repo | https://github.com/josedolz/IVD-Net |
Framework | pytorch |