October 21, 2019

3167 words 15 mins read

Paper Group AWR 52

Design Challenges and Misconceptions in Neural Sequence Labeling. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. SlowFast Networks for Video Recognition. Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks. Learning to Separate Object Sounds by Watching Unlabeled Video. LaneNet …

Design Challenges and Misconceptions in Neural Sequence Labeling


Title	Design Challenges and Misconceptions in Neural Sequence Labeling
Authors	Jie Yang, Shuailong Liang, Yue Zhang
Abstract	We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging). Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments. In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners.
Tasks	Chunking
Published	2018-06-12
URL	http://arxiv.org/abs/1806.04470v2
PDF	http://arxiv.org/pdf/1806.04470v2.pdf
PWC	https://paperswithcode.com/paper/design-challenges-and-misconceptions-in
Repo	https://github.com/jiesutd/NCRFpp
Framework	pytorch

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset


Title	Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Authors	Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck
Abstract	Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.
Tasks	Music Modeling, Piano Music Modeling
Published	2018-10-29
URL	http://arxiv.org/abs/1810.12247v5
PDF	http://arxiv.org/pdf/1810.12247v5.pdf
PWC	https://paperswithcode.com/paper/enabling-factorized-piano-music-modeling-and
Repo	https://github.com/BShakhovsky/PolyphonicPianoTranscription
Framework	tf

SlowFast Networks for Video Recognition


Title	SlowFast Networks for Video Recognition
Authors	Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
Abstract	We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast
Tasks	Action Classification, Action Detection, Video Recognition
Published	2018-12-10
URL	https://arxiv.org/abs/1812.03982v3
PDF	https://arxiv.org/pdf/1812.03982v3.pdf
PWC	https://paperswithcode.com/paper/slowfast-networks-for-video-recognition
Repo	https://github.com/Guocode/SlowFast-Networks
Framework	none

Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks


Title	Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks
Authors	Pedro D. Marrero Fernandez, Fidel A. Guerrero-Peña, Tsang Ing Ren, Jorge J. G. Leandro
Abstract	ColorCheckers are reference standards that professional photographers and filmmakers use to ensure predictable results under every lighting condition. The objective of this work is to propose a new fast and robust method for automatic ColorChecker detection. The process is divided into two steps: (1) ColorCheckers localization and (2) ColorChecker patches recognition. For the ColorChecker localization, we trained a detection convolutional neural network using synthetic images. The synthetic images are created with the 3D models of the ColorChecker and different background images. The output of the neural networks are the bounding box of each possible ColorChecker candidates in the input image. Each bounding box defines a cropped image which is evaluated by a recognition system, and each image is canonized with regards to color and dimensions. Subsequently, all possible color patches are extracted and grouped with respect to the center’s distance. Each group is evaluated as a candidate for a ColorChecker part, and its position in the scene is estimated. Finally, a cost function is applied to evaluate the accuracy of the estimation. The method is tested using real and synthetic images. The proposed method is fast, robust to overlaps and invariant to affine projections. The algorithm also performs well in case of multiple ColorCheckers detection.
Tasks
Published	2018-10-19
URL	http://arxiv.org/abs/1810.08639v1
PDF	http://arxiv.org/pdf/1810.08639v1.pdf
PWC	https://paperswithcode.com/paper/fast-and-robust-multiple-colorchecker
Repo	https://github.com/pedrodiamel/colorchacker-detection
Framework	none

Learning to Separate Object Sounds by Watching Unlabeled Video


Title	Learning to Separate Object Sounds by Watching Unlabeled Video
Authors	Ruohan Gao, Rogerio Feris, Kristen Grauman
Abstract	Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/
Tasks	Audio Denoising, Denoising, Multi-Label Learning
Published	2018-04-05
URL	http://arxiv.org/abs/1804.01665v2
PDF	http://arxiv.org/pdf/1804.01665v2.pdf
PWC	https://paperswithcode.com/paper/learning-to-separate-object-sounds-by
Repo	https://github.com/rhgao/Deep-MIML-Network
Framework	pytorch

LaneNet: Real-Time Lane Detection Networks for Autonomous Driving


Title	LaneNet: Real-Time Lane Detection Networks for Autonomous Driving
Authors	Ze Wang, Weiqiang Ren, Qiang Qiu
Abstract	Lane detection is to detect lanes on the road and provide the accurate location and shape of each lane. It severs as one of the key techniques to enable modern assisted and autonomous driving systems. However, several unique properties of lanes challenge the detection methods. The lack of distinctive features makes lane detection algorithms tend to be confused by other objects with similar local appearance. Moreover, the inconsistent number of lanes on a road as well as diverse lane line patterns, e.g. solid, broken, single, double, merging, and splitting lines further hamper the performance. In this paper, we propose a deep neural network based method, named LaneNet, to break down the lane detection into two stages: lane edge proposal and lane line localization. Stage one uses a lane edge proposal network for pixel-wise lane edge classification, and the lane line localization network in stage two then detects lane lines based on lane edge proposals. Please note that the goal of our LaneNet is built to detect lane line only, which introduces more difficulties on suppressing the false detections on the similar lane marks on the road like arrows and characters. Despite all the difficulties, our lane detection is shown to be robust to both highway and urban road scenarios method without relying on any assumptions on the lane number or the lane line patterns. The high running speed and low computational cost endow our LaneNet the capability of being deployed on vehicle-based systems. Experiments validate that our LaneNet consistently delivers outstanding performances on real world traffic scenarios.
Tasks	Autonomous Driving, Lane Detection
Published	2018-07-04
URL	http://arxiv.org/abs/1807.01726v1
PDF	http://arxiv.org/pdf/1807.01726v1.pdf
PWC	https://paperswithcode.com/paper/lanenet-real-time-lane-detection-networks-for
Repo	https://github.com/klintan/pytorch-lanenet
Framework	pytorch

Model compression via distillation and quantization


Title	Model compression via distillation and quantization
Authors	Antonio Polino, Razvan Pascanu, Dan Alistarh
Abstract	Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.
Tasks	Model Compression, Quantization
Published	2018-02-15
URL	http://arxiv.org/abs/1802.05668v1
PDF	http://arxiv.org/pdf/1802.05668v1.pdf
PWC	https://paperswithcode.com/paper/model-compression-via-distillation-and
Repo	https://github.com/NervanaSystems/distiller
Framework	pytorch

AMC: AutoML for Model Compression and Acceleration on Mobile Devices


Title	AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Authors	Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han
Abstract	Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.
Tasks	AutoML, Model Compression, Neural Architecture Search
Published	2018-02-10
URL	http://arxiv.org/abs/1802.03494v4
PDF	http://arxiv.org/pdf/1802.03494v4.pdf
PWC	https://paperswithcode.com/paper/amc-automl-for-model-compression-and
Repo	https://github.com/NervanaSystems/distiller
Framework	pytorch

Object-Oriented Dynamics Predictor


Title	Object-Oriented Dynamics Predictor
Authors	Guangxiang Zhu, Zhiao Huang, Chongjie Zhang
Abstract	Generalization has been one of the major challenges for learning dynamics models in model-based reinforcement learning. However, previous work on action-conditioned dynamics prediction focuses on learning the pixel-level motion and thus does not generalize well to novel environments with different object layouts. In this paper, we present a novel object-oriented framework, called object-oriented dynamics predictor (OODP), which decomposes the environment into objects and predicts the dynamics of objects conditioned on both actions and object-to-object relations. It is an end-to-end neural network and can be trained in an unsupervised manner. To enable the generalization ability of dynamics learning, we design a novel CNN-based relation mechanism that is class-specific (rather than object-specific) and exploits the locality principle. Empirical results show that OODP significantly outperforms previous methods in terms of generalization over novel environments with various object layouts. OODP is able to learn from very few environments and accurately predict dynamics in a large number of unseen environments. In addition, OODP learns semantically and visually interpretable dynamics models.
Tasks
Published	2018-05-25
URL	http://arxiv.org/abs/1806.07371v3
PDF	http://arxiv.org/pdf/1806.07371v3.pdf
PWC	https://paperswithcode.com/paper/object-oriented-dynamics-predictor
Repo	https://github.com/mig-zh/OODP
Framework	tf

Deep-FSMN for Large Vocabulary Continuous Speech Recognition


Title	Deep-FSMN for Large Vocabulary Continuous Speech Recognition
Authors	Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai
Abstract	In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications.
Tasks	Language Modelling, Large Vocabulary Continuous Speech Recognition, Speech Recognition
Published	2018-03-04
URL	http://arxiv.org/abs/1803.05030v1
PDF	http://arxiv.org/pdf/1803.05030v1.pdf
PWC	https://paperswithcode.com/paper/deep-fsmn-for-large-vocabulary-continuous
Repo	https://github.com/yangxueruivs/DFSMN
Framework	tf

Malthusian Reinforcement Learning


Title	Malthusian Reinforcement Learning
Authors	Joel Z. Leibo, Julien Perolat, Edward Hughes, Steven Wheelwright, Adam H. Marblestone, Edgar Duéñez-Guzmán, Peter Sunehag, Iain Dunning, Thore Graepel
Abstract	Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. In Malthusian RL, increases in a subpopulation’s average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth. Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play.
Tasks	Multi-agent Reinforcement Learning
Published	2018-12-17
URL	http://arxiv.org/abs/1812.07019v2
PDF	http://arxiv.org/pdf/1812.07019v2.pdf
PWC	https://paperswithcode.com/paper/malthusian-reinforcement-learning
Repo	https://github.com/AbhijeetPendyala/Knowledge_base_ML
Framework	tf

Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points


Title	Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points
Authors	Fabien Baradel, Christian Wolf, Julien Mille, Graham W. Taylor
Abstract	We propose a method for human activity recognition from RGB data that does not rely on any pose information during test time and does not explicitly calculate pose information internally. Instead, a visual attention module learns to predict glimpse sequences in each frame. These glimpses correspond to interest points in the scene that are relevant to the classified activities. No spatial coherence is forced on the glimpse locations, which gives the module liberty to explore different points at each frame and better optimize the process of scrutinizing visual information. Tracking and sequentially integrating this kind of unstructured data is a challenge, which we address by separating the set of glimpses from a set of recurrent tracking/recognition workers. These workers receive glimpses, jointly performing subsequent motion tracking and activity prediction. The glimpses are soft-assigned to the workers, optimizing coherence of the assignments in space, time and feature space using an external memory module. No hard decisions are taken, i.e. each glimpse point is assigned to all existing workers, albeit with different importance. Our methods outperform state-of-the-art methods on the largest human activity recognition dataset available to-date; NTU RGB+D Dataset, and on a smaller human action recognition dataset Northwestern-UCLA Multiview Action 3D Dataset. Our code is publicly available at https://github.com/fabienbaradel/glimpse_clouds.
Tasks	Action Recognition In Videos, Activity Prediction, Activity Recognition, Human Activity Recognition, Skeleton Based Action Recognition, Temporal Action Localization
Published	2018-02-22
URL	http://arxiv.org/abs/1802.07898v4
PDF	http://arxiv.org/pdf/1802.07898v4.pdf
PWC	https://paperswithcode.com/paper/glimpse-clouds-human-activity-recognition
Repo	https://github.com/fabienbaradel/glimpse_clouds
Framework	pytorch

Deep reinforcement learning for time series: playing idealized trading games


Title	Deep reinforcement learning for time series: playing idealized trading games
Authors	Xiang Gao
Abstract	Deep Q-learning is investigated as an end-to-end solution to estimate the optimal strategies for acting on time series input. Experiments are conducted on two idealized trading games. 1) Univariate: the only input is a wave-like price time series, and 2) Bivariate: the input includes a random stepwise price time series and a noisy signal time series, which is positively correlated with future price changes. The Univariate game tests whether the agent can capture the underlying dynamics, and the Bivariate game tests whether the agent can utilize the hidden relation among the inputs. Stacked Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM) units, Convolutional Neural Network (CNN), and multi-layer perceptron (MLP) are used to model Q values. For both games, all agents successfully find a profitable strategy. The GRU-based agents show best overall performance in the Univariate game, while the MLP-based agents outperform others in the Bivariate game.
Tasks	Q-Learning, Time Series
Published	2018-03-11
URL	http://arxiv.org/abs/1803.03916v1
PDF	http://arxiv.org/pdf/1803.03916v1.pdf
PWC	https://paperswithcode.com/paper/deep-reinforcement-learning-for-time-series
Repo	https://github.com/golsun/deep-RL-time-series
Framework	none

DSR: Direct Self-rectification for Uncalibrated Dual-lens Cameras


Title	DSR: Direct Self-rectification for Uncalibrated Dual-lens Cameras
Authors	Ruichao Xiao, Wenxiu Sun, Jiahao Pang, Qiong Yan, Jimmy Ren
Abstract	With the developments of dual-lens camera modules,depth information representing the third dimension of thecaptured scenes becomes available for smartphones. It isestimated by stereo matching algorithms, taking as input thetwo views captured by dual-lens cameras at slightly differ-ent viewpoints. Depth-of-field rendering (also be referred toas synthetic defocus or bokeh) is one of the trending depth-based applications. However, to achieve fast depth estima-tion on smartphones, the stereo pairs need to be rectified inthe first place. In this paper, we propose a cost-effective so-lution to perform stereo rectification for dual-lens camerascalled direct self-rectification, short for DSR1. It removesthe need of individual offline calibration for every pair ofdual-lens cameras. In addition, the proposed solution isrobust to the slight movements, e.g., due to collisions, ofthe dual-lens cameras after fabrication. Different with ex-isting self-rectification approaches, our approach computesthe homography in a novel way with zero geometric distor-tions introduced to the master image. It is achieved by di-rectly minimizing the vertical displacements of correspond-ing points between the original master image and the trans-formed slave image. Our method is evaluated on both real-istic and synthetic stereo image pairs, and produces supe-rior results compared to the calibrated rectification or otherself-rectification approaches
Tasks	Calibration, Stereo Matching, Stereo Matching Hand
Published	2018-09-26
URL	http://arxiv.org/abs/1809.09763v1
PDF	http://arxiv.org/pdf/1809.09763v1.pdf
PWC	https://paperswithcode.com/paper/dsr-direct-self-rectification-for
Repo	https://github.com/garroud/self-rectification
Framework	none

End-to-End Learning of Communications Systems Without a Channel Model


Title	End-to-End Learning of Communications Systems Without a Channel Model
Authors	Fayçal Ait Aoudia, Jakob Hoydis
Abstract	The idea of end-to-end learning of communications systems through neural network -based autoencoders has the shortcoming that it requires a differentiable channel model. We present in this paper a novel learning algorithm which alleviates this problem. The algorithm iterates between supervised training of the receiver and reinforcement learning -based training of the transmitter. We demonstrate that this approach works as well as fully supervised methods on additive white Gaussian noise (AWGN) and Rayleigh block-fading (RBF) channels. Surprisingly, while our method converges slower on AWGN channels than supervised training, it converges faster on RBF channels. Our results are a first step towards learning of communications systems over any type of channel without prior assumptions.
Tasks
Published	2018-04-06
URL	http://arxiv.org/abs/1804.02276v3
PDF	http://arxiv.org/pdf/1804.02276v3.pdf
PWC	https://paperswithcode.com/paper/end-to-end-learning-of-communications-systems
Repo	https://github.com/Aithu-Snehith/End-to-End-Learning-of-Communications-Systems-Without-a-Channel-Model
Framework	tf