Paper Group AWR 52
Design Challenges and Misconceptions in Neural Sequence Labeling. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. SlowFast Networks for Video Recognition. Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks. Learning to Separate Object Sounds by Watching Unlabeled Video. LaneNet …
Design Challenges and Misconceptions in Neural Sequence Labeling
Title | Design Challenges and Misconceptions in Neural Sequence Labeling |
Authors | Jie Yang, Shuailong Liang, Yue Zhang |
Abstract | We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging). Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments. In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners. |
Tasks | Chunking |
Published | 2018-06-12 |
URL | http://arxiv.org/abs/1806.04470v2 |
http://arxiv.org/pdf/1806.04470v2.pdf | |
PWC | https://paperswithcode.com/paper/design-challenges-and-misconceptions-in |
Repo | https://github.com/jiesutd/NCRFpp |
Framework | pytorch |
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Title | Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset |
Authors | Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck |
Abstract | Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. |
Tasks | Music Modeling, Piano Music Modeling |
Published | 2018-10-29 |
URL | http://arxiv.org/abs/1810.12247v5 |
http://arxiv.org/pdf/1810.12247v5.pdf | |
PWC | https://paperswithcode.com/paper/enabling-factorized-piano-music-modeling-and |
Repo | https://github.com/BShakhovsky/PolyphonicPianoTranscription |
Framework | tf |
SlowFast Networks for Video Recognition
Title | SlowFast Networks for Video Recognition |
Authors | Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He |
Abstract | We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast |
Tasks | Action Classification, Action Detection, Video Recognition |
Published | 2018-12-10 |
URL | https://arxiv.org/abs/1812.03982v3 |
https://arxiv.org/pdf/1812.03982v3.pdf | |
PWC | https://paperswithcode.com/paper/slowfast-networks-for-video-recognition |
Repo | https://github.com/Guocode/SlowFast-Networks |
Framework | none |
Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks
Title | Fast and Robust Multiple ColorChecker Detection using Deep Convolutional Neural Networks |
Authors | Pedro D. Marrero Fernandez, Fidel A. Guerrero-Peña, Tsang Ing Ren, Jorge J. G. Leandro |
Abstract | ColorCheckers are reference standards that professional photographers and filmmakers use to ensure predictable results under every lighting condition. The objective of this work is to propose a new fast and robust method for automatic ColorChecker detection. The process is divided into two steps: (1) ColorCheckers localization and (2) ColorChecker patches recognition. For the ColorChecker localization, we trained a detection convolutional neural network using synthetic images. The synthetic images are created with the 3D models of the ColorChecker and different background images. The output of the neural networks are the bounding box of each possible ColorChecker candidates in the input image. Each bounding box defines a cropped image which is evaluated by a recognition system, and each image is canonized with regards to color and dimensions. Subsequently, all possible color patches are extracted and grouped with respect to the center’s distance. Each group is evaluated as a candidate for a ColorChecker part, and its position in the scene is estimated. Finally, a cost function is applied to evaluate the accuracy of the estimation. The method is tested using real and synthetic images. The proposed method is fast, robust to overlaps and invariant to affine projections. The algorithm also performs well in case of multiple ColorCheckers detection. |
Tasks | |
Published | 2018-10-19 |
URL | http://arxiv.org/abs/1810.08639v1 |
http://arxiv.org/pdf/1810.08639v1.pdf | |
PWC | https://paperswithcode.com/paper/fast-and-robust-multiple-colorchecker |
Repo | https://github.com/pedrodiamel/colorchacker-detection |
Framework | none |
Learning to Separate Object Sounds by Watching Unlabeled Video
Title | Learning to Separate Object Sounds by Watching Unlabeled Video |
Authors | Ruohan Gao, Rogerio Feris, Kristen Grauman |
Abstract | Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/ |
Tasks | Audio Denoising, Denoising, Multi-Label Learning |
Published | 2018-04-05 |
URL | http://arxiv.org/abs/1804.01665v2 |
http://arxiv.org/pdf/1804.01665v2.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-separate-object-sounds-by |
Repo | https://github.com/rhgao/Deep-MIML-Network |
Framework | pytorch |
LaneNet: Real-Time Lane Detection Networks for Autonomous Driving
Title | LaneNet: Real-Time Lane Detection Networks for Autonomous Driving |
Authors | Ze Wang, Weiqiang Ren, Qiang Qiu |
Abstract | Lane detection is to detect lanes on the road and provide the accurate location and shape of each lane. It severs as one of the key techniques to enable modern assisted and autonomous driving systems. However, several unique properties of lanes challenge the detection methods. The lack of distinctive features makes lane detection algorithms tend to be confused by other objects with similar local appearance. Moreover, the inconsistent number of lanes on a road as well as diverse lane line patterns, e.g. solid, broken, single, double, merging, and splitting lines further hamper the performance. In this paper, we propose a deep neural network based method, named LaneNet, to break down the lane detection into two stages: lane edge proposal and lane line localization. Stage one uses a lane edge proposal network for pixel-wise lane edge classification, and the lane line localization network in stage two then detects lane lines based on lane edge proposals. Please note that the goal of our LaneNet is built to detect lane line only, which introduces more difficulties on suppressing the false detections on the similar lane marks on the road like arrows and characters. Despite all the difficulties, our lane detection is shown to be robust to both highway and urban road scenarios method without relying on any assumptions on the lane number or the lane line patterns. The high running speed and low computational cost endow our LaneNet the capability of being deployed on vehicle-based systems. Experiments validate that our LaneNet consistently delivers outstanding performances on real world traffic scenarios. |
Tasks | Autonomous Driving, Lane Detection |
Published | 2018-07-04 |
URL | http://arxiv.org/abs/1807.01726v1 |
http://arxiv.org/pdf/1807.01726v1.pdf | |
PWC | https://paperswithcode.com/paper/lanenet-real-time-lane-detection-networks-for |
Repo | https://github.com/klintan/pytorch-lanenet |
Framework | pytorch |
Model compression via distillation and quantization
Title | Model compression via distillation and quantization |
Authors | Antonio Polino, Razvan Pascanu, Dan Alistarh |
Abstract | Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices. |
Tasks | Model Compression, Quantization |
Published | 2018-02-15 |
URL | http://arxiv.org/abs/1802.05668v1 |
http://arxiv.org/pdf/1802.05668v1.pdf | |
PWC | https://paperswithcode.com/paper/model-compression-via-distillation-and |
Repo | https://github.com/NervanaSystems/distiller |
Framework | pytorch |
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Title | AMC: AutoML for Model Compression and Acceleration on Mobile Devices |
Authors | Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han |
Abstract | Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy. |
Tasks | AutoML, Model Compression, Neural Architecture Search |
Published | 2018-02-10 |
URL | http://arxiv.org/abs/1802.03494v4 |
http://arxiv.org/pdf/1802.03494v4.pdf | |
PWC | https://paperswithcode.com/paper/amc-automl-for-model-compression-and |
Repo | https://github.com/NervanaSystems/distiller |
Framework | pytorch |
Object-Oriented Dynamics Predictor
Title | Object-Oriented Dynamics Predictor |
Authors | Guangxiang Zhu, Zhiao Huang, Chongjie Zhang |
Abstract | Generalization has been one of the major challenges for learning dynamics models in model-based reinforcement learning. However, previous work on action-conditioned dynamics prediction focuses on learning the pixel-level motion and thus does not generalize well to novel environments with different object layouts. In this paper, we present a novel object-oriented framework, called object-oriented dynamics predictor (OODP), which decomposes the environment into objects and predicts the dynamics of objects conditioned on both actions and object-to-object relations. It is an end-to-end neural network and can be trained in an unsupervised manner. To enable the generalization ability of dynamics learning, we design a novel CNN-based relation mechanism that is class-specific (rather than object-specific) and exploits the locality principle. Empirical results show that OODP significantly outperforms previous methods in terms of generalization over novel environments with various object layouts. OODP is able to learn from very few environments and accurately predict dynamics in a large number of unseen environments. In addition, OODP learns semantically and visually interpretable dynamics models. |
Tasks | |
Published | 2018-05-25 |
URL | http://arxiv.org/abs/1806.07371v3 |
http://arxiv.org/pdf/1806.07371v3.pdf | |
PWC | https://paperswithcode.com/paper/object-oriented-dynamics-predictor |
Repo | https://github.com/mig-zh/OODP |
Framework | tf |
Deep-FSMN for Large Vocabulary Continuous Speech Recognition
Title | Deep-FSMN for Large Vocabulary Continuous Speech Recognition |
Authors | Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai |
Abstract | In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications. |
Tasks | Language Modelling, Large Vocabulary Continuous Speech Recognition, Speech Recognition |
Published | 2018-03-04 |
URL | http://arxiv.org/abs/1803.05030v1 |
http://arxiv.org/pdf/1803.05030v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-fsmn-for-large-vocabulary-continuous |
Repo | https://github.com/yangxueruivs/DFSMN |
Framework | tf |
Malthusian Reinforcement Learning
Title | Malthusian Reinforcement Learning |
Authors | Joel Z. Leibo, Julien Perolat, Edward Hughes, Steven Wheelwright, Adam H. Marblestone, Edgar Duéñez-Guzmán, Peter Sunehag, Iain Dunning, Thore Graepel |
Abstract | Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. In Malthusian RL, increases in a subpopulation’s average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth. Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play. |
Tasks | Multi-agent Reinforcement Learning |
Published | 2018-12-17 |
URL | http://arxiv.org/abs/1812.07019v2 |
http://arxiv.org/pdf/1812.07019v2.pdf | |
PWC | https://paperswithcode.com/paper/malthusian-reinforcement-learning |
Repo | https://github.com/AbhijeetPendyala/Knowledge_base_ML |
Framework | tf |
Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points
Title | Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points |
Authors | Fabien Baradel, Christian Wolf, Julien Mille, Graham W. Taylor |
Abstract | We propose a method for human activity recognition from RGB data that does not rely on any pose information during test time and does not explicitly calculate pose information internally. Instead, a visual attention module learns to predict glimpse sequences in each frame. These glimpses correspond to interest points in the scene that are relevant to the classified activities. No spatial coherence is forced on the glimpse locations, which gives the module liberty to explore different points at each frame and better optimize the process of scrutinizing visual information. Tracking and sequentially integrating this kind of unstructured data is a challenge, which we address by separating the set of glimpses from a set of recurrent tracking/recognition workers. These workers receive glimpses, jointly performing subsequent motion tracking and activity prediction. The glimpses are soft-assigned to the workers, optimizing coherence of the assignments in space, time and feature space using an external memory module. No hard decisions are taken, i.e. each glimpse point is assigned to all existing workers, albeit with different importance. Our methods outperform state-of-the-art methods on the largest human activity recognition dataset available to-date; NTU RGB+D Dataset, and on a smaller human action recognition dataset Northwestern-UCLA Multiview Action 3D Dataset. Our code is publicly available at https://github.com/fabienbaradel/glimpse_clouds. |
Tasks | Action Recognition In Videos, Activity Prediction, Activity Recognition, Human Activity Recognition, Skeleton Based Action Recognition, Temporal Action Localization |
Published | 2018-02-22 |
URL | http://arxiv.org/abs/1802.07898v4 |
http://arxiv.org/pdf/1802.07898v4.pdf | |
PWC | https://paperswithcode.com/paper/glimpse-clouds-human-activity-recognition |
Repo | https://github.com/fabienbaradel/glimpse_clouds |
Framework | pytorch |
Deep reinforcement learning for time series: playing idealized trading games
Title | Deep reinforcement learning for time series: playing idealized trading games |
Authors | Xiang Gao |
Abstract | Deep Q-learning is investigated as an end-to-end solution to estimate the optimal strategies for acting on time series input. Experiments are conducted on two idealized trading games. 1) Univariate: the only input is a wave-like price time series, and 2) Bivariate: the input includes a random stepwise price time series and a noisy signal time series, which is positively correlated with future price changes. The Univariate game tests whether the agent can capture the underlying dynamics, and the Bivariate game tests whether the agent can utilize the hidden relation among the inputs. Stacked Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM) units, Convolutional Neural Network (CNN), and multi-layer perceptron (MLP) are used to model Q values. For both games, all agents successfully find a profitable strategy. The GRU-based agents show best overall performance in the Univariate game, while the MLP-based agents outperform others in the Bivariate game. |
Tasks | Q-Learning, Time Series |
Published | 2018-03-11 |
URL | http://arxiv.org/abs/1803.03916v1 |
http://arxiv.org/pdf/1803.03916v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-reinforcement-learning-for-time-series |
Repo | https://github.com/golsun/deep-RL-time-series |
Framework | none |
DSR: Direct Self-rectification for Uncalibrated Dual-lens Cameras
Title | DSR: Direct Self-rectification for Uncalibrated Dual-lens Cameras |
Authors | Ruichao Xiao, Wenxiu Sun, Jiahao Pang, Qiong Yan, Jimmy Ren |
Abstract | With the developments of dual-lens camera modules,depth information representing the third dimension of thecaptured scenes becomes available for smartphones. It isestimated by stereo matching algorithms, taking as input thetwo views captured by dual-lens cameras at slightly differ-ent viewpoints. Depth-of-field rendering (also be referred toas synthetic defocus or bokeh) is one of the trending depth-based applications. However, to achieve fast depth estima-tion on smartphones, the stereo pairs need to be rectified inthe first place. In this paper, we propose a cost-effective so-lution to perform stereo rectification for dual-lens camerascalled direct self-rectification, short for DSR1. It removesthe need of individual offline calibration for every pair ofdual-lens cameras. In addition, the proposed solution isrobust to the slight movements, e.g., due to collisions, ofthe dual-lens cameras after fabrication. Different with ex-isting self-rectification approaches, our approach computesthe homography in a novel way with zero geometric distor-tions introduced to the master image. It is achieved by di-rectly minimizing the vertical displacements of correspond-ing points between the original master image and the trans-formed slave image. Our method is evaluated on both real-istic and synthetic stereo image pairs, and produces supe-rior results compared to the calibrated rectification or otherself-rectification approaches |
Tasks | Calibration, Stereo Matching, Stereo Matching Hand |
Published | 2018-09-26 |
URL | http://arxiv.org/abs/1809.09763v1 |
http://arxiv.org/pdf/1809.09763v1.pdf | |
PWC | https://paperswithcode.com/paper/dsr-direct-self-rectification-for |
Repo | https://github.com/garroud/self-rectification |
Framework | none |
End-to-End Learning of Communications Systems Without a Channel Model
Title | End-to-End Learning of Communications Systems Without a Channel Model |
Authors | Fayçal Ait Aoudia, Jakob Hoydis |
Abstract | The idea of end-to-end learning of communications systems through neural network -based autoencoders has the shortcoming that it requires a differentiable channel model. We present in this paper a novel learning algorithm which alleviates this problem. The algorithm iterates between supervised training of the receiver and reinforcement learning -based training of the transmitter. We demonstrate that this approach works as well as fully supervised methods on additive white Gaussian noise (AWGN) and Rayleigh block-fading (RBF) channels. Surprisingly, while our method converges slower on AWGN channels than supervised training, it converges faster on RBF channels. Our results are a first step towards learning of communications systems over any type of channel without prior assumptions. |
Tasks | |
Published | 2018-04-06 |
URL | http://arxiv.org/abs/1804.02276v3 |
http://arxiv.org/pdf/1804.02276v3.pdf | |
PWC | https://paperswithcode.com/paper/end-to-end-learning-of-communications-systems |
Repo | https://github.com/Aithu-Snehith/End-to-End-Learning-of-Communications-Systems-Without-a-Channel-Model |
Framework | tf |