January 25, 2020

3354 words 16 mins read

Paper Group ANR 1778

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching. QuesNet: A Unified Representation for Heterogeneous Test Questions. TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching. On Transfer Learning For Chatter Detection in Turning Using Wavelet Packet Transform and Empirical Mode Decomposition. Learning from …

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching


Title	End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching
Authors	Li Zhang, Quanhong Wang, Haihua Lu, Yong Zhao
Abstract	Deep neural networks have shown excellent performance in stereo matching task. Recently CNN-based methods have shown that stereo matching can be formulated as a supervised learning task. However, less attention is paid on the fusion of contextual semantic information and details. To tackle this problem, we propose a network for disparity estimation based on abundant contextual details and semantic information, called Multi-scale Features Network (MSFNet). First, we design a new structure to encode rich semantic information and fine-grained details by fusing multi-scale features. And we combine the advantages of element-wise addition and concatenation, which is conducive to merge semantic information with details. Second, a guidance mechanism is introduced to guide the network to automatically focus more on the unreliable regions. Third, we formulate the consistency check as an error map, obtained by the low stage features with fine-grained details. Finally, we adopt the consistency checking between the left feature and the synthetic left feature to refine the initial disparity. Experiments on Scene Flow and KITTI 2015 benchmark demonstrated that the proposed method can achieve the state-of-the-art performance.
Tasks	Disparity Estimation, Stereo Matching, Stereo Matching Hand
Published	2019-06-25
URL	https://arxiv.org/abs/1906.10399v1
PDF	https://arxiv.org/pdf/1906.10399v1.pdf
PWC	https://paperswithcode.com/paper/end-to-end-learning-of-multi-scale
Repo
Framework

QuesNet: A Unified Representation for Heterogeneous Test Questions


Title	QuesNet: A Unified Representation for Heterogeneous Test Questions
Authors	Yu Yin, Qi Liu, Zhenya Huang, Enhong Chen, Wei Tong, Shijin Wang, Yu Su
Abstract	Understanding learning materials (e.g. test questions) is a crucial issue in online learning systems, which can promote many applications in education domain. Unfortunately, many supervised approaches suffer from the problem of scarce human labeled data, whereas abundant unlabeled resources are highly underutilized. To alleviate this problem, an effective solution is to use pre-trained representations for question understanding. However, existing pre-training methods in NLP area are infeasible to learn test question representations due to several domain-specific characteristics in education. First, questions usually comprise of heterogeneous data including content text, images and side information. Second, there exists both basic linguistic information as well as domain logic and knowledge. To this end, in this paper, we propose a novel pre-training method, namely QuesNet, for comprehensively learning question representations. Specifically, we first design a unified framework to aggregate question information with its heterogeneous inputs into a comprehensive vector. Then we propose a two-level hierarchical pre-training algorithm to learn better understanding of test questions in an unsupervised way. Here, a novel holed language model objective is developed to extract low-level linguistic features, and a domain-oriented objective is proposed to learn high-level logic and knowledge. Moreover, we show that QuesNet has good capability of being fine-tuned in many question-based tasks. We conduct extensive experiments on large-scale real-world question data, where the experimental results clearly demonstrate the effectiveness of QuesNet for question understanding as well as its superior applicability.
Tasks	Language Modelling
Published	2019-05-27
URL	https://arxiv.org/abs/1905.10949v1
PDF	https://arxiv.org/pdf/1905.10949v1.pdf
PWC	https://paperswithcode.com/paper/quesnet-a-unified-representation-for
Repo
Framework

TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching


Title	TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching
Authors	Mostafa El-Khamy, Haoyu Ren, Xianzhi Du, Jungwon Lee
Abstract	In this paper, we introduce the problem of estimating the real world depth of elements in a scene captured by two cameras with different field of views, where the first field of view (FOV) is a Wide FOV (WFOV) captured by a wide angle lens, and the second FOV is contained in the first FOV and is captured by a tele zoom lens. We refer to the problem of estimating the inverse depth for the union of FOVs, while leveraging the stereo information in the overlapping FOV, as Tele-Wide Stereo Matching (TW-SM). We propose different deep learning solutions to the TW-SM problem. Since the disparity is proportional to the inverse depth, we train stereo matching disparity estimation (SMDE) networks to estimate the disparity for the union WFOV. We further propose an end-to-end deep multitask tele-wide stereo matching neural network (MT-TW-SMNet), which simultaneously learns the SMDE task for the overlapped Tele FOV and the single image inverse depth estimation (SIDE) task for the WFOV. Moreover, we design multiple methods for the fusion of the SMDE and SIDE networks. We evaluate the performance of TW-SM on the popular KITTI and SceneFlow stereo datasets, and demonstrate its practicality by synthesizing the Bokeh effect on the WFOV from a tele-wide stereo image pair.
Tasks	Depth Estimation, Disparity Estimation, Stereo Matching, Stereo Matching Hand
Published	2019-06-11
URL	https://arxiv.org/abs/1906.04463v1
PDF	https://arxiv.org/pdf/1906.04463v1.pdf
PWC	https://paperswithcode.com/paper/tw-smnet-deep-multitask-learning-of-tele-wide
Repo
Framework

On Transfer Learning For Chatter Detection in Turning Using Wavelet Packet Transform and Empirical Mode Decomposition


Title	On Transfer Learning For Chatter Detection in Turning Using Wavelet Packet Transform and Empirical Mode Decomposition
Authors	Melih C. Yesilli, Firas A. Khasawneh, Andreas Otto
Abstract	The increasing availability of sensor data at machine tools makes automatic chatter detection algorithms a trending topic in metal cutting. Two prominent and advanced methods for feature extraction via signal decomposition are Wavelet Packet Transform (WPT) and Ensemble Empirical Mode Decomposition (EEMD). We apply these two methods to time series acquired from an acceleration sensor at the tool holder of a lathe. Different turning experiments with varying dynamic behavior of the machine tool structure were performed. We compare the performance of these two methods with Support Vector Machine (SVM), Logistic Regression, Random Forest Classification and Gradient Boosting combined with Recursive Feature Elimination (RFE). We also show that the common WPT-based approach of choosing wavelet packets with the highest energy ratios as representative features for chatter does not always result in packets that enclose the chatter frequency, thus reducing the classification accuracy. Further, we test the transfer learning capability of each of these methods by training the classifier on one of the cutting configurations and then testing it on the other cases. It is found that when training and testing on data from the same cutting configuration both methods yield high accuracies reaching in one of the cases as high as 94% and 95%, respectively, for WPT and EEMD. However, our experimental results show that EEMD can outperform WPT in transfer learning applications with accuracy of up to 95%.
Tasks	Time Series, Transfer Learning
Published	2019-05-03
URL	https://arxiv.org/abs/1905.01982v2
PDF	https://arxiv.org/pdf/1905.01982v2.pdf
PWC	https://paperswithcode.com/paper/on-transfer-learning-for-chatter-detection-in
Repo
Framework

Learning from Trajectories via Subgoal Discovery


Title	Learning from Trajectories via Subgoal Discovery
Authors	Sujoy Paul, Jeroen van Baar, Amit K. Roy-Chowdhury
Abstract	Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples. In such cases, using a set of expert trajectories could help to learn faster. However, Imitation Learning (IL) via supervised pre-training with these trajectories may not perform as well and generally requires additional finetuning with expert-in-the-loop. In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. We learn a function which partitions the state-space into sub-goals, which can then be used to design an extrinsic reward function. We follow a strategy where the agent first learns from the trajectories using IL and then switches to Reinforcement Learning (RL) using the identified sub-goals, to alleviate the errors in the IL step. To deal with states which are under-represented by the trajectory set, we also learn a function to modulate the sub-goal predictions. We show that our method is able to solve complex goal-oriented tasks, which other RL, IL or their combinations in literature are not able to solve.
Tasks	Imitation Learning
Published	2019-11-03
URL	https://arxiv.org/abs/1911.07224v1
PDF	https://arxiv.org/pdf/1911.07224v1.pdf
PWC	https://paperswithcode.com/paper/learning-from-trajectories-via-subgoal-1
Repo
Framework

Towards Unified INT8 Training for Convolutional Neural Network


Title	Towards Unified INT8 Training for Convolutional Neural Network
Authors	Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, Junjie Yan
Abstract	Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit training framework that can support diverse networks on various tasks. In this paper, we give an attempt to build a unified 8-bit (INT8) training framework for common convolutional neural networks from the aspects of both accuracy and speed. First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization. Then, we theoretically give an in-depth analysis of the convergence bound and derive two principles for stable INT8 training. Finally, we propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients and Deviation Counteractive Learning Rate Scaling that avoids illegal gradient update along the wrong direction. The experiments show that our unified solution promises accurate and efficient INT8 training for a variety of networks and tasks, including MobileNetV2, InceptionV3 and object detection that prior studies have never succeeded. Moreover, it enjoys a strong flexibility to run on off-the-shelf hardware, and reduces the training time by 22% on Pascal GPU without too much optimization effort. We believe that this pioneering study will help lead the community towards a fully unified INT8 training for convolutional neural networks.
Tasks	Object Detection, Quantization
Published	2019-12-29
URL	https://arxiv.org/abs/1912.12607v1
PDF	https://arxiv.org/pdf/1912.12607v1.pdf
PWC	https://paperswithcode.com/paper/towards-unified-int8-training-for
Repo
Framework

Short-term Load Forecasting with Dense Average Network


Title	Short-term Load Forecasting with Dense Average Network
Authors	Zhifang Liao, Haihui Pan
Abstract	Short-trem Load forecasting is of great significance to power system. In this paper, we propose a new connection, Dense Average connection, in which the outputs of all previous layers are averaged as the input of the next layer in a feedforward method.Compared with fully connected layer, the Dense Average connection does not introduce new training parameters.Based on the Dense Average connection,we build the Dense Average Network for load forecasting. In two public datasets and one real dataset, we verify the validity of the model.Compared with ANN, our proposed model has better convergence and prediction effect.Meanwhile, we use the ensemble method to further improve the prediction effect. In order to verify the reliability of the model, we also disturb the input of the model to different degrees. Experimental results show that the proposed model is very robust.
Tasks	Load Forecasting
Published	2019-12-08
URL	https://arxiv.org/abs/1912.03668v2
PDF	https://arxiv.org/pdf/1912.03668v2.pdf
PWC	https://paperswithcode.com/paper/short-term-load-forecasting-with-dense
Repo
Framework

PEPSI++: Fast and Lightweight Network for Image Inpainting


Title	PEPSI++: Fast and Lightweight Network for Image Inpainting
Authors	Yong-Goo Shin, Min-Cheol Sagong, Yoon-Jae Yeo, Seung-Wook Kim, Sung-Jea Ko
Abstract	Among the various generative adversarial network (GAN)-based image inpainting methods, a coarse-to-fine network with a contextual attention module (CAM) has shown remarkable performance. However, owing to two stacked generative networks, the coarse-to-fine network needs numerous computational resources such as convolution operations and network parameters, which result in low speed. To address this problem, we propose a novel network architecture called PEPSI: parallel extended-decoder path for semantic inpainting network, which aims at reducing the hardware costs and improving the inpainting performance. PEPSI consists of a single shared encoding network and parallel decoding networks called coarse and inpainting paths. The coarse path produces a preliminary inpainting result to train the encoding network for the prediction of features for the CAM. Simultaneously, the inpainting path generates higher inpainting quality using the refined features reconstructed via the CAM. In addition, we propose Diet-PEPSI that significantly reduces the network parameters while maintaining the performance. In Diet-PEPSI, to capture the global contextual information with low hardware costs, we propose novel rate-adaptive dilated convolutional layers, which employ the common weights but produce dynamic features depending on the given dilation rates. Extensive experiments comparing the performance with state-of-the-art image inpainting methods demonstrate that both PEPSI and Diet-PEPSI improve the qualitative scores, i.e. the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), as well as significantly reduce hardware costs such as computational time and the number of network parameters.
Tasks	Image Inpainting
Published	2019-05-22
URL	https://arxiv.org/abs/1905.09010v5
PDF	https://arxiv.org/pdf/1905.09010v5.pdf
PWC	https://paperswithcode.com/paper/pepsi-fast-and-lightweight-network-for-image
Repo
Framework

CraftAssist Instruction Parsing: Semantic Parsing for a Minecraft Assistant


Title	CraftAssist Instruction Parsing: Semantic Parsing for a Minecraft Assistant
Authors	Yacine Jernite, Kavya Srinet, Jonathan Gray, Arthur Szlam
Abstract	We propose a large scale semantic parsing dataset focused on instruction-driven communication with an agent in Minecraft. We describe the data collection process which yields additional 35K human generated instructions with their semantic annotations. We report the performance of three baseline models and find that while a dataset of this size helps us train a usable instruction parser, it still poses interesting generalization challenges which we hope will help develop better and more robust models.
Tasks	Semantic Parsing
Published	2019-04-17
URL	http://arxiv.org/abs/1905.01978v1
PDF	http://arxiv.org/pdf/1905.01978v1.pdf
PWC	https://paperswithcode.com/paper/190501978
Repo
Framework

Towards Good Practices for Multi-Person Pose Estimation


Title	Towards Good Practices for Multi-Person Pose Estimation
Authors	Dongdong Yu, Kai Su, Changhu Wang
Abstract	Multi-Person Pose Estimation is an interesting yet challenging task in computer vision. In this paper, we conduct a series of refinements with the MSPN and PoseFix Networks, and empirically evaluate their impact on the final model performance through ablation studies. By taking all the refinements, we achieve 78.7 on the COCO test-dev dataset and 76.3 on the COCO test-challenge dataset.
Tasks	Multi-Person Pose Estimation, Pose Estimation
Published	2019-10-28
URL	https://arxiv.org/abs/1911.07938v1
PDF	https://arxiv.org/pdf/1911.07938v1.pdf
PWC	https://paperswithcode.com/paper/towards-good-practices-for-multi-person-pose
Repo
Framework

Towards Good Practices for Video Object Segmentation


Title	Towards Good Practices for Video Object Segmentation
Authors	Dongdong Yu, Kai Su, Hengkai Guo, Jian Wang, Kaihui Zhou, Yuanyuan Huang, Minghui Dong, Jie Shao, Changhu Wang
Abstract	Semi-supervised video object segmentation is an interesting yet challenging task in machine learning. In this work, we conduct a series of refinements with the propagation-based video object segmentation method and empirically evaluate their impact on the final model performance through ablation study. By taking all the refinements, we improve the space-time memory networks to achieve a Overall of 79.1 on the Youtube-VOS Challenge 2019.
Tasks	Semantic Segmentation, Semi-supervised Video Object Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-09-30
URL	https://arxiv.org/abs/1909.13583v1
PDF	https://arxiv.org/pdf/1909.13583v1.pdf
PWC	https://paperswithcode.com/paper/towards-good-practices-for-video-object
Repo
Framework

RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation


Title	RPM-Net: Robust Pixel-Level Matching Networks for Self-Supervised Video Object Segmentation
Authors	Youngeun Kim, Seokeon Choi, Hankyeol Lee, Taekyung Kim, Changick Kim
Abstract	In this paper, we introduce a self-supervised approach for video object segmentation without human labeled data.Specifically, we present Robust Pixel-level Matching Net-works (RPM-Net), a novel deep architecture that matches pixels between adjacent frames, using only color information from unlabeled videos for training. Technically, RPM-Net can be separated in two main modules. The embed-ding module first projects input images into high dimensional embedding space. Then the matching module with deformable convolution layers matches pixels between reference and target frames based on the embedding features.Unlike previous methods using deformable convolution, our matching module adopts deformable convolution to focus on similar features in spatio-temporally neighboring pixels.Our experiments show that the selective feature sampling improves the robustness to challenging problems in video object segmentation such as camera shake, fast motion, deformation, and occlusion. Also, we carry out comprehensive experiments on three public datasets (i.e., DAVIS-2017,SegTrack-v2, and Youtube-Objects) and achieve state-of-the-art performance on self-supervised video object seg-mentation. Moreover, we significantly reduce the performance gap between self-supervised and fully-supervised video object segmentation (41.0% vs. 52.5% on DAVIS-2017 validation set)
Tasks	Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-09-29
URL	https://arxiv.org/abs/1909.13247v2
PDF	https://arxiv.org/pdf/1909.13247v2.pdf
PWC	https://paperswithcode.com/paper/rpm-net-robust-pixel-level-matching-networks
Repo
Framework

Do Neural Language Representations Learn Physical Commonsense?


Title	Do Neural Language Representations Learn Physical Commonsense?
Authors	Maxwell Forbes, Ari Holtzman, Yejin Choi
Abstract	Humans understand language based on the rich background knowledge about how the physical world works, which in turn allows us to reason about the physical world through language. In addition to the properties of objects (e.g., boats require fuel) and their affordances, i.e., the actions that are applicable to them (e.g., boats can be driven), we can also reason about if-then inferences between what properties of objects imply the kind of actions that are applicable to them (e.g., that if we can drive something then it likely requires fuel). In this paper, we investigate the extent to which state-of-the-art neural language representations, trained on a vast amount of natural language text, demonstrate physical commonsense reasoning. While recent advancements of neural language models have demonstrated strong performance on various types of natural language inference tasks, our study based on a dataset of over 200k newly collected annotations suggests that neural language representations still only learn associations that are explicitly written down.
Tasks	Natural Language Inference
Published	2019-08-08
URL	https://arxiv.org/abs/1908.02899v1
PDF	https://arxiv.org/pdf/1908.02899v1.pdf
PWC	https://paperswithcode.com/paper/do-neural-language-representations-learn
Repo
Framework

Fast Video Object Segmentation via Mask Transfer Network


Title	Fast Video Object Segmentation via Mask Transfer Network
Authors	Tao Zhuo, Zhiyong Cheng, Mohan Kankanhalli
Abstract	Accuracy and processing speed are two important factors that affect the use of video object segmentation (VOS) in real applications. With the advanced techniques of deep neural networks, the accuracy has been significantly improved, however, the speed is still far below the real-time needs because of the complicated network design, such as the requirement of the first frame fine-tuning step. To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy. The basic idea of MTN is to transfer the reference mask to the target frame via an efficient global pixel matching strategy. The global pixel matching between the reference frame and the target frame is to ensure good matching results. To enhance the matching speed, we perform the matching on a downsampled feature map with 1/32 of the original frame size. At the same time, to preserve the detailed mask information in such a small feature map, a mask network is designed to encode the annotated mask information with 512 channels. Finally, an efficient feature warping method is used to transfer the encoded reference mask to the target frame. Based on this design, our method avoids the fine-tuning step on the first frame and does not rely on the temporal cues and particular object categories. Therefore, it runs very fast and can be conveniently trained only with images, as well as being robust to unseen objects. Experiments on the DAVIS datasets demonstrate that MTN can achieve a speed of 37 fps, and also shows a competitive accuracy in comparison to the state-of-the-art methods.
Tasks	Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-08-28
URL	https://arxiv.org/abs/1908.10717v1
PDF	https://arxiv.org/pdf/1908.10717v1.pdf
PWC	https://paperswithcode.com/paper/fast-video-object-segmentation-via-mask
Repo
Framework

Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation


Title	Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation
Authors	Yu Liu, Lingqiao Liu, Haokui Zhang, Hamid Rezatofighi, Ian Reid
Abstract	This paper tackles the problem of video object segmentation. We are specifically concerned with the task of segmenting all pixels of a target object in all frames, given the annotation mask in the first frame. Even when such annotation is available this remains a challenging problem because of the changing appearance and shape of the object over time. In this paper, we tackle this task by formulating it as a meta-learning problem, where the base learner grasping the semantic scene understanding for a general type of objects, and the meta learner quickly adapting the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed form optimizer, the so-called “ridge regression”, which has been shown to be conducive for fast and better training convergence. Moreover, we propose a mechanism, named “block splitting”, to further speed up the training process as well as to reduce the number of learning parameters. In comparison with the-state-of-the art methods, our proposed framework achieves significant boost up in processing speed, while having very competitive performance compared to the best performing methods on the widely used datasets.
Tasks	Meta-Learning, Scene Understanding, Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2019-09-28
URL	https://arxiv.org/abs/1909.13046v1
PDF	https://arxiv.org/pdf/1909.13046v1.pdf
PWC	https://paperswithcode.com/paper/meta-learning-with-differentiable-closed-form-2
Repo
Framework