Paper Group ANR 495
Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost. Neural Nets via Forward State Transformation and Backward Loss Transformation. Learning Generic Diffusion Processes for Image Restoration. Move Forward and Tell: A Progressive Generator of Video Descriptions. VADRA: Visual Adversarial Domain Randomization and …
Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost
Title | Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost |
Authors | Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, Vikash Kumar |
Abstract | Dexterous multi-fingered robotic hands can perform a wide range of manipulation skills, making them an appealing component for general-purpose robotic manipulators. However, such hands pose a major challenge for autonomous control, due to the high dimensionality of their configuration space and complex intermittent contact interactions. In this work, we propose deep reinforcement learning (deep RL) as a scalable solution for learning complex, contact rich behaviors with multi-fingered hands. Deep RL provides an end-to-end approach to directly map sensor readings to actions, without the need for task specific models or policy classes. We show that contact-rich manipulation behavior with multi-fingered hands can be learned by directly training with model-free deep RL algorithms in the real world, with minimal additional assumption and without the aid of simulation. We learn a variety of complex behaviors on two different low-cost hardware platforms. We show that each task can be learned entirely from scratch, and further study how the learning process can be further accelerated by using a small number of human demonstrations to bootstrap learning. Our experiments demonstrate that complex multi-fingered manipulation skills can be learned in the real world in about 4-7 hours for most tasks, and that demonstrations can decrease this to 2-3 hours, indicating that direct deep RL training in the real world is a viable and practical alternative to simulation and model-based control. \url{https://sites.google.com/view/deeprl-handmanipulation} |
Tasks | |
Published | 2018-10-14 |
URL | http://arxiv.org/abs/1810.06045v1 |
http://arxiv.org/pdf/1810.06045v1.pdf | |
PWC | https://paperswithcode.com/paper/dexterous-manipulation-with-deep |
Repo | |
Framework | |
Neural Nets via Forward State Transformation and Backward Loss Transformation
Title | Neural Nets via Forward State Transformation and Backward Loss Transformation |
Authors | Bart Jacobs, David Sprunger |
Abstract | This article studies (multilayer perceptron) neural networks with an emphasis on the transformations involved — both forward and backward — in order to develop a semantical/logical perspective that is in line with standard program semantics. The common two-pass neural network training algorithms make this viewpoint particularly fitting. In the forward direction, neural networks act as state transformers. In the reverse direction, however, neural networks change losses of outputs to losses of inputs, thereby acting like a (real-valued) predicate transformer. In this way, backpropagation is functorial by construction, as shown earlier in recent other work. We illustrate this perspective by training a simple instance of a neural network. |
Tasks | |
Published | 2018-03-25 |
URL | http://arxiv.org/abs/1803.09356v1 |
http://arxiv.org/pdf/1803.09356v1.pdf | |
PWC | https://paperswithcode.com/paper/neural-nets-via-forward-state-transformation |
Repo | |
Framework | |
Learning Generic Diffusion Processes for Image Restoration
Title | Learning Generic Diffusion Processes for Image Restoration |
Authors | Peng Qiao, Yong Dou, Yunjin Chen, Wensen Feng |
Abstract | Image restoration problems are typical ill-posed problems where the regularization term plays an important role. The regularization term learned via generative approaches is easy to transfer to various image restoration, but offers inferior restoration quality compared with that learned via discriminative approaches. On the contrary, the regularization term learned via discriminative approaches are usually trained for a specific image restoration problem, and fail in the problem for which it is not trained. To address this issue, we propose a generic diffusion process (genericDP) to handle multiple Gaussian denoising problems based on the Trainable Non-linear Reaction Diffusion (TNRD) models. Instead of one model, which consists of a diffusion and a reaction term, for one Gaussian denoising problem in TNRD, we enforce multiple TNRD models to share one diffusion term. The trained genericDP model can provide both promising denoising performance and high training efficiency compared with the original TNRD models. We also transfer the trained diffusion term to non-blind deconvolution which is unseen in the training phase. Experiment results show that the trained diffusion term for multiple Gaussian denoising can be transferred to image non-blind deconvolution as an image prior and provide competitive performance. |
Tasks | Denoising, Image Restoration |
Published | 2018-07-17 |
URL | http://arxiv.org/abs/1807.06216v1 |
http://arxiv.org/pdf/1807.06216v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-generic-diffusion-processes-for |
Repo | |
Framework | |
Move Forward and Tell: A Progressive Generator of Video Descriptions
Title | Move Forward and Tell: A Progressive Generator of Video Descriptions |
Authors | Yilei Xiong, Bo Dai, Dahua Lin |
Abstract | We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption conditioned on a single embedding. On the contrary, we consider videos with rich temporal structures and aim to generate paragraph descriptions that can preserve the story flow while being coherent and concise. Towards this goal, we propose a new approach, which produces a descriptive paragraph by assembling temporally localized descriptions. Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner. Particularly, the selection of clips and the production of sentences are done jointly and progressively driven by a recurrent network – what to describe next depends on what have been said before. Here, the recurrent network is learned via self-critical sequence training with both sentence-level and paragraph-level rewards. On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos. Compared to those by other methods, the descriptions produced by our method are often more relevant, more coherent, and more concise. |
Tasks | Video Captioning |
Published | 2018-07-26 |
URL | http://arxiv.org/abs/1807.10018v1 |
http://arxiv.org/pdf/1807.10018v1.pdf | |
PWC | https://paperswithcode.com/paper/move-forward-and-tell-a-progressive-generator |
Repo | |
Framework | |
VADRA: Visual Adversarial Domain Randomization and Augmentation
Title | VADRA: Visual Adversarial Domain Randomization and Augmentation |
Authors | Rawal Khirodkar, Donghyun Yoo, Kris M. Kitani |
Abstract | We address the issue of learning from synthetic domain randomized data effectively. While previous works have showcased domain randomization as an effective learning approach, it lacks in challenging the learner and wastes valuable compute on generating easy examples. This can be attributed to uniform randomization over the rendering parameter distribution. In this work, firstly we provide a theoretical perspective on characteristics of domain randomization and analyze its limitations. As a solution to these limitations, we propose a novel algorithm which closes the loop between the synthetic generative model and the learner in an adversarial fashion. Our framework easily extends to the scenario when there is unlabelled target data available, thus incorporating domain adaptation. We evaluate our method on diverse vision tasks using state-of-the-art simulators for public datasets like CLEVR, Syn2Real, and VIRAT, where we demonstrate that a learner trained using adversarial data generation performs better than using a random data generation strategy. |
Tasks | Domain Adaptation |
Published | 2018-12-03 |
URL | http://arxiv.org/abs/1812.00491v1 |
http://arxiv.org/pdf/1812.00491v1.pdf | |
PWC | https://paperswithcode.com/paper/vadra-visual-adversarial-domain-randomization |
Repo | |
Framework | |
Finite-sample Analysis of M-estimators using Self-concordance
Title | Finite-sample Analysis of M-estimators using Self-concordance |
Authors | Dmitrii Ostrovskii, Francis Bach |
Abstract | We demonstrate how self-concordance of the loss can be exploited to obtain asymptotically optimal rates for M-estimators in finite-sample regimes. We consider two classes of losses: (i) canonically self-concordant losses in the sense of Nesterov and Nemirovski (1994), i.e., with the third derivative bounded with the $3/2$ power of the second; (ii) pseudo self-concordant losses, for which the power is removed, as introduced by Bach (2010). These classes contain some losses arising in generalized linear models, including logistic regression; in addition, the second class includes some common pseudo-Huber losses. Our results consist in establishing the critical sample size sufficient to reach the asymptotically optimal excess risk for both classes of losses. Denoting $d$ the parameter dimension, and $d_{\text{eff}}$ the effective dimension which takes into account possible model misspecification, we find the critical sample size to be $O(d_{\text{eff}} \cdot d)$ for canonically self-concordant losses, and $O(\rho \cdot d_{\text{eff}} \cdot d)$ for pseudo self-concordant losses, where $\rho$ is the problem-dependent local curvature parameter. In contrast to the existing results, we only impose local assumptions on the data distribution, assuming that the calibrated design, i.e., the design scaled with the square root of the second derivative of the loss, is subgaussian at the best predictor $\theta_*$. Moreover, we obtain the improved bounds on the critical sample size, scaling near-linearly in $\max(d_{\text{eff}},d)$, under the extra assumption that the calibrated design is subgaussian in the Dikin ellipsoid of $\theta_*$. Motivated by these findings, we construct canonically self-concordant analogues of the Huber and logistic losses with improved statistical properties. Finally, we extend some of these results to $\ell_1$-regularized M-estimators in high dimensions. |
Tasks | |
Published | 2018-10-16 |
URL | http://arxiv.org/abs/1810.06838v1 |
http://arxiv.org/pdf/1810.06838v1.pdf | |
PWC | https://paperswithcode.com/paper/finite-sample-analysis-of-m-estimators-using |
Repo | |
Framework | |
Domain-to-Domain Translation Model for Recommender System
Title | Domain-to-Domain Translation Model for Recommender System |
Authors | Linh Nguyen, Tsukasa Ishigaki |
Abstract | Recently multi-domain recommender systems have received much attention from researchers because they can solve cold-start problem as well as support for cross-selling. However, when applying into multi-domain items, although algorithms specifically addressing a single domain have many difficulties in capturing the specific characteristics of each domain, multi-domain algorithms have less opportunity to obtain similar features among domains. Because both similarities and differences exist among domains, multi-domain models must capture both to achieve good performance. Other studies of multi-domain systems merely transfer knowledge from the source domain to the target domain, so the source domain usually comes from external factors such as the search query or social network, which is sometimes impossible to obtain. To handle the two problems, we propose a model that can extract both homogeneous and divergent features among domains and extract data in a domain can support for other domain equally: a so-called Domain-to-Domain Translation Model (D2D-TM). It is based on generative adversarial networks (GANs), Variational Autoencoders (VAEs), and Cycle-Consistency (CC) for weight-sharing. We use the user interaction history of each domain as input and extract latent features through a VAE-GAN-CC network. Experiments underscore the effectiveness of the proposed system over state-of-the-art methods by a large margin. |
Tasks | Multi-Domain Recommender Systems, Recommendation Systems |
Published | 2018-12-15 |
URL | http://arxiv.org/abs/1812.06229v1 |
http://arxiv.org/pdf/1812.06229v1.pdf | |
PWC | https://paperswithcode.com/paper/domain-to-domain-translation-model-for |
Repo | |
Framework | |
Semi-Supervised Semantic Image Segmentation with Self-correcting Networks
Title | Semi-Supervised Semantic Image Segmentation with Self-correcting Networks |
Authors | Mostafa S. Ibrahim, Arash Vahdat, Mani Ranjbar, William G. Macready |
Abstract | Building a large image dataset with high-quality object masks for semantic segmentation is costly and time consuming. In this paper, we introduce a principled semi-supervised framework that only uses a small set of fully supervised images (having semantic segmentation labels and box labels) and a set of images with only object bounding box labels (we call it the weak set). Our framework trains the primary segmentation model with the aid of an ancillary model that generates initial segmentation labels for the weak set and a self-correction module that improves the generated labels during training using the increasingly accurate primary model. We introduce two variants of the self-correction module using either linear or convolutional functions. Experiments on the PASCAL VOC 2012 and Cityscape datasets show that our models trained with a small fully supervised set perform similar to, or better than, models trained with a large fully supervised set while requiring ~7x less annotation effort. |
Tasks | Semantic Segmentation |
Published | 2018-11-17 |
URL | https://arxiv.org/abs/1811.07073v3 |
https://arxiv.org/pdf/1811.07073v3.pdf | |
PWC | https://paperswithcode.com/paper/weakly-supervised-semantic-image-segmentation |
Repo | |
Framework | |
Exponential Weights on the Hypercube in Polynomial Time
Title | Exponential Weights on the Hypercube in Polynomial Time |
Authors | Sudeep Raja Putta, Abhishek Shetty |
Abstract | We study a general online linear optimization problem(OLO). At each round, a subset of objects from a fixed universe of $n$ objects is chosen, and a linear cost associated with the chosen subset is incurred. To measure the performance of our algorithms, we use the notion of regret which is the difference between the total cost incurred over all iterations and the cost of the best fixed subset in hindsight. We consider Full Information and Bandit feedback for this problem. This problem is equivalent to OLO on the ${0,1}^n$ hypercube. The Exp2 algorithm and its bandit variant are commonly used strategies for this problem. It was previously unknown if it is possible to run Exp2 on the hypercube in polynomial time. In this paper, we present a polynomial time algorithm called PolyExp for OLO on the hypercube. We show that our algorithm is equivalent Exp2 on ${0,1}^n$, Online Mirror Descent(OMD), Follow The Regularized Leader(FTRL) and Follow The Perturbed Leader(FTPL) algorithms. We show PolyExp achieves expected regret bound that is a factor of $\sqrt{n}$ better than Exp2 in the full information setting under $L_\infty$ adversarial losses. Because of the equivalence of these algorithms, this implies an improvement on Exp2’s regret bound in full information. We also show matching regret lower bounds. Finally, we show how to use PolyExp on the ${-1,+1}^n$ hypercube, solving an open problem in Bubeck et al (COLT 2012). |
Tasks | |
Published | 2018-06-12 |
URL | https://arxiv.org/abs/1806.04594v5 |
https://arxiv.org/pdf/1806.04594v5.pdf | |
PWC | https://paperswithcode.com/paper/exponential-weights-on-the-hypercube-in |
Repo | |
Framework | |
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Title | Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search |
Authors | Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan |
Abstract | Text-based person search aims to retrieve the corresponding person images in an image database by virtue of a describing sentence about the person, which poses great potential for various applications such as video surveillance. Extracting visual contents corresponding to the human description is the key to this cross-modal matching problem. Moreover, correlated images and descriptions involve different granularities of semantic relevance, which is usually ignored in previous methods. To exploit the multilevel corresponding visual contents, we propose a pose-guided multi-granularity attention network (PMA). Firstly, we propose a coarse alignment network (CA) to select the related image regions to the global description by a similarity-based attention. To further capture the phrase-related visual body part, a fine-grained alignment network (FA) is proposed, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase. To verify the effectiveness of our model, we perform extensive experiments on the CUHK Person Description Dataset (CUHK-PEDES) which is currently the only available dataset for text-based person search. Experimental results show that our approach outperforms the state-of-the-art methods by 15 % in terms of the top-1 metric. |
Tasks | Person Search |
Published | 2018-09-22 |
URL | https://arxiv.org/abs/1809.08440v3 |
https://arxiv.org/pdf/1809.08440v3.pdf | |
PWC | https://paperswithcode.com/paper/cascade-attention-network-for-person-search |
Repo | |
Framework | |
An Anti-fraud System for Car Insurance Claim Based on Visual Evidence
Title | An Anti-fraud System for Car Insurance Claim Based on Visual Evidence |
Authors | Pei Li, Bingyu Shen, Weishan Dong |
Abstract | Automatically scene understanding using machine learning algorithms has been widely applied to different industries to reduce the cost of manual labor. Nowadays, insurance companies launch express vehicle insurance claim and settlement by allowing customers uploading pictures taken by mobile devices. This kind of insurance claim is treated as small claim and can be processed either manually or automatically in a quick fashion. However, due to the increasing amount of claims every day, system or people are likely to be fooled by repeated claims for identical case leading to big lost to insurance companies.Thus, an anti-fraud checking before processing the claim is necessary. We create the first data set of car damage images collected from internet and local parking lots. In addition, we proposed an approach to generate robust deep features by locating the damages accurately and efficiently in the images. The state-of-the-art real-time object detector YOLO \cite{redmon2016you}is modified to train and discover damage region as an important part of the pipeline. Both local and global deep features are extracted using VGG model\cite{Simonyan14c}, which are fused later for more robust system performance. Experiments show our approach is effective in preventing fraud claims as well as meet the requirement to speed up the insurance claim prepossessing. |
Tasks | Scene Understanding |
Published | 2018-04-30 |
URL | http://arxiv.org/abs/1804.11207v1 |
http://arxiv.org/pdf/1804.11207v1.pdf | |
PWC | https://paperswithcode.com/paper/an-anti-fraud-system-for-car-insurance-claim |
Repo | |
Framework | |
Automatic Configuration of Deep Neural Networks with EGO
Title | Automatic Configuration of Deep Neural Networks with EGO |
Authors | Bas van Stein, Hao Wang, Thomas Bäck |
Abstract | Designing the architecture for an artificial neural network is a cumbersome task because of the numerous parameters to configure, including activation functions, layer types, and hyper-parameters. With the large number of parameters for most networks nowadays, it is intractable to find a good configuration for a given task by hand. In this paper an Efficient Global Optimization (EGO) algorithm is adapted to automatically optimize and configure convolutional neural network architectures. A configurable neural network architecture based solely on convolutional layers is proposed for the optimization. Without using any knowledge on the target problem and not using any data augmentation techniques, it is shown that on several image classification tasks this approach is able to find competitive network architectures in terms of prediction accuracy, compared to the best hand-crafted ones in literature. In addition, a very small training budget (200 evaluations and 10 epochs in training) is spent on each optimized architectures in contrast to the usual long training time of hand-crafted networks. Moreover, instead of the standard sequential evaluation in EGO, several candidate architectures are proposed and evaluated in parallel, which saves the execution overheads significantly and leads to an efficient automation for deep neural network design. |
Tasks | Data Augmentation, Image Classification |
Published | 2018-10-10 |
URL | http://arxiv.org/abs/1810.05526v1 |
http://arxiv.org/pdf/1810.05526v1.pdf | |
PWC | https://paperswithcode.com/paper/automatic-configuration-of-deep-neural |
Repo | |
Framework | |
Multi-view Common Component Discriminant Analysis for Cross-view Classification
Title | Multi-view Common Component Discriminant Analysis for Cross-view Classification |
Authors | Xinge You, Jiamiao Xu, Wei Yuan, Xiao-Yuan Jing, Dacheng Tao, Taiping Zhang |
Abstract | Cross-view classification that means to classify samples from heterogeneous views is a significant yet challenging problem in computer vision. A promising approach to handle this problem is the multi-view subspace learning (MvSL), which intends to find a common subspace for multi-view data. Despite the satisfactory results achieved by existing methods, the performance of previous work will be dramatically degraded when multi-view data lies on nonlinear manifolds. To circumvent this drawback, we propose Multi-view Common Component Discriminant Analysis (MvCCDA) to handle view discrepancy, discriminability and nonlinearity in a joint manner. Specifically, our MvCCDA incorporates supervised information and local geometric information into the common component extraction process to learn a discriminant common subspace and to discover the nonlinear structure embedded in multi-view data. We develop a kernel method of MvCCDA to further boost the performance of MvCCDA. Beyond kernel extension, optimization and complexity analysis of MvCCDA are also presented for completeness. Our MvCCDA is competitive with the state-of-the-art MvSL based methods on four benchmark datasets, demonstrating its superiority. |
Tasks | |
Published | 2018-05-14 |
URL | http://arxiv.org/abs/1805.05029v2 |
http://arxiv.org/pdf/1805.05029v2.pdf | |
PWC | https://paperswithcode.com/paper/multi-view-common-component-discriminant |
Repo | |
Framework | |
Multimodal Relational Tensor Network for Sentiment and Emotion Classification
Title | Multimodal Relational Tensor Network for Sentiment and Emotion Classification |
Authors | Saurav Sahay, Shachi H Kumar, Rui Xia, Jonathan Huang, Lama Nachman |
Abstract | Understanding Affect from video segments has brought researchers from the language, audio and video domains together. Most of the current multimodal research in this area deals with various techniques to fuse the modalities, and mostly treat the segments of a video independently. Motivated by the work of (Zadeh et al., 2017) and (Poria et al., 2017), we present our architecture, Relational Tensor Network, where we use the inter-modal interactions within a segment (intra-segment) and also consider the sequence of segments in a video to model the inter-segment inter-modal interactions. We also generate rich representations of text and audio modalities by leveraging richer audio and linguistic context alongwith fusing fine-grained knowledge based polarity scores from text. We present the results of our model on CMU-MOSEI dataset and show that our model outperforms many baselines and state of the art methods for sentiment classification and emotion recognition. |
Tasks | Emotion Classification, Emotion Recognition, Sentiment Analysis |
Published | 2018-06-07 |
URL | http://arxiv.org/abs/1806.02923v1 |
http://arxiv.org/pdf/1806.02923v1.pdf | |
PWC | https://paperswithcode.com/paper/multimodal-relational-tensor-network-for |
Repo | |
Framework | |
Lipschitz regularized Deep Neural Networks generalize and are adversarially robust
Title | Lipschitz regularized Deep Neural Networks generalize and are adversarially robust |
Authors | Chris Finlay, Jeff Calder, Bilal Abbasi, Adam Oberman |
Abstract | In this work we study input gradient regularization of deep neural networks, and demonstrate that such regularization leads to generalization proofs and improved adversarial robustness. The proof of generalization does not overcome the curse of dimensionality, but it is independent of the number of layers in the networks. The adversarial robustness regularization combines adversarial training, which we show to be equivalent to Total Variation regularization, with Lipschitz regularization. We demonstrate empirically that the regularized models are more robust, and that gradient norms of images can be used for attack detection. |
Tasks | |
Published | 2018-08-28 |
URL | https://arxiv.org/abs/1808.09540v4 |
https://arxiv.org/pdf/1808.09540v4.pdf | |
PWC | https://paperswithcode.com/paper/lipschitz-regularized-deep-neural-networks |
Repo | |
Framework | |