Paper Group ANR 769
The Complexity of Making the Gradient Small in Stochastic Convex Optimization. Reference-Based Sequence Classification. On the Importance of Delexicalization for Fact Verification. Attributed Graph Learning with 2-D Graph Convolution. Conversational Product Search Based on Negative Feedback. Distribution-Dependent Analysis of Gibbs-ERM Principle. N …
The Complexity of Making the Gradient Small in Stochastic Convex Optimization
Title | The Complexity of Making the Gradient Small in Stochastic Convex Optimization |
Authors | Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth |
Abstract | We give nearly matching upper and lower bounds on the oracle complexity of finding $\epsilon$-stationary points ($\ \nabla F(x) \ \leq\epsilon$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent “recursive regularization” technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff. |
Tasks | Stochastic Optimization |
Published | 2019-02-13 |
URL | http://arxiv.org/abs/1902.04686v2 |
http://arxiv.org/pdf/1902.04686v2.pdf | |
PWC | https://paperswithcode.com/paper/the-complexity-of-making-the-gradient-small |
Repo | |
Framework | |
Reference-Based Sequence Classification
Title | Reference-Based Sequence Classification |
Authors | Zengyou He, Guangyao Xu, Chaohua Sheng, Bo Xu, Quan Zou |
Abstract | Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms. |
Tasks | |
Published | 2019-05-17 |
URL | https://arxiv.org/abs/1905.07188v1 |
https://arxiv.org/pdf/1905.07188v1.pdf | |
PWC | https://paperswithcode.com/paper/reference-based-sequence-classification |
Repo | |
Framework | |
On the Importance of Delexicalization for Fact Verification
Title | On the Importance of Delexicalization for Fact Verification |
Authors | Sandeep Suntwal, Mithun Paul, Rebecca Sharp, Mihai Surdeanu |
Abstract | In this work we aim to understand and estimate the importance that a neural network assigns to various aspects of the data while learning and making predictions. Here we focus on the recognizing textual entailment (RTE) task and its application to fact verification. In this context, the contributions of this work are as follows. We investigate the attention weights a state of the art RTE method assigns to input tokens in the RTE component of fact verification systems, and confirm that most of the weight is assigned to POS tags of nouns (e.g., NN, NNP etc.) or their phrases. To verify that these lexicalized models transfer poorly, we implement a domain transfer experiment where a RTE component is trained on the FEVER data, and tested on the Fake News Challenge (FNC) dataset. As expected, even though this method achieves high accuracy when evaluated in the same domain, the performance in the target domain is poor, marginally above chance.To mitigate this dependence on lexicalized information, we experiment with several strategies for masking out names by replacing them with their semantic category, coupled with a unique identifier to mark that the same or new entities are referenced between claim and evidence. The results show that, while the performance on the FEVER dataset remains at par with that of the model trained on lexicalized data, it improves significantly when tested in the FNC dataset. Thus our experiments demonstrate that our strategy is successful in mitigating the dependency on lexical information. |
Tasks | Natural Language Inference |
Published | 2019-09-21 |
URL | https://arxiv.org/abs/1909.09868v1 |
https://arxiv.org/pdf/1909.09868v1.pdf | |
PWC | https://paperswithcode.com/paper/190909868 |
Repo | |
Framework | |
Attributed Graph Learning with 2-D Graph Convolution
Title | Attributed Graph Learning with 2-D Graph Convolution |
Authors | Qimai Li, Xiaotong Zhang, Han Liu, Xiao-Ming Wu |
Abstract | Recent methods based on graph convolutional neural networks have shown promising results in attributed graph learning, due to the use of graph convolution for learning effective node representations. However, they commonly adopt 1-D graph convolution that operates on the object link graph while completely overlooking informative relational information on other data dimensions. This significantly limits their modeling capability and may lead to inferior performance on noisy and sparse real-world attributed networks. To address this issue, we propose to explore relations among object attributes to complement object links for node representation learning. In particular, we use 2-D graph convolution to jointly model the relations on the two data dimensions and develop a computationally efficient dimensionwise separable 2-D graph convolutional filter (DSGC). Theoretically, we show that DSGC can reduce intra-class variance of node features on both the object dimension and the attribute dimension to facilitate learning. Empirically, we demonstrate that by modeling attribute relations, DSGC achieves significant performance gain over state-of-the-art methods for node classification and clustering on several real-life attributed networks. |
Tasks | Node Classification, Representation Learning |
Published | 2019-09-26 |
URL | https://arxiv.org/abs/1909.12038v3 |
https://arxiv.org/pdf/1909.12038v3.pdf | |
PWC | https://paperswithcode.com/paper/attributed-graph-learning-with-2-d-graph-1 |
Repo | |
Framework | |
Conversational Product Search Based on Negative Feedback
Title | Conversational Product Search Based on Negative Feedback |
Authors | Keping Bi, Qingyao Ai, Yongfeng Zhang, W. Bruce Croft |
Abstract | Intelligent assistants change the way people interact with computers and make it possible for people to search for products through conversations when they have purchase needs. During the interactions, the system could ask questions on certain aspects of the ideal products to clarify the users’ needs. For example, previous work proposed to ask users the exact characteristics of their ideal items before showing results. However, users may not have clear ideas about what an ideal item looks like, especially when they have not seen any item. So it is more feasible to facilitate the conversational search by showing example items and asking for feedback instead. In addition, when the users provide negative feedback for the presented items, it is easier to collect their detailed feedback on certain properties (aspect-value pairs) of the non-relevant items. By breaking down the item-level negative feedback to fine-grained feedback on aspect-value pairs, more information is available to help clarify users’ intents. So in this paper, we propose a conversational paradigm for product search driven by non-relevant items, based on which fine-grained feedback is collected and utilized to show better results in the next iteration. We then propose an aspect-value likelihood model to incorporate both positive and negative feedback on fine-grained aspect-value pairs of the non-relevant items. Experimental results show that our model is significantly better than state-of-the-art product search baselines without using feedback and those baselines using item-level negative feedback. |
Tasks | |
Published | 2019-09-04 |
URL | https://arxiv.org/abs/1909.02071v1 |
https://arxiv.org/pdf/1909.02071v1.pdf | |
PWC | https://paperswithcode.com/paper/conversational-product-search-based-on |
Repo | |
Framework | |
Distribution-Dependent Analysis of Gibbs-ERM Principle
Title | Distribution-Dependent Analysis of Gibbs-ERM Principle |
Authors | Ilja Kuzborskij, Nicolò Cesa-Bianchi, Csaba Szepesvári |
Abstract | Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as Stochastic Gradient Langevin Dynamics and —to some extent— Stochastic Gradient Descent), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are distribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the effective dimension $\mathrm{tr}\left(\boldsymbol{H}^{\star} (\boldsymbol{H}^{\star} + \lambda \boldsymbol{I})^{-1}\right)$ of the problem, where $\boldsymbol{H}^{\star}$ is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. (2017), and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the localized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the global excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest. |
Tasks | Stochastic Optimization |
Published | 2019-02-05 |
URL | http://arxiv.org/abs/1902.01846v1 |
http://arxiv.org/pdf/1902.01846v1.pdf | |
PWC | https://paperswithcode.com/paper/distribution-dependent-analysis-of-gibbs-erm |
Repo | |
Framework | |
Node Injection Attacks on Graphs via Reinforcement Learning
Title | Node Injection Attacks on Graphs via Reinforcement Learning |
Authors | Yiwei Sun, Suhang Wang, Xianfeng Tang, Tsung-Yu Hsieh, Vasant Honavar |
Abstract | Real-world graph applications, such as advertisements and product recommendations make profits based on accurately classify the label of the nodes. However, in such scenarios, there are high incentives for the adversaries to attack such graph to reduce the node classification performance. Previous work on graph adversarial attacks focus on modifying existing graph structures, which is infeasible in most real-world applications. In contrast, it is more practical to inject adversarial nodes into existing graphs, which can also potentially reduce the performance of the classifier. In this paper, we study the novel node injection poisoning attacks problem which aims to poison the graph. We describe a reinforcement learning based method, namely NIPA, to sequentially modify the adversarial information of the injected nodes. We report the results of experiments using several benchmark data sets that show the superior performance of the proposed method NIPA, relative to the existing state-of-the-art methods. |
Tasks | Node Classification |
Published | 2019-09-14 |
URL | https://arxiv.org/abs/1909.06543v1 |
https://arxiv.org/pdf/1909.06543v1.pdf | |
PWC | https://paperswithcode.com/paper/node-injection-attacks-on-graphs-via |
Repo | |
Framework | |
Crossmodal Voice Conversion
Title | Crossmodal Voice Conversion |
Authors | Hirokazu Kameoka, Kou Tanaka, Aaron Valero Puche, Yasunori Ohishi, Takuhiro Kaneko |
Abstract | Humans are able to imagine a person’s voice from the person’s appearance and imagine the person’s appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a model, consisting of a speech converter, a face encoder/decoder and a voice encoder. We use the latent code of an input face image encoded by the face encoder as the auxiliary input into the speech converter and train the speech converter so that the original latent code can be recovered from the generated speech by the voice encoder. We also train the face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image. We confirmed experimentally that a speech converter trained in this way was able to convert input speech into a voice that matched an input face image and that the voice encoder and face decoder can be used to generate a face image that matches the voice of the input speech. |
Tasks | Voice Conversion |
Published | 2019-04-09 |
URL | http://arxiv.org/abs/1904.04540v1 |
http://arxiv.org/pdf/1904.04540v1.pdf | |
PWC | https://paperswithcode.com/paper/crossmodal-voice-conversion |
Repo | |
Framework | |
Lattice Transformer for Speech Translation
Title | Lattice Transformer for Speech Translation |
Authors | Pei Zhang, Boxing Chen, Niyu Ge, Kai Fan |
Abstract | Recent advances in sequence modeling have highlighted the strengths of the transformer architecture, especially in achieving state-of-the-art machine translation results. However, depending on the up-stream systems, e.g., speech recognition, or word segmentation, the input to translation system can vary greatly. The goal of this work is to extend the attention mechanism of the transformer to naturally consume the lattice in addition to the traditional sequential input. We first propose a general lattice transformer for speech translation where the input is the output of the automatic speech recognition (ASR) which contains multiple paths and posterior scores. To leverage the extra information from the lattice structure, we develop a novel controllable lattice attention mechanism to obtain latent representations. On the LDC Spanish-English speech translation corpus, our experiments show that lattice transformer generalizes significantly better and outperforms both a transformer baseline and a lattice LSTM. Additionally, we validate our approach on the WMT 2017 Chinese-English translation task with lattice inputs from different BPE segmentations. In this task, we also observe the improvements over strong baselines. |
Tasks | Machine Translation, Speech Recognition |
Published | 2019-06-13 |
URL | https://arxiv.org/abs/1906.05551v1 |
https://arxiv.org/pdf/1906.05551v1.pdf | |
PWC | https://paperswithcode.com/paper/lattice-transformer-for-speech-translation |
Repo | |
Framework | |
Learning to Make Analogies by Contrasting Abstract Relational Structure
Title | Learning to Make Analogies by Contrasting Abstract Relational Structure |
Authors | Felix Hill, Adam Santoro, David G. T. Barrett, Ari S. Morcos, Timothy Lillicrap |
Abstract | Analogical reasoning has been a principal focus of various waves of AI research. Analogy is particularly challenging for machines because it requires relational structures to be represented such that they can be flexibly applied across diverse domains of experience. Here, we study how analogical reasoning can be induced in neural networks that learn to perceive and reason about raw visual data. We find that the critical factor for inducing such a capacity is not an elaborate architecture, but rather, careful attention to the choice of data and the manner in which it is presented to the model. The most robust capacity for analogical reasoning is induced when networks learn analogies by contrasting abstract relational structures in their input domains, a training method that uses only the input data to force models to learn about important abstract features. Using this technique we demonstrate capacities for complex, visual and symbolic analogy making and generalisation in even the simplest neural network architectures. |
Tasks | |
Published | 2019-01-31 |
URL | http://arxiv.org/abs/1902.00120v1 |
http://arxiv.org/pdf/1902.00120v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-make-analogies-by-contrasting |
Repo | |
Framework | |
Deep Learning Based on Orthogonal Approximate Message Passing for CP-Free OFDM
Title | Deep Learning Based on Orthogonal Approximate Message Passing for CP-Free OFDM |
Authors | Jing Zhang, Hengtao He, Chao-Kai Wen, Shi Jin, Geoffrey Ye Li |
Abstract | Channel estimation and signal detection are very challenging for an orthogonal frequency division multiplexing (OFDM) system without cyclic prefix (CP). In this article, deep learning based on orthogonal approximate message passing (DL-OAMP) is used to address these problems. The DL-OAMP receiver includes a channel estimation neural network (CE-Net) and a signal detection neural network based on OAMP, called OAMP-Net. The CE-Net is initialized by the least square channel estimation algorithm and refined by minimum mean-squared error (MMSE) neural network. The OAMP-Net is established by unfolding the iterative OAMP algorithm and adding some trainable parameters to improve the detection performance. The DL-OAMP receiver is with low complexity and can estimate time-varying channels with only a single training. Simulation results demonstrate that the bit-error rate (BER) of the proposed scheme is lower than those of competitive algorithms for high-order modulation. |
Tasks | |
Published | 2019-05-04 |
URL | https://arxiv.org/abs/1905.02541v1 |
https://arxiv.org/pdf/1905.02541v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-learning-based-on-orthogonal-approximate |
Repo | |
Framework | |
Model Learning: Primal Dual Networks for Fast MR imaging
Title | Model Learning: Primal Dual Networks for Fast MR imaging |
Authors | Jing Cheng, Haifeng Wang, Leslie Ying, Dong Liang |
Abstract | Magnetic resonance imaging (MRI) is known to be a slow imaging modality and undersampling in k-space has been used to increase the imaging speed. However, image reconstruction from undersampled k-space data is an ill-posed inverse problem. Iterative algorithms based on compressed sensing have been used to address the issue. In this work, we unroll the iterations of the primal-dual hybrid gradient algorithm to a learnable deep network architecture, and gradually relax the constraints to reconstruct MR images from highly undersampled k-space data. The proposed method combines the theoretical convergence guarantee of optimi-zation methods with the powerful learning capability of deep networks. As the constraints are gradually relaxed, the reconstruction model is finally learned from the training data by updating in k-space and image domain alternatively. Experi-ments on in vivo MR data demonstrate that the proposed method achieves supe-rior MR reconstructions from highly undersampled k-space data over other state-of-the-art image reconstruction methods. |
Tasks | Image Reconstruction |
Published | 2019-08-07 |
URL | https://arxiv.org/abs/1908.02426v1 |
https://arxiv.org/pdf/1908.02426v1.pdf | |
PWC | https://paperswithcode.com/paper/model-learning-primal-dual-networks-for-fast |
Repo | |
Framework | |
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Title | Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization |
Authors | Junxin Liu, Fangzhao Wu, Chuhan Wu, Yongfeng Huang, Xing Xie |
Abstract | Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach. |
Tasks | Chinese Word Segmentation |
Published | 2019-04-26 |
URL | http://arxiv.org/abs/1905.01963v1 |
http://arxiv.org/pdf/1905.01963v1.pdf | |
PWC | https://paperswithcode.com/paper/190501963 |
Repo | |
Framework | |
Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of Undervolting Faults
Title | Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of Undervolting Faults |
Authors | Behzad Salami, Osman S. Unsal, Adrian Cristal Kestelman |
Abstract | Voltage underscaling below the nominal level is an effective solution for improving energy efficiency in digital circuits, e.g., Field Programmable Gate Arrays (FPGAs). However, further undervolting below a safe voltage level and without accompanying frequency scaling leads to timing related faults, potentially undermining the energy savings. Through experimental voltage underscaling studies on commercial FPGAs, we observed that the rate of these faults exponentially increases for on-chip memories, or Block RAMs (BRAMs). To mitigate these faults, we evaluated the efficiency of the built-in Error-Correction Code (ECC) and observed that more than 90% of the faults are correctable and further 7% are detectable (but not correctable). This efficiency is the result of the single-bit type of these faults, which are then effectively covered by the Single-Error Correction and Double-Error Detection (SECDED) design of the built-in ECC. Finally, motivated by the above experimental observations, we evaluated an FPGA-based Neural Network (NN) accelerator under low-voltage operations, while built-in ECC is leveraged to mitigate undervolting faults and thus, prevent NN significant accuracy loss. In consequence, we achieve 40% of the BRAM power saving through undervolting below the minimum safe voltage level, with a negligible NN accuracy loss, thanks to the substantial fault coverage by the built-in ECC. |
Tasks | |
Published | 2019-03-29 |
URL | http://arxiv.org/abs/1903.12514v1 |
http://arxiv.org/pdf/1903.12514v1.pdf | |
PWC | https://paperswithcode.com/paper/evaluating-built-in-ecc-of-fpga-on-chip |
Repo | |
Framework | |
Reducing catastrophic forgetting when evolving neural networks
Title | Reducing catastrophic forgetting when evolving neural networks |
Authors | Joseph Early |
Abstract | A key stepping stone in the development of an artificial general intelligence (a machine that can perform any task), is the production of agents that can perform multiple tasks at once instead of just one. Unfortunately, canonical methods are very prone to catastrophic forgetting (CF) - the act of overwriting previous knowledge about a task when learning a new task. Recent efforts have developed techniques for overcoming CF in learning systems, but no attempt has been made to apply these new techniques to evolutionary systems. This research presents a novel technique, weight protection, for reducing CF in evolutionary systems by adapting a method from learning systems. It is used in conjunction with other evolutionary approaches for overcoming CF and is shown to be effective at alleviating CF when applied to a suite of reinforcement learning tasks. It is speculated that this work could indicate the potential for a wider application of existing learning-based approaches to evolutionary systems and that evolutionary techniques may be competitive with or better than learning systems when it comes to reducing CF. |
Tasks | |
Published | 2019-04-05 |
URL | http://arxiv.org/abs/1904.03178v1 |
http://arxiv.org/pdf/1904.03178v1.pdf | |
PWC | https://paperswithcode.com/paper/reducing-catastrophic-forgetting-when |
Repo | |
Framework | |