February 1, 2020

3228 words 16 mins read

Paper Group AWR 246

Homogeneous Online Transfer Learning with Online Distribution Discrepancy Minimization. On-Policy Trust Region Policy Optimisation with Replay Buffers. Fine-tune BERT for Extractive Summarization. compare-mt: A Tool for Holistic Comparison of Language Generation Systems. DeepCaps: Going Deeper with Capsule Networks. Learning to Exploit Long-term Re …

Homogeneous Online Transfer Learning with Online Distribution Discrepancy Minimization


Title	Homogeneous Online Transfer Learning with Online Distribution Discrepancy Minimization
Authors	Yuntao Du, Zhiwen Tan, Qian Chen, Yi Zhang, Chongjun Wang
Abstract	Transfer learning has been demonstrated to be successful and essential in diverse applications, which transfers knowledge from related but different source domains to the target domain. Online transfer learning(OTL) is a more challenging problem where the target data arrive in an online manner. Most OTL methods combine source classifier and target classifier directly by assigning a weight to each classifier, and adjust the weights constantly. However, these methods pay little attention to reducing the distribution discrepancy between domains. In this paper, we propose a novel online transfer learning method which seeks to find a new feature representation, so that the marginal distribution and conditional distribution discrepancy can be online reduced simultaneously. We focus on online transfer learning with multiple source domains and use the Hedge strategy to leverage knowledge from source domains. We analyze the theoretical properties of the proposed algorithm and provide an upper mistake bound. Comprehensive experiments on two real-world datasets show that our method outperforms state-of-the-art methods by a large margin.
Tasks	Transfer Learning
Published	2019-12-31
URL	https://arxiv.org/abs/1912.13226v1
PDF	https://arxiv.org/pdf/1912.13226v1.pdf
PWC	https://paperswithcode.com/paper/homogeneous-online-transfer-learning-with
Repo	https://github.com/yaoyueduzhen/HomOTL-ODDM
Framework	none

On-Policy Trust Region Policy Optimisation with Replay Buffers


Title	On-Policy Trust Region Policy Optimisation with Replay Buffers
Authors	Dmitry Kangin, Nicolas Pugeault
Abstract	Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the $Q$-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.
Tasks	Continuous Control, Policy Gradient Methods
Published	2019-01-18
URL	http://arxiv.org/abs/1901.06212v1
PDF	http://arxiv.org/pdf/1901.06212v1.pdf
PWC	https://paperswithcode.com/paper/on-policy-trust-region-policy-optimisation
Repo	https://github.com/dkangin/baselines/tree/master/baselines/trpo_replay
Framework	tf

Fine-tune BERT for Extractive Summarization


Title	Fine-tune BERT for Extractive Summarization
Authors	Yang Liu
Abstract	BERT, a pre-trained Transformer model, has achieved ground-breaking performance on multiple NLP tasks. In this paper, we describe BERTSUM, a simple variant of BERT, for extractive summarization. Our system is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L. The codes to reproduce our results are available at https://github.com/nlpyang/BertSum
Tasks	Extractive Document Summarization
Published	2019-03-25
URL	https://arxiv.org/abs/1903.10318v2
PDF	https://arxiv.org/pdf/1903.10318v2.pdf
PWC	https://paperswithcode.com/paper/fine-tune-bert-for-extractive-summarization
Repo	https://github.com/nlpyang/BertSum
Framework	pytorch

compare-mt: A Tool for Holistic Comparison of Language Generation Systems


Title	compare-mt: A Tool for Holistic Comparison of Language Generation Systems
Authors	Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, Xinyi Wang, John Wieting
Abstract	In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or system improvement. It implements a number of tools to do so, such as analysis of accuracy of generation of particular types of words, bucketed histograms of sentence accuracies or counts based on salient characteristics, and extraction of characteristic $n$-grams for each system. It also has a number of advanced features such as use of linguistic labels, source side data, or comparison of log likelihoods for probabilistic models, and also aims to be easily extensible by users to new types of analysis. The code is available at https://github.com/neulab/compare-mt
Tasks	Machine Translation, Text Generation
Published	2019-03-19
URL	https://arxiv.org/abs/1903.07926v2
PDF	https://arxiv.org/pdf/1903.07926v2.pdf
PWC	https://paperswithcode.com/paper/compare-mt-a-tool-for-holistic-comparison-of
Repo	https://github.com/neulab/compare-mt
Framework	none

DeepCaps: Going Deeper with Capsule Networks


Title	DeepCaps: Going Deeper with Capsule Networks
Authors	Jathushan Rajasegaran, Vinoj Jayasundara, Sandaru Jayasekara, Hirunima Jayasekara, Suranga Seneviratne, Ranga Rodrigo
Abstract	Capsule Network is a promising concept in deep learning, yet its true potential is not fully realized thus far, providing sub-par performance on several key benchmark datasets with complex data. Drawing intuition from the success achieved by Convolutional Neural Networks (CNNs) by going deeper, we introduce DeepCaps1, a deep capsule network architecture which uses a novel 3D convolution based dynamic routing algorithm. With DeepCaps, we surpass the state-of-the-art results in the capsule network domain on CIFAR10, SVHN and Fashion MNIST, while achieving a 68% reduction in the number of parameters. Further, we propose a class-independent decoder network, which strengthens the use of reconstruction loss as a regularization term. This leads to an interesting property of the decoder, which allows us to identify and control the physical attributes of the images represented by the instantiation parameters.
Tasks
Published	2019-04-21
URL	http://arxiv.org/abs/1904.09546v1
PDF	http://arxiv.org/pdf/1904.09546v1.pdf
PWC	https://paperswithcode.com/paper/deepcaps-going-deeper-with-capsule-networks
Repo	https://github.com/HopefulRational/DeepCaps-PyTorch
Framework	pytorch

Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs


Title	Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs
Authors	Lingbing Guo, Zequn Sun, Wei Hu
Abstract	We study the problem of knowledge graph (KG) embedding. A widely-established assumption to this problem is that similar entities are likely to have similar relational roles. However, existing related methods derive KG embeddings mainly based on triple-level learning, which lack the capability of capturing long-term relational dependencies of entities. Moreover, triple-level learning is insufficient for the propagation of semantic information among entities, especially for the case of cross-KG embedding. In this paper, we propose recurrent skipping networks (RSNs), which employ a skipping mechanism to bridge the gaps between entities. RSNs integrate recurrent neural networks (RNNs) with residual learning to efficiently capture the long-term relational dependencies within and between KGs. We design an end-to-end framework to support RSNs on different tasks. Our experimental results showed that RSNs outperformed state-of-the-art embedding-based methods for entity alignment and achieved competitive performance for KG completion.
Tasks	Entity Alignment, Knowledge Graphs
Published	2019-05-13
URL	https://arxiv.org/abs/1905.04914v1
PDF	https://arxiv.org/pdf/1905.04914v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-exploit-long-term-relational
Repo	https://github.com/nju-websoft/RSN
Framework	tf

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models


Title	ZeRO: Memory Optimization Towards Training A Trillion Parameter Models
Authors	Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
Abstract	Training large DL models with billions and potentially trillions of parameters is challenging. Existing solutions exhibit fundamental limitations to obtain both memory and scaling (computation/communication) efficiency together. Data parallelism does not help reduce memory footprint per device: a model with 1.5 billion parameters or more runs out of memory. Model parallelism hardly scales efficiently beyond multiple devices of a single node due to fine-grained computation and expensive communication. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency. Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states instead, to scale the model size linearly with the number of devices. Furthermore, it retains scaling efficiency via computation and communication rescheduling and by reducing the model parallelism degree required to run large models. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today’s hardware (e.g., 1024 GPUs, 64 DGX-2 nodes). To meet near-term scaling goals and serve as a demonstration of ZeRO’s capability, we implemented stage-1 optimizations of ZeRO (out of 3 stages in total described in the paper) and tested this ZeRO-OS version. ZeRO-OS reduces memory and boosts model size by 4x compared with the state-of-art, scaling up to 100B parameters. Moving forward, we will work on unlocking stage-2 optimizations, with up to 8x memory savings per device, and ultimately stage-3 optimizations, reducing memory linearly with respect to the number of devices and potentially scaling to models of arbitrary size. We are excited to transform very large models from impossible to train to feasible and efficient to train!
Tasks
Published	2019-10-04
URL	https://arxiv.org/abs/1910.02054v2
PDF	https://arxiv.org/pdf/1910.02054v2.pdf
PWC	https://paperswithcode.com/paper/zero-memory-optimization-towards-training-a
Repo	https://github.com/microsoft/DeepSpeed
Framework	pytorch

System Identification with Time-Aware Neural Sequence Models


Title	System Identification with Time-Aware Neural Sequence Models
Authors	Thomas Demeester
Abstract	Established recurrent neural networks are well-suited to solve a wide variety of prediction tasks involving discrete sequences. However, they do not perform as well in the task of dynamical system identification, when dealing with observations from continuous variables that are unevenly sampled in time, for example due to missing observations. We show how such neural sequence models can be adapted to deal with variable step sizes in a natural way. In particular, we introduce a time-aware and stationary extension of existing models (including the Gated Recurrent Unit) that allows them to deal with unevenly sampled system observations by adapting to the observation times, while facilitating higher-order temporal behavior. We discuss the properties and demonstrate the validity of the proposed approach, based on samples from two industrial input/output processes.
Tasks
Published	2019-11-21
URL	https://arxiv.org/abs/1911.09431v1
PDF	https://arxiv.org/pdf/1911.09431v1.pdf
PWC	https://paperswithcode.com/paper/system-identification-with-time-aware-neural
Repo	https://github.com/tdmeeste/TimeAwareRNN
Framework	pytorch

Optimization algorithms inspired by the geometry of dissipative systems


Title	Optimization algorithms inspired by the geometry of dissipative systems
Authors	Alessandro Bravetti, Maria L. Daza-Torres, Hugo Flores-Arguedas, Michael Betancourt
Abstract	Accelerated gradient methods are a powerful optimization tool in machine learning and statistics but their development has traditionally been driven by heuristic motivations. Recent research, however, has demonstrated that these methods can be derived as discretizations of dynamical systems, which in turn has provided a basis for more systematic investigations, especially into the structure of those dynamical systems and their structure-preserving discretizations. In this work we introduce dynamical systems defined through a contact geometry which are not only naturally suited to the optimization goal but also subsume all previous methods based on geometric dynamical systems. These contact dynamical systems also admit a natural, robust discretization through geometric contact integrators. We demonstrate these features in paradigmatic examples which show that we can indeed obtain optimization algorithms that achieve oracle lower bounds on convergence rates while also improving on previous proposals in terms of stability.
Tasks
Published	2019-12-06
URL	https://arxiv.org/abs/1912.02928v2
PDF	https://arxiv.org/pdf/1912.02928v2.pdf
PWC	https://paperswithcode.com/paper/optimization-algorithms-inspired-by-the
Repo	https://github.com/mdazatorres/Contact-optimization-algorithms
Framework	none


Title	Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images
Authors	Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi
Abstract	Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is retrieval, which is particularly helpful for health related applications, where we are interested in retrieving important information about food (e.g., ingredients, nutrition, etc.). In this paper, we investigate an open research task of cross-modal retrieval between cooking recipes and food images, and propose a novel framework Adversarial Cross-Modal Embedding (ACME) to resolve the cross-modal retrieval task in food domains. Specifically, the goal is to learn a common embedding feature space between the two modalities, in which our approach consists of several novel ideas: (i) learning by using a new triplet loss scheme together with an effective sampling strategy, (ii) imposing modality alignment using an adversarial learning strategy, and (iii) imposing cross-modal translation consistency such that the embedding of one modality is able to recover some important information of corresponding instances in the other modality. ACME achieves the state-of-the-art performance on the benchmark Recipe1M dataset, validating the efficacy of the proposed technique.
Tasks	Cross-Modal Retrieval
Published	2019-05-03
URL	https://arxiv.org/abs/1905.01273v1
PDF	https://arxiv.org/pdf/1905.01273v1.pdf
PWC	https://paperswithcode.com/paper/learning-cross-modal-embeddings-with
Repo	https://github.com/LARC-CMU-SMU/ACME
Framework	pytorch

NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization


Title	NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
Authors	Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, Jie Tang
Abstract	We study the problem of large-scale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2)the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrix—which is dense—is prohibitively expensive in terms of both time and space, making it not scalable for large networks. In this work, we present the algorithm of large-scale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved efficiency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded approximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks and factorization based methods, NetSMF is the only method that achieves both high efficiency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a large-scale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution. The source code of NetSMF is publicly available (https://github.com/xptree/NetSMF).
Tasks	Network Embedding
Published	2019-06-26
URL	https://arxiv.org/abs/1906.11156v1
PDF	https://arxiv.org/pdf/1906.11156v1.pdf
PWC	https://paperswithcode.com/paper/netsmf-large-scale-network-embedding-as
Repo	https://github.com/xptree/NetSMF
Framework	none

Bayesian Optimization with Unknown Search Space


Title	Bayesian Optimization with Unknown Search Space
Authors	Huong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh
Abstract	Applying Bayesian optimization in problems wherein the search space is unknown is challenging. To address this problem, we propose a systematic volume expansion strategy for the Bayesian optimization. We devise a strategy to guarantee that in iterative expansions of the search space, our method can find a point whose function value within epsilon of the objective function maximum. Without the need to specify any parameters, our algorithm automatically triggers a minimal expansion required iteratively. We derive analytic expressions for when to trigger the expansion and by how much to expand. We also provide theoretical analysis to show that our method achieves epsilon-accuracy after a finite number of iterations. We demonstrate our method on both benchmark test functions and machine learning hyper-parameter tuning tasks and demonstrate that our method outperforms baselines.
Tasks
Published	2019-10-29
URL	https://arxiv.org/abs/1910.13092v1
PDF	https://arxiv.org/pdf/1910.13092v1.pdf
PWC	https://paperswithcode.com/paper/bayesian-optimization-with-unknown-search
Repo	https://github.com/HuongHa12/BO_unknown_searchspace
Framework	tf

Learning with Long-term Remembering: Following the Lead of Mixed Stochastic Gradient


Title	Learning with Long-term Remembering: Following the Lead of Mixed Stochastic Gradient
Authors	Yunhui Guo, Mingrui Liu, Tianbao Yang, Tajana Rosing
Abstract	Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as catastrophic forgetting and motivates the field called lifelong learning. The central question in lifelong learning is how to enable deep neural networks to maintain performance on old tasks while learning a new task. In this paper, we introduce a novel and effective lifelong learning algorithm, called MixEd stochastic GrAdient (MEGA), which allows deep neural networks to acquire the ability of retaining performance on old tasks while learning new tasks. MEGA modulates the balance between old tasks and the new task by integrating the current gradient with the gradient computed on a small reference episodic memory. Extensive experimental results show that the proposed MEGA algorithm significantly advances the state-of-the-art on all four commonly used lifelong learning benchmarks, reducing the error by up to 18%.
Tasks
Published	2019-09-25
URL	https://arxiv.org/abs/1909.11763v5
PDF	https://arxiv.org/pdf/1909.11763v5.pdf
PWC	https://paperswithcode.com/paper/learning-with-long-term-remembering-following
Repo	https://github.com/yunhuiguo/MEGA
Framework	tf

Adversarial Training Methods for Network Embedding


Title	Adversarial Training Methods for Network Embedding
Authors	Quanyu Dai, Xiao Shen, Liang Zhang, Qiang Li, Dan Wang
Abstract	Network Embedding is the task of learning continuous node representations for networks, which has been shown effective in a variety of tasks such as link prediction and node classification. Most of existing works aim to preserve different network structures and properties in low-dimensional embedding vectors, while neglecting the existence of noisy information in many real-world networks and the overfitting issue in the embedding learning process. Most recently, generative adversarial networks (GANs) based regularization methods are exploited to regularize embedding learning process, which can encourage a global smoothness of embedding vectors. These methods have very complicated architecture and suffer from the well-recognized non-convergence problem of GANs. In this paper, we aim to introduce a more succinct and effective local regularization method, namely adversarial training, to network embedding so as to achieve model robustness and better generalization performance. Firstly, the adversarial training method is applied by defining adversarial perturbations in the embedding space with an adaptive $L_2$ norm constraint that depends on the connectivity pattern of node pairs. Though effective as a regularizer, it suffers from the interpretability issue which may hinder its application in certain real-world scenarios. To improve this strategy, we further propose an interpretable adversarial training method by enforcing the reconstruction of the adversarial examples in the discrete graph domain. These two regularization methods can be applied to many existing embedding models, and we take DeepWalk as the base model for illustration in the paper. Empirical evaluations in both link prediction and node classification demonstrate the effectiveness of the proposed methods.
Tasks	Link Prediction, Network Embedding, Node Classification
Published	2019-08-30
URL	https://arxiv.org/abs/1908.11514v1
PDF	https://arxiv.org/pdf/1908.11514v1.pdf
PWC	https://paperswithcode.com/paper/adversarial-training-methods-for-network
Repo	https://github.com/wonniu/AdvT4NE_WWW2019
Framework	tf

Fast and Accurate Network Embeddings via Very Sparse Random Projection


Title	Fast and Accurate Network Embeddings via Very Sparse Random Projection
Authors	Haochen Chen, Syed Fahad Sultan, Yingtao Tian, Muhao Chen, Steven Skiena
Abstract	We present FastRP, a scalable and performant algorithm for learning distributed node representations in a graph. FastRP is over 4,000 times faster than state-of-the-art methods such as DeepWalk and node2vec, while achieving comparable or even better performance as evaluated on several real-world networks on various downstream tasks. We observe that most network embedding methods consist of two components: construct a node similarity matrix and then apply dimension reduction techniques to this matrix. We show that the success of these methods should be attributed to the proper construction of this similarity matrix, rather than the dimension reduction method employed. FastRP is proposed as a scalable algorithm for network embeddings. Two key features of FastRP are: 1) it explicitly constructs a node similarity matrix that captures transitive relationships in a graph and normalizes matrix entries based on node degrees; 2) it utilizes very sparse random projection, which is a scalable optimization-free method for dimension reduction. An extra benefit from combining these two design choices is that it allows the iterative computation of node embeddings so that the similarity matrix need not be explicitly constructed, which further speeds up FastRP. FastRP is also advantageous for its ease of implementation, parallelization and hyperparameter tuning. The source code is available at https://github.com/GTmac/FastRP.
Tasks	Dimensionality Reduction, Network Embedding
Published	2019-08-30
URL	https://arxiv.org/abs/1908.11512v1
PDF	https://arxiv.org/pdf/1908.11512v1.pdf
PWC	https://paperswithcode.com/paper/fast-and-accurate-network-embeddings-via-very
Repo	https://github.com/GTmac/FastRP
Framework	none