February 1, 2020

3156 words 15 mins read

Paper Group AWR 238

Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step. Understanding Top-k Sparsification in Distributed Deep Learning. Word-Class Embeddings for Multiclass Text Classification. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication. Ordered SGD: A New Stochastic Optimi …

Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step


Title	Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step
Authors	Steinþór Steingrímsson, Örvar Kárason, Hrafn Loftsson
Abstract	Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any previously published tagger not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we outperform previous state-of-the-art results by a significant margin. We also report on work in progress that attempts to address the problem of data sparsity inherent in morphologically detailed, fine-grained tagsets. We experiment with training a separate model on only the lexical category and using the coarse-grained output tag as an input for the main model. This method further increases the accuracy and reduces the tagging errors by 21.3% compared to previous state-of-the-art results. Finally, we train and test our tagger on a new gold standard for Icelandic.
Tasks
Published	2019-07-21
URL	https://arxiv.org/abs/1907.09038v1
PDF	https://arxiv.org/pdf/1907.09038v1.pdf
PWC	https://paperswithcode.com/paper/augmenting-a-bilstm-tagger-with-a
Repo	https://github.com/steinst/ABLTagger
Framework	none

Understanding Top-k Sparsification in Distributed Deep Learning


Title	Understanding Top-k Sparsification in Distributed Deep Learning
Authors	Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See
Abstract	Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}.
Tasks
Published	2019-11-20
URL	https://arxiv.org/abs/1911.08772v1
PDF	https://arxiv.org/pdf/1911.08772v1.pdf
PWC	https://paperswithcode.com/paper/understanding-top-k-sparsification-in-1
Repo	https://github.com/hclhkbu/GaussianK-SGD
Framework	pytorch

Word-Class Embeddings for Multiclass Text Classification


Title	Word-Class Embeddings for Multiclass Text Classification
Authors	Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani
Abstract	Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using four popular neural architectures and six widely used and publicly available datasets for multiclass text classification. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings
Tasks	Machine Translation, Sentiment Analysis, Text Classification, Word Embeddings, Word Sense Disambiguation
Published	2019-11-26
URL	https://arxiv.org/abs/1911.11506v1
PDF	https://arxiv.org/pdf/1911.11506v1.pdf
PWC	https://paperswithcode.com/paper/word-class-embeddings-for-multiclass-text
Repo	https://github.com/AlexMoreo/word-class-embeddings
Framework	pytorch

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication


Title	Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
Authors	Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi
Abstract	We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by $\omega \leq 1$ ($\omega=1$ meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, $\mathcal{O}(1/(nT))$, is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time $\mathcal{O}(1/(\delta^2\omega) \log (1/\epsilon))$ for accuracy $\epsilon > 0$. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for $\omega > 0$ and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes.
Tasks	Stochastic Optimization
Published	2019-02-01
URL	http://arxiv.org/abs/1902.00340v1
PDF	http://arxiv.org/pdf/1902.00340v1.pdf
PWC	https://paperswithcode.com/paper/decentralized-stochastic-optimization-and
Repo	https://github.com/JYWa/MATCHA
Framework	pytorch

Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization


Title	Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization
Authors	Kenji Kawaguchi, Haihao Lu
Abstract	We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems.
Tasks	Stochastic Optimization
Published	2019-07-09
URL	https://arxiv.org/abs/1907.04371v5
PDF	https://arxiv.org/pdf/1907.04371v5.pdf
PWC	https://paperswithcode.com/paper/a-stochastic-first-order-method-for-ordered
Repo	https://github.com/kenjikawaguchi/qSGD
Framework	pytorch

Text Generation from Knowledge Graphs with Graph Transformers


Title	Text Generation from Knowledge Graphs with Graph Transformers
Authors	Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, Hannaneh Hajishirzi
Abstract	Generating texts which express complex ideas spanning multiple sentences requires a structured representation of their content (document plan), but these representations are prohibitively expensive to manually produce. In this work, we address the problem of generating coherent multi-sentence texts from the output of an information extraction system, and in particular a knowledge graph. Graphical knowledge representations are ubiquitous in computing, but pose a significant challenge for text generation techniques due to their non-hierarchical nature, collapsing of long-distance dependencies, and structural variety. We introduce a novel graph transforming encoder which can leverage the relational structure of such knowledge graphs without imposing linearization or hierarchical constraints. Incorporated into an encoder-decoder setup, we provide an end-to-end trainable system for graph-to-text generation that we apply to the domain of scientific text. Automatic and human evaluations show that our technique produces more informative texts which exhibit better document structure than competitive encoder-decoder methods.
Tasks	Knowledge Graphs, Text Generation
Published	2019-04-04
URL	https://arxiv.org/abs/1904.02342v2
PDF	https://arxiv.org/pdf/1904.02342v2.pdf
PWC	https://paperswithcode.com/paper/text-generation-from-knowledge-graphs-with
Repo	https://github.com/rikdz/GraphWriter
Framework	pytorch

Weight Standardization


Title	Weight Standardization
Authors	Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille
Abstract	In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: https://github.com/joe-siyuan-qiao/WeightStandardization.
Tasks	Image Classification, Instance Segmentation, Object Detection, Semantic Segmentation, Video Recognition
Published	2019-03-25
URL	http://arxiv.org/abs/1903.10520v1
PDF	http://arxiv.org/pdf/1903.10520v1.pdf
PWC	https://paperswithcode.com/paper/weight-standardization
Repo	https://github.com/joe-siyuan-qiao/WeightStandardization
Framework	tf

On Direct Distribution Matching for Adapting Segmentation Networks


Title	On Direct Distribution Matching for Adapting Segmentation Networks
Authors	Georg Pichler, Jose Dolz, Ismail Ben Ayed, Pablo Piantanida
Abstract	Minimization of distribution matching losses is a principled approach to domain adaptation in the context of image classification. However, it is largely overlooked in adapting segmentation networks, which is currently dominated by adversarial models. We propose a class of loss functions, which encourage direct kernel density matching in the network-output space, up to some geometric transformations computed from unlabeled inputs. Rather than using an intermediate domain discriminator, our direct approach unifies distribution matching and segmentation in a single loss. Therefore, it simplifies segmentation adaptation by avoiding extra adversarial steps, while improving both the quality, stability and efficiency of training. We juxtapose our approach to state-of-the-art segmentation adaptation via adversarial training in the network-output space. In the challenging task of adapting brain segmentation across different magnetic resonance images (MRI) modalities, our approach achieves significantly better results both in terms of accuracy and stability.
Tasks	Brain Segmentation, Domain Adaptation, Image Classification
Published	2019-04-04
URL	http://arxiv.org/abs/1904.02657v1
PDF	http://arxiv.org/pdf/1904.02657v1.pdf
PWC	https://paperswithcode.com/paper/on-direct-distribution-matching-for-adapting
Repo	https://github.com/anonymauthor/DDMSegNet
Framework	none

Online Continual Learning with Maximally Interfered Retrieval


Title	Online Continual Learning with Maximally Interfered Retrieval
Authors	Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Min Lin, Laurent Charlin, Tinne Tuytelaars
Abstract	Continual learning, the setting where a learning agent is faced with a never ending stream of data, continues to be a great challenge for modern machine learning systems. In particular the online or “single-pass through the data” setting has gained attention recently as a natural setting that is difficult to tackle. Methods based on replay, either generative or from a stored memory, have been shown to be effective approaches for continual learning, matching or exceeding the state of the art in a number of standard benchmarks. These approaches typically rely on randomly selecting samples from the replay memory or from a generative model, which is suboptimal. In this work, we consider a controlled sampling of memories for replay. We retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. We show a formulation for this sampling criterion in both the generative replay and the experience replay setting, producing consistent gains in performance and greatly reduced forgetting. We release an implementation of our method at https://github.com/optimass/Maximally_Interfered_Retrieval.
Tasks	Continual Learning
Published	2019-08-11
URL	https://arxiv.org/abs/1908.04742v3
PDF	https://arxiv.org/pdf/1908.04742v3.pdf
PWC	https://paperswithcode.com/paper/online-continual-learning-with-maximally
Repo	https://github.com/optimass/Maximally_Interfered_Retrieval
Framework	pytorch

3D Whole Brain Segmentation using Spatially Localized Atlas Network Tiles


Title	3D Whole Brain Segmentation using Spatially Localized Atlas Network Tiles
Authors	Yuankai Huo, Zhoubing Xu, Yunxi Xiong, Katherine Aboud, Prasanna Parvathaneni, Shunxing Bao, Camilo Bermudez, Susan M. Resnick, Laurie E. Cutting, Bennett A. Landman
Abstract	Detailed whole brain segmentation is an essential quantitative technique, which provides a non-invasive way of measuring brain regions from a structural magnetic resonance imaging (MRI). Recently, deep convolution neural network (CNN) has been applied to whole brain segmentation. However, restricted by current GPU memory, 2D based methods, downsampling based 3D CNN methods, and patch-based high-resolution 3D CNN methods have been the de facto standard solutions. 3D patch-based high resolution methods typically yield superior performance among CNN approaches on detailed whole brain segmentation (>100 labels), however, whose performance are still commonly inferior compared with multi-atlas segmentation methods (MAS) due to the following challenges: (1) a single network is typically used to learn both spatial and contextual information for the patches, (2) limited manually traced whole brain volumes are available (typically less than 50) for training a network. In this work, we propose the spatially localized atlas network tiles (SLANT) method to distribute multiple independent 3D fully convolutional networks (FCN) for high-resolution whole brain segmentation. To address the first challenge, multiple spatially distributed networks were used in the SLANT method, in which each network learned contextual information for a fixed spatial location. To address the second challenge, auxiliary labels on 5111 initially unlabeled scans were created by multi-atlas segmentation for training. Since the method integrated multiple traditional medical image processing methods with deep learning, we developed a containerized pipeline to deploy the end-to-end solution. From the results, the proposed method achieved superior performance compared with multi-atlas segmentation methods, while reducing the computational time from >30 hours to 15 minutes (https://github.com/MASILab/SLANTbrainSeg).
Tasks	Brain Segmentation
Published	2019-03-28
URL	http://arxiv.org/abs/1903.12152v1
PDF	http://arxiv.org/pdf/1903.12152v1.pdf
PWC	https://paperswithcode.com/paper/3d-whole-brain-segmentation-using-spatially
Repo	https://github.com/MASILab/SLANT_brain_seg
Framework	caffe2

Learning to Route in Similarity Graphs


Title	Learning to Route in Similarity Graphs
Authors	Dmitry Baranchuk, Dmitry Persiyanov, Anton Sinitsin, Artem Babenko
Abstract	Recently similarity graphs became the leading paradigm for efficient nearest neighbor search, outperforming traditional tree-based and LSH-based methods. Similarity graphs perform the search via greedy routing: a query traverses the graph and in each vertex moves to the adjacent vertex that is the closest to this query. In practice, similarity graphs are often susceptible to local minima, when queries do not reach its nearest neighbors, getting stuck in suboptimal vertices. In this paper we propose to learn the routing function that overcomes local minima via incorporating information about the graph global structure. In particular, we augment the vertices of a given graph with additional representations that are learned to provide the optimal routing from the start vertex to the query nearest neighbor. By thorough experiments, we demonstrate that the proposed learnable routing successfully diminishes the local minima problem and significantly improves the overall search performance.
Tasks
Published	2019-05-27
URL	https://arxiv.org/abs/1905.10987v1
PDF	https://arxiv.org/pdf/1905.10987v1.pdf
PWC	https://paperswithcode.com/paper/learning-to-route-in-similarity-graphs
Repo	https://github.com/dbaranchuk/learning-to-route
Framework	pytorch

Hierarchical Decision Making by Generating and Following Natural Language Instructions


Title	Hierarchical Decision Making by Generating and Following Natural Language Instructions
Authors	Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis
Abstract	We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models and data.
Tasks	Decision Making
Published	2019-06-03
URL	https://arxiv.org/abs/1906.00744v5
PDF	https://arxiv.org/pdf/1906.00744v5.pdf
PWC	https://paperswithcode.com/paper/190600744
Repo	https://github.com/facebookresearch/minirts
Framework	pytorch

Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning


Title	Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning
Authors	Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, Song-Chun Zhu
Abstract	This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions. To tackle this novel and challenging problem, we contribute a large-scale video dataset, VACATION, which covers diverse daily social scenes and gaze communication behaviors with complete annotations of objects and human faces, human attention, and communication structures and labels in both atomic-level and event-level. Together with VACATION, we propose a spatio-temporal graph neural network to explicitly represent the diverse gaze interactions in the social scenes and to infer atomic-level gaze communication by message passing. We further propose an event network with encoder-decoder structure to predict the event-level gaze communication. Our experiments demonstrate that the proposed model improves various baselines significantly in predicting the atomic-level and event-level gaze
Tasks
Published	2019-09-04
URL	https://arxiv.org/abs/1909.02144v1
PDF	https://arxiv.org/pdf/1909.02144v1.pdf
PWC	https://paperswithcode.com/paper/understanding-human-gaze-communication-by
Repo	https://github.com/LifengFan/Human-Gaze-Communication
Framework	none

LiFF: Light Field Features in Scale and Depth


Title	LiFF: Light Field Features in Scale and Depth
Authors	Donald G. Dansereau, Bernd Girod, Gordon Wetzstein
Abstract	Feature detectors and descriptors are key low-level vision tools that many higher-level tasks build on. Unfortunately these fail in the presence of challenging light transport effects including partial occlusion, low contrast, and reflective or refractive surfaces. Building on spatio-angular imaging modalities offered by emerging light field cameras, we introduce a new and computationally efficient 4D light field feature detector and descriptor: LiFF. LiFF is scale invariant and utilizes the full 4D light field to detect features that are robust to changes in perspective. This is particularly useful for structure from motion (SfM) and other tasks that match features across viewpoints of a scene. We demonstrate significantly improved 3D reconstructions via SfM when using LiFF instead of the leading 2D or 4D features, and show that LiFF runs an order of magnitude faster than the leading 4D approach. Finally, LiFF inherently estimates depth for each feature, opening a path for future research in light field-based SfM.
Tasks
Published	2019-01-13
URL	http://arxiv.org/abs/1901.03916v1
PDF	http://arxiv.org/pdf/1901.03916v1.pdf
PWC	https://paperswithcode.com/paper/liff-light-field-features-in-scale-and-depth
Repo	https://github.com/doda42/LiFF
Framework	none

Deep Reinforcement Learning Designed RF Pulse: $DeepRF_{SLR}$


Title	Deep Reinforcement Learning Designed RF Pulse: $DeepRF_{SLR}$
Authors	Dongmyung Shin, Sooyeon Ji, Doohee Lee, Jieun Lee, Se-Hong Oh, Jongho Lee
Abstract	A novel approach of applying deep reinforcement learning to an RF pulse design is introduced. This method, which is referred to as $DeepRF_{SLR}$, is designed to minimize the peak amplitude or, equivalently, minimize the pulse duration of a multiband refocusing pulse generated by the Shinar Le-Roux (SLR) algorithm. In the method, the root pattern of SLR polynomial, which determines the RF pulse shape, is optimized by iterative applications of deep reinforcement learning and greedy tree search. When tested for the designs of the multiband factors of three and seven RFs, $DeepRF_{SLR}$ demonstrated improved performance compared to conventional methods, generating shorter duration RF pulses in shorter computational time. In the experiments, the RF pulse from $DeepRF_{SLR}$ produced a slice profile similar to the minimum-phase SLR RF pulse and the profiles matched to that of the computer simulation. Our approach suggests a new way of designing an RF by applying a machine learning algorithm, demonstrating a machine-designed MRI sequence.
Tasks
Published	2019-12-19
URL	https://arxiv.org/abs/1912.09015v1
PDF	https://arxiv.org/pdf/1912.09015v1.pdf
PWC	https://paperswithcode.com/paper/deep-reinforcement-learning-designed-rf-pulse
Repo	https://github.com/SNU-LIST/DeepRF_SLR
Framework	tf