Paper Group AWR 238
Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step. Understanding Top-k Sparsification in Distributed Deep Learning. Word-Class Embeddings for Multiclass Text Classification. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication. Ordered SGD: A New Stochastic Optimi …
Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step
Title | Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step |
Authors | Steinþór Steingrímsson, Örvar Kárason, Hrafn Loftsson |
Abstract | Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any previously published tagger not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we outperform previous state-of-the-art results by a significant margin. We also report on work in progress that attempts to address the problem of data sparsity inherent in morphologically detailed, fine-grained tagsets. We experiment with training a separate model on only the lexical category and using the coarse-grained output tag as an input for the main model. This method further increases the accuracy and reduces the tagging errors by 21.3% compared to previous state-of-the-art results. Finally, we train and test our tagger on a new gold standard for Icelandic. |
Tasks | |
Published | 2019-07-21 |
URL | https://arxiv.org/abs/1907.09038v1 |
https://arxiv.org/pdf/1907.09038v1.pdf | |
PWC | https://paperswithcode.com/paper/augmenting-a-bilstm-tagger-with-a |
Repo | https://github.com/steinst/ABLTagger |
Framework | none |
Understanding Top-k Sparsification in Distributed Deep Learning
Title | Understanding Top-k Sparsification in Distributed Deep Learning |
Authors | Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See |
Abstract | Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}. |
Tasks | |
Published | 2019-11-20 |
URL | https://arxiv.org/abs/1911.08772v1 |
https://arxiv.org/pdf/1911.08772v1.pdf | |
PWC | https://paperswithcode.com/paper/understanding-top-k-sparsification-in-1 |
Repo | https://github.com/hclhkbu/GaussianK-SGD |
Framework | pytorch |
Word-Class Embeddings for Multiclass Text Classification
Title | Word-Class Embeddings for Multiclass Text Classification |
Authors | Alejandro Moreo, Andrea Esuli, Fabrizio Sebastiani |
Abstract | Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using four popular neural architectures and six widely used and publicly available datasets for multiclass text classification. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings |
Tasks | Machine Translation, Sentiment Analysis, Text Classification, Word Embeddings, Word Sense Disambiguation |
Published | 2019-11-26 |
URL | https://arxiv.org/abs/1911.11506v1 |
https://arxiv.org/pdf/1911.11506v1.pdf | |
PWC | https://paperswithcode.com/paper/word-class-embeddings-for-multiclass-text |
Repo | https://github.com/AlexMoreo/word-class-embeddings |
Framework | pytorch |
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
Title | Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication |
Authors | Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi |
Abstract | We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by $\omega \leq 1$ ($\omega=1$ meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, $\mathcal{O}(1/(nT))$, is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, CHOCO-GOSSIP, for the average consensus problem that converges in time $\mathcal{O}(1/(\delta^2\omega) \log (1/\epsilon))$ for accuracy $\epsilon > 0$. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for $\omega > 0$ and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and CHOCO-SGD can reduce communication by at least two orders of magnitudes. |
Tasks | Stochastic Optimization |
Published | 2019-02-01 |
URL | http://arxiv.org/abs/1902.00340v1 |
http://arxiv.org/pdf/1902.00340v1.pdf | |
PWC | https://paperswithcode.com/paper/decentralized-stochastic-optimization-and |
Repo | https://github.com/JYWa/MATCHA |
Framework | pytorch |
Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization
Title | Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization |
Authors | Kenji Kawaguchi, Haihao Lu |
Abstract | We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems. |
Tasks | Stochastic Optimization |
Published | 2019-07-09 |
URL | https://arxiv.org/abs/1907.04371v5 |
https://arxiv.org/pdf/1907.04371v5.pdf | |
PWC | https://paperswithcode.com/paper/a-stochastic-first-order-method-for-ordered |
Repo | https://github.com/kenjikawaguchi/qSGD |
Framework | pytorch |
Text Generation from Knowledge Graphs with Graph Transformers
Title | Text Generation from Knowledge Graphs with Graph Transformers |
Authors | Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, Hannaneh Hajishirzi |
Abstract | Generating texts which express complex ideas spanning multiple sentences requires a structured representation of their content (document plan), but these representations are prohibitively expensive to manually produce. In this work, we address the problem of generating coherent multi-sentence texts from the output of an information extraction system, and in particular a knowledge graph. Graphical knowledge representations are ubiquitous in computing, but pose a significant challenge for text generation techniques due to their non-hierarchical nature, collapsing of long-distance dependencies, and structural variety. We introduce a novel graph transforming encoder which can leverage the relational structure of such knowledge graphs without imposing linearization or hierarchical constraints. Incorporated into an encoder-decoder setup, we provide an end-to-end trainable system for graph-to-text generation that we apply to the domain of scientific text. Automatic and human evaluations show that our technique produces more informative texts which exhibit better document structure than competitive encoder-decoder methods. |
Tasks | Knowledge Graphs, Text Generation |
Published | 2019-04-04 |
URL | https://arxiv.org/abs/1904.02342v2 |
https://arxiv.org/pdf/1904.02342v2.pdf | |
PWC | https://paperswithcode.com/paper/text-generation-from-knowledge-graphs-with |
Repo | https://github.com/rikdz/GraphWriter |
Framework | pytorch |
Weight Standardization
Title | Weight Standardization |
Authors | Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille |
Abstract | In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: https://github.com/joe-siyuan-qiao/WeightStandardization. |
Tasks | Image Classification, Instance Segmentation, Object Detection, Semantic Segmentation, Video Recognition |
Published | 2019-03-25 |
URL | http://arxiv.org/abs/1903.10520v1 |
http://arxiv.org/pdf/1903.10520v1.pdf | |
PWC | https://paperswithcode.com/paper/weight-standardization |
Repo | https://github.com/joe-siyuan-qiao/WeightStandardization |
Framework | tf |
On Direct Distribution Matching for Adapting Segmentation Networks
Title | On Direct Distribution Matching for Adapting Segmentation Networks |
Authors | Georg Pichler, Jose Dolz, Ismail Ben Ayed, Pablo Piantanida |
Abstract | Minimization of distribution matching losses is a principled approach to domain adaptation in the context of image classification. However, it is largely overlooked in adapting segmentation networks, which is currently dominated by adversarial models. We propose a class of loss functions, which encourage direct kernel density matching in the network-output space, up to some geometric transformations computed from unlabeled inputs. Rather than using an intermediate domain discriminator, our direct approach unifies distribution matching and segmentation in a single loss. Therefore, it simplifies segmentation adaptation by avoiding extra adversarial steps, while improving both the quality, stability and efficiency of training. We juxtapose our approach to state-of-the-art segmentation adaptation via adversarial training in the network-output space. In the challenging task of adapting brain segmentation across different magnetic resonance images (MRI) modalities, our approach achieves significantly better results both in terms of accuracy and stability. |
Tasks | Brain Segmentation, Domain Adaptation, Image Classification |
Published | 2019-04-04 |
URL | http://arxiv.org/abs/1904.02657v1 |
http://arxiv.org/pdf/1904.02657v1.pdf | |
PWC | https://paperswithcode.com/paper/on-direct-distribution-matching-for-adapting |
Repo | https://github.com/anonymauthor/DDMSegNet |
Framework | none |
Online Continual Learning with Maximally Interfered Retrieval
Title | Online Continual Learning with Maximally Interfered Retrieval |
Authors | Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Min Lin, Laurent Charlin, Tinne Tuytelaars |
Abstract | Continual learning, the setting where a learning agent is faced with a never ending stream of data, continues to be a great challenge for modern machine learning systems. In particular the online or “single-pass through the data” setting has gained attention recently as a natural setting that is difficult to tackle. Methods based on replay, either generative or from a stored memory, have been shown to be effective approaches for continual learning, matching or exceeding the state of the art in a number of standard benchmarks. These approaches typically rely on randomly selecting samples from the replay memory or from a generative model, which is suboptimal. In this work, we consider a controlled sampling of memories for replay. We retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. We show a formulation for this sampling criterion in both the generative replay and the experience replay setting, producing consistent gains in performance and greatly reduced forgetting. We release an implementation of our method at https://github.com/optimass/Maximally_Interfered_Retrieval. |
Tasks | Continual Learning |
Published | 2019-08-11 |
URL | https://arxiv.org/abs/1908.04742v3 |
https://arxiv.org/pdf/1908.04742v3.pdf | |
PWC | https://paperswithcode.com/paper/online-continual-learning-with-maximally |
Repo | https://github.com/optimass/Maximally_Interfered_Retrieval |
Framework | pytorch |
3D Whole Brain Segmentation using Spatially Localized Atlas Network Tiles
Title | 3D Whole Brain Segmentation using Spatially Localized Atlas Network Tiles |
Authors | Yuankai Huo, Zhoubing Xu, Yunxi Xiong, Katherine Aboud, Prasanna Parvathaneni, Shunxing Bao, Camilo Bermudez, Susan M. Resnick, Laurie E. Cutting, Bennett A. Landman |
Abstract | Detailed whole brain segmentation is an essential quantitative technique, which provides a non-invasive way of measuring brain regions from a structural magnetic resonance imaging (MRI). Recently, deep convolution neural network (CNN) has been applied to whole brain segmentation. However, restricted by current GPU memory, 2D based methods, downsampling based 3D CNN methods, and patch-based high-resolution 3D CNN methods have been the de facto standard solutions. 3D patch-based high resolution methods typically yield superior performance among CNN approaches on detailed whole brain segmentation (>100 labels), however, whose performance are still commonly inferior compared with multi-atlas segmentation methods (MAS) due to the following challenges: (1) a single network is typically used to learn both spatial and contextual information for the patches, (2) limited manually traced whole brain volumes are available (typically less than 50) for training a network. In this work, we propose the spatially localized atlas network tiles (SLANT) method to distribute multiple independent 3D fully convolutional networks (FCN) for high-resolution whole brain segmentation. To address the first challenge, multiple spatially distributed networks were used in the SLANT method, in which each network learned contextual information for a fixed spatial location. To address the second challenge, auxiliary labels on 5111 initially unlabeled scans were created by multi-atlas segmentation for training. Since the method integrated multiple traditional medical image processing methods with deep learning, we developed a containerized pipeline to deploy the end-to-end solution. From the results, the proposed method achieved superior performance compared with multi-atlas segmentation methods, while reducing the computational time from >30 hours to 15 minutes (https://github.com/MASILab/SLANTbrainSeg). |
Tasks | Brain Segmentation |
Published | 2019-03-28 |
URL | http://arxiv.org/abs/1903.12152v1 |
http://arxiv.org/pdf/1903.12152v1.pdf | |
PWC | https://paperswithcode.com/paper/3d-whole-brain-segmentation-using-spatially |
Repo | https://github.com/MASILab/SLANT_brain_seg |
Framework | caffe2 |
Learning to Route in Similarity Graphs
Title | Learning to Route in Similarity Graphs |
Authors | Dmitry Baranchuk, Dmitry Persiyanov, Anton Sinitsin, Artem Babenko |
Abstract | Recently similarity graphs became the leading paradigm for efficient nearest neighbor search, outperforming traditional tree-based and LSH-based methods. Similarity graphs perform the search via greedy routing: a query traverses the graph and in each vertex moves to the adjacent vertex that is the closest to this query. In practice, similarity graphs are often susceptible to local minima, when queries do not reach its nearest neighbors, getting stuck in suboptimal vertices. In this paper we propose to learn the routing function that overcomes local minima via incorporating information about the graph global structure. In particular, we augment the vertices of a given graph with additional representations that are learned to provide the optimal routing from the start vertex to the query nearest neighbor. By thorough experiments, we demonstrate that the proposed learnable routing successfully diminishes the local minima problem and significantly improves the overall search performance. |
Tasks | |
Published | 2019-05-27 |
URL | https://arxiv.org/abs/1905.10987v1 |
https://arxiv.org/pdf/1905.10987v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-to-route-in-similarity-graphs |
Repo | https://github.com/dbaranchuk/learning-to-route |
Framework | pytorch |
Hierarchical Decision Making by Generating and Following Natural Language Instructions
Title | Hierarchical Decision Making by Generating and Following Natural Language Instructions |
Authors | Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis |
Abstract | We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models and data. |
Tasks | Decision Making |
Published | 2019-06-03 |
URL | https://arxiv.org/abs/1906.00744v5 |
https://arxiv.org/pdf/1906.00744v5.pdf | |
PWC | https://paperswithcode.com/paper/190600744 |
Repo | https://github.com/facebookresearch/minirts |
Framework | pytorch |
Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning
Title | Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning |
Authors | Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, Song-Chun Zhu |
Abstract | This paper addresses a new problem of understanding human gaze communication in social videos from both atomic-level and event-level, which is significant for studying human social interactions. To tackle this novel and challenging problem, we contribute a large-scale video dataset, VACATION, which covers diverse daily social scenes and gaze communication behaviors with complete annotations of objects and human faces, human attention, and communication structures and labels in both atomic-level and event-level. Together with VACATION, we propose a spatio-temporal graph neural network to explicitly represent the diverse gaze interactions in the social scenes and to infer atomic-level gaze communication by message passing. We further propose an event network with encoder-decoder structure to predict the event-level gaze communication. Our experiments demonstrate that the proposed model improves various baselines significantly in predicting the atomic-level and event-level gaze |
Tasks | |
Published | 2019-09-04 |
URL | https://arxiv.org/abs/1909.02144v1 |
https://arxiv.org/pdf/1909.02144v1.pdf | |
PWC | https://paperswithcode.com/paper/understanding-human-gaze-communication-by |
Repo | https://github.com/LifengFan/Human-Gaze-Communication |
Framework | none |
LiFF: Light Field Features in Scale and Depth
Title | LiFF: Light Field Features in Scale and Depth |
Authors | Donald G. Dansereau, Bernd Girod, Gordon Wetzstein |
Abstract | Feature detectors and descriptors are key low-level vision tools that many higher-level tasks build on. Unfortunately these fail in the presence of challenging light transport effects including partial occlusion, low contrast, and reflective or refractive surfaces. Building on spatio-angular imaging modalities offered by emerging light field cameras, we introduce a new and computationally efficient 4D light field feature detector and descriptor: LiFF. LiFF is scale invariant and utilizes the full 4D light field to detect features that are robust to changes in perspective. This is particularly useful for structure from motion (SfM) and other tasks that match features across viewpoints of a scene. We demonstrate significantly improved 3D reconstructions via SfM when using LiFF instead of the leading 2D or 4D features, and show that LiFF runs an order of magnitude faster than the leading 4D approach. Finally, LiFF inherently estimates depth for each feature, opening a path for future research in light field-based SfM. |
Tasks | |
Published | 2019-01-13 |
URL | http://arxiv.org/abs/1901.03916v1 |
http://arxiv.org/pdf/1901.03916v1.pdf | |
PWC | https://paperswithcode.com/paper/liff-light-field-features-in-scale-and-depth |
Repo | https://github.com/doda42/LiFF |
Framework | none |
Deep Reinforcement Learning Designed RF Pulse: $DeepRF_{SLR}$
Title | Deep Reinforcement Learning Designed RF Pulse: $DeepRF_{SLR}$ |
Authors | Dongmyung Shin, Sooyeon Ji, Doohee Lee, Jieun Lee, Se-Hong Oh, Jongho Lee |
Abstract | A novel approach of applying deep reinforcement learning to an RF pulse design is introduced. This method, which is referred to as $DeepRF_{SLR}$, is designed to minimize the peak amplitude or, equivalently, minimize the pulse duration of a multiband refocusing pulse generated by the Shinar Le-Roux (SLR) algorithm. In the method, the root pattern of SLR polynomial, which determines the RF pulse shape, is optimized by iterative applications of deep reinforcement learning and greedy tree search. When tested for the designs of the multiband factors of three and seven RFs, $DeepRF_{SLR}$ demonstrated improved performance compared to conventional methods, generating shorter duration RF pulses in shorter computational time. In the experiments, the RF pulse from $DeepRF_{SLR}$ produced a slice profile similar to the minimum-phase SLR RF pulse and the profiles matched to that of the computer simulation. Our approach suggests a new way of designing an RF by applying a machine learning algorithm, demonstrating a machine-designed MRI sequence. |
Tasks | |
Published | 2019-12-19 |
URL | https://arxiv.org/abs/1912.09015v1 |
https://arxiv.org/pdf/1912.09015v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-reinforcement-learning-designed-rf-pulse |
Repo | https://github.com/SNU-LIST/DeepRF_SLR |
Framework | tf |