January 26, 2020

3104 words 15 mins read

Paper Group ANR 1610

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors. Improving Grammatical Error Correction with Machine Translation Pairs. Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features. Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learnin …

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors


Title	GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
Authors	Masato Hagiwara, Masato Mita
Abstract	The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1 ~ 0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.
Tasks	Grammatical Error Correction, Spelling Correction
Published	2019-11-28
URL	https://arxiv.org/abs/1911.12893v1
PDF	https://arxiv.org/pdf/1911.12893v1.pdf
PWC	https://paperswithcode.com/paper/github-typo-corpus-a-large-scale-multilingual
Repo
Framework

Improving Grammatical Error Correction with Machine Translation Pairs


Title	Improving Grammatical Error Correction with Machine Translation Pairs
Authors	Wangchunshu Zhou, Tao Ge, Chang Mu, Ke Xu, Furu Wei, Ming Zhou
Abstract	We propose a novel data synthesis method to generate diverse error-corrected sentence pairs for improving grammatical error correction, which is based on a pair of machine translation models of different qualities (i.e., poor and good). The poor translation model resembles the ESL (English as a second language) learner and tends to generate translations of low quality in terms of fluency and grammatical correctness, while the good translation model generally generates fluent and grammatically correct translations. We build the poor and good translation model with phrase-based statistical machine translation model with decreased language model weight and neural machine translation model respectively. By taking the pair of their translations of the same sentences in a bridge language as error-corrected sentence pairs, we can construct unlimited pseudo parallel data. Our approach is capable of generating diverse fluency-improving patterns without being limited by the pre-defined rule set and the seed error-corrected data. Experimental results demonstrate the effectiveness of our approach and show that it can be combined with other synthetic data sources to yield further improvements.
Tasks	Grammatical Error Correction, Language Modelling, Machine Translation
Published	2019-11-07
URL	https://arxiv.org/abs/1911.02825v1
PDF	https://arxiv.org/pdf/1911.02825v1.pdf
PWC	https://paperswithcode.com/paper/improving-grammatical-error-correction-with
Repo
Framework

Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features


Title	Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features
Authors	Noé Cecillon, Vincent Labatut, Richard Dufour, Georges Linarès
Abstract	In recent years, online social networks have allowed worldwide users to meet and discuss. As guarantors of these communities, the administrators of these platforms must prevent users from adopting inappropriate behaviors. This verification task, mainly done by humans, is more and more difficult due to the ever growing amount of messages to check. Methods have been proposed to automatize this moderation process, mainly by providing approaches based on the textual content of the exchanged messages. Recent work has also shown that characteristics derived from the structure of conversations, in the form of conversational graphs, can help detecting these abusive messages. In this paper, we propose to take advantage of both sources of information by proposing fusion methods integrating content-and graph-based features. Our experiments on raw chat logs show that the content of the messages, but also of their dynamics within a conversation contain partially complementary information, allowing performance improvements on an abusive message classification task with a final F-measure of 93.26%.
Tasks
Published	2019-05-20
URL	https://arxiv.org/abs/1905.07894v1
PDF	https://arxiv.org/pdf/1905.07894v1.pdf
PWC	https://paperswithcode.com/paper/abusive-language-detection-in-online
Repo
Framework

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation


Title	Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation
Authors	Siyuan Feng, Tan Lee
Abstract	This study tackles unsupervised subword modeling in the zero-resource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve $2.4%$ and $0.6%$ absolute ABX error rate reductions in across- and within-speaker conditions, comparing to the baseline DNN-BNF system without applying FHVAEs. Our proposed approaches significantly outperform vocal tract length normalization in improving frame labeling and subword modeling.
Tasks	Representation Learning
Published	2019-06-17
URL	https://arxiv.org/abs/1906.07245v2
PDF	https://arxiv.org/pdf/1906.07245v2.pdf
PWC	https://paperswithcode.com/paper/improving-unsupervised-subword-modeling-via
Repo
Framework

Aggregated Pairwise Classification of Statistical Shapes


Title	Aggregated Pairwise Classification of Statistical Shapes
Authors	Min Ho Cho, Sebastian Kurtek, Steven N. MacEachern
Abstract	The classification of shapes is of great interest in diverse areas ranging from medical imaging to computer vision and beyond. While many statistical frameworks have been developed for the classification problem, most are strongly tied to early formulations of the problem - with an object to be classified described as a vector in a relatively low-dimensional Euclidean space. Statistical shape data have two main properties that suggest a need for a novel approach: (i) shapes are inherently infinite dimensional with strong dependence among the positions of nearby points, and (ii) shape space is not Euclidean, but is fundamentally curved. To accommodate these features of the data, we work with the square-root velocity function of the curves to provide a useful formal description of the shape, pass to tangent spaces of the manifold of shapes at different projection points which effectively separate shapes for pairwise classification in the training data, and use principal components within these tangent spaces to reduce dimensionality. We illustrate the impact of the projection point and choice of subspace on the misclassification rate with a novel method of combining pairwise classifiers.
Tasks
Published	2019-01-22
URL	http://arxiv.org/abs/1901.07593v1
PDF	http://arxiv.org/pdf/1901.07593v1.pdf
PWC	https://paperswithcode.com/paper/aggregated-pairwise-classification-of
Repo
Framework

A Communication Efficient Vertical Federated Learning Framework


Title	A Communication Efficient Vertical Federated Learning Framework
Authors	Yang Liu, Yan Kang, Xinwei Zhang, Liping Li, Yong Cheng, Tianjian Chen, Mingyi Hong, Qiang Yang
Abstract	One critical challenge for applying today’s Artificial Intelligence (AI) technologies to real-world applications is the common existence of data silos across different organizations. Due to legal, privacy and other practical constraints, data from different organizations cannot be easily integrated. Federated learning (FL), especially the vertical FL (VFL), allows multiple parties having different sets of attributes about the same user collaboratively build models while preserving user privacy. However, communication overhead is a principal bottleneck since the existing VFL protocols require per-iteration communications among all parties. In this paper, we propose the Federated Stochastic Block Coordinate Descent (FedBCD) to effectively reduce the communication rounds for VFL. We show that when the batch size, sample size, and the local iterations are selected appropriately, the algorithm requires $\mathcal{O}(\sqrt{T})$ communication rounds to achieve $\mathcal{O}(1/\sqrt{T})$ accuracy. Finally, we demonstrate the performance of FedBCD on several models and datasets, and on a large-scale industrial platform for VFL.
Tasks
Published	2019-12-24
URL	https://arxiv.org/abs/1912.11187v2
PDF	https://arxiv.org/pdf/1912.11187v2.pdf
PWC	https://paperswithcode.com/paper/a-communication-efficient-vertical-federated
Repo
Framework

High-resolution Markov state models for the dynamics of Trp-cage miniprotein constructed over slow folding modes identified by state-free reversible VAMPnets


Title	High-resolution Markov state models for the dynamics of Trp-cage miniprotein constructed over slow folding modes identified by state-free reversible VAMPnets
Authors	Hythem Sidky, Wei Chen, Andrew L. Ferguson
Abstract	State-free reversible VAMPnets (SRVs) are a neural network-based framework capable of learning the leading eigenfunctions of the transfer operator of a dynamical system from trajectory data. In molecular dynamics simulations, these data-driven collective variables (CVs) capture the slowest modes of the dynamics and are useful for enhanced sampling and free energy estimation. In this work, we employ SRV coordinates as a feature set for Markov state model (MSM) construction. Compared to the current state of the art, MSMs constructed from SRV coordinates are more robust to the choice of input features, exhibit faster implied timescale convergence, and permit the use of shorter lagtimes to construct higher kinetic resolution models. We apply this methodology to study the folding kinetics and conformational landscape of the Trp-cage miniprotein. Folding and unfolding mean first passage times are in good agreement with prior literature, and a nine macrostate model is presented. The unfolded ensemble comprises a central kinetic hub with interconversions to several metastable unfolded conformations and which serves as the gateway to the folded ensemble. The folded ensemble comprises the native state, a partially unfolded intermediate “loop” state, and a previously unreported short-lived intermediate that we were able to resolve due to the high time-resolution of the SRV-MSM. We propose SRVs as an excellent candidate for integration into modern MSM construction pipelines.
Tasks
Published	2019-06-12
URL	https://arxiv.org/abs/1906.04890v1
PDF	https://arxiv.org/pdf/1906.04890v1.pdf
PWC	https://paperswithcode.com/paper/high-resolution-markov-state-models-for-the
Repo
Framework

Probabilistic Modeling with Matrix Product States


Title	Probabilistic Modeling with Matrix Product States
Authors	James Stokes, John Terilla
Abstract	Inspired by the possibility that generative models based on quantum circuits can provide a useful inductive bias for sequence modeling tasks, we propose an efficient training algorithm for a subset of classically simulable quantum circuit models. The gradient-free algorithm, presented as a sequence of exactly solvable effective models, is a modification of the density matrix renormalization group procedure adapted for learning a probability distribution. The conclusion that circuit-based models offer a useful inductive bias for classical datasets is supported by experimental results on the parity learning problem.
Tasks
Published	2019-02-19
URL	http://arxiv.org/abs/1902.06888v1
PDF	http://arxiv.org/pdf/1902.06888v1.pdf
PWC	https://paperswithcode.com/paper/probabilistic-modeling-with-matrix-product
Repo
Framework

Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder


Title	Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder
Authors	Yunsu Kim, Jiahui Geng, Hermann Ney
Abstract	Unsupervised learning of cross-lingual word embedding offers elegant matching of words across languages, but has fundamental limitations in translating sentences. In this paper, we propose simple yet effective methods to improve word-by-word translation of cross-lingual embeddings, using only monolingual corpora but without any back-translation. We integrate a language model for context-aware search, and use a novel denoising autoencoder to handle reordering. Our system surpasses state-of-the-art unsupervised neural translation systems without costly iterative training. We also analyze the effect of vocabulary size and denoising type on the translation performance, which provides better understanding of learning the cross-lingual word embedding and its usage in translation.
Tasks	Denoising, Language Modelling
Published	2019-01-06
URL	http://arxiv.org/abs/1901.01590v1
PDF	http://arxiv.org/pdf/1901.01590v1.pdf
PWC	https://paperswithcode.com/paper/improving-unsupervised-word-by-word
Repo
Framework

HABNet: Machine Learning, Remote Sensing Based Detection and Prediction of Harmful Algal Blooms


Title	HABNet: Machine Learning, Remote Sensing Based Detection and Prediction of Harmful Algal Blooms
Authors	P. R. Hill, A. Kumar, M. Temimi, D. R. Bull
Abstract	This paper describes the application of machine learning techniques to develop a state-of-the-art detection and prediction system for spatiotemporal events found within remote sensing data; specifically, Harmful Algal Bloom events (HABs). HABs cause a large variety of human health and environmental issues together with associated economic impacts. This work has focused specifically on the case study of the detection of Karenia Brevis Algae (K. brevis) HAB events within the coastal waters of Florida (over 2850 events from 2003 to 2018: an order of magnitude larger than any previous machine learning detection study into HAB events). The development of multimodal spatiotemporal datacube data structures and associated novel machine learning methods give a unique architecture for the automatic detection of environmental events. Specifically, when applied to the detection of HAB events it gives a maximum detection accuracy of 91% and a Kappa coefficient of 0.81 for the Florida data considered. A HAB prediction system was also developed where a temporal subset of each datacube was used to forecast the presence of a HAB in the future. This system was not significantly less accurate than the detection system being able to predict with 86% accuracy up to 8 days in the future. The same datacube and machine learning structure were also applied to a more limited database of multi-species HAB events within the Arabian Gulf. This results for this additional study gave a classification accuracy of 93% and a Kappa coefficient of 0.83.
Tasks
Published	2019-12-04
URL	https://arxiv.org/abs/1912.02305v1
PDF	https://arxiv.org/pdf/1912.02305v1.pdf
PWC	https://paperswithcode.com/paper/habnet-machine-learning-remote-sensing-based
Repo
Framework

A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera


Title	A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera
Authors	Huy Hieu Pham, Houssam Salmane, Louahdi Khoudour, Alain Crouzil, Pablo Zegers, Sergio A Velastin
Abstract	We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB video sequences. Our approach proceeds along two stages. In the first, we run a real-time 2D pose detector to determine the precise pixel location of important keypoints of the body. A two-stream neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second, we deploy the Efficient Neural Architecture Search (ENAS) algorithm to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that our method requires a low computational budget for training and inference.
Tasks	3D Human Pose Estimation, 3D Pose Estimation, Neural Architecture Search, Pose Estimation
Published	2019-07-16
URL	https://arxiv.org/abs/1907.06968v1
PDF	https://arxiv.org/pdf/1907.06968v1.pdf
PWC	https://paperswithcode.com/paper/a-unified-deep-framework-for-joint-3d-pose
Repo
Framework

Computational Induction of Prosodic Structure


Title	Computational Induction of Prosodic Structure
Authors	Dafydd Gibbon
Abstract	The present study has two goals relating to the grammar of prosody, understood as the rhythms and melodies of speech. First, an overview is provided of the computable grammatical and phonetic approaches to prosody analysis which use hypothetico-deductive methods and are based on learned hermeneutic intuitions about language. Second, a proposal is presented for an inductive grounding in the physical signal, in which prosodic structure is inferred using a language-independent method from the low-frequency spectrum of the speech signal. The overview includes a discussion of computational aspects of standard generative and post-generative models, and suggestions for reformulating these to form inductive approaches. Also included is a discussion of linguistic phonetic approaches to analysis of annotations (pairs of speech unit labels with time-stamps) of recorded spoken utterances. The proposal introduces the inductive approach of Rhythm Formant Theory (RFT) and the associated Rhythm Formant Analysis (RFA) method are introduced, with the aim of completing a gap in the linguistic hypothetico-deductive cycle by grounding in a language-independent inductive procedure of speech signal analysis. The validity of the method is demonstrated and applied to rhythm patterns in read-aloud Mandarin Chinese, finding differences from English which are related to lexical and grammatical differences between the languages, as well as individual variation. The overall conclusions are (1) that normative language-to-language phonological or phonetic comparisons of rhythm, for example of Mandarin and English, are too simplistic, in view of diverse language-internal factors due to genre and style differences as well as utterance dynamics, and (2) that language-independent empirical grounding of rhythm in the physical signal is called for.
Tasks
Published	2019-12-15
URL	https://arxiv.org/abs/1912.07050v1
PDF	https://arxiv.org/pdf/1912.07050v1.pdf
PWC	https://paperswithcode.com/paper/computational-induction-of-prosodic-structure
Repo
Framework

Conditional Expectation Propagation


Title	Conditional Expectation Propagation
Authors	Zheng Wang, Shandian Zhe
Abstract	Expectation propagation (EP) is a powerful approximate inference algorithm. However, a critical barrier in applying EP is that the moment matching in message updates can be intractable. Handcrafting approximations is usually tricky, and lacks generalizability. Importance sampling is very expensive. While Laplace propagation provides a good solution, it has to run numerical optimizations to find Laplace approximations in every update, which is still quite inefficient. To overcome these practical barriers, we propose conditional expectation propagation (CEP) that performs conditional moment matching given the variables outside each message, and then takes expectation w.r.t the approximate posterior of these variables. The conditional moments are often analytical and much easier to derive. In the most general case, we can use (fully) factorized messages to represent the conditional moments by quadrature formulas. We then compute the expectation of the conditional moments via Taylor approximations when necessary. In this way, our algorithm can always conduct efficient, analytical fixed point iterations. Experiments on several popular models for which standard EP is available or unavailable demonstrate the advantages of CEP in both inference quality and computational efficiency.
Tasks
Published	2019-10-27
URL	https://arxiv.org/abs/1910.12360v2
PDF	https://arxiv.org/pdf/1910.12360v2.pdf
PWC	https://paperswithcode.com/paper/conditional-expectation-propagation
Repo
Framework

Determining offshore wind installation times using machine learning and open data


Title	Determining offshore wind installation times using machine learning and open data
Authors	Bo Tranberg, Kasper Koops Kratmann, Jason Stege
Abstract	The installation process of offshore wind turbines requires the use of expensive jack-up vessels. These vessels regularly report their position via the Automatic Identification System (AIS). This paper introduces a novel approach of applying machine learning to AIS data from jack-up vessels. We apply the new method to 13 offshore wind farms in Danish, German and British waters. For each of the wind farms we identify individual turbine locations, individual installation times, time in transit and time in harbor for the respective vessel. This is done in an automated way exclusively using AIS data with no prior knowledge of turbine locations, thus enabling a detailed description of the entire installation process.
Tasks
Published	2019-09-25
URL	https://arxiv.org/abs/1909.11313v2
PDF	https://arxiv.org/pdf/1909.11313v2.pdf
PWC	https://paperswithcode.com/paper/determining-offshore-wind-installation-times
Repo
Framework

Using Scratch to Teach Undergraduate Students’ Skills on Artificial Intelligence


Title	Using Scratch to Teach Undergraduate Students’ Skills on Artificial Intelligence
Authors	Julian Estevez, Gorka Garate, JM Lopez Guede, Manuel Graña
Abstract	This paper presents a educational workshop in Scratch that is proposed for the active participation of undergraduate students in contexts of Artificial Intelligence. The main objective of the activity is to demystify the complexity of Artificial Intelligence and its algorithms. For this purpose, students must realize simple exercises of clustering and two neural networks, in Scratch. The detailed methodology to get that is presented in the article.
Tasks
Published	2019-03-30
URL	http://arxiv.org/abs/1904.00296v1
PDF	http://arxiv.org/pdf/1904.00296v1.pdf
PWC	https://paperswithcode.com/paper/using-scratch-to-teach-undergraduate-students
Repo
Framework