May 5, 2019

2503 words 12 mins read

Paper Group NANR 54

A Distribution-based Model to Learn Bilingual Word Embeddings. Towards Broad-coverage Meaning Representation: The Case of Comparison Structures. Neural Embedding Language Models in Semantic Clustering of Web Search Results. Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models. Cross-validating Image Description Datasets and Eva …

A Distribution-based Model to Learn Bilingual Word Embeddings


Title	A Distribution-based Model to Learn Bilingual Word Embeddings
Authors	Hailong Cao, Tiejun Zhao, Shu Zhang, Yao Meng
Abstract	We introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages.
Tasks	Machine Translation, Word Embeddings
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-1171/
PDF	https://www.aclweb.org/anthology/C16-1171
PWC	https://paperswithcode.com/paper/a-distribution-based-model-to-learn-bilingual
Repo
Framework

Towards Broad-coverage Meaning Representation: The Case of Comparison Structures


Title	Towards Broad-coverage Meaning Representation: The Case of Comparison Structures
Authors	Bakhsh, Omid eh, James Allen
Abstract
Tasks	Question Answering, Reading Comprehension, Semantic Parsing, Sentiment Analysis
Published	2016-11-01
URL	https://www.aclweb.org/anthology/W16-6006/
PDF	https://www.aclweb.org/anthology/W16-6006
PWC	https://paperswithcode.com/paper/towards-broad-coverage-meaning-representation
Repo
Framework

Neural Embedding Language Models in Semantic Clustering of Web Search Results


Title	Neural Embedding Language Models in Semantic Clustering of Web Search Results
Authors	Andrey Kutuzov, Elizaveta Kuzmenko
Abstract	In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding models to detect senses of search queries and to cluster search engine results page according to these senses. The words from titles and snippets together with semantic relationships between them form a graph, which is further partitioned into components related to different query senses. This approach to search engine results clustering is evaluated against a new manually annotated evaluation data set of Russian search queries. We show that in the task of semantically clustering search results, prediction-based models slightly but stably outperform traditional count-based ones, with the same training corpora.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1486/
PDF	https://www.aclweb.org/anthology/L16-1486
PWC	https://paperswithcode.com/paper/neural-embedding-language-models-in-semantic
Repo
Framework

Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models


Title	Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models
Authors	M{'a}rton Makrai
Abstract	Word translations arise in dictionary-like organization as well as via machine learning from corpora. The former is exemplified by Wiktionary, a crowd-sourced dictionary with editions in many languages. {'A}cs et al. (2013) obtain word translations from Wiktionary with the pivot-based method, also called triangulation, that infers word translations in a pair of languages based on translations to other, typically better resourced ones called pivots. Triangulation may introduce noise if words in the pivot are polysemous. The reliability of each triangulated translation is basically estimated by the number of pivot languages (Tanaka et al 1994). Mikolov et al (2013) introduce a method for generating or scoring word translations. Translation is formalized as a linear mapping between distributed vector space models (VSM) of the two languages. VSMs are trained on monolingual data, while the mapping is learned in a supervised fashion, using a seed dictionary of some thousand word pairs. The mapping can be used to associate existing translations with a real-valued similarity score. This paper exploits human labor in Wiktionary combined with distributional information in VSMs. We train VSMs on gigaword corpora, and the linear translation mapping on direct (non-triangulated) Wiktionary pairs. This mapping is used to filter triangulated translations based on scores. The motivation is that scores by the mapping may be a smoother measure of merit than considering only the number of pivot for the triangle. We evaluate the scores against dictionaries extracted from parallel corpora (Tiedemann 2012). We show that linear translation really provides a more reliable method for triangle scoring than pivot count. The methods we use are language-independent, and the training data is easy to obtain for many languages. We chose the German-Hungarian pair for evaluation, in which the filtered triangles resulting from our experiments are the greatest freely available list of word translations we are aware of.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1439/
PDF	https://www.aclweb.org/anthology/L16-1439
PWC	https://paperswithcode.com/paper/filtering-wiktionary-triangles-by-linear
Repo
Framework

Cross-validating Image Description Datasets and Evaluation Metrics


Title	Cross-validating Image Description Datasets and Evaluation Metrics
Authors	Josiah Wang, Robert Gaizauskas
Abstract	The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems. However, not much work has been done to analyse and understand both datasets and the metrics. In this paper, we propose using a leave-one-out cross validation (LOOCV) process as a means to analyse multiply annotated, human-authored image description datasets and the various evaluation metrics, i.e. evaluating one image description against other human-authored descriptions of the same image. Such an evaluation process affords various insights into the image description datasets and evaluation metrics, such as the variations of image descriptions within and across datasets and also what the metrics capture. We compute and analyse (i) human upper-bound performance; (ii) ranked correlation between metric pairs across datasets; (iii) lower-bound performance by comparing a set of descriptions describing one image to another sentence not describing that image. Interesting observations are made about the evaluation metrics and image description datasets, and we conclude that such cross-validation methods are extremely useful for assessing and gaining insights into image description datasets and evaluation metrics for image descriptions.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1489/
PDF	https://www.aclweb.org/anthology/L16-1489
PWC	https://paperswithcode.com/paper/cross-validating-image-description-datasets
Repo
Framework

Benchmarking Lexical Simplification Systems


Title	Benchmarking Lexical Simplification Systems
Authors	Gustavo Paetzold, Lucia Specia
Abstract	Lexical Simplification is the task of replacing complex words in a text with simpler alternatives. A variety of strategies have been devised for this challenge, yet there has been little effort in comparing their performance. In this contribution, we present a benchmarking of several Lexical Simplification systems. By combining resources created in previous work with automatic spelling and inflection correction techniques, we introduce BenchLS: a new evaluation dataset for the task. Using BenchLS, we evaluate the performance of solutions for various steps in the typical Lexical Simplification pipeline, both individually and jointly. This is the first time Lexical Simplification systems are compared in such fashion on the same data, and the findings introduce many contributions to the field, revealing several interesting properties of the systems evaluated.
Tasks	Lexical Simplification
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1491/
PDF	https://www.aclweb.org/anthology/L16-1491
PWC	https://paperswithcode.com/paper/benchmarking-lexical-simplification-systems
Repo
Framework

Extending AIDA framework by incorporating coreference resolution on detected mentions and pruning based on popularity of an entity


Title	Extending AIDA framework by incorporating coreference resolution on detected mentions and pruning based on popularity of an entity
Authors	Samaikya Akarapu, C Ravindranath Chowdary
Abstract
Tasks	Coreference Resolution, Entity Linking, Named Entity Recognition, Word Sense Disambiguation
Published	2016-12-01
URL	https://www.aclweb.org/anthology/W16-6306/
PDF	https://www.aclweb.org/anthology/W16-6306
PWC	https://paperswithcode.com/paper/extending-aida-framework-by-incorporating
Repo
Framework

A Novel Evaluation Method for Morphological Segmentation


Title	A Novel Evaluation Method for Morphological Segmentation
Authors	Javad Nouri, Roman Yangarber
Abstract	Unsupervised learning of morphological segmentation of words in a language, based only on a large corpus of words, is a challenging task. Evaluation of the learned segmentations is a challenge in itself, due to the inherent ambiguity of the segmentation task. There is no way to posit unique {``}correct{''} segmentation for a set of data in an objective way. Two models may arrive at different ways of segmenting the data, which may nonetheless both be valid. Several evaluation methods have been proposed to date, but they do not insist on consistency of the evaluated model. We introduce a new evaluation methodology, which enforces correctness of segmentation boundaries while also assuring consistency of segmentation decisions across the corpus. \|
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1495/
PDF	https://www.aclweb.org/anthology/L16-1495
PWC	https://paperswithcode.com/paper/a-novel-evaluation-method-for-morphological
Repo
Framework

EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs


Title	EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs
Authors	Liu Hongchao, Karl Neergaard, Enrico Santus, Chu-Ren Huang
Abstract	Distributional semantic models (DSMs) are currently being used in the measurement of word relatedness and word similarity. One shortcoming of DSMs is that they do not provide a principled way to discriminate different semantic relations. Several approaches have been adopted that rely on annotated data either in the training of the model or later in its evaluation. In this paper, we introduce a dataset for training and evaluating DSMs on semantic relations discrimination between words, in Mandarin, Chinese. The construction of the dataset followed EVALution 1.0, which is an English dataset for the training and evaluating of DSMs. The dataset contains 360 relation pairs, distributed in five different semantic relations, including antonymy, synonymy, hypernymy, meronymy and nearsynonymy. All relation pairs were checked manually to estimate their quality. In the 360 word relation pairs, there are 373 relata. They were all extracted and subsequently manually tagged according to their semantic type. The relatas{'} frequency was calculated in a combined corpus of Sinica and Chinese Gigaword. To the best of our knowledge, EVALution-MAN is the first of its kind for Mandarin, Chinese.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1726/
PDF	https://www.aclweb.org/anthology/L16-1726
PWC	https://paperswithcode.com/paper/evalution-man-a-chinese-dataset-for-the
Repo
Framework

Case and Cause in Icelandic: Reconstructing Causal Networks of Cascaded Language Changes


Title	Case and Cause in Icelandic: Reconstructing Causal Networks of Cascaded Language Changes
Authors	Ferm{'\i}n Moscoso del Prado Mart{'\i}n, Christian Brendel
Abstract
Tasks
Published	2016-08-01
URL	https://www.aclweb.org/anthology/P16-1229/
PDF	https://www.aclweb.org/anthology/P16-1229
PWC	https://paperswithcode.com/paper/case-and-cause-in-icelandic-reconstructing
Repo
Framework

Automatic tagging and retrieval of E-Commerce products based on visual features


Title	Automatic tagging and retrieval of E-Commerce products based on visual features
Authors	Vasu Sharma, Harish Karnick
Abstract
Tasks	Content-Based Image Retrieval, Image Retrieval, Multi-Label Classification, Product Categorization
Published	2016-06-01
URL	https://www.aclweb.org/anthology/N16-2004/
PDF	https://www.aclweb.org/anthology/N16-2004
PWC	https://paperswithcode.com/paper/automatic-tagging-and-retrieval-of-e-commerce
Repo
Framework

Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms


Title	Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Authors	Christo Kirov, John Sylak-Glassman, Roger Que, David Yarowsky
Abstract	Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of standardized formatting and grammatical descriptor definitions. This paper describes a large-scale effort to automatically extract and standardize the data in Wiktionary and make it available for use by the NLP research community. The methodological innovations include a multidimensional table parsing algorithm, a cross-lexeme, token-frequency-based method of separating inflectional form data from grammatical descriptors, the normalization of grammatical descriptors to a unified annotation scheme that accounts for cross-linguistic diversity, and a verification and correction process that exploits within-language, cross-lexeme table format consistency to minimize human effort. The effort described here resulted in the extraction of a uniquely large normalized resource of nearly 1,000,000 inflectional paradigms across 350 languages. Evaluation shows that even though the data is extracted using a language-independent approach, it is comparable in quantity and quality to data extracted using hand-tuned, language-specific approaches.
Tasks	Machine Translation, Morphological Analysis
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1498/
PDF	https://www.aclweb.org/anthology/L16-1498
PWC	https://paperswithcode.com/paper/very-large-scale-parsing-and-normalization-of
Repo
Framework

Purely Corpus-based Automatic Conversation Authoring


Title	Purely Corpus-based Automatic Conversation Authoring
Authors	Guillaume Dubuisson Duplessis, Vincent Letard, Anne-Laure Ligozat, Sophie Rosset
Abstract	This paper presents an automatic corpus-based process to author an open-domain conversational strategy usable both in chatterbot systems and as a fallback strategy for out-of-domain human utterances. Our approach is implemented on a corpus of television drama subtitles. This system is used as a chatterbot system to collect a corpus of 41 open-domain textual dialogues with 27 human participants. The general capabilities of the system are studied through objective measures and subjective self-reports in terms of understandability, repetition and coherence of the system responses selected in reaction to human utterances. Subjective evaluations of the collected dialogues are presented with respect to amusement, engagement and enjoyability. The main factors influencing those dimensions in our chatterbot experiment are discussed.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1433/
PDF	https://www.aclweb.org/anthology/L16-1433
PWC	https://paperswithcode.com/paper/purely-corpus-based-automatic-conversation
Repo
Framework

The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics


Title	The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics
Authors	Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, Michimasa Inaba
Abstract	Dialogue breakdown detection is a promising technique in dialogue systems. To promote the research and development of such a technique, we organized a dialogue breakdown detection challenge where the task is to detect a system{'}s inappropriate utterances that lead to dialogue breakdowns in chat. This paper describes the design, datasets, and evaluation metrics for the challenge as well as the methods and results of the submitted runs of the participants.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1502/
PDF	https://www.aclweb.org/anthology/L16-1502
PWC	https://paperswithcode.com/paper/the-dialogue-breakdown-detection-challenge
Repo
Framework


Title	Coordinating Communication in the Wild: The Artwalk Dialogue Corpus of Pedestrian Navigation and Mobile Referential Communication
Authors	Kris Liu, Jean Fox Tree, Marilyn Walker
Abstract	The Artwalk Corpus is a collection of 48 mobile phone conversations between 24 pairs of friends and 24 pairs of strangers performing a novel, naturalistically-situated referential communication task. This task produced dialogues which, on average, are just under 40 minutes. The task requires the identification of public art while walking around and navigating pedestrian routes in the downtown area of Santa Cruz, California. The task involves a Director on the UCSC campus with access to maps providing verbal instructions to a Follower executing the task. The task provides a setting for real-world situated dialogic language and is designed to: (1) elicit entrainment and coordination of referring expressions between the dialogue participants, (2) examine the effect of friendship on dialogue strategies, and (3) examine how the need to complete the task while negotiating myriad, unanticipated events in the real world ― such as avoiding cars and other pedestrians ― affects linguistic coordination and other dialogue behaviors. Previous work on entrainment and coordinating communication has primarily focused on similar tasks in laboratory settings where there are no interruptions and no need to navigate from one point to another in a complex space. The corpus provides a general resource for studies on how coordinated task-oriented dialogue changes when we move outside the laboratory and into the world. It can also be used for studies of entrainment in dialogue, and the form and style of pedestrian instruction dialogues, as well as the effect of friendship on dialogic behaviors.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1504/
PDF	https://www.aclweb.org/anthology/L16-1504
PWC	https://paperswithcode.com/paper/coordinating-communication-in-the-wild-the
Repo
Framework