Paper Group NANR 54
A Distribution-based Model to Learn Bilingual Word Embeddings. Towards Broad-coverage Meaning Representation: The Case of Comparison Structures. Neural Embedding Language Models in Semantic Clustering of Web Search Results. Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models. Cross-validating Image Description Datasets and Eva …
A Distribution-based Model to Learn Bilingual Word Embeddings
Title | A Distribution-based Model to Learn Bilingual Word Embeddings |
Authors | Hailong Cao, Tiejun Zhao, Shu Zhang, Yao Meng |
Abstract | We introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages. |
Tasks | Machine Translation, Word Embeddings |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/C16-1171/ |
https://www.aclweb.org/anthology/C16-1171 | |
PWC | https://paperswithcode.com/paper/a-distribution-based-model-to-learn-bilingual |
Repo | |
Framework | |
Towards Broad-coverage Meaning Representation: The Case of Comparison Structures
Title | Towards Broad-coverage Meaning Representation: The Case of Comparison Structures |
Authors | Bakhsh, Omid eh, James Allen |
Abstract | |
Tasks | Question Answering, Reading Comprehension, Semantic Parsing, Sentiment Analysis |
Published | 2016-11-01 |
URL | https://www.aclweb.org/anthology/W16-6006/ |
https://www.aclweb.org/anthology/W16-6006 | |
PWC | https://paperswithcode.com/paper/towards-broad-coverage-meaning-representation |
Repo | |
Framework | |
Neural Embedding Language Models in Semantic Clustering of Web Search Results
Title | Neural Embedding Language Models in Semantic Clustering of Web Search Results |
Authors | Andrey Kutuzov, Elizaveta Kuzmenko |
Abstract | In this paper, a new approach towards semantic clustering of the results of ambiguous search queries is presented. We propose using distributed vector representations of words trained with the help of prediction-based neural embedding models to detect senses of search queries and to cluster search engine results page according to these senses. The words from titles and snippets together with semantic relationships between them form a graph, which is further partitioned into components related to different query senses. This approach to search engine results clustering is evaluated against a new manually annotated evaluation data set of Russian search queries. We show that in the task of semantically clustering search results, prediction-based models slightly but stably outperform traditional count-based ones, with the same training corpora. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1486/ |
https://www.aclweb.org/anthology/L16-1486 | |
PWC | https://paperswithcode.com/paper/neural-embedding-language-models-in-semantic |
Repo | |
Framework | |
Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models
Title | Filtering Wiktionary Triangles by Linear Mbetween Distributed Word Models |
Authors | M{'a}rton Makrai |
Abstract | Word translations arise in dictionary-like organization as well as via machine learning from corpora. The former is exemplified by Wiktionary, a crowd-sourced dictionary with editions in many languages. {'A}cs et al. (2013) obtain word translations from Wiktionary with the pivot-based method, also called triangulation, that infers word translations in a pair of languages based on translations to other, typically better resourced ones called pivots. Triangulation may introduce noise if words in the pivot are polysemous. The reliability of each triangulated translation is basically estimated by the number of pivot languages (Tanaka et al 1994). Mikolov et al (2013) introduce a method for generating or scoring word translations. Translation is formalized as a linear mapping between distributed vector space models (VSM) of the two languages. VSMs are trained on monolingual data, while the mapping is learned in a supervised fashion, using a seed dictionary of some thousand word pairs. The mapping can be used to associate existing translations with a real-valued similarity score. This paper exploits human labor in Wiktionary combined with distributional information in VSMs. We train VSMs on gigaword corpora, and the linear translation mapping on direct (non-triangulated) Wiktionary pairs. This mapping is used to filter triangulated translations based on scores. The motivation is that scores by the mapping may be a smoother measure of merit than considering only the number of pivot for the triangle. We evaluate the scores against dictionaries extracted from parallel corpora (Tiedemann 2012). We show that linear translation really provides a more reliable method for triangle scoring than pivot count. The methods we use are language-independent, and the training data is easy to obtain for many languages. We chose the German-Hungarian pair for evaluation, in which the filtered triangles resulting from our experiments are the greatest freely available list of word translations we are aware of. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1439/ |
https://www.aclweb.org/anthology/L16-1439 | |
PWC | https://paperswithcode.com/paper/filtering-wiktionary-triangles-by-linear |
Repo | |
Framework | |
Cross-validating Image Description Datasets and Evaluation Metrics
Title | Cross-validating Image Description Datasets and Evaluation Metrics |
Authors | Josiah Wang, Robert Gaizauskas |
Abstract | The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems. However, not much work has been done to analyse and understand both datasets and the metrics. In this paper, we propose using a leave-one-out cross validation (LOOCV) process as a means to analyse multiply annotated, human-authored image description datasets and the various evaluation metrics, i.e. evaluating one image description against other human-authored descriptions of the same image. Such an evaluation process affords various insights into the image description datasets and evaluation metrics, such as the variations of image descriptions within and across datasets and also what the metrics capture. We compute and analyse (i) human upper-bound performance; (ii) ranked correlation between metric pairs across datasets; (iii) lower-bound performance by comparing a set of descriptions describing one image to another sentence not describing that image. Interesting observations are made about the evaluation metrics and image description datasets, and we conclude that such cross-validation methods are extremely useful for assessing and gaining insights into image description datasets and evaluation metrics for image descriptions. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1489/ |
https://www.aclweb.org/anthology/L16-1489 | |
PWC | https://paperswithcode.com/paper/cross-validating-image-description-datasets |
Repo | |
Framework | |
Benchmarking Lexical Simplification Systems
Title | Benchmarking Lexical Simplification Systems |
Authors | Gustavo Paetzold, Lucia Specia |
Abstract | Lexical Simplification is the task of replacing complex words in a text with simpler alternatives. A variety of strategies have been devised for this challenge, yet there has been little effort in comparing their performance. In this contribution, we present a benchmarking of several Lexical Simplification systems. By combining resources created in previous work with automatic spelling and inflection correction techniques, we introduce BenchLS: a new evaluation dataset for the task. Using BenchLS, we evaluate the performance of solutions for various steps in the typical Lexical Simplification pipeline, both individually and jointly. This is the first time Lexical Simplification systems are compared in such fashion on the same data, and the findings introduce many contributions to the field, revealing several interesting properties of the systems evaluated. |
Tasks | Lexical Simplification |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1491/ |
https://www.aclweb.org/anthology/L16-1491 | |
PWC | https://paperswithcode.com/paper/benchmarking-lexical-simplification-systems |
Repo | |
Framework | |
Extending AIDA framework by incorporating coreference resolution on detected mentions and pruning based on popularity of an entity
Title | Extending AIDA framework by incorporating coreference resolution on detected mentions and pruning based on popularity of an entity |
Authors | Samaikya Akarapu, C Ravindranath Chowdary |
Abstract | |
Tasks | Coreference Resolution, Entity Linking, Named Entity Recognition, Word Sense Disambiguation |
Published | 2016-12-01 |
URL | https://www.aclweb.org/anthology/W16-6306/ |
https://www.aclweb.org/anthology/W16-6306 | |
PWC | https://paperswithcode.com/paper/extending-aida-framework-by-incorporating |
Repo | |
Framework | |
A Novel Evaluation Method for Morphological Segmentation
Title | A Novel Evaluation Method for Morphological Segmentation |
Authors | Javad Nouri, Roman Yangarber |
Abstract | Unsupervised learning of morphological segmentation of words in a language, based only on a large corpus of words, is a challenging task. Evaluation of the learned segmentations is a challenge in itself, due to the inherent ambiguity of the segmentation task. There is no way to posit unique {``}correct{''} segmentation for a set of data in an objective way. Two models may arrive at different ways of segmenting the data, which may nonetheless both be valid. Several evaluation methods have been proposed to date, but they do not insist on consistency of the evaluated model. We introduce a new evaluation methodology, which enforces correctness of segmentation boundaries while also assuring consistency of segmentation decisions across the corpus. | |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1495/ |
https://www.aclweb.org/anthology/L16-1495 | |
PWC | https://paperswithcode.com/paper/a-novel-evaluation-method-for-morphological |
Repo | |
Framework | |
EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs
Title | EVALution-MAN: A Chinese Dataset for the Training and Evaluation of DSMs |
Authors | Liu Hongchao, Karl Neergaard, Enrico Santus, Chu-Ren Huang |
Abstract | Distributional semantic models (DSMs) are currently being used in the measurement of word relatedness and word similarity. One shortcoming of DSMs is that they do not provide a principled way to discriminate different semantic relations. Several approaches have been adopted that rely on annotated data either in the training of the model or later in its evaluation. In this paper, we introduce a dataset for training and evaluating DSMs on semantic relations discrimination between words, in Mandarin, Chinese. The construction of the dataset followed EVALution 1.0, which is an English dataset for the training and evaluating of DSMs. The dataset contains 360 relation pairs, distributed in five different semantic relations, including antonymy, synonymy, hypernymy, meronymy and nearsynonymy. All relation pairs were checked manually to estimate their quality. In the 360 word relation pairs, there are 373 relata. They were all extracted and subsequently manually tagged according to their semantic type. The relatas{'} frequency was calculated in a combined corpus of Sinica and Chinese Gigaword. To the best of our knowledge, EVALution-MAN is the first of its kind for Mandarin, Chinese. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1726/ |
https://www.aclweb.org/anthology/L16-1726 | |
PWC | https://paperswithcode.com/paper/evalution-man-a-chinese-dataset-for-the |
Repo | |
Framework | |
Case and Cause in Icelandic: Reconstructing Causal Networks of Cascaded Language Changes
Title | Case and Cause in Icelandic: Reconstructing Causal Networks of Cascaded Language Changes |
Authors | Ferm{'\i}n Moscoso del Prado Mart{'\i}n, Christian Brendel |
Abstract | |
Tasks | |
Published | 2016-08-01 |
URL | https://www.aclweb.org/anthology/P16-1229/ |
https://www.aclweb.org/anthology/P16-1229 | |
PWC | https://paperswithcode.com/paper/case-and-cause-in-icelandic-reconstructing |
Repo | |
Framework | |
Automatic tagging and retrieval of E-Commerce products based on visual features
Title | Automatic tagging and retrieval of E-Commerce products based on visual features |
Authors | Vasu Sharma, Harish Karnick |
Abstract | |
Tasks | Content-Based Image Retrieval, Image Retrieval, Multi-Label Classification, Product Categorization |
Published | 2016-06-01 |
URL | https://www.aclweb.org/anthology/N16-2004/ |
https://www.aclweb.org/anthology/N16-2004 | |
PWC | https://paperswithcode.com/paper/automatic-tagging-and-retrieval-of-e-commerce |
Repo | |
Framework | |
Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Title | Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms |
Authors | Christo Kirov, John Sylak-Glassman, Roger Que, David Yarowsky |
Abstract | Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of standardized formatting and grammatical descriptor definitions. This paper describes a large-scale effort to automatically extract and standardize the data in Wiktionary and make it available for use by the NLP research community. The methodological innovations include a multidimensional table parsing algorithm, a cross-lexeme, token-frequency-based method of separating inflectional form data from grammatical descriptors, the normalization of grammatical descriptors to a unified annotation scheme that accounts for cross-linguistic diversity, and a verification and correction process that exploits within-language, cross-lexeme table format consistency to minimize human effort. The effort described here resulted in the extraction of a uniquely large normalized resource of nearly 1,000,000 inflectional paradigms across 350 languages. Evaluation shows that even though the data is extracted using a language-independent approach, it is comparable in quantity and quality to data extracted using hand-tuned, language-specific approaches. |
Tasks | Machine Translation, Morphological Analysis |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1498/ |
https://www.aclweb.org/anthology/L16-1498 | |
PWC | https://paperswithcode.com/paper/very-large-scale-parsing-and-normalization-of |
Repo | |
Framework | |
Purely Corpus-based Automatic Conversation Authoring
Title | Purely Corpus-based Automatic Conversation Authoring |
Authors | Guillaume Dubuisson Duplessis, Vincent Letard, Anne-Laure Ligozat, Sophie Rosset |
Abstract | This paper presents an automatic corpus-based process to author an open-domain conversational strategy usable both in chatterbot systems and as a fallback strategy for out-of-domain human utterances. Our approach is implemented on a corpus of television drama subtitles. This system is used as a chatterbot system to collect a corpus of 41 open-domain textual dialogues with 27 human participants. The general capabilities of the system are studied through objective measures and subjective self-reports in terms of understandability, repetition and coherence of the system responses selected in reaction to human utterances. Subjective evaluations of the collected dialogues are presented with respect to amusement, engagement and enjoyability. The main factors influencing those dimensions in our chatterbot experiment are discussed. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1433/ |
https://www.aclweb.org/anthology/L16-1433 | |
PWC | https://paperswithcode.com/paper/purely-corpus-based-automatic-conversation |
Repo | |
Framework | |
The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics
Title | The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics |
Authors | Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, Michimasa Inaba |
Abstract | Dialogue breakdown detection is a promising technique in dialogue systems. To promote the research and development of such a technique, we organized a dialogue breakdown detection challenge where the task is to detect a system{'}s inappropriate utterances that lead to dialogue breakdowns in chat. This paper describes the design, datasets, and evaluation metrics for the challenge as well as the methods and results of the submitted runs of the participants. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1502/ |
https://www.aclweb.org/anthology/L16-1502 | |
PWC | https://paperswithcode.com/paper/the-dialogue-breakdown-detection-challenge |
Repo | |
Framework | |
Coordinating Communication in the Wild: The Artwalk Dialogue Corpus of Pedestrian Navigation and Mobile Referential Communication
Title | Coordinating Communication in the Wild: The Artwalk Dialogue Corpus of Pedestrian Navigation and Mobile Referential Communication |
Authors | Kris Liu, Jean Fox Tree, Marilyn Walker |
Abstract | The Artwalk Corpus is a collection of 48 mobile phone conversations between 24 pairs of friends and 24 pairs of strangers performing a novel, naturalistically-situated referential communication task. This task produced dialogues which, on average, are just under 40 minutes. The task requires the identification of public art while walking around and navigating pedestrian routes in the downtown area of Santa Cruz, California. The task involves a Director on the UCSC campus with access to maps providing verbal instructions to a Follower executing the task. The task provides a setting for real-world situated dialogic language and is designed to: (1) elicit entrainment and coordination of referring expressions between the dialogue participants, (2) examine the effect of friendship on dialogue strategies, and (3) examine how the need to complete the task while negotiating myriad, unanticipated events in the real world ― such as avoiding cars and other pedestrians ― affects linguistic coordination and other dialogue behaviors. Previous work on entrainment and coordinating communication has primarily focused on similar tasks in laboratory settings where there are no interruptions and no need to navigate from one point to another in a complex space. The corpus provides a general resource for studies on how coordinated task-oriented dialogue changes when we move outside the laboratory and into the world. It can also be used for studies of entrainment in dialogue, and the form and style of pedestrian instruction dialogues, as well as the effect of friendship on dialogic behaviors. |
Tasks | |
Published | 2016-05-01 |
URL | https://www.aclweb.org/anthology/L16-1504/ |
https://www.aclweb.org/anthology/L16-1504 | |
PWC | https://paperswithcode.com/paper/coordinating-communication-in-the-wild-the |
Repo | |
Framework | |