May 5, 2019

1800 words 9 mins read

Paper Group NAWR 4

Paper Group NAWR 4

Adapting to All Domains at Once: Rewarding Domain Invariance in SMT. C4Corpus: Multilingual Web-size Corpus with Free License. ExploreKit: Automatic Feature Generation and Selection. An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability. Efficient Coarse-To-Fine PatchMatch for Large Displ …

Adapting to All Domains at Once: Rewarding Domain Invariance in SMT

Title Adapting to All Domains at Once: Rewarding Domain Invariance in SMT
Authors Hoang Cuong, Khalil Sima{'}an, Ivan Titov
Abstract Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.
Tasks Domain Adaptation, Machine Translation
Published 2016-01-01
URL https://www.aclweb.org/anthology/Q16-1008/
PDF https://www.aclweb.org/anthology/Q16-1008
PWC https://paperswithcode.com/paper/adapting-to-all-domains-at-once-rewarding
Repo https://github.com/hoangcuong2011/UDIT
Framework none

C4Corpus: Multilingual Web-size Corpus with Free License

Title C4Corpus: Multilingual Web-size Corpus with Free License
Authors Ivan Habernal, Omnia Zayed, Iryna Gurevych
Abstract Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community.
Tasks
Published 2016-05-01
URL https://www.aclweb.org/anthology/L16-1146/
PDF https://www.aclweb.org/anthology/L16-1146
PWC https://paperswithcode.com/paper/c4corpus-multilingual-web-size-corpus-with
Repo https://github.com/dkpro/dkpro-c4corpus
Framework none

ExploreKit: Automatic Feature Generation and Selection

Title ExploreKit: Automatic Feature Generation and Selection
Authors Gilad Katz, Eui Chul Richard Shin, Dawn Song
Abstract Feature generation is one of the challenging aspects of machine learning. We present ExploreKit, a framework for automated feature generation. ExploreKit generates a large set of candidate features by combining information in the original features, with the aim of maximizing predictive performance according to user-selected criteria. To overcome the exponential growth of the feature space, ExploreKit uses a novel machine learning-based feature selection approach to predict the usefulness of new candidate features. This approach enables efficient identification of the new features and produces superior results compared to existing feature selection solutions. We demonstrate the effectiveness and robustness of our approach by conducting an extensive evaluation on 25 datasets and 3 different classification algorithms. We show that ExploreKit can achieve classification-error reduction of 20% overall.
Tasks Automated Feature Engineering, Feature Selection
Published 2016-01-01
URL https://ieeexplore.ieee.org/document/7837936
PDF http://people.eecs.berkeley.edu/~dawnsong/papers/icdm-2016.pdf
PWC https://paperswithcode.com/paper/explorekit-automatic-feature-generation-and
Repo https://github.com/giladkatz/ExploreKit
Framework none

An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability

Title An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability
Authors Johannes Hellrich, Udo Hahn
Abstract
Tasks Semantic Textual Similarity
Published 2016-08-01
URL https://www.aclweb.org/anthology/W16-2114/
PDF https://www.aclweb.org/anthology/W16-2114
PWC https://paperswithcode.com/paper/an-assessment-of-experimental-protocols-for
Repo https://github.com/hellrich/latech2016
Framework none

Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow

Title Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow
Authors Yinlin Hu, Rui Song, Yunsong Li
Abstract As a key component in many computer vision systems, optical flow estimation, especially with large displacements, remains an open problem. In this paper we present a simple but powerful matching method works in a coarse-to-fine scheme for optical flow estimation. Inspired by the nearest neighbor field (NNF) algorithms, our approach, called CPM (Coarse-to-fine PatchMatch), blends an efficient random search strategy with the coarse-to-fine scheme for optical flow problem. Unlike existing NNF techniques, which is efficient but the results is often too noisy for optical flow caused by the lack of global regularization, we propose a propagation step with constrained random search radius between adjacent levels on the hierarchical architecture. The resulting correspondences enjoys a built-in smoothing effect, which is more suited for optical flow estimation than NNF techniques. Furthermore, our approach can also capture the tiny structures with large motions which is a problem for traditional coarse-to-fine optical flow algorithms. Interpolated by an edge-preserving interpolation method (EpicFlow), our method outperforms the state of the art on MPI-Sintel and KITTI, and runs much faster than the competing methods.
Tasks Optical Flow Estimation
Published 2016-06-01
URL http://openaccess.thecvf.com/content_cvpr_2016/html/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.html
PDF http://openaccess.thecvf.com/content_cvpr_2016/papers/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.pdf
PWC https://paperswithcode.com/paper/efficient-coarse-to-fine-patchmatch-for-large
Repo https://github.com/YinlinHu/CPM
Framework none

Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models

Title Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
Authors Karl Stratos, Michael Collins, Daniel Hsu
Abstract We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., {``}the{''} is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words. |
Tasks Part-Of-Speech Tagging, Unsupervised Part-Of-Speech Tagging
Published 2016-01-01
URL https://www.aclweb.org/anthology/Q16-1018/
PDF https://www.aclweb.org/anthology/Q16-1018
PWC https://paperswithcode.com/paper/unsupervised-part-of-speech-tagging-with
Repo https://github.com/karlstratos/anchor
Framework none

Latent Tree Language Model

Title Latent Tree Language Model
Authors Tom{'a}{\v{s}} Brychc{'\i}n
Abstract
Tasks Language Modelling, Machine Translation, Optical Character Recognition, Speech Recognition
Published 2016-11-01
URL https://www.aclweb.org/anthology/D16-1042/
PDF https://www.aclweb.org/anthology/D16-1042
PWC https://paperswithcode.com/paper/latent-tree-language-model-1
Repo https://github.com/brychcin/LTLM
Framework none

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation

Title A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation
Authors Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, Alexander Sorkine-Hornung
Abstract Over the years, datasets and benchmarks have proven their fundamental importance in computer vision research, enabling targeted progress and objective comparisons in many fields. At the same time, legacy datasets may impend the evolution of a field due to saturated algorithm performance and the lack of contemporary, high quality data. In this work we present a new benchmark dataset and evaluation methodology for the area of video object segmentation. The dataset, named DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, Full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motion-blur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation. In addition, we provide a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics that measure the spatial extent of the segmentation, the accuracy of the silhouette contours and the temporal coherence. The results uncover strengths and weaknesses of current approaches, opening up promising directions for future works.
Tasks Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published 2016-06-01
URL http://openaccess.thecvf.com/content_cvpr_2016/html/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.html
PDF http://openaccess.thecvf.com/content_cvpr_2016/papers/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.pdf
PWC https://paperswithcode.com/paper/a-benchmark-dataset-and-evaluation
Repo https://github.com/fperazzi/davis
Framework none

Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources

Title Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources
Authors Gintar{.e} Grigonyt{.e}, Kristina Nilsson Bj{"o}rkenstam
Abstract
Tasks Language Acquisition
Published 2016-11-01
URL https://www.aclweb.org/anthology/W16-6506/
PDF https://www.aclweb.org/anthology/W16-6506
PWC https://paperswithcode.com/paper/language-independent-exploration-of
Repo https://github.com/ginta-re/Varseta
Framework none

Parsing as Language Modeling

Title Parsing as Language Modeling
Authors Do Kook Choe, Eugene Charniak
Abstract
Tasks Constituency Parsing, Dependency Parsing, Language Modelling
Published 2016-11-01
URL https://www.aclweb.org/anthology/D16-1257/
PDF https://www.aclweb.org/anthology/D16-1257
PWC https://paperswithcode.com/paper/parsing-as-language-modeling
Repo https://github.com/cdg720/emnlp2016
Framework tf

Valencer: an API to Query Valence Patterns in FrameNet

Title Valencer: an API to Query Valence Patterns in FrameNet
Authors Alex Kabbach, re, Corentin Ribeyre
Abstract This paper introduces Valencer: a RESTful API to search for annotated sentences matching a given combination of syntactic realizations of the arguments of a predicate {–} also called {`}valence pattern{'} {–} in the FrameNet database. The API takes as input an HTTP GET request specifying a valence pattern and outputs a list of exemplifying annotated sentences in JSON format. The API is designed to be modular and language-independent, and can therefore be easily integrated to other (NLP) server-side or client-side applications, as well as non-English FrameNet projects. Valencer is free, open-source, and licensed under the MIT license. |
Tasks Question Answering
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-2033/
PDF https://www.aclweb.org/anthology/C16-2033
PWC https://paperswithcode.com/paper/valencer-an-api-to-query-valence-patterns-in
Repo https://github.com/akb89/valencer
Framework none

Aicyber at SemEval-2016 Task 4: i-vector based sentence representation

Title Aicyber at SemEval-2016 Task 4: i-vector based sentence representation
Authors Steven Du, Xi Zhang
Abstract
Tasks Sentiment Analysis, Speaker Verification
Published 2016-06-01
URL https://www.aclweb.org/anthology/S16-1017/
PDF https://www.aclweb.org/anthology/S16-1017
PWC https://paperswithcode.com/paper/aicyber-at-semeval-2016-task-4-i-vector-based
Repo https://github.com/StevenLOL/aicyber_semeval_2016_ivector
Framework none

UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields

Title UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields
Authors Mohammad Javad Hosseini, Noah A. Smith, Su-In Lee
Abstract
Tasks
Published 2016-06-01
URL https://www.aclweb.org/anthology/S16-1143/
PDF https://www.aclweb.org/anthology/S16-1143
PWC https://paperswithcode.com/paper/uw-cse-at-semeval-2016-task-10-detecting
Repo https://github.com/mjhosseini/2-CRF-MWE
Framework none

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors

Title PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
Authors David R. Mortensen, Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, Lori Levin
Abstract This paper contributes to a growing body of evidence that{—}when coupled with appropriate machine-learning techniques{–}linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.
Tasks
Published 2016-12-01
URL https://www.aclweb.org/anthology/C16-1328/
PDF https://www.aclweb.org/anthology/C16-1328
PWC https://paperswithcode.com/paper/panphon-a-resource-for-mapping-ipa-segments
Repo https://github.com/dmort27/panphon
Framework none

Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings

Title Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings
Authors Ond{\v{r}}ej Du{\v{s}}ek, Filip Jur{\v{c}}{'\i}{\v{c}}ek
Abstract
Tasks Spoken Dialogue Systems, Text Generation
Published 2016-08-01
URL https://www.aclweb.org/anthology/P16-2008/
PDF https://www.aclweb.org/anthology/P16-2008
PWC https://paperswithcode.com/paper/sequence-to-sequence-generation-for-spoken-1
Repo https://github.com/UFAL-DSG/tgen
Framework tf
comments powered by Disqus