May 5, 2019

1800 words 9 mins read

Paper Group NAWR 4

Adapting to All Domains at Once: Rewarding Domain Invariance in SMT. C4Corpus: Multilingual Web-size Corpus with Free License. ExploreKit: Automatic Feature Generation and Selection. An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability. Efficient Coarse-To-Fine PatchMatch for Large Displ …

Adapting to All Domains at Once: Rewarding Domain Invariance in SMT


Title	Adapting to All Domains at Once: Rewarding Domain Invariance in SMT
Authors	Hoang Cuong, Khalil Sima{'}an, Ivan Titov
Abstract	Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.
Tasks	Domain Adaptation, Machine Translation
Published	2016-01-01
URL	https://www.aclweb.org/anthology/Q16-1008/
PDF	https://www.aclweb.org/anthology/Q16-1008
PWC	https://paperswithcode.com/paper/adapting-to-all-domains-at-once-rewarding
Repo	https://github.com/hoangcuong2011/UDIT
Framework	none

C4Corpus: Multilingual Web-size Corpus with Free License


Title	C4Corpus: Multilingual Web-size Corpus with Free License
Authors	Ivan Habernal, Omnia Zayed, Iryna Gurevych
Abstract	Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community.
Tasks
Published	2016-05-01
URL	https://www.aclweb.org/anthology/L16-1146/
PDF	https://www.aclweb.org/anthology/L16-1146
PWC	https://paperswithcode.com/paper/c4corpus-multilingual-web-size-corpus-with
Repo	https://github.com/dkpro/dkpro-c4corpus
Framework	none

ExploreKit: Automatic Feature Generation and Selection


Title	ExploreKit: Automatic Feature Generation and Selection
Authors	Gilad Katz, Eui Chul Richard Shin, Dawn Song
Abstract	Feature generation is one of the challenging aspects of machine learning. We present ExploreKit, a framework for automated feature generation. ExploreKit generates a large set of candidate features by combining information in the original features, with the aim of maximizing predictive performance according to user-selected criteria. To overcome the exponential growth of the feature space, ExploreKit uses a novel machine learning-based feature selection approach to predict the usefulness of new candidate features. This approach enables efficient identification of the new features and produces superior results compared to existing feature selection solutions. We demonstrate the effectiveness and robustness of our approach by conducting an extensive evaluation on 25 datasets and 3 different classification algorithms. We show that ExploreKit can achieve classification-error reduction of 20% overall.
Tasks	Automated Feature Engineering, Feature Selection
Published	2016-01-01
URL	https://ieeexplore.ieee.org/document/7837936
PDF	http://people.eecs.berkeley.edu/~dawnsong/papers/icdm-2016.pdf
PWC	https://paperswithcode.com/paper/explorekit-automatic-feature-generation-and
Repo	https://github.com/giladkatz/ExploreKit
Framework	none

An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability


Title	An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability
Authors	Johannes Hellrich, Udo Hahn
Abstract
Tasks	Semantic Textual Similarity
Published	2016-08-01
URL	https://www.aclweb.org/anthology/W16-2114/
PDF	https://www.aclweb.org/anthology/W16-2114
PWC	https://paperswithcode.com/paper/an-assessment-of-experimental-protocols-for
Repo	https://github.com/hellrich/latech2016
Framework	none

Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow


Title	Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow
Authors	Yinlin Hu, Rui Song, Yunsong Li
Abstract	As a key component in many computer vision systems, optical flow estimation, especially with large displacements, remains an open problem. In this paper we present a simple but powerful matching method works in a coarse-to-fine scheme for optical flow estimation. Inspired by the nearest neighbor field (NNF) algorithms, our approach, called CPM (Coarse-to-fine PatchMatch), blends an efficient random search strategy with the coarse-to-fine scheme for optical flow problem. Unlike existing NNF techniques, which is efficient but the results is often too noisy for optical flow caused by the lack of global regularization, we propose a propagation step with constrained random search radius between adjacent levels on the hierarchical architecture. The resulting correspondences enjoys a built-in smoothing effect, which is more suited for optical flow estimation than NNF techniques. Furthermore, our approach can also capture the tiny structures with large motions which is a problem for traditional coarse-to-fine optical flow algorithms. Interpolated by an edge-preserving interpolation method (EpicFlow), our method outperforms the state of the art on MPI-Sintel and KITTI, and runs much faster than the competing methods.
Tasks	Optical Flow Estimation
Published	2016-06-01
URL	http://openaccess.thecvf.com/content_cvpr_2016/html/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.html
PDF	http://openaccess.thecvf.com/content_cvpr_2016/papers/Hu_Efficient_Coarse-To-Fine_PatchMatch_CVPR_2016_paper.pdf
PWC	https://paperswithcode.com/paper/efficient-coarse-to-fine-patchmatch-for-large
Repo	https://github.com/YinlinHu/CPM
Framework	none

Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models


Title	Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
Authors	Karl Stratos, Michael Collins, Daniel Hsu
Abstract	We tackle unsupervised part-of-speech (POS) tagging by learning hidden Markov models (HMMs) that are particularly well-suited for the problem. These HMMs, which we call anchor HMMs, assume that each tag is associated with at least one word that can have no other tag, which is a relatively benign condition for POS tagging (e.g., {``}the{''} is a word that appears only under the determiner tag). We exploit this assumption and extend the non-negative matrix factorization framework of Arora et al. (2013) to design a consistent estimator for anchor HMMs. In experiments, our algorithm is competitive with strong baselines such as the clustering method of Brown et al. (1992) and the log-linear model of Berg-Kirkpatrick et al. (2010). Furthermore, it produces an interpretable model in which hidden states are automatically lexicalized by words. \|
Tasks	Part-Of-Speech Tagging, Unsupervised Part-Of-Speech Tagging
Published	2016-01-01
URL	https://www.aclweb.org/anthology/Q16-1018/
PDF	https://www.aclweb.org/anthology/Q16-1018
PWC	https://paperswithcode.com/paper/unsupervised-part-of-speech-tagging-with
Repo	https://github.com/karlstratos/anchor
Framework	none

Latent Tree Language Model


Title	Latent Tree Language Model
Authors	Tom{'a}{\v{s}} Brychc{'\i}n
Abstract
Tasks	Language Modelling, Machine Translation, Optical Character Recognition, Speech Recognition
Published	2016-11-01
URL	https://www.aclweb.org/anthology/D16-1042/
PDF	https://www.aclweb.org/anthology/D16-1042
PWC	https://paperswithcode.com/paper/latent-tree-language-model-1
Repo	https://github.com/brychcin/LTLM
Framework	none

A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation


Title	A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation
Authors	Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, Alexander Sorkine-Hornung
Abstract	Over the years, datasets and benchmarks have proven their fundamental importance in computer vision research, enabling targeted progress and objective comparisons in many fields. At the same time, legacy datasets may impend the evolution of a field due to saturated algorithm performance and the lack of contemporary, high quality data. In this work we present a new benchmark dataset and evaluation methodology for the area of video object segmentation. The dataset, named DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, Full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motion-blur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation. In addition, we provide a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics that measure the spatial extent of the segmentation, the accuracy of the silhouette contours and the temporal coherence. The results uncover strengths and weaknesses of current approaches, opening up promising directions for future works.
Tasks	Semantic Segmentation, Video Object Segmentation, Video Semantic Segmentation
Published	2016-06-01
URL	http://openaccess.thecvf.com/content_cvpr_2016/html/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.html
PDF	http://openaccess.thecvf.com/content_cvpr_2016/papers/Perazzi_A_Benchmark_Dataset_CVPR_2016_paper.pdf
PWC	https://paperswithcode.com/paper/a-benchmark-dataset-and-evaluation
Repo	https://github.com/fperazzi/davis
Framework	none

Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources


Title	Language-independent exploration of repetition and variation in longitudinal child-directed speech: A tool and resources
Authors	Gintar{.e} Grigonyt{.e}, Kristina Nilsson Bj{"o}rkenstam
Abstract
Tasks	Language Acquisition
Published	2016-11-01
URL	https://www.aclweb.org/anthology/W16-6506/
PDF	https://www.aclweb.org/anthology/W16-6506
PWC	https://paperswithcode.com/paper/language-independent-exploration-of
Repo	https://github.com/ginta-re/Varseta
Framework	none

Parsing as Language Modeling


Title	Parsing as Language Modeling
Authors	Do Kook Choe, Eugene Charniak
Abstract
Tasks	Constituency Parsing, Dependency Parsing, Language Modelling
Published	2016-11-01
URL	https://www.aclweb.org/anthology/D16-1257/
PDF	https://www.aclweb.org/anthology/D16-1257
PWC	https://paperswithcode.com/paper/parsing-as-language-modeling
Repo	https://github.com/cdg720/emnlp2016
Framework	tf

Valencer: an API to Query Valence Patterns in FrameNet


Title	Valencer: an API to Query Valence Patterns in FrameNet
Authors	Alex Kabbach, re, Corentin Ribeyre
Abstract	This paper introduces Valencer: a RESTful API to search for annotated sentences matching a given combination of syntactic realizations of the arguments of a predicate {–} also called {`}valence pattern{'} {–} in the FrameNet database. The API takes as input an HTTP GET request specifying a valence pattern and outputs a list of exemplifying annotated sentences in JSON format. The API is designed to be modular and language-independent, and can therefore be easily integrated to other (NLP) server-side or client-side applications, as well as non-English FrameNet projects. Valencer is free, open-source, and licensed under the MIT license. \|
Tasks	Question Answering
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-2033/
PDF	https://www.aclweb.org/anthology/C16-2033
PWC	https://paperswithcode.com/paper/valencer-an-api-to-query-valence-patterns-in
Repo	https://github.com/akb89/valencer
Framework	none

Aicyber at SemEval-2016 Task 4: i-vector based sentence representation


Title	Aicyber at SemEval-2016 Task 4: i-vector based sentence representation
Authors	Steven Du, Xi Zhang
Abstract
Tasks	Sentiment Analysis, Speaker Verification
Published	2016-06-01
URL	https://www.aclweb.org/anthology/S16-1017/
PDF	https://www.aclweb.org/anthology/S16-1017
PWC	https://paperswithcode.com/paper/aicyber-at-semeval-2016-task-4-i-vector-based
Repo	https://github.com/StevenLOL/aicyber_semeval_2016_ivector
Framework	none

UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields


Title	UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields
Authors	Mohammad Javad Hosseini, Noah A. Smith, Su-In Lee
Abstract
Tasks
Published	2016-06-01
URL	https://www.aclweb.org/anthology/S16-1143/
PDF	https://www.aclweb.org/anthology/S16-1143
PWC	https://paperswithcode.com/paper/uw-cse-at-semeval-2016-task-10-detecting
Repo	https://github.com/mjhosseini/2-CRF-MWE
Framework	none

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors


Title	PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
Authors	David R. Mortensen, Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, Lori Levin
Abstract	This paper contributes to a growing body of evidence that{—}when coupled with appropriate machine-learning techniques{–}linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.
Tasks
Published	2016-12-01
URL	https://www.aclweb.org/anthology/C16-1328/
PDF	https://www.aclweb.org/anthology/C16-1328
PWC	https://paperswithcode.com/paper/panphon-a-resource-for-mapping-ipa-segments
Repo	https://github.com/dmort27/panphon
Framework	none

Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings


Title	Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings
Authors	Ond{\v{r}}ej Du{\v{s}}ek, Filip Jur{\v{c}}{'\i}{\v{c}}ek
Abstract
Tasks	Spoken Dialogue Systems, Text Generation
Published	2016-08-01
URL	https://www.aclweb.org/anthology/P16-2008/
PDF	https://www.aclweb.org/anthology/P16-2008
PWC	https://paperswithcode.com/paper/sequence-to-sequence-generation-for-spoken-1
Repo	https://github.com/UFAL-DSG/tgen
Framework	tf