Paper Group ANR 89
Learning spectro-temporal features with 3D CNNs for speech emotion recognition. Evaluation of Deep Learning on an Abstract Image Classification Dataset. Unsupervised neural and Bayesian models for zero-resource speech processing. No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models. Improved Lossy Image Compr …
Learning spectro-temporal features with 3D CNNs for speech emotion recognition
Title | Learning spectro-temporal features with 3D CNNs for speech emotion recognition |
Authors | Jaebok Kim, Khiet P. Truong, Gwenn Englebienne, Vanessa Evers |
Abstract | In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions. |
Tasks | Emotion Recognition, Speech Emotion Recognition |
Published | 2017-08-14 |
URL | http://arxiv.org/abs/1708.05071v1 |
http://arxiv.org/pdf/1708.05071v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-spectro-temporal-features-with-3d |
Repo | |
Framework | |
Evaluation of Deep Learning on an Abstract Image Classification Dataset
Title | Evaluation of Deep Learning on an Abstract Image Classification Dataset |
Authors | Sebastian Stabinger, Antonio Rodriguez-Sanchez |
Abstract | Convolutional Neural Networks have become state of the art methods for image classification over the last couple of years. By now they perform better than human subjects on many of the image classification datasets. Most of these datasets are based on the notion of concrete classes (i.e. images are classified by the type of object in the image). In this paper we present a novel image classification dataset, using abstract classes, which should be easy to solve for humans, but variations of it are challenging for CNNs. The classification performance of popular CNN architectures is evaluated on this dataset and variations of the dataset that might be interesting for further research are identified. |
Tasks | Image Classification |
Published | 2017-08-25 |
URL | http://arxiv.org/abs/1708.07770v1 |
http://arxiv.org/pdf/1708.07770v1.pdf | |
PWC | https://paperswithcode.com/paper/evaluation-of-deep-learning-on-an-abstract |
Repo | |
Framework | |
Unsupervised neural and Bayesian models for zero-resource speech processing
Title | Unsupervised neural and Bayesian models for zero-resource speech processing |
Authors | Herman Kamper |
Abstract | In settings where only unlabelled speech data is available, zero-resource speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units (phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful units. In this thesis, we argue that a combination of top-down and bottom-up modelling is advantageous in tackling these two problems. To address the problem of frame-level representation learning, we present the correspondence autoencoder (cAE), a neural network trained with weak top-down supervision from an unsupervised term discovery system. By combining this top-down supervision with unsupervised bottom-up initialization, the cAE yields much more discriminative features than previous approaches. We then present our unsupervised segmental Bayesian model that segments and clusters unlabelled speech into hypothesized words. By imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, our system outperforms several others on multi-speaker conversational English and Xitsonga speech data. Finally, we show that the clusters discovered by the segmental Bayesian model can be made less speaker- and gender-specific by using features from the cAE instead of traditional acoustic features. In summary, the different models and systems presented in this thesis show that both top-down and bottom-up modelling can improve representation learning, segmentation and clustering of unlabelled speech data. |
Tasks | Language Modelling, Representation Learning |
Published | 2017-01-03 |
URL | http://arxiv.org/abs/1701.00851v1 |
http://arxiv.org/pdf/1701.00851v1.pdf | |
PWC | https://paperswithcode.com/paper/unsupervised-neural-and-bayesian-models-for |
Repo | |
Framework | |
No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models
Title | No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models |
Authors | Tara N. Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, Yonghui Wu, Zhifeng Chen, Chung-Cheng Chiu |
Abstract | For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phonemes. We also compare grapheme and phoneme-based approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects. |
Tasks | Language Modelling |
Published | 2017-12-05 |
URL | http://arxiv.org/abs/1712.01864v1 |
http://arxiv.org/pdf/1712.01864v1.pdf | |
PWC | https://paperswithcode.com/paper/no-need-for-a-lexicon-evaluating-the-value-of |
Repo | |
Framework | |
Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks
Title | Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks |
Authors | Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen, Sung Jin Hwang, Joel Shor, George Toderici |
Abstract | We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0 ), WebP, JPEG2000, and JPEG as measured by MS-SSIM. We introduce three improvements over previous research that lead to this state-of-the-art result. First, we show that training with a pixel-wise loss weighted by SSIM increases reconstruction quality according to several metrics. Second, we modify the recurrent architecture to improve spatial diffusion, which allows the network to more effectively capture and propagate image information through the network’s hidden state. Finally, in addition to lossless entropy coding, we use a spatially adaptive bit allocation algorithm to more efficiently use the limited number of bits to encode visually complex image regions. We evaluate our method on the Kodak and Tecnick image sets and compare against standard codecs as well recently published methods based on deep neural networks. |
Tasks | Image Compression |
Published | 2017-03-29 |
URL | http://arxiv.org/abs/1703.10114v1 |
http://arxiv.org/pdf/1703.10114v1.pdf | |
PWC | https://paperswithcode.com/paper/improved-lossy-image-compression-with-priming |
Repo | |
Framework | |
Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression
Title | Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression |
Authors | Shiva Prasad Kasiviswanathan, Mark Rudelson |
Abstract | High-dimensional settings, where the data dimension ($d$) far exceeds the number of observations ($n$), are common in many statistical and machine learning applications. Methods based on $\ell_1$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix $X \in \mathbb{R}^{n \times d}$ satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution $X \Phi^\top \Phi$, where $\Phi \in \mathbb{R}^{m \times d}$ is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily {\em verifiable} condition into the design process, and allows for generation of {\em compressed} design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix $X$. |
Tasks | |
Published | 2017-07-25 |
URL | http://arxiv.org/abs/1707.08092v4 |
http://arxiv.org/pdf/1707.08092v4.pdf | |
PWC | https://paperswithcode.com/paper/restricted-eigenvalue-from-stable-rank-with |
Repo | |
Framework | |
Providing Self-Aware Systems with Reflexivity
Title | Providing Self-Aware Systems with Reflexivity |
Authors | Alessandro Valitutti, Giuseppe Trautteur |
Abstract | We propose a new type of self-aware systems inspired by ideas from higher-order theories of consciousness. First, we discussed the crucial distinction between introspection and reflexion. Then, we focus on computational reflexion as a mechanism by which a computer program can inspect its own code at every stage of the computation. Finally, we provide a formal definition and a proof-of-concept implementation of computational reflexion, viewed as an enriched form of program interpretation and a way to dynamically “augment” a computational process. |
Tasks | |
Published | 2017-07-27 |
URL | http://arxiv.org/abs/1707.08901v1 |
http://arxiv.org/pdf/1707.08901v1.pdf | |
PWC | https://paperswithcode.com/paper/providing-self-aware-systems-with-reflexivity |
Repo | |
Framework | |
Mining Smart Card Data for Travelers’ Mini Activities
Title | Mining Smart Card Data for Travelers’ Mini Activities |
Authors | Boris Chidlovskii |
Abstract | In the context of public transport modeling and simulation, we address the problem of mismatch between simulated transit trips and observed ones. We point to the weakness of the current travel demand modeling process; the trips it generates are over-optimistic and do not reflect the real passenger choices. We introduce the notion of mini activities the travelers do during the trips; they can explain the deviation of simulated trips from the observed trips. We propose to mine the smart card data to extract the mini activities. We develop a technique to integrate them in the generated trips and learn such an integration from two available sources, the trip history and trip planner recommendations. For an input travel demand, we build a Markov chain over the trip collection and apply the Monte Carlo Markov Chain algorithm to integrate mini activities in such a way that the selected characteristics converge to the desired distributions. We test our method in different settings on the passenger trip collection of Nancy, France. We report experimental results demonstrating a very important mismatch reduction. |
Tasks | |
Published | 2017-12-19 |
URL | http://arxiv.org/abs/1712.06935v1 |
http://arxiv.org/pdf/1712.06935v1.pdf | |
PWC | https://paperswithcode.com/paper/mining-smart-card-data-for-travelers-mini |
Repo | |
Framework | |
Robust Optimization of Unconstrained Binary Quadratic Problems
Title | Robust Optimization of Unconstrained Binary Quadratic Problems |
Authors | Mark Lewis, Gary Kochenberger, John Metcalfe |
Abstract | In this paper we focus on the unconstrained binary quadratic optimization model, maximize x^t Qx, x binary, and consider the problem of identifying optimal solutions that are robust with respect to perturbations in the Q matrix.. We are motivated to find robust, or stable, solutions because of the uncertainty inherent in the big data origins of Q and limitations in computer numerical precision, particularly in a new class of quantum annealing computers. Experimental design techniques are used to generate a diverse subset of possible scenarios, from which robust solutions are identified. An illustrative example with practical application to business decision making is examined. The approach presented also generates a surface response equation which is used to estimate upper bounds in constant time for Q instantiations within the scenario extremes. In addition, a theoretical framework for the robustness of individual x_i variables is considered by examining the range of Q values over which the x_i are predetermined. |
Tasks | Decision Making |
Published | 2017-09-21 |
URL | http://arxiv.org/abs/1709.07511v1 |
http://arxiv.org/pdf/1709.07511v1.pdf | |
PWC | https://paperswithcode.com/paper/robust-optimization-of-unconstrained-binary |
Repo | |
Framework | |
Relevance-based Word Embedding
Title | Relevance-based Word Embedding |
Authors | Hamed Zamani, W. Bruce Croft |
Abstract | Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe. |
Tasks | Information Retrieval, Semantic Similarity, Semantic Textual Similarity |
Published | 2017-05-09 |
URL | http://arxiv.org/abs/1705.03556v2 |
http://arxiv.org/pdf/1705.03556v2.pdf | |
PWC | https://paperswithcode.com/paper/relevance-based-word-embedding |
Repo | |
Framework | |
Deep Speaker Verification: Do We Need End to End?
Title | Deep Speaker Verification: Do We Need End to End? |
Authors | Dong Wang, Lantian Li, Zhiyuan Tang, Thomas Fang Zheng |
Abstract | End-to-end learning treats the entire system as a whole adaptable black box, which, if sufficient data are available, may learn a system that works very well for the target task. This principle has recently been applied to several prototype research on speaker verification (SV), where the feature learning and classifier are learned together with an objective function that is consistent with the evaluation metric. An opposite approach to end-to-end is feature learning, which firstly trains a feature learning model, and then constructs a back-end classifier separately to perform SV. Recently, both approaches achieved significant performance gains on SV, mainly attributed to the smart utilization of deep neural networks. However, the two approaches have not been carefully compared, and their respective advantages have not been well discussed. In this paper, we compare the end-to-end and feature learning approaches on a text-independent SV task. Our experiments on a dataset sampled from the Fisher database and involving 5,000 speakers demonstrated that the feature learning approach outperformed the end-to-end approach. This is a strong support for the feature learning approach, at least with data and computation resources similar to ours. |
Tasks | Speaker Verification |
Published | 2017-06-22 |
URL | http://arxiv.org/abs/1706.07859v1 |
http://arxiv.org/pdf/1706.07859v1.pdf | |
PWC | https://paperswithcode.com/paper/deep-speaker-verification-do-we-need-end-to |
Repo | |
Framework | |
Perceiving and Reasoning About Liquids Using Fully Convolutional Networks
Title | Perceiving and Reasoning About Liquids Using Fully Convolutional Networks |
Authors | Conor Schenck, Dieter Fox |
Abstract | Liquids are an important part of many common manipulation tasks in human environments. If we wish to have robots that can accomplish these types of tasks, they must be able to interact with liquids in an intelligent manner. In this paper, we investigate ways for robots to perceive and reason about liquids. That is, a robot asks the questions What in the visual data stream is liquid? and How can I use that to infer all the potential places where liquid might be? We collected two datasets to evaluate these questions, one using a realistic liquid simulator and another on our robot. We used fully convolutional neural networks to learn to detect and track liquids across pouring sequences. Our results show that these networks are able to perceive and reason about liquids, and that integrating temporal information is important to performing such tasks well. |
Tasks | |
Published | 2017-03-05 |
URL | http://arxiv.org/abs/1703.01564v2 |
http://arxiv.org/pdf/1703.01564v2.pdf | |
PWC | https://paperswithcode.com/paper/perceiving-and-reasoning-about-liquids-using |
Repo | |
Framework | |
An Investigation of Newton-Sketch and Subsampled Newton Methods
Title | An Investigation of Newton-Sketch and Subsampled Newton Methods |
Authors | Albert S. Berahas, Raghu Bollapragada, Jorge Nocedal |
Abstract | Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton’s method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomized Hadamard transformations. Each has its own advantages, and their relative tradeoffs have not been investigated in the optimization literature. Our study focuses on practical versions of the two methods in which the resulting linear systems of equations are solved approximately, at every iteration, using an iterative solver. The advantages of using the conjugate gradient method vs. a stochastic gradient iteration are revealed through a set of numerical experiments, and a complexity analysis of the Hessian subsampling method is presented. |
Tasks | Dimensionality Reduction |
Published | 2017-05-17 |
URL | https://arxiv.org/abs/1705.06211v4 |
https://arxiv.org/pdf/1705.06211v4.pdf | |
PWC | https://paperswithcode.com/paper/an-investigation-of-newton-sketch-and |
Repo | |
Framework | |
Discovering objects and their relations from entangled scene representations
Title | Discovering objects and their relations from entangled scene representations |
Authors | David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, Peter Battaglia |
Abstract | Our world can be succinctly and compactly described as structured scenes of objects and relations. A typical room, for example, contains salient objects such as tables, chairs and books, and these objects typically relate to each other by their underlying causes and semantics. This gives rise to correlated features, such as position, function and shape. Humans exploit knowledge of objects and their relations for learning a wide spectrum of tasks, and more generally when learning the structure underlying observed data. In this work, we introduce relation networks (RNs) - a general purpose neural network architecture for object-relation reasoning. We show that RNs are capable of learning object relations from scene description data. Furthermore, we show that RNs can act as a bottleneck that induces the factorization of objects from entangled scene description inputs, and from distributed deep representations of scene images provided by a variational autoencoder. The model can also be used in conjunction with differentiable memory mechanisms for implicit relation discovery in one-shot learning tasks. Our results suggest that relation networks are a potentially powerful architecture for solving a variety of problems that require object relation reasoning. |
Tasks | One-Shot Learning |
Published | 2017-02-16 |
URL | http://arxiv.org/abs/1702.05068v1 |
http://arxiv.org/pdf/1702.05068v1.pdf | |
PWC | https://paperswithcode.com/paper/discovering-objects-and-their-relations-from |
Repo | |
Framework | |
Attentive Memory Networks: Efficient Machine Reading for Conversational Search
Title | Attentive Memory Networks: Efficient Machine Reading for Conversational Search |
Authors | Tom Kenter, Maarten de Rijke |
Abstract | Recent advances in conversational systems have changed the search paradigm. Traditionally, a user poses a query to a search engine that returns an answer based on its index, possibly leveraging external knowledge bases and conditioning the response on earlier interactions in the search session. In a natural conversation, there is an additional source of information to take into account: utterances produced earlier in a conversation can also be referred to and a conversational IR system has to keep track of information conveyed by the user during the conversation, even if it is implicit. We argue that the process of building a representation of the conversation can be framed as a machine reading task, where an automated system is presented with a number of statements about which it should answer questions. The questions should be answered solely by referring to the statements provided, without consulting external knowledge. The time is right for the information retrieval community to embrace this task, both as a stand-alone task and integrated in a broader conversational search setting. In this paper, we focus on machine reading as a stand-alone task and present the Attentive Memory Network (AMN), an end-to-end trainable machine reading algorithm. Its key contribution is in efficiency, achieved by having an hierarchical input encoder, iterating over the input only once. Speed is an important requirement in the setting of conversational search, as gaps between conversational turns have a detrimental effect on naturalness. On 20 datasets commonly used for evaluating machine reading algorithms we show that the AMN achieves performance comparable to the state-of-the-art models, while using considerably fewer computations. |
Tasks | Information Retrieval, Reading Comprehension |
Published | 2017-12-19 |
URL | http://arxiv.org/abs/1712.07229v1 |
http://arxiv.org/pdf/1712.07229v1.pdf | |
PWC | https://paperswithcode.com/paper/attentive-memory-networks-efficient-machine |
Repo | |
Framework | |