Paper Group ANR 565
Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd. Dynamic Pricing with Demand Covariates. First Person Action-Object Detection with EgoNet. A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection. Learning camera viewpoint using CNN to improve 3D body pose estimation. Asymptotic …
Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd
Title | Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd |
Authors | Purushotam Radadia, Shirish Karande |
Abstract | Manual correction of speech transcription can involve a selection from plausible transcriptions. Recent work has shown the feasibility of employing a mismatched crowd for speech transcription. However, it is yet to be established whether a mismatched worker has sufficiently fine-granular speech perception to choose among the phonetically proximate options that are likely to be generated from the trellis of an ASRU. Hence, we consider five languages, Arabic, German, Hindi, Russian and Spanish. For each we generate synthetic, phonetically proximate, options which emulate post-editing scenarios of varying difficulty. We consistently observe non-trivial crowd ability to choose among fine-granular options. |
Tasks | |
Published | 2016-09-07 |
URL | http://arxiv.org/abs/1609.02043v1 |
http://arxiv.org/pdf/1609.02043v1.pdf | |
PWC | https://paperswithcode.com/paper/feasibility-of-post-editing-speech |
Repo | |
Framework | |
Dynamic Pricing with Demand Covariates
Title | Dynamic Pricing with Demand Covariates |
Authors | Sheng Qiang, Mohsen Bayati |
Abstract | We consider a firm that sells products over $T$ periods without knowing the demand function. The firm sequentially sets prices to earn revenue and to learn the underlying demand function simultaneously. A natural heuristic for this problem, commonly used in practice, is greedy iterative least squares (GILS). At each time period, GILS estimates the demand as a linear function of the price by applying least squares to the set of prior prices and realized demands. Then a price that maximizes the revenue, given the estimated demand function, is used for the next time period. The performance is measured by the regret, which is the expected revenue loss from the optimal (oracle) pricing policy when the demand function is known. Recently, den Boer and Zwart (2014) and Keskin and Zeevi (2014) demonstrated that GILS is sub-optimal. They introduced algorithms which integrate forced price dispersion with GILS and achieve asymptotically optimal performance. In this paper, we consider this dynamic pricing problem in a data-rich environment. In particular, we assume that the firm knows the expected demand under a particular price from historical data, and in each period, before setting the price, the firm has access to extra information (demand covariates) which may be predictive of the demand. We prove that in this setting GILS achieves asymptotically optimal regret of order $\log(T)$. We also show the following surprising result: in the original dynamic pricing problem of den Boer and Zwart (2014) and Keskin and Zeevi (2014), inclusion of any set of covariates in GILS as potential demand covariates (even though they could carry no information) would make GILS asymptotically optimal. We validate our results via extensive numerical simulations on synthetic and real data sets. |
Tasks | |
Published | 2016-04-25 |
URL | http://arxiv.org/abs/1604.07463v1 |
http://arxiv.org/pdf/1604.07463v1.pdf | |
PWC | https://paperswithcode.com/paper/dynamic-pricing-with-demand-covariates |
Repo | |
Framework | |
First Person Action-Object Detection with EgoNet
Title | First Person Action-Object Detection with EgoNet |
Authors | Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi |
Abstract | Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person’s visual sensorimotor object interactions from up close. In this paper, we study the tight interplay between our momentary visual attention and motor action with objects from a first-person camera. We propose a concept of action-objects—the objects that capture person’s conscious visual (watching a TV) or tactile (taking a cup) interactions. Action-objects may be task-dependent but since many tasks share common person-object spatial configurations, action-objects exhibit a characteristic 3D spatial distance and orientation with respect to the person. We design a predictive model that detects action-objects using EgoNet, a joint two-stream network that holistically integrates visual appearance (RGB) and 3D spatial layout (depth and height) cues to predict per-pixel likelihood of action-objects. Our network also incorporates a first-person coordinate embedding, which is designed to learn a spatial distribution of the action-objects in the first-person data. We demonstrate EgoNet’s predictive power, by showing that it consistently outperforms previous baseline approaches. Furthermore, EgoNet also exhibits a strong generalization ability, i.e., it predicts semantically meaningful objects in novel first-person datasets. Our method’s ability to effectively detect action-objects could be used to improve robots’ understanding of human-object interactions. |
Tasks | Human-Object Interaction Detection, Object Detection |
Published | 2016-03-15 |
URL | http://arxiv.org/abs/1603.04908v3 |
http://arxiv.org/pdf/1603.04908v3.pdf | |
PWC | https://paperswithcode.com/paper/first-person-action-object-detection-with |
Repo | |
Framework | |
A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection
Title | A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection |
Authors | Carlos Caetano, Sandra Avila, William Robson Schwartz, Silvio Jamil F. Guimarães, Arnaldo de A. Araújo |
Abstract | With the growing amount of inappropriate content on the Internet, such as pornography, arises the need to detect and filter such material. The reason for this is given by the fact that such content is often prohibited in certain environments (e.g., schools and workplaces) or for certain publics (e.g., children). In recent years, many works have been mainly focused on detecting pornographic images and videos based on visual content, particularly on the detection of skin color. Although these approaches provide good results, they generally have the disadvantage of a high false positive rate since not all images with large areas of skin exposure are necessarily pornographic images, such as people wearing swimsuits or images related to sports. Local feature based approaches with Bag-of-Words models (BoW) have been successfully applied to visual recognition tasks in the context of pornography detection. Even though existing methods provide promising results, they use local feature descriptors that require a high computational processing time yielding high-dimensional vectors. In this work, we propose an approach for pornography detection based on local binary feature extraction and BossaNova image representation, a BoW model extension that preserves more richly the visual information. Moreover, we propose two approaches for video description based on the combination of mid-level representations namely BossaNova Video Descriptor (BNVD) and BoW Video Descriptor (BoW-VD). The proposed techniques are promising, achieving an accuracy of 92.40%, thus reducing the classification error by 16% over the current state-of-the-art local features approach on the Pornography dataset. |
Tasks | Pornography Detection, Video Description |
Published | 2016-05-12 |
URL | http://arxiv.org/abs/1605.03804v1 |
http://arxiv.org/pdf/1605.03804v1.pdf | |
PWC | https://paperswithcode.com/paper/a-mid-level-video-representation-based-on |
Repo | |
Framework | |
Learning camera viewpoint using CNN to improve 3D body pose estimation
Title | Learning camera viewpoint using CNN to improve 3D body pose estimation |
Authors | Mona Fathollahi Ghezelghieh, Rangachar Kasturi, Sudeep Sarkar |
Abstract | The objective of this work is to estimate 3D human pose from a single RGB image. Extracting image representations which incorporate both spatial relation of body parts and their relative depth plays an essential role in accurate3D pose reconstruction. In this paper, for the first time, we show that camera viewpoint in combination to 2D joint lo-cations significantly improves 3D pose accuracy without the explicit use of perspective geometry mathematical models.To this end, we train a deep Convolutional Neural Net-work (CNN) to learn categorical camera viewpoint. To make the network robust against clothing and body shape of the subject in the image, we utilized 3D computer rendering to synthesize additional training images. We test our framework on the largest 3D pose estimation bench-mark, Human3.6m, and achieve up to 20% error reduction compared to the state-of-the-art approaches that do not use body part segmentation. |
Tasks | 3D Pose Estimation, Pose Estimation |
Published | 2016-09-18 |
URL | http://arxiv.org/abs/1609.05522v1 |
http://arxiv.org/pdf/1609.05522v1.pdf | |
PWC | https://paperswithcode.com/paper/learning-camera-viewpoint-using-cnn-to |
Repo | |
Framework | |
Asymptotic properties of Principal Component Analysis and shrinkage-bias adjustment under the Generalized Spiked Population model
Title | Asymptotic properties of Principal Component Analysis and shrinkage-bias adjustment under the Generalized Spiked Population model |
Authors | Rounak Dey, Seunggeun Lee |
Abstract | With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigated the asymptotic behaviors of PCA under the generalized spiked population model. Based on the theoretical results, we proposed a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we showed that our methods can greatly reduce bias and improve prediction accuracy. |
Tasks | |
Published | 2016-07-28 |
URL | http://arxiv.org/abs/1607.08647v1 |
http://arxiv.org/pdf/1607.08647v1.pdf | |
PWC | https://paperswithcode.com/paper/asymptotic-properties-of-principal-component |
Repo | |
Framework | |
HeMIS: Hetero-Modal Image Segmentation
Title | HeMIS: Hetero-Modal Image Segmentation |
Authors | Mohammad Havaei, Nicolas Guizard, Nicolas Chapados, Yoshua Bengio |
Abstract | We introduce a deep learning image segmentation framework that is extremely robust to missing imaging modalities. Instead of attempting to impute or synthesize missing data, the proposed approach learns, for each modality, an embedding of the input image into a single latent vector space for which arithmetic operations (such as taking the mean) are well defined. Points in that space, which are averaged over modalities available at inference time, can then be further processed to yield the desired segmentation. As such, any combinatorial subset of available modalities can be provided as input, without having to learn a combinatorial number of imputation models. Evaluated on two neurological MRI datasets (brain tumors and MS lesions), the approach yields state-of-the-art segmentation results when provided with all modalities; moreover, its performance degrades remarkably gracefully when modalities are removed, significantly more so than alternative mean-filling or other synthesis approaches. |
Tasks | Imputation, Semantic Segmentation |
Published | 2016-07-18 |
URL | http://arxiv.org/abs/1607.05194v1 |
http://arxiv.org/pdf/1607.05194v1.pdf | |
PWC | https://paperswithcode.com/paper/hemis-hetero-modal-image-segmentation |
Repo | |
Framework | |
A Language-independent and Compositional Model for Personality Trait Recognition from Short Texts
Title | A Language-independent and Compositional Model for Personality Trait Recognition from Short Texts |
Authors | Fei Liu, Julien Perez, Scott Nowson |
Abstract | Many methods have been used to recognize author personality traits from text, typically combining linguistic feature engineering with shallow learning models, e.g. linear regression or Support Vector Machines. This work uses deep-learning-based models and atomic features of text, the characters, to build hierarchical, vectorial word and sentence representations for trait inference. This method, applied to a corpus of tweets, shows state-of-the-art performance across five traits and three languages (English, Spanish and Italian) compared with prior work in author profiling. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits. |
Tasks | Feature Engineering, Personality Trait Recognition |
Published | 2016-10-14 |
URL | http://arxiv.org/abs/1610.04345v1 |
http://arxiv.org/pdf/1610.04345v1.pdf | |
PWC | https://paperswithcode.com/paper/a-language-independent-and-compositional |
Repo | |
Framework | |
Space-Time Representation of People Based on 3D Skeletal Data: A Review
Title | Space-Time Representation of People Based on 3D Skeletal Data: A Review |
Authors | Fei Han, Brian Reily, William Hoff, Hao Zhang |
Abstract | Spatiotemporal human representation based on 3D visual perception data is a rapidly growing research area. Based on the information sources, these representations can be broadly categorized into two groups based on RGB-D information or 3D skeleton data. Recently, skeleton-based human representations have been intensively studied and kept attracting an increasing attention, due to their robustness to variations of viewpoint, human body scale and motion speed as well as the realtime, online performance. This paper presents a comprehensive survey of existing space-time representations of people based on 3D skeletal data, and provides an informative categorization and analysis of these methods from the perspectives, including information modality, representation encoding, structure and transition, and feature engineering. We also provide a brief overview of skeleton acquisition devices and construction methods, enlist a number of public benchmark datasets with skeleton data, and discuss potential future research directions. |
Tasks | Feature Engineering |
Published | 2016-01-05 |
URL | http://arxiv.org/abs/1601.01006v3 |
http://arxiv.org/pdf/1601.01006v3.pdf | |
PWC | https://paperswithcode.com/paper/space-time-representation-of-people-based-on |
Repo | |
Framework | |
Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks
Title | Object Detection via Aspect Ratio and Context Aware Region-based Convolutional Networks |
Authors | Bo Li, Tianfu Wu, Shuai Shao, Lun Zhang, Rufeng Chu |
Abstract | Jointly integrating aspect ratio and context has been extensively studied and shown performance improvement in traditional object detection systems such as the DPMs. It, however, has been largely ignored in deep neural network based detection systems. This paper presents a method of integrating a mixture of object models and region-based convolutional networks for accurate object detection. Each mixture component accounts for both object aspect ratio and multi-scale contextual information explicitly: (i) it exploits a mixture of tiling configurations in the RoI pooling to remedy the warping artifacts caused by a single type RoI pooling (e.g., with equally-sized 7 x 7 cells), and to respect the underlying object shapes more; (ii) it “looks from both the inside and the outside of a RoI” by incorporating contextual information at two scales: global context pooled from the whole image and local context pooled from the surrounding of a RoI. To facilitate accurate detection, this paper proposes a multi-stage detection scheme for integrating the mixture of object models, which utilizes the detection results of the model at the previous stage as the proposals for the current in both training and testing. The proposed method is called the aspect ratio and context aware region-based convolutional network (ARC-R-CNN). In experiments, ARC-R-CNN shows very competitive results with Faster R-CNN [41] and R-FCN [10] on two datasets: the PASCAL VOC and the Microsoft COCO. It obtains significantly better mAP performance using high IoU thresholds on both datasets. |
Tasks | Object Detection |
Published | 2016-12-02 |
URL | http://arxiv.org/abs/1612.00534v2 |
http://arxiv.org/pdf/1612.00534v2.pdf | |
PWC | https://paperswithcode.com/paper/object-detection-via-aspect-ratio-and-context |
Repo | |
Framework | |
Optimal Quantum Sample Complexity of Learning Algorithms
Title | Optimal Quantum Sample Complexity of Learning Algorithms |
Authors | Srinivasan Arunachalam, Ronald de Wolf |
Abstract | $ \newcommand{\eps}{\varepsilon} $In learning theory, the VC dimension of a concept class $C$ is the most common way to measure its “richness.” In the PAC model $$ \Theta\Big(\frac{d}{\eps} + \frac{\log(1/\delta)}{\eps}\Big) $$ examples are necessary and sufficient for a learner to output, with probability $1-\delta$, a hypothesis $h$ that is $\eps$-close to the target concept $c$. In the related agnostic model, where the samples need not come from a $c\in C$, we know that $$ \Theta\Big(\frac{d}{\eps^2} + \frac{\log(1/\delta)}{\eps^2}\Big) $$ examples are necessary and sufficient to output an hypothesis $h\in C$ whose error is at most $\eps$ worse than the best concept in $C$. Here we analyze quantum sample complexity, where each example is a coherent quantum state. This model was introduced by Bshouty and Jackson, who showed that quantum examples are more powerful than classical examples in some fixed-distribution settings. However, Atici and Servedio, improved by Zhang, showed that in the PAC setting, quantum examples cannot be much more powerful: the required number of quantum examples is $$ \Omega\Big(\frac{d^{1-\eta}}{\eps} + d + \frac{\log(1/\delta)}{\eps}\Big)\mbox{ for all }\eta> 0. $$ Our main result is that quantum and classical sample complexity are in fact equal up to constant factors in both the PAC and agnostic models. We give two approaches. The first is a fairly simple information-theoretic argument that yields the above two classical bounds and yields the same bounds for quantum sample complexity up to a $\log(d/\eps)$ factor. We then give a second approach that avoids the log-factor loss, based on analyzing the behavior of the “Pretty Good Measurement” on the quantum state identification problems that correspond to learning. This shows classical and quantum sample complexity are equal up to constant factors. |
Tasks | |
Published | 2016-07-04 |
URL | http://arxiv.org/abs/1607.00932v3 |
http://arxiv.org/pdf/1607.00932v3.pdf | |
PWC | https://paperswithcode.com/paper/optimal-quantum-sample-complexity-of-learning |
Repo | |
Framework | |
Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition
Title | Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition |
Authors | Yi Zhu, Shawn Newsam |
Abstract | This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches. |
Tasks | Temporal Action Localization |
Published | 2016-08-15 |
URL | http://arxiv.org/abs/1608.04339v1 |
http://arxiv.org/pdf/1608.04339v1.pdf | |
PWC | https://paperswithcode.com/paper/depth2action-exploring-embedded-depth-for |
Repo | |
Framework | |
Graphons, mergeons, and so on!
Title | Graphons, mergeons, and so on! |
Authors | Justin Eldridge, Mikhail Belkin, Yusu Wang |
Abstract | In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the “correct” clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties. |
Tasks | Graph Clustering |
Published | 2016-07-06 |
URL | http://arxiv.org/abs/1607.01718v4 |
http://arxiv.org/pdf/1607.01718v4.pdf | |
PWC | https://paperswithcode.com/paper/graphons-mergeons-and-so-on |
Repo | |
Framework | |
Information Recovery in Shuffled Graphs via Graph Matching
Title | Information Recovery in Shuffled Graphs via Graph Matching |
Authors | Vince Lyzinski |
Abstract | While many multiple graph inference methodologies operate under the implicit assumption that an explicit vertex correspondence is known across the vertex sets of the graphs, in practice these correspondences may only be partially or errorfully known. Herein, we provide an information theoretic foundation for understanding the practical impact that errorfully observed vertex correspondences can have on subsequent inference, and the capacity of graph matching methods to recover the lost vertex alignment and inferential performance. Working in the correlated stochastic blockmodel setting, we establish a duality between the loss of mutual information due to an errorfully observed vertex correspondence and the ability of graph matching algorithms to recover the true correspondence across graphs. In the process, we establish a phase transition for graph matchability in terms of the correlation across graphs, and we conjecture the analogous phase transition for the relative information loss due to shuffling vertex labels. We demonstrate the practical effect that graph shuffling—and matching—can have on subsequent inference, with examples from two sample graph hypothesis testing and joint spectral graph clustering. |
Tasks | Graph Clustering, Graph Matching, Spectral Graph Clustering |
Published | 2016-05-08 |
URL | http://arxiv.org/abs/1605.02315v2 |
http://arxiv.org/pdf/1605.02315v2.pdf | |
PWC | https://paperswithcode.com/paper/information-recovery-in-shuffled-graphs-via |
Repo | |
Framework | |
Do logarithmic proximity measures outperform plain ones in graph clustering?
Title | Do logarithmic proximity measures outperform plain ones in graph clustering? |
Authors | Vladimir Ivashkin, Pavel Chebotarev |
Abstract | We consider a number of graph kernels and proximity measures including commute time kernel, regularized Laplacian kernel, heat kernel, exponential diffusion kernel (also called “communicability”), etc., and the corresponding distances as applied to clustering nodes in random graphs and several well-known datasets. The model of generating random graphs involves edge probabilities for the pairs of nodes that belong to the same class or different predefined classes of nodes. It turns out that in most cases, logarithmic measures (i.e., measures resulting after taking logarithm of the proximities) perform better while distinguishing underlying classes than the “plain” measures. A comparison in terms of reject curves of inter-class and intra-class distances confirms this conclusion. A similar conclusion can be made for several well-known datasets. A possible origin of this effect is that most kernels have a multiplicative nature, while the nature of distances used in cluster algorithms is an additive one (cf. the triangle inequality). The logarithmic transformation is a tool to transform the first nature to the second one. Moreover, some distances corresponding to the logarithmic measures possess a meaningful cutpoint additivity property. In our experiments, the leader is usually the logarithmic Communicability measure. However, we indicate some more complicated cases in which other measures, typically, Communicability and plain Walk, can be the winners. |
Tasks | Graph Clustering |
Published | 2016-05-03 |
URL | http://arxiv.org/abs/1605.01046v3 |
http://arxiv.org/pdf/1605.01046v3.pdf | |
PWC | https://paperswithcode.com/paper/do-logarithmic-proximity-measures-outperform |
Repo | |
Framework | |