January 29, 2020

2916 words 14 mins read

Paper Group ANR 625

Multi-View Features and Hybrid Reward Strategies for Vatex Video Captioning Challenge 2019. HyST: A Hybrid Approach for Flexible and Accurate Dialogue State Tracking. Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation. Imperial College London Submission to VATEX Video Captioning Task. Integrating …

Multi-View Features and Hybrid Reward Strategies for Vatex Video Captioning Challenge 2019


Title	Multi-View Features and Hybrid Reward Strategies for Vatex Video Captioning Challenge 2019
Authors	Xinxin Zhu, Longteng Guo, Peng Yao, Jing Liu, Shichen Lu, Zheng Yu, Wei Liu, Hanqing Lu
Abstract	This document describes our solution for the VATEX Captioning Challenge 2019, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-view features, hybrid reward, and diverse ensemble. Our method achieves the 2nd and the 3rd places on the Chinese and English video captioning tracks, respectively.
Tasks	Video Captioning
Published	2019-10-17
URL	https://arxiv.org/abs/1910.11102v2
PDF	https://arxiv.org/pdf/1910.11102v2.pdf
PWC	https://paperswithcode.com/paper/multi-view-features-and-hybrid-reward
Repo
Framework

HyST: A Hybrid Approach for Flexible and Accurate Dialogue State Tracking


Title	HyST: A Hybrid Approach for Flexible and Accurate Dialogue State Tracking
Authors	Rahul Goel, Shachi Paul, Dilek Hakkani-Tür
Abstract	Recent works on end-to-end trainable neural network based approaches have demonstrated state-of-the-art results on dialogue state tracking. The best performing approaches estimate a probability distribution over all possible slot values. However, these approaches do not scale for large value sets commonly present in real-life applications and are not ideal for tracking slot values that were not observed in the training set. To tackle these issues, candidate-generation-based approaches have been proposed. These approaches estimate a set of values that are possible at each turn based on the conversation history and/or language understanding outputs, and hence enable state tracking over unseen values and large value sets however, they fall short in terms of performance in comparison to the first group. In this work, we analyze the performance of these two alternative dialogue state tracking methods, and present a hybrid approach (HyST) which learns the appropriate method for each slot type. To demonstrate the effectiveness of HyST on a rich-set of slot types, we experiment with the recently released MultiWOZ-2.0 multi-domain, task-oriented dialogue-dataset. Our experiments show that HyST scales to multi-domain applications. Our best performing model results in a relative improvement of 24% and 10% over the previous SOTA and our best baseline respectively.
Tasks	Dialogue State Tracking
Published	2019-07-01
URL	https://arxiv.org/abs/1907.00883v1
PDF	https://arxiv.org/pdf/1907.00883v1.pdf
PWC	https://paperswithcode.com/paper/hyst-a-hybrid-approach-for-flexible-and
Repo
Framework

Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation


Title	Propagated Perturbation of Adversarial Attack for well-known CNNs: Empirical Study and its Explanation
Authors	Jihyeun Yoon, Kyungyul Kim, Jongseong Jang
Abstract	Deep Neural Network based classifiers are known to be vulnerable to perturbations of inputs constructed by an adversarial attack to force misclassification. Most studies have focused on how to make vulnerable noise by gradient based attack methods or to defense model from adversarial attack. The use of the denoiser model is one of a well-known solution to reduce the adversarial noise although classification performance had not significantly improved. In this study, we aim to analyze the propagation of adversarial attack as an explainable AI(XAI) point of view. Specifically, we examine the trend of adversarial perturbations through the CNN architectures. To analyze the propagated perturbation, we measured normalized Euclidean Distance and cosine distance in each CNN layer between the feature map of the perturbed image passed through denoiser and the non-perturbed original image. We used five well-known CNN based classifiers and three gradient-based adversarial attacks. From the experimental results, we observed that in most cases, Euclidean Distance explosively increases in the final fully connected layer while cosine distance fluctuated and disappeared at the last layer. This means that the use of denoiser can decrease the amount of noise. However, it failed to defense accuracy degradation.
Tasks	Adversarial Attack
Published	2019-09-19
URL	https://arxiv.org/abs/1909.09263v2
PDF	https://arxiv.org/pdf/1909.09263v2.pdf
PWC	https://paperswithcode.com/paper/propagated-perturbation-of-adversarial-attack
Repo
Framework

Imperial College London Submission to VATEX Video Captioning Task


Title	Imperial College London Submission to VATEX Video Captioning Task
Authors	Ozan Caglayan, Zixiu Wu, Pranava Madhyastha, Josiah Wang, Lucia Specia
Abstract	This paper describes the Imperial College London team’s submission to the 2019’ VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features. We then investigate the effect of dropping the encoder and the attention mechanism and instead conditioning the GRU decoder over two different vectorial representations: (i) a max-pooled action feature vector and (ii) the output of a multi-label classifier trained to predict visual entities from the action features. Our baselines achieved scores comparable to the official baseline. Conditioning over entity predictions performed substantially better than conditioning on the max-pooled feature vector, and only marginally worse than the GRU-based sequence-to-sequence baseline.
Tasks	Video Captioning
Published	2019-10-16
URL	https://arxiv.org/abs/1910.07482v1
PDF	https://arxiv.org/pdf/1910.07482v1.pdf
PWC	https://paperswithcode.com/paper/imperial-college-london-submission-to-vatex
Repo
Framework

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019


Title	Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019
Authors	Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu
Abstract	This notebook paper presents our model in the VATEX video captioning challenge. In order to capture multi-level aspects in the video, we propose to integrate both temporal and spatial attentions for video captioning. The temporal attentive module focuses on global action movements while spatial attentive module enables to describe more fine-grained objects. Considering these two types of attentive modules are complementary, we thus fuse them via a late fusion strategy. The proposed model significantly outperforms baselines and achieves 73.4 CIDEr score on the testing set which ranks the second place at the VATEX video captioning challenge leaderboard 2019.
Tasks	Video Captioning
Published	2019-10-15
URL	https://arxiv.org/abs/1910.06737v1
PDF	https://arxiv.org/pdf/1910.06737v1.pdf
PWC	https://paperswithcode.com/paper/integrating-temporal-and-spatial-attentions
Repo
Framework

SAWNet: A Spatially Aware Deep Neural Network for 3D Point Cloud Processing


Title	SAWNet: A Spatially Aware Deep Neural Network for 3D Point Cloud Processing
Authors	Chaitanya Kaul, Nick Pears, Suresh Manandhar
Abstract	Deep neural networks have established themselves as the state-of-the-art methodology in almost all computer vision tasks to date. But their application to processing data lying on non-Euclidean domains is still a very active area of research. One such area is the analysis of point cloud data which poses a challenge due to its lack of order. Many recent techniques have been proposed, spearheaded by the PointNet architecture. These techniques use either global or local information from the point clouds to extract a latent representation for the points, which is then used for the task at hand (classification/segmentation). In our work, we introduce a neural network layer that combines both global and local information to produce better embeddings of these points. We enhance our architecture with residual connections, to pass information between the layers, which also makes the network easier to train. We achieve state-of-the-art results on the ModelNet40 dataset with our architecture, and our results are also highly competitive with the state-of-the-art on the ShapeNet part segmentation dataset and the indoor scene segmentation dataset. We plan to open source our pre-trained models on github to encourage the research community to test our networks on their data, or simply use them for benchmarking purposes.
Tasks	Scene Segmentation
Published	2019-05-18
URL	https://arxiv.org/abs/1905.07650v1
PDF	https://arxiv.org/pdf/1905.07650v1.pdf
PWC	https://paperswithcode.com/paper/sawnet-a-spatially-aware-deep-neural-network
Repo
Framework


Title	VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning
Authors	Ziqi Zhang, Yaya Shi, Jiutong Wei, Chunfeng Yuan, Bing Li, Weiming Hu
Abstract	Multi-modal information is essential to describe what has happened in a video. In this work, we represent videos by various appearance, motion and audio information guided with video topic. By following multi-stage training strategy, our experiments show steady and significant improvement on the VATEX benchmark. This report presents an overview and comparative analysis of our system designed for both Chinese and English tracks on VATEX Captioning Challenge 2019.
Tasks	Video Captioning
Published	2019-10-13
URL	https://arxiv.org/abs/1910.05752v1
PDF	https://arxiv.org/pdf/1910.05752v1.pdf
PWC	https://paperswithcode.com/paper/vatex-captioning-challenge-2019-multi-modal
Repo
Framework

Rhythm Zone Theory: Speech Rhythms are Physical after all


Title	Rhythm Zone Theory: Speech Rhythms are Physical after all
Authors	Dafydd Gibbon, Xuewei Lin
Abstract	Speech rhythms have been dealt with in three main ways: from the introspective analyses of rhythm as a correlate of syllable and foot timing in linguistics and applied linguistics, through analyses of durations of segments of utterances associated with consonantal and vocalic properties, syllables, feet and words, to models of rhythms in speech production and perception as physical oscillations. The present study avoids introspection and human-filtered annotation methods and extends the signal processing paradigm of amplitude envelope spectrum analysis by adding an additional analytic step of edge detection, and postulating the co-existence of multiple speech rhythms in rhythm zones marked by identifiable edges (Rhythm Zone Theory, RZT). An exploratory investigation of the utility of RZT is conducted, suggesting that native and non-native readings of the same text are distinct sub-genres of read speech: a reading by a US native speaker and non-native readings by relatively low-performing Cantonese adult learners of English. The study concludes by noting that with the methods used, RZT can distinguish between the speech rhythms of well-defined sub-genres of native speaker reading vs. non-native learner reading, but needs further refinement in order to be applied to the paradoxically more complex speech of low-performing language learners, whose speech rhythms are co-determined by non-fluency and disfluency factors in addition to well-known linguistic factors of grammar, vocabulary and discourse constraints.
Tasks	Edge Detection
Published	2019-01-31
URL	http://arxiv.org/abs/1902.01267v2
PDF	http://arxiv.org/pdf/1902.01267v2.pdf
PWC	https://paperswithcode.com/paper/rhythm-zone-theory-speech-rhythms-are
Repo
Framework

Human Action Sequence Classification


Title	Human Action Sequence Classification
Authors	Yan Bin Ng, Basura Fernando
Abstract	This paper classifies human action sequences from videos using a machine translation model. In contrast to classical human action classification which outputs a set of actions, our method output a sequence of action in the chronological order of the actions performed by the human. Therefore our method is evaluated using sequential performance measures such as Bilingual Evaluation Understudy (BLEU) scores. Action sequence classification has many applications such as learning from demonstration, action segmentation, detection, localization and video captioning. Furthermore, we use our model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization. We obtain state of the art results for video captioning in challenging Charades dataset obtaining BLEU-4 score of 34.8 and METEOR score of 33.6 outperforming previous state-of-the-art of 18.8 and 19.5 respectively. Similarly, on ActivityNet captioning, we obtain excellent results in-terms of ROUGE (20.24) and CIDER (37.58) scores. For action localization, without using any explicit start/end action annotations, our method obtains localization performance of 22.2 mAP outperforming prior fully supervised methods.
Tasks	Action Classification, Action Localization, action segmentation, Machine Translation, Video Captioning
Published	2019-10-07
URL	https://arxiv.org/abs/1910.02602v1
PDF	https://arxiv.org/pdf/1910.02602v1.pdf
PWC	https://paperswithcode.com/paper/human-action-sequence-classification
Repo
Framework

Optimizing vaccine distribution networks in low and middle-income countries


Title	Optimizing vaccine distribution networks in low and middle-income countries
Authors	Yuwen Yang, Hoda Bidkhori, Jayant Rajgopal
Abstract	Vaccination has been proven to be the most effective method to prevent infectious diseases. However, there are still millions of children in low and middle-income countries who are not covered by routine vaccines and remain at risk. The World Health Organization’s Expanded Programme on Immunization (WHO-EPI) was designed to provide universal childhood vaccine access for children across the world and in this work, we address the design of the distribution network for WHO-EPI vaccines. In particular, we formulate the network design problem as a mixed integer program (MIP) and present a new algorithm for typical problems that are too large to be solved using commercial MIP software. We test the algorithm using data derived from four different countries in sub-Saharan Africa and show that the algorithm is able to obtain high-quality solutions for even the largest problems within a few minutes.
Tasks
Published	2019-07-25
URL	https://arxiv.org/abs/1907.13434v1
PDF	https://arxiv.org/pdf/1907.13434v1.pdf
PWC	https://paperswithcode.com/paper/optimizing-vaccine-distribution-networks-in
Repo
Framework


Title	Toward Maximizing the Visibility of Content in Social Media Brand Pages: A Temporal Analysis
Authors	Nagendra Kumar, Gopi Ande, J. Shirish Kumar, Manish Singh
Abstract	A large amount of content is generated everyday in social media. One of the main goals of content creators is to spread their information to a large audience. There are many factors that affect information spread, such as posting time, location, type of information, number of social connections, etc. In this paper, we look at the problem of finding the best posting time(s) to get high content visibility. The posting time is derived taking other factors into account, such as location, type of information, etc. In this paper, we do our analysis over Facebook pages. We propose six posting schedules that can be used for individual pages or group of pages with similar audience reaction profile. We perform our experiment on a Facebook pages dataset containing 0.3 million posts, 10 million audience reactions. Our best posting schedule can lead to seven times more number of audience reactions compared to the average number of audience reactions that users would get without following any optimized posting schedule. We also present some interesting audience reaction patterns that we obtained through daily, weekly and monthly audience reaction analysis.
Tasks
Published	2019-08-22
URL	https://arxiv.org/abs/1908.08622v1
PDF	https://arxiv.org/pdf/1908.08622v1.pdf
PWC	https://paperswithcode.com/paper/toward-maximizing-the-visibility-of-content
Repo
Framework

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability


Title	SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability
Authors	Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Abstract	The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to real contexts. In this paper, we propose a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands. Our model is inspired by the Transformer model and employs only two Transformer layers in the encoding and decoding stages. Further, it incorporates a novel memory-aware encoding of image regions. Experiments demonstrate that our approach achieves competitive results in terms of caption quality while featuring reduced computational demands. Further, to evaluate its applicability on autonomous agents, we conduct experiments on simulated scenes taken from the perspective of domestic robots.
Tasks	Text Generation, Video Captioning
Published	2019-10-07
URL	https://arxiv.org/abs/1910.02974v3
PDF	https://arxiv.org/pdf/1910.02974v3.pdf
PWC	https://paperswithcode.com/paper/smart-training-shallow-memory-aware
Repo
Framework

Function Follows Form: Regression from Complete Thoracic Computed Tomography Scans


Title	Function Follows Form: Regression from Complete Thoracic Computed Tomography Scans
Authors	Max Argus, Cornelia Schaefer-Prokop, David A. Lynch, Bram van Ginneken
Abstract	Chronic Obstructive Pulmonary Disease (COPD) is a leading cause of morbidity and mortality. While COPD diagnosis is based on lung function tests, early stages and progression of different aspects of the disease can be visible and quantitatively assessed on computed tomography (CT) scans. Many studies have been published that quantify imaging biomarkers related to COPD. In this paper we present a convolutional neural network that directly computes visual emphysema scores and predicts the outcome of lung function tests for 195 CT scans from the COPDGene study. Contrary to previous work, the proposed method does not encode any specific prior knowledge about what to quantify, but it is trained end-to-end with a set of 1424 CT scans for which the output parameters were available. The network provided state-of-the-art results for these tasks: Visual emphysema scores are comparable to those assessed by trained human observers; COPD diagnosis from estimated lung function reaches an area under the ROC curve of 0.94, outperforming prior art. The method is easily generalizable to other situations where information from whole scans needs to be summarized in single quantities.
Tasks	Computed Tomography (CT)
Published	2019-09-26
URL	https://arxiv.org/abs/1909.12047v2
PDF	https://arxiv.org/pdf/1909.12047v2.pdf
PWC	https://paperswithcode.com/paper/follows-form-regression-from-complete
Repo
Framework

Detecting Bias with Generative Counterfactual Face Attribute Augmentation


Title	Detecting Bias with Generative Counterfactual Face Attribute Augmentation
Authors	Emily Denton, Ben Hutchinson, Margaret Mitchell, Timnit Gebru
Abstract	We introduce a simple framework for identifying biases of a smiling attribute classifier. Our method poses counterfactual questions of the form: how would the prediction change if this face characteristic had been different? We leverage recent advances in generative adversarial networks to build a realistic generative model of face images that affords controlled manipulation of specific image characteristics. We introduce a set of metrics that measure the effect of manipulating a specific property of an image on the output of a trained classifier. Empirically, we identify several different factors of variation that affect the predictions of a smiling classifier trained on CelebA.
Tasks
Published	2019-06-14
URL	https://arxiv.org/abs/1906.06439v2
PDF	https://arxiv.org/pdf/1906.06439v2.pdf
PWC	https://paperswithcode.com/paper/detecting-bias-with-generative-counterfactual
Repo
Framework

Predicting Rainfall using Machine Learning Techniques


Title	Predicting Rainfall using Machine Learning Techniques
Authors	Nikhil Oswal
Abstract	Rainfall prediction is one of the challenging and uncertain tasks which has a significant impact on human society. Timely and accurate predictions can help to proactively reduce human and financial loss. This study presents a set of experiments which involve the use of prevalent machine learning techniques to build models to predict whether it is going to rain tomorrow or not based on weather data for that particular day in major cities of Australia. This comparative study is conducted concentrating on three aspects: modeling inputs, modeling methods, and pre-processing techniques. The results provide a comparison of various evaluation metrics of these machine learning techniques and their reliability to predict the rainfall by analyzing the weather data.
Tasks
Published	2019-10-29
URL	https://arxiv.org/abs/1910.13827v1
PDF	https://arxiv.org/pdf/1910.13827v1.pdf
PWC	https://paperswithcode.com/paper/predicting-rainfall-using-machine-learning
Repo
Framework