February 1, 2020

3377 words 16 mins read

Paper Group AWR 209

An Automatic Cardiac Segmentation Framework based on Multi-sequence MR Image. Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN. Markov Random Fields for Collaborative Filtering. Human Attention in Image Captioning: Dataset and Analysis. An investigation of model-free planning. Discovery of Natural Language Concepts in Indi …

An Automatic Cardiac Segmentation Framework based on Multi-sequence MR Image


Title	An Automatic Cardiac Segmentation Framework based on Multi-sequence MR Image
Authors	Yashu Liu, Wei Wang, Kuanquan Wang, Chengqin Ye, Gongning Luo
Abstract	LGE CMR is an efficient technology for detecting infarcted myocardium. An efficient and objective ventricle segmentation method in LGE can benefit the location of the infarcted myocardium. In this paper, we proposed an automatic framework for LGE image segmentation. There are just 5 labeled LGE volumes with about 15 slices of each volume. We adopted histogram match, an invariant of rotation registration method, on the other labeled modalities to achieve effective augmentation of the training data. A CNN segmentation model was trained based on the augmented training data by leave-one-out strategy. The predicted result of the model followed a connected component analysis for each class to remain the largest connected component as the final segmentation result. Our model was evaluated by the 2019 Multi-sequence Cardiac MR Segmentation Challenge. The mean testing result of 40 testing volumes on Dice score, Jaccard score, Surface distance, and Hausdorff distance is 0.8087, 0.6976, 2.8727mm, and 15.6387mm, respectively. The experiment result shows a satisfying performance of the proposed framework. Code is available at https://github.com/Suiiyu/MS-CMR2019.
Tasks	Cardiac Segmentation, Semantic Segmentation
Published	2019-09-12
URL	https://arxiv.org/abs/1909.05488v1
PDF	https://arxiv.org/pdf/1909.05488v1.pdf
PWC	https://paperswithcode.com/paper/an-automatic-cardiac-segmentation-framework
Repo	https://github.com/Suiiyu/MS-CMR2019
Framework	tf

Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN


Title	Free-form Video Inpainting with 3D Gated Convolution and Temporal PatchGAN
Authors	Ya-Liang Chang, Zhe Yu Liu, Kuan-Ying Lee, Winston Hsu
Abstract	Free-form video inpainting is a very challenging task that could be widely used for video editing such as text removal. Existing patch-based methods could not handle non-repetitive structures such as faces, while directly applying image-based inpainting models to videos will result in temporal inconsistency (see http://bit.ly/2Fu1n6b ). In this paper, we introduce a deep learn-ing based free-form video inpainting model, with proposed 3D gated convolutions to tackle the uncertainty of free-form masks and a novel Temporal PatchGAN loss to enhance temporal consistency. In addition, we collect videos and design a free-form mask generation algorithm to build the free-form video inpainting (FVI) dataset for training and evaluation of video inpainting models. We demonstrate the benefits of these components and experiments on both the FaceForensics and our FVI dataset suggest that our method is superior to existing ones. Related source code, full-resolution result videos and the FVI dataset could be found on Github https://github.com/amjltc295/Free-Form-Video-Inpainting .
Tasks	Video Inpainting
Published	2019-04-23
URL	https://arxiv.org/abs/1904.10247v3
PDF	https://arxiv.org/pdf/1904.10247v3.pdf
PWC	https://paperswithcode.com/paper/free-form-video-inpainting-with-3d-gated
Repo	https://github.com/amjltc295/Free-Form-Video-Inpainting
Framework	pytorch

Markov Random Fields for Collaborative Filtering


Title	Markov Random Fields for Collaborative Filtering
Authors	Harald Steck
Abstract	In this paper, we model the dependencies among the items that are recommended to a user in a collaborative-filtering problem via a Gaussian Markov Random Field (MRF). We build upon Besag’s auto-normal parameterization and pseudo-likelihood, which not only enables computationally efficient learning, but also connects the areas of MRFs and sparse inverse covariance estimation with autoencoders and neighborhood models, two successful approaches in collaborative filtering. We propose a novel approximation for learning sparse MRFs, where the trade-off between recommendation-accuracy and training-time can be controlled. At only a small fraction of the training-time compared to various baselines, including deep nonlinear models, the proposed approach achieved competitive ranking-accuracy on all three well-known data-sets used in our experiments, and notably a 20% gain in accuracy on the data-set with the largest number of items.
Tasks
Published	2019-10-21
URL	https://arxiv.org/abs/1910.09645v1
PDF	https://arxiv.org/pdf/1910.09645v1.pdf
PWC	https://paperswithcode.com/paper/markov-random-fields-for-collaborative
Repo	https://github.com/hasteck/MRF_NeurIPS_2019
Framework	none

Human Attention in Image Captioning: Dataset and Analysis


Title	Human Attention in Image Captioning: Dataset and Analysis
Authors	Sen He, Hamed R. Tavakoli, Ali Borji, Nicolas Pugeault
Abstract	In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects ($97%$ of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around $78%$), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model’s performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/Human-Attention-in-Image-Captioning.
Tasks	Image Captioning
Published	2019-03-06
URL	https://arxiv.org/abs/1903.02499v3
PDF	https://arxiv.org/pdf/1903.02499v3.pdf
PWC	https://paperswithcode.com/paper/a-synchronized-multi-modal-attention-caption
Repo	https://github.com/SenHe/Human-Attention-in-Image-Captioning
Framework	none

An investigation of model-free planning


Title	An investigation of model-free planning
Authors	Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap
Abstract	The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent’s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.
Tasks
Published	2019-01-11
URL	https://arxiv.org/abs/1901.03559v2
PDF	https://arxiv.org/pdf/1901.03559v2.pdf
PWC	https://paperswithcode.com/paper/an-investigation-of-model-free-planning
Repo	https://github.com/deepmind/boxoban-levels
Framework	none

Discovery of Natural Language Concepts in Individual Units of CNNs


Title	Discovery of Natural Language Concepts in Individual Units of CNNs
Authors	Seil Na, Yo Joong Choe, Dong-Hyun Lee, Gunhee Kim
Abstract	Although deep convolutional networks have achieved improved performance in many natural language tasks, they have been treated as black boxes because they are difficult to interpret. Especially, little is known about how they represent language in their intermediate layers. In an attempt to understand the representations of deep convolutional networks trained on language tasks, we show that individual units are selectively responsive to specific morphemes, words, and phrases, rather than responding to arbitrary and uninterpretable patterns. In order to quantitatively analyze such an intriguing phenomenon, we propose a concept alignment method based on how units respond to the replicated text. We conduct analyses with different architectures on multiple datasets for classification and translation tasks and provide new insights into how deep models understand natural language.
Tasks
Published	2019-02-18
URL	http://arxiv.org/abs/1902.07249v2
PDF	http://arxiv.org/pdf/1902.07249v2.pdf
PWC	https://paperswithcode.com/paper/discovery-of-natural-language-concepts-in
Repo	https://github.com/seilna/CNN-Units-in-NLP
Framework	tf

Autoregressive Text Generation Beyond Feedback Loops


Title	Autoregressive Text Generation Beyond Feedback Loops
Authors	Florian Schmidt, Stephan Mandt, Thomas Hofmann
Abstract	Autoregressive state transitions, where predictions are conditioned on past predictions, are the predominant choice for both deterministic and stochastic sequential models. However, autoregressive feedback exposes the evolution of the hidden state trajectory to potential biases from well-known train-test discrepancies. In this paper, we combine a latent state space model with a CRF observation model. We argue that such autoregressive observation models form an interesting middle ground that expresses local correlations on the word level but keeps the state evolution non-autoregressive. On unconditional sentence generation we show performance improvements compared to RNN and GAN baselines while avoiding some prototypical failure modes of autoregressive models.
Tasks	Text Generation
Published	2019-08-30
URL	https://arxiv.org/abs/1908.11658v1
PDF	https://arxiv.org/pdf/1908.11658v1.pdf
PWC	https://paperswithcode.com/paper/autoregressive-text-generation-beyond
Repo	https://github.com/schmiflo/crf-generation
Framework	tf

Structured Object-Aware Physics Prediction for Video Modeling and Planning


Title	Structured Object-Aware Physics Prediction for Video Modeling and Planning
Authors	Jannik Kossen, Karl Stelzner, Marcel Hussing, Claas Voelcker, Kristian Kersting
Abstract	When humans observe a physical system, they can easily locate objects, understand their interactions, and anticipate future behavior, even in settings with complicated and previously unseen interactions. For computers, however, learning such models from videos in an unsupervised fashion is an unsolved research problem. In this paper, we present STOVE, a novel state-space model for videos, which explicitly reasons about objects and their positions, velocities, and interactions. It is constructed by combining an image model and a dynamics model in compositional manner and improves on previous work by reusing the dynamics model for inference, accelerating and regularizing training. STOVE predicts videos with convincing physical behavior over hundreds of timesteps, outperforms previous unsupervised models, and even approaches the performance of supervised baselines. We further demonstrate the strength of our model as a simulator for sample efficient model-based control in a task with heavily interacting objects.
Tasks
Published	2019-10-06
URL	https://arxiv.org/abs/1910.02425v2
PDF	https://arxiv.org/pdf/1910.02425v2.pdf
PWC	https://paperswithcode.com/paper/structured-object-aware-physics-prediction-1
Repo	https://github.com/jlko/STOVE
Framework	pytorch

Explainable Video Action Reasoning via Prior Knowledge and State Transitions


Title	Explainable Video Action Reasoning via Prior Knowledge and State Transitions
Authors	Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, Mohan Kankanhalli
Abstract	Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are linked by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.
Tasks
Published	2019-08-28
URL	https://arxiv.org/abs/1908.10700v1
PDF	https://arxiv.org/pdf/1908.10700v1.pdf
PWC	https://paperswithcode.com/paper/explainable-video-action-reasoning-via-prior
Repo	https://github.com/visiontao/evar
Framework	tf

Learning Sparse Networks Using Targeted Dropout


Title	Learning Sparse Networks Using Targeted Dropout
Authors	Aidan N. Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, Geoffrey E. Hinton
Abstract	Neural networks are easier to optimise when they have many more weights than are required for modelling the mapping from inputs to outputs. This suggests a two-stage learning procedure that first learns a large net and then prunes away connections or hidden units. But standard training does not necessarily encourage nets to be amenable to pruning. We introduce targeted dropout, a method for training a neural network so that it is robust to subsequent pruning. Before computing the gradients for each weight update, targeted dropout stochastically selects a set of units or weights to be dropped using a simple self-reinforcing sparsity criterion and then computes the gradients for the remaining weights. The resulting network is robust to post hoc pruning of weights or units that frequently occur in the dropped sets. The method improves upon more complicated sparsifying regularisers while being simple to implement and easy to tune.
Tasks	Network Pruning, Neural Network Compression
Published	2019-05-31
URL	https://arxiv.org/abs/1905.13678v5
PDF	https://arxiv.org/pdf/1905.13678v5.pdf
PWC	https://paperswithcode.com/paper/learning-sparse-networks-using-targeted
Repo	https://github.com/for-ai/TD
Framework	tf

Geometry Normalization Networks for Accurate Scene Text Detection


Title	Geometry Normalization Networks for Accurate Scene Text Detection
Authors	Youjiang Xu, Jiaqi Duan, Zhanghui Kuang, Xiaoyu Yue, Hongbin Sun, Yue Guan, Wayne Zhang
Abstract	Large geometry (e.g., orientation) variances are the key challenges in the scene text detection. In this work, we first conduct experiments to investigate the capacity of networks for learning geometry variances on detecting scene texts, and find that networks can handle only limited text geometry variances. Then, we put forward a novel Geometry Normalization Module (GNM) with multiple branches, each of which is composed of one Scale Normalization Unit and one Orientation Normalization Unit, to normalize each text instance to one desired canonical geometry range through at least one branch. The GNM is general and readily plugged into existing convolutional neural network based text detectors to construct end-to-end Geometry Normalization Networks (GNNets). Moreover, we propose a geometry-aware training scheme to effectively train the GNNets by sampling and augmenting text instances from a uniform geometry variance distribution. Finally, experiments on popular benchmarks of ICDAR 2015 and ICDAR 2017 MLT validate that our method outperforms all the state-of-the-art approaches remarkably by obtaining one-forward test F-scores of 88.52 and 74.54 respectively.
Tasks	Scene Text Detection
Published	2019-09-02
URL	https://arxiv.org/abs/1909.00794v1
PDF	https://arxiv.org/pdf/1909.00794v1.pdf
PWC	https://paperswithcode.com/paper/geometry-normalization-networks-for-accurate
Repo	https://github.com/bigvideoresearch/GNNets
Framework	none

Personalized Purchase Prediction of Market Baskets with Wasserstein-Based Sequence Matching


Title	Personalized Purchase Prediction of Market Baskets with Wasserstein-Based Sequence Matching
Authors	Mathias Kraus, Stefan Feuerriegel
Abstract	Personalization in marketing aims at improving the shopping experience of customers by tailoring services to individuals. In order to achieve this, businesses must be able to make personalized predictions regarding the next purchase. That is, one must forecast the exact list of items that will comprise the next purchase, i.e., the so-called market basket. Despite its relevance to firm operations, this problem has received surprisingly little attention in prior research, largely due to its inherent complexity. In fact, state-of-the-art approaches are limited to intuitive decision rules for pattern extraction. However, the simplicity of the pre-coded rules impedes performance, since decision rules operate in an autoregressive fashion: the rules can only make inferences from past purchases of a single customer without taking into account the knowledge transfer that takes place between customers. In contrast, our research overcomes the limitations of pre-set rules by contributing a novel predictor of market baskets from sequential purchase histories: our predictions are based on similarity matching in order to identify similar purchase habits among the complete shopping histories of all customers. Our contributions are as follows: (1) We propose similarity matching based on subsequential dynamic time warping (SDTW) as a novel predictor of market baskets. Thereby, we can effectively identify cross-customer patterns. (2) We leverage the Wasserstein distance for measuring the similarity among embedded purchase histories. (3) We develop a fast approximation algorithm for computing a lower bound of the Wasserstein distance in our setting. An extensive series of computational experiments demonstrates the effectiveness of our approach. The accuracy of identifying the exact market baskets based on state-of-the-art decision rules from the literature is outperformed by a factor of 4.0.
Tasks	Transfer Learning
Published	2019-05-24
URL	https://arxiv.org/abs/1905.13131v2
PDF	https://arxiv.org/pdf/1905.13131v2.pdf
PWC	https://paperswithcode.com/paper/190513131
Repo	https://github.com/mathiaskraus/MarketBasket
Framework	none

Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation


Title	Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation
Authors	He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, Leonidas J. Guibas
Abstract	The goal of this paper is to estimate the 6D pose and dimensions of unseen object instances in an RGB-D image. Contrary to “instance-level” 6D pose estimation tasks, our problem assumes that no exact object CAD models are available during either training or testing time. To handle different and unseen object instances in a given category, we introduce a Normalized Object Coordinate Space (NOCS)—a shared canonical representation for all possible object instances within a category. Our region-based neural network is then trained to directly infer the correspondence from observed pixels to this shared object representation (NOCS) along with other object information such as class label and instance mask. These predictions can be combined with the depth map to jointly estimate the metric 6D pose and dimensions of multiple objects in a cluttered scene. To train our network, we present a new context-aware technique to generate large amounts of fully annotated mixed reality data. To further improve our model and evaluate its performance on real data, we also provide a fully annotated real-world dataset with large environment and instance variation. Extensive experiments demonstrate that the proposed method is able to robustly estimate the pose and size of unseen object instances in real environments while also achieving state-of-the-art performance on standard 6D pose estimation benchmarks.
Tasks	6D Pose Estimation, 6D Pose Estimation using RGB, Pose Estimation
Published	2019-01-09
URL	https://arxiv.org/abs/1901.02970v2
PDF	https://arxiv.org/pdf/1901.02970v2.pdf
PWC	https://paperswithcode.com/paper/normalized-object-coordinate-space-for
Repo	https://github.com/lh641446825/NOCS_2019CVPR
Framework	tf

Towards Similarity Graphs Constructed by Deep Reinforcement Learning


Title	Towards Similarity Graphs Constructed by Deep Reinforcement Learning
Authors	Dmitry Baranchuk, Artem Babenko
Abstract	Similarity graphs are an active research direction for the nearest neighbor search (NNS) problem. New algorithms for similarity graph construction are continuously being proposed and analyzed by both theoreticians and practitioners. However, existing construction algorithms are mostly based on heuristics and do not explicitly maximize the target performance measure, i.e., search recall. Therefore, at the moment it is not clear whether the performance of similarity graphs has plateaued or more effective graphs can be constructed with more theoretically grounded methods. In this paper, we introduce a new principled algorithm, based on adjacency matrix optimization, which explicitly maximizes search efficiency. Namely, we propose a probabilistic model of a similarity graph defined in terms of its edge probabilities and show how to learn these probabilities from data as a reinforcement learning task. As confirmed by experiments, the proposed construction method can be used to refine the state-of-the-art similarity graphs, achieving higher recall rates for the same number of distance computations. Furthermore, we analyze the learned graphs and reveal the structural properties that are responsible for more efficient search.
Tasks	graph construction
Published	2019-11-27
URL	https://arxiv.org/abs/1911.12122v2
PDF	https://arxiv.org/pdf/1911.12122v2.pdf
PWC	https://paperswithcode.com/paper/towards-similarity-graphs-constructed-by-deep
Repo	https://github.com/dbaranchuk/nns-meets-deep-rl
Framework	pytorch

Generative Latent Flow


Title	Generative Latent Flow
Authors	Zhisheng Xiao, Qing Yan, Yali Amit
Abstract	In this work, we propose the Generative Latent Flow (GLF), an algorithm for generative modeling of the data distribution. GLF uses an Auto-encoder (AE) to learn latent representations of the data, and a normalizing flow to map the distribution of the latent variables to that of simple i.i.d noise. In contrast to some other Auto-encoder based generative models, which use various regularizers that encourage the encoded latent distribution to match the prior distribution, our model explicitly constructs a mapping between these two distributions, leading to better density matching while avoiding over regularizing the latent variables. We compare our model with several related techniques, and show that it has many relative advantages including fast convergence, single stage training and minimal reconstruction trade-off. We also study the relationship between our model and its stochastic counterpart, and show that our model can be viewed as a vanishing noise limit of VAEs with flow prior. Quantitatively, under standardized evaluations, our method achieves state-of-the-art sample quality among AE based models on commonly used datasets, and is competitive with GANs’ benchmarks.
Tasks	Image Generation
Published	2019-05-24
URL	https://arxiv.org/abs/1905.10485v2
PDF	https://arxiv.org/pdf/1905.10485v2.pdf
PWC	https://paperswithcode.com/paper/generative-latent-flow-a-framework-for-non
Repo	https://github.com/rakhimovv/GenerativeLatentFlow
Framework	pytorch