April 2, 2020

3073 words 15 mins read

Paper Group ANR 378

Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables. Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images. Sparse Sinkhorn Attention. Iterative Pre-Conditioning to Expedite the Gradient-Descent Method. Revisiting EXTRA for Smooth Distributed Optimization. CheXclusion: Fairness gaps in deep chest …

Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables


Title	Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables
Authors	Rohit Bhattacharya, Razieh Nabi, Ilya Shpitser
Abstract	The last decade witnessed the development of algorithms that completely solve the identifiability problem for causal effects in hidden variable causal models associated with directed acyclic graphs. However, much of this machinery remains underutilized in practice owing to the complexity of estimating identifying functionals yielded by these algorithms. In this paper, we provide simple graphical criteria and semiparametric estimators that bridge the gap between identification and estimation for causal effects involving a single treatment and a single outcome. First, we provide influence function based doubly robust estimators that cover a significant subset of hidden variable causal models where the effect is identifiable. We further characterize an important subset of this class for which we demonstrate how to derive the estimator with the lowest asymptotic variance, i.e., one that achieves the semiparametric efficiency bound. Finally, we provide semiparametric estimators for any single treatment causal effect parameter identified via the aforementioned algorithms. The resulting estimators resemble influence function based estimators that are sequentially reweighted, and exhibit a partial double robustness property, provided the parts of the likelihood corresponding to a set of weight models are correctly specified. Our methods are easy to implement and we demonstrate their utility through simulations.
Tasks
Published	2020-03-27
URL	https://arxiv.org/abs/2003.12659v1
PDF	https://arxiv.org/pdf/2003.12659v1.pdf
PWC	https://paperswithcode.com/paper/semiparametric-inference-for-causal-effects
Repo
Framework

Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images


Title	Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images
Authors	Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, Ravi Ramamoorthi
Abstract	We introduce a novel learning-based method to reconstruct the high-quality geometry and complex, spatially-varying BRDF of an arbitrary object from a sparse set of only six images captured by wide-baseline cameras under collocated point lighting. We first estimate per-view depth maps using a deep multi-view stereo network; these depth maps are used to coarsely align the different views. We propose a novel multi-view reflectance estimation network architecture that is trained to pool features from these coarsely aligned images and predict per-view spatially-varying diffuse albedo, surface normals, specular roughness and specular albedo. We do this by jointly optimizing the latent space of our multi-view reflectance network to minimize the photometric error between images rendered with our predictions and the input images. While previous state-of-the-art methods fail on such sparse acquisition setups, we demonstrate, via extensive experiments on synthetic and real data, that our method produces high-quality reconstructions that can be used to render photorealistic images.
Tasks
Published	2020-03-27
URL	https://arxiv.org/abs/2003.12642v1
PDF	https://arxiv.org/pdf/2003.12642v1.pdf
PWC	https://paperswithcode.com/paper/deep-3d-capture-geometry-and-reflectance-from
Repo
Framework

Sparse Sinkhorn Attention


Title	Sparse Sinkhorn Attention
Authors	Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan
Abstract	We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers.
Tasks	Document Classification, Image Generation, Language Modelling, Natural Language Inference
Published	2020-02-26
URL	https://arxiv.org/abs/2002.11296v1
PDF	https://arxiv.org/pdf/2002.11296v1.pdf
PWC	https://paperswithcode.com/paper/sparse-sinkhorn-attention
Repo
Framework

Iterative Pre-Conditioning to Expedite the Gradient-Descent Method


Title	Iterative Pre-Conditioning to Expedite the Gradient-Descent Method
Authors	Kushal Chakrabarti, Nirupam Gupta, Nikhil Chopra
Abstract	This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an iterative method. However, the speed of convergence of the traditional gradient-descent method is highly influenced by the conditioning of the optimization problem being solved. Specifically, the method requires a large number of iterations to converge to a solution if the optimization problem is ill-conditioned. In this paper, we propose an iterative pre-conditioning approach that can significantly attenuate the influence of the problem’s conditioning on the convergence-speed of the gradient-descent method. The proposed pre-conditioning approach can be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only consider a specific distributed optimization problem wherein the individual local cost functions of the agents are quadratic. Besides the theoretical guarantees, the improved convergence speed of our approach is demonstrated through experiments on a real data-set.
Tasks	Distributed Optimization
Published	2020-03-13
URL	https://arxiv.org/abs/2003.07180v2
PDF	https://arxiv.org/pdf/2003.07180v2.pdf
PWC	https://paperswithcode.com/paper/iterative-pre-conditioning-to-expedite-the
Repo
Framework

Revisiting EXTRA for Smooth Distributed Optimization


Title	Revisiting EXTRA for Smooth Distributed Optimization
Authors	Huan Li, Zhouchen Lin
Abstract	EXTRA is a popular method for the dencentralized distributed optimization and has broad applications. This paper revisits the EXTRA. Firstly, we give a sharp complexity analysis for EXTRA with the improved $O\left(\left(\frac{L}{\mu}+\frac{1}{1-\sigma_2(W)}\right)\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ communication and computation complexities for $\mu$-strongly convex and $L$-smooth problems, where $\sigma_2(W)$ is the second largest singular value of the weight matrix $W$. When the strong convexity is absent, we prove the $O\left(\left(\frac{L}{\epsilon}+\frac{1}{1-\sigma_2(W)}\right)\log\frac{1}{1-\sigma_2(W)}\right)$ complexities. Then, we use the Catalyst framework to accelerate EXTRA and obtain the $O\left(\sqrt{\frac{L}{\mu(1-\sigma_2(W))}}\log\frac{ L}{\mu(1-\sigma_2(W))}\log\frac{1}{\epsilon}\right)$ communication and computation complexities for strongly convex and smooth problems and the $O\left(\sqrt{\frac{L}{\epsilon(1-\sigma_2(W))}}\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ complexities for non-strongly convex ones. Our communication complexities of the accelerated EXTRA are only worse by the factors of $\left(\log\frac{L}{\mu(1-\sigma_2(W))}\right)$ and $\left(\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ from the lower complexity bounds for strongly convex and non-strongly convex problems, respectively.
Tasks	Distributed Optimization
Published	2020-02-24
URL	https://arxiv.org/abs/2002.10110v1
PDF	https://arxiv.org/pdf/2002.10110v1.pdf
PWC	https://paperswithcode.com/paper/revisiting-extra-for-smooth-distributed
Repo
Framework

CheXclusion: Fairness gaps in deep chest X-ray classifiers


Title	CheXclusion: Fairness gaps in deep chest X-ray classifiers
Authors	Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew McDermott, Marzyeh Ghassemi
Abstract	Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes. We train convolution neural networks to predict 14 diagnostic labels in three prominent public chest X-ray datasets: MIMIC-CXR, Chest-Xray8, and CheXpert. We then evaluate the TPR disparity - the difference in true positive rates (TPR) and - underdiagnosis rate - the false positive rate of a non-diagnosis - among different protected attributes such as patient sex, age, race, and insurance type. We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups. We find that TPR disparities are most commonly not significantly correlated with a subgroup’s proportional disease burden; further, we find that some subgroups and subsection of the population are chronically underdiagnosed. Such performance disparities have real consequences as models move from papers to products, and should be carefully audited prior to deployment.
Tasks
Published	2020-02-14
URL	https://arxiv.org/abs/2003.00827v1
PDF	https://arxiv.org/pdf/2003.00827v1.pdf
PWC	https://paperswithcode.com/paper/chexclusion-fairness-gaps-in-deep-chest-x-ray
Repo
Framework

Distributed Mean Estimation with Optimal Error Bounds


Title	Distributed Mean Estimation with Optimal Error Bounds
Authors	Dan Alistarh, Saleh Ashkboos, Peter Davies
Abstract	Motivated by applications to distributed optimization and machine learning, we consider the distributed mean estimation problem, in which $n$ nodes are each assigned a multi-dimensional input vector, and must cooperate to estimate the mean of the input vectors, while minimizing communication. In this paper, we provide the first tight bounds for this problem, in terms of the trade-off between the amount of communication between nodes and the variance of the node estimates relative to the true value of the mean.
Tasks	Distributed Optimization
Published	2020-02-21
URL	https://arxiv.org/abs/2002.09268v2
PDF	https://arxiv.org/pdf/2002.09268v2.pdf
PWC	https://paperswithcode.com/paper/distributed-mean-estimation-with-optimal
Repo
Framework

Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets


Title	Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets
Authors	Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank Reddi, Barnabas Poczos
Abstract	We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel \emph{adaptive} sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to \emph{average} Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments.
Tasks	Distributed Optimization
Published	2020-02-20
URL	https://arxiv.org/abs/2002.08528v1
PDF	https://arxiv.org/pdf/2002.08528v1.pdf
PWC	https://paperswithcode.com/paper/adaptive-sampling-distributed-stochastic
Repo
Framework

Distributed Optimization over Block-Cyclic Data


Title	Distributed Optimization over Block-Cyclic Data
Authors	Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, Rongfei Jia
Abstract	We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure: each cycle contains several blocks, and each client’s training data follow block-specific and non-i.i.d. distributions. Such a data structure would introduce client and block biases during the collaborative training: the single global model would be biased towards the client or block specific data. To overcome the biases, we propose two new distributed optimization algorithms called multi-model parallel SGD (MM-PSGD) and multi-chain parallel SGD (MC-PSGD) with a convergence rate of $O(1/\sqrt{NT})$, achieving a linear speedup with respect to the total number of clients. In particular, MM-PSGD adopts the block-mixed training strategy, while MC-PSGD further adds the block-separate training strategy. Both algorithms create a specific predictor for each block by averaging and comparing the historical global models generated in this block from different cycles. We extensively evaluate our algorithms over the CIFAR-10 dataset. Evaluation results demonstrate that our algorithms significantly outperform the conventional federated averaging algorithm in terms of test accuracy, and also preserve robustness for the variance of critical parameters.
Tasks	Distributed Optimization
Published	2020-02-18
URL	https://arxiv.org/abs/2002.07454v1
PDF	https://arxiv.org/pdf/2002.07454v1.pdf
PWC	https://paperswithcode.com/paper/distributed-optimization-over-block-cyclic
Repo
Framework

Reproducible Bootstrap Aggregating


Title	Reproducible Bootstrap Aggregating
Authors	Meimei Liu, David B. Dunson
Abstract	Heterogeneity between training and testing data degrades reproducibility of a well-trained predictive algorithm. In modern applications, how to deploy a trained algorithm in a different domain is becoming an urgent question raised by many domain scientists. In this paper, we propose a reproducible bootstrap aggregating (Rbagging) method coupled with a new algorithm, the iterative nearest neighbor sampler (INNs), effectively drawing bootstrap samples from training data to mimic the distribution of the test data. Rbagging is a general ensemble framework that can be applied to most classifiers. We further propose Rbagging+ to effectively detect anomalous samples in the testing data. Our theoretical results show that the resamples based on Rbagging have the same distribution as the testing data. Moreover, under suitable assumptions, we further provide a general bound to control the test excess risk of the ensemble classifiers. The proposed method is compared with several other popular domain adaptation methods via extensive simulation studies and real applications including medical diagnosis and imaging classifications.
Tasks	Domain Adaptation, Medical Diagnosis
Published	2020-01-12
URL	https://arxiv.org/abs/2001.03988v1
PDF	https://arxiv.org/pdf/2001.03988v1.pdf
PWC	https://paperswithcode.com/paper/reproducible-bootstrap-aggregating
Repo
Framework

When can Wasserstein GANs minimize Wasserstein Distance?


Title	When can Wasserstein GANs minimize Wasserstein Distance?
Authors	Yuanzhi Li, Zehao Dou
Abstract	Generative Adversarial Networks (GANs) are widely used models to learn complex real-world distributions. In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator’s output from the set of training examples. A central question of GANs is that when the training stops, whether the generated distribution is actually close to the target distribution. Previously, it was found that such closeness can only be achieved when there is a strict capacity trade-off between the generator and discriminator: Neither of the two models can be too powerful than the other. In this paper, we established one of the first theoretical results in explaining this trade-off. We show that when the generator is a class of two-layer neural networks, then it is necessary and sufficient for the discriminator to be a one-layer network with ReLU-type activation functions. With this trade-off, using polynomially many training examples, when the training stops, the generator will indeed output a distribution that is inverse-polynomially close to the target. Our result also sheds light on how GANs training can find such a generator efficiently.
Tasks
Published	2020-03-09
URL	https://arxiv.org/abs/2003.04033v1
PDF	https://arxiv.org/pdf/2003.04033v1.pdf
PWC	https://paperswithcode.com/paper/when-can-wasserstein-gans-minimize
Repo
Framework

IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules


Title	IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules
Authors	Bishwamittra Ghosh, Kuldeep S. Meel
Abstract	The wide adoption of machine learning in the critical domains such as medical diagnosis, law, education had propelled the need for interpretable techniques due to the need for end users to understand the reasoning behind decisions due to learning systems. The computational intractability of interpretable learning led practitioners to design heuristic techniques, which fail to provide sound handles to tradeoff accuracy and interpretability. Motivated by the success of MaxSAT solvers over the past decade, recently MaxSAT-based approach, called MLIC, was proposed that seeks to reduce the problem of learning interpretable rules expressed in Conjunctive Normal Form (CNF) to a MaxSAT query. While MLIC was shown to achieve accuracy similar to that of other state of the art black-box classifiers while generating small interpretable CNF formulas, the runtime performance of MLIC is significantly lagging and renders approach unusable in practice. In this context, authors raised the question: Is it possible to achieve the best of both worlds, i.e., a sound framework for interpretable learning that can take advantage of MaxSAT solvers while scaling to real-world instances? In this paper, we take a step towards answering the above question in affirmation. We propose IMLI: an incremental approach to MaxSAT based framework that achieves scalable runtime performance via partition-based training methodology. Extensive experiments on benchmarks arising from UCI repository demonstrate that IMLI achieves up to three orders of magnitude runtime improvement without loss of accuracy and interpretability.
Tasks	Medical Diagnosis
Published	2020-01-07
URL	https://arxiv.org/abs/2001.01891v1
PDF	https://arxiv.org/pdf/2001.01891v1.pdf
PWC	https://paperswithcode.com/paper/imli-an-incremental-framework-for-maxsat
Repo
Framework

On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems


Title	On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems
Authors	Parth Thaker, Gautam Dasarathy, Angelia Nedić
Abstract	We consider the problem of recovering a complex vector $\mathbf{x}\in \mathbb{C}^n$ from $m$ quadratic measurements ${\langle A_i\mathbf{x}, \mathbf{x}\rangle}{i=1}^m$. This problem, known as quadratic feasibility, encompasses the well known phase retrieval problem and has applications in a wide range of important areas including power system state estimation and x-ray crystallography. In general, not only is the the quadratic feasibility problem NP-hard to solve, but it may in fact be unidentifiable. In this paper, we establish conditions under which this problem becomes {identifiable}, and further prove isometry properties in the case when the matrices ${A_i}{i=1}^m$ are Hermitian matrices sampled from a complex Gaussian distribution. Moreover, we explore a nonconvex {optimization} formulation of this problem, and establish salient features of the associated optimization landscape that enables gradient algorithms with an arbitrary initialization to converge to a \emph{globally optimal} point with a high probability. Our results also reveal sample complexity requirements for successfully identifying a feasible solution in these contexts.
Tasks
Published	2020-02-04
URL	https://arxiv.org/abs/2002.01066v1
PDF	https://arxiv.org/pdf/2002.01066v1.pdf
PWC	https://paperswithcode.com/paper/on-the-sample-complexity-and-optimization
Repo
Framework

A Hierarchical Transformer for Unsupervised Parsing


Title	A Hierarchical Transformer for Unsupervised Parsing
Authors	Ashok Thillaisundaram
Abstract	The underlying structure of natural language is hierarchical; words combine into phrases, which in turn form clauses. An awareness of this hierarchical structure can aid machine learning models in performing many linguistic tasks. However, most such models just process text sequentially and there is no bias towards learning hierarchical structure encoded into their architecture. In this paper, we extend the recent transformer model (Vaswani et al., 2017) by enabling it to learn hierarchical representations. To achieve this, we adapt the ordering mechanism introduced in Shen et al., 2018, to the self-attention module of the transformer architecture. We train our new model on language modelling and then apply it to the task of unsupervised parsing. We achieve reasonable results on the freely available subset of the WSJ10 dataset with an F1-score of about 50%.
Tasks	Language Modelling
Published	2020-03-30
URL	https://arxiv.org/abs/2003.13841v1
PDF	https://arxiv.org/pdf/2003.13841v1.pdf
PWC	https://paperswithcode.com/paper/a-hierarchical-transformer-for-unsupervised
Repo
Framework

Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking


Title	Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking
Authors	Haosheng Chen, Qiangqiang Wu, Yanjie Liang, Xinbo Gao, Hanzi Wang
Abstract	Event cameras, which are asynchronous bio-inspired vision sensors, have shown great potential in a variety of situations, such as fast motion and low illumination scenes. However, most of the event-based object tracking methods are designed for scenarios with untextured objects and uncluttered backgrounds. There are few event-based object tracking methods that support bounding box-based object tracking. The main idea behind this work is to propose an asynchronous Event-based Tracking-by-Detection (ETD) method for generic bounding box-based object tracking. To achieve this goal, we present an Adaptive Time-Surface with Linear Time Decay (ATSLTD) event-to-frame conversion algorithm, which asynchronously and effectively warps the spatio-temporal information of asynchronous retinal events to a sequence of ATSLTD frames with clear object contours. We feed the sequence of ATSLTD frames to the proposed ETD method to perform accurate and efficient object tracking, which leverages the high temporal resolution property of event cameras. We compare the proposed ETD method with seven popular object tracking methods, that are based on conventional cameras or event cameras, and two variants of ETD. The experimental results show the superiority of the proposed ETD method in handling various challenging environments.
Tasks	Object Tracking
Published	2020-02-13
URL	https://arxiv.org/abs/2002.05583v1
PDF	https://arxiv.org/pdf/2002.05583v1.pdf
PWC	https://paperswithcode.com/paper/asynchronous-tracking-by-detection-on
Repo
Framework