# Paper Group ANR 378

Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables. Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images. Sparse Sinkhorn Attention. Iterative Pre-Conditioning to Expedite the Gradient-Descent Method. Revisiting EXTRA for Smooth Distributed Optimization. CheXclusion: Fairness gaps in deep chest …

#### Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables

Title | Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables |

Authors | Rohit Bhattacharya, Razieh Nabi, Ilya Shpitser |

Abstract | The last decade witnessed the development of algorithms that completely solve the identifiability problem for causal effects in hidden variable causal models associated with directed acyclic graphs. However, much of this machinery remains underutilized in practice owing to the complexity of estimating identifying functionals yielded by these algorithms. In this paper, we provide simple graphical criteria and semiparametric estimators that bridge the gap between identification and estimation for causal effects involving a single treatment and a single outcome. First, we provide influence function based doubly robust estimators that cover a significant subset of hidden variable causal models where the effect is identifiable. We further characterize an important subset of this class for which we demonstrate how to derive the estimator with the lowest asymptotic variance, i.e., one that achieves the semiparametric efficiency bound. Finally, we provide semiparametric estimators for any single treatment causal effect parameter identified via the aforementioned algorithms. The resulting estimators resemble influence function based estimators that are sequentially reweighted, and exhibit a partial double robustness property, provided the parts of the likelihood corresponding to a set of weight models are correctly specified. Our methods are easy to implement and we demonstrate their utility through simulations. |

Tasks | |

Published | 2020-03-27 |

URL | https://arxiv.org/abs/2003.12659v1 |

https://arxiv.org/pdf/2003.12659v1.pdf | |

PWC | https://paperswithcode.com/paper/semiparametric-inference-for-causal-effects |

Repo | |

Framework | |

#### Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images

Title | Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images |

Authors | Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, Ravi Ramamoorthi |

Abstract | We introduce a novel learning-based method to reconstruct the high-quality geometry and complex, spatially-varying BRDF of an arbitrary object from a sparse set of only six images captured by wide-baseline cameras under collocated point lighting. We first estimate per-view depth maps using a deep multi-view stereo network; these depth maps are used to coarsely align the different views. We propose a novel multi-view reflectance estimation network architecture that is trained to pool features from these coarsely aligned images and predict per-view spatially-varying diffuse albedo, surface normals, specular roughness and specular albedo. We do this by jointly optimizing the latent space of our multi-view reflectance network to minimize the photometric error between images rendered with our predictions and the input images. While previous state-of-the-art methods fail on such sparse acquisition setups, we demonstrate, via extensive experiments on synthetic and real data, that our method produces high-quality reconstructions that can be used to render photorealistic images. |

Tasks | |

Published | 2020-03-27 |

URL | https://arxiv.org/abs/2003.12642v1 |

https://arxiv.org/pdf/2003.12642v1.pdf | |

PWC | https://paperswithcode.com/paper/deep-3d-capture-geometry-and-reflectance-from |

Repo | |

Framework | |

#### Sparse Sinkhorn Attention

Title | Sparse Sinkhorn Attention |

Authors | Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan |

Abstract | We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers. |

Tasks | Document Classification, Image Generation, Language Modelling, Natural Language Inference |

Published | 2020-02-26 |

URL | https://arxiv.org/abs/2002.11296v1 |

https://arxiv.org/pdf/2002.11296v1.pdf | |

PWC | https://paperswithcode.com/paper/sparse-sinkhorn-attention |

Repo | |

Framework | |

#### Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

Title | Iterative Pre-Conditioning to Expedite the Gradient-Descent Method |

Authors | Kushal Chakrabarti, Nirupam Gupta, Nikhil Chopra |

Abstract | This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an iterative method. However, the speed of convergence of the traditional gradient-descent method is highly influenced by the conditioning of the optimization problem being solved. Specifically, the method requires a large number of iterations to converge to a solution if the optimization problem is ill-conditioned. In this paper, we propose an iterative pre-conditioning approach that can significantly attenuate the influence of the problem’s conditioning on the convergence-speed of the gradient-descent method. The proposed pre-conditioning approach can be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only consider a specific distributed optimization problem wherein the individual local cost functions of the agents are quadratic. Besides the theoretical guarantees, the improved convergence speed of our approach is demonstrated through experiments on a real data-set. |

Tasks | Distributed Optimization |

Published | 2020-03-13 |

URL | https://arxiv.org/abs/2003.07180v2 |

https://arxiv.org/pdf/2003.07180v2.pdf | |

PWC | https://paperswithcode.com/paper/iterative-pre-conditioning-to-expedite-the |

Repo | |

Framework | |

#### Revisiting EXTRA for Smooth Distributed Optimization

Title | Revisiting EXTRA for Smooth Distributed Optimization |

Authors | Huan Li, Zhouchen Lin |

Abstract | EXTRA is a popular method for the dencentralized distributed optimization and has broad applications. This paper revisits the EXTRA. Firstly, we give a sharp complexity analysis for EXTRA with the improved $O\left(\left(\frac{L}{\mu}+\frac{1}{1-\sigma_2(W)}\right)\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ communication and computation complexities for $\mu$-strongly convex and $L$-smooth problems, where $\sigma_2(W)$ is the second largest singular value of the weight matrix $W$. When the strong convexity is absent, we prove the $O\left(\left(\frac{L}{\epsilon}+\frac{1}{1-\sigma_2(W)}\right)\log\frac{1}{1-\sigma_2(W)}\right)$ complexities. Then, we use the Catalyst framework to accelerate EXTRA and obtain the $O\left(\sqrt{\frac{L}{\mu(1-\sigma_2(W))}}\log\frac{ L}{\mu(1-\sigma_2(W))}\log\frac{1}{\epsilon}\right)$ communication and computation complexities for strongly convex and smooth problems and the $O\left(\sqrt{\frac{L}{\epsilon(1-\sigma_2(W))}}\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ complexities for non-strongly convex ones. Our communication complexities of the accelerated EXTRA are only worse by the factors of $\left(\log\frac{L}{\mu(1-\sigma_2(W))}\right)$ and $\left(\log\frac{1}{\epsilon(1-\sigma_2(W))}\right)$ from the lower complexity bounds for strongly convex and non-strongly convex problems, respectively. |

Tasks | Distributed Optimization |

Published | 2020-02-24 |

URL | https://arxiv.org/abs/2002.10110v1 |

https://arxiv.org/pdf/2002.10110v1.pdf | |

PWC | https://paperswithcode.com/paper/revisiting-extra-for-smooth-distributed |

Repo | |

Framework | |

#### CheXclusion: Fairness gaps in deep chest X-ray classifiers

Title | CheXclusion: Fairness gaps in deep chest X-ray classifiers |

Authors | Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew McDermott, Marzyeh Ghassemi |

Abstract | Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes. We train convolution neural networks to predict 14 diagnostic labels in three prominent public chest X-ray datasets: MIMIC-CXR, Chest-Xray8, and CheXpert. We then evaluate the TPR disparity - the difference in true positive rates (TPR) and - underdiagnosis rate - the false positive rate of a non-diagnosis - among different protected attributes such as patient sex, age, race, and insurance type. We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups. We find that TPR disparities are most commonly not significantly correlated with a subgroup’s proportional disease burden; further, we find that some subgroups and subsection of the population are chronically underdiagnosed. Such performance disparities have real consequences as models move from papers to products, and should be carefully audited prior to deployment. |

Tasks | |

Published | 2020-02-14 |

URL | https://arxiv.org/abs/2003.00827v1 |

https://arxiv.org/pdf/2003.00827v1.pdf | |

PWC | https://paperswithcode.com/paper/chexclusion-fairness-gaps-in-deep-chest-x-ray |

Repo | |

Framework | |

#### Distributed Mean Estimation with Optimal Error Bounds

Title | Distributed Mean Estimation with Optimal Error Bounds |

Authors | Dan Alistarh, Saleh Ashkboos, Peter Davies |

Abstract | Motivated by applications to distributed optimization and machine learning, we consider the distributed mean estimation problem, in which $n$ nodes are each assigned a multi-dimensional input vector, and must cooperate to estimate the mean of the input vectors, while minimizing communication. In this paper, we provide the first tight bounds for this problem, in terms of the trade-off between the amount of communication between nodes and the variance of the node estimates relative to the true value of the mean. |

Tasks | Distributed Optimization |

Published | 2020-02-21 |

URL | https://arxiv.org/abs/2002.09268v2 |

https://arxiv.org/pdf/2002.09268v2.pdf | |

PWC | https://paperswithcode.com/paper/distributed-mean-estimation-with-optimal |

Repo | |

Framework | |

#### Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

Title | Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets |

Authors | Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank Reddi, Barnabas Poczos |

Abstract | We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel \emph{adaptive} sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to \emph{average} Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments. |

Tasks | Distributed Optimization |

Published | 2020-02-20 |

URL | https://arxiv.org/abs/2002.08528v1 |

https://arxiv.org/pdf/2002.08528v1.pdf | |

PWC | https://paperswithcode.com/paper/adaptive-sampling-distributed-stochastic |

Repo | |

Framework | |

#### Distributed Optimization over Block-Cyclic Data

Title | Distributed Optimization over Block-Cyclic Data |

Authors | Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, Rongfei Jia |

Abstract | We consider practical data characteristics underlying federated learning, where unbalanced and non-i.i.d. data from clients have a block-cyclic structure: each cycle contains several blocks, and each client’s training data follow block-specific and non-i.i.d. distributions. Such a data structure would introduce client and block biases during the collaborative training: the single global model would be biased towards the client or block specific data. To overcome the biases, we propose two new distributed optimization algorithms called multi-model parallel SGD (MM-PSGD) and multi-chain parallel SGD (MC-PSGD) with a convergence rate of $O(1/\sqrt{NT})$, achieving a linear speedup with respect to the total number of clients. In particular, MM-PSGD adopts the block-mixed training strategy, while MC-PSGD further adds the block-separate training strategy. Both algorithms create a specific predictor for each block by averaging and comparing the historical global models generated in this block from different cycles. We extensively evaluate our algorithms over the CIFAR-10 dataset. Evaluation results demonstrate that our algorithms significantly outperform the conventional federated averaging algorithm in terms of test accuracy, and also preserve robustness for the variance of critical parameters. |

Tasks | Distributed Optimization |

Published | 2020-02-18 |

URL | https://arxiv.org/abs/2002.07454v1 |

https://arxiv.org/pdf/2002.07454v1.pdf | |

PWC | https://paperswithcode.com/paper/distributed-optimization-over-block-cyclic |

Repo | |

Framework | |

#### Reproducible Bootstrap Aggregating

Title | Reproducible Bootstrap Aggregating |

Authors | Meimei Liu, David B. Dunson |

Abstract | Heterogeneity between training and testing data degrades reproducibility of a well-trained predictive algorithm. In modern applications, how to deploy a trained algorithm in a different domain is becoming an urgent question raised by many domain scientists. In this paper, we propose a reproducible bootstrap aggregating (Rbagging) method coupled with a new algorithm, the iterative nearest neighbor sampler (INNs), effectively drawing bootstrap samples from training data to mimic the distribution of the test data. Rbagging is a general ensemble framework that can be applied to most classifiers. We further propose Rbagging+ to effectively detect anomalous samples in the testing data. Our theoretical results show that the resamples based on Rbagging have the same distribution as the testing data. Moreover, under suitable assumptions, we further provide a general bound to control the test excess risk of the ensemble classifiers. The proposed method is compared with several other popular domain adaptation methods via extensive simulation studies and real applications including medical diagnosis and imaging classifications. |

Tasks | Domain Adaptation, Medical Diagnosis |

Published | 2020-01-12 |

URL | https://arxiv.org/abs/2001.03988v1 |

https://arxiv.org/pdf/2001.03988v1.pdf | |

PWC | https://paperswithcode.com/paper/reproducible-bootstrap-aggregating |

Repo | |

Framework | |

#### When can Wasserstein GANs minimize Wasserstein Distance?

Title | When can Wasserstein GANs minimize Wasserstein Distance? |

Authors | Yuanzhi Li, Zehao Dou |

Abstract | Generative Adversarial Networks (GANs) are widely used models to learn complex real-world distributions. In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator’s output from the set of training examples. A central question of GANs is that when the training stops, whether the generated distribution is actually close to the target distribution. Previously, it was found that such closeness can only be achieved when there is a strict capacity trade-off between the generator and discriminator: Neither of the two models can be too powerful than the other. In this paper, we established one of the first theoretical results in explaining this trade-off. We show that when the generator is a class of two-layer neural networks, then it is necessary and sufficient for the discriminator to be a one-layer network with ReLU-type activation functions. With this trade-off, using polynomially many training examples, when the training stops, the generator will indeed output a distribution that is inverse-polynomially close to the target. Our result also sheds light on how GANs training can find such a generator efficiently. |

Tasks | |

Published | 2020-03-09 |

URL | https://arxiv.org/abs/2003.04033v1 |

https://arxiv.org/pdf/2003.04033v1.pdf | |

PWC | https://paperswithcode.com/paper/when-can-wasserstein-gans-minimize |

Repo | |

Framework | |

#### IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules

Title | IMLI: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules |

Authors | Bishwamittra Ghosh, Kuldeep S. Meel |

Abstract | The wide adoption of machine learning in the critical domains such as medical diagnosis, law, education had propelled the need for interpretable techniques due to the need for end users to understand the reasoning behind decisions due to learning systems. The computational intractability of interpretable learning led practitioners to design heuristic techniques, which fail to provide sound handles to tradeoff accuracy and interpretability. Motivated by the success of MaxSAT solvers over the past decade, recently MaxSAT-based approach, called MLIC, was proposed that seeks to reduce the problem of learning interpretable rules expressed in Conjunctive Normal Form (CNF) to a MaxSAT query. While MLIC was shown to achieve accuracy similar to that of other state of the art black-box classifiers while generating small interpretable CNF formulas, the runtime performance of MLIC is significantly lagging and renders approach unusable in practice. In this context, authors raised the question: Is it possible to achieve the best of both worlds, i.e., a sound framework for interpretable learning that can take advantage of MaxSAT solvers while scaling to real-world instances? In this paper, we take a step towards answering the above question in affirmation. We propose IMLI: an incremental approach to MaxSAT based framework that achieves scalable runtime performance via partition-based training methodology. Extensive experiments on benchmarks arising from UCI repository demonstrate that IMLI achieves up to three orders of magnitude runtime improvement without loss of accuracy and interpretability. |

Tasks | Medical Diagnosis |

Published | 2020-01-07 |

URL | https://arxiv.org/abs/2001.01891v1 |

https://arxiv.org/pdf/2001.01891v1.pdf | |

PWC | https://paperswithcode.com/paper/imli-an-incremental-framework-for-maxsat |

Repo | |

Framework | |

#### On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems

Title | On the Sample Complexity and Optimization Landscape for Quadratic Feasibility Problems |

Authors | Parth Thaker, Gautam Dasarathy, Angelia Nedić |

Abstract | We consider the problem of recovering a complex vector $\mathbf{x}\in \mathbb{C}^n$ from $m$ quadratic measurements ${\langle A_i\mathbf{x}, \mathbf{x}\rangle}{i=1}^m$. This problem, known as quadratic feasibility, encompasses the well known phase retrieval problem and has applications in a wide range of important areas including power system state estimation and x-ray crystallography. In general, not only is the the quadratic feasibility problem NP-hard to solve, but it may in fact be unidentifiable. In this paper, we establish conditions under which this problem becomes {identifiable}, and further prove isometry properties in the case when the matrices ${A_i}{i=1}^m$ are Hermitian matrices sampled from a complex Gaussian distribution. Moreover, we explore a nonconvex {optimization} formulation of this problem, and establish salient features of the associated optimization landscape that enables gradient algorithms with an arbitrary initialization to converge to a \emph{globally optimal} point with a high probability. Our results also reveal sample complexity requirements for successfully identifying a feasible solution in these contexts. |

Tasks | |

Published | 2020-02-04 |

URL | https://arxiv.org/abs/2002.01066v1 |

https://arxiv.org/pdf/2002.01066v1.pdf | |

PWC | https://paperswithcode.com/paper/on-the-sample-complexity-and-optimization |

Repo | |

Framework | |

#### A Hierarchical Transformer for Unsupervised Parsing

Title | A Hierarchical Transformer for Unsupervised Parsing |

Authors | Ashok Thillaisundaram |

Abstract | The underlying structure of natural language is hierarchical; words combine into phrases, which in turn form clauses. An awareness of this hierarchical structure can aid machine learning models in performing many linguistic tasks. However, most such models just process text sequentially and there is no bias towards learning hierarchical structure encoded into their architecture. In this paper, we extend the recent transformer model (Vaswani et al., 2017) by enabling it to learn hierarchical representations. To achieve this, we adapt the ordering mechanism introduced in Shen et al., 2018, to the self-attention module of the transformer architecture. We train our new model on language modelling and then apply it to the task of unsupervised parsing. We achieve reasonable results on the freely available subset of the WSJ10 dataset with an F1-score of about 50%. |

Tasks | Language Modelling |

Published | 2020-03-30 |

URL | https://arxiv.org/abs/2003.13841v1 |

https://arxiv.org/pdf/2003.13841v1.pdf | |

PWC | https://paperswithcode.com/paper/a-hierarchical-transformer-for-unsupervised |

Repo | |

Framework | |

#### Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking

Title | Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking |

Authors | Haosheng Chen, Qiangqiang Wu, Yanjie Liang, Xinbo Gao, Hanzi Wang |

Abstract | Event cameras, which are asynchronous bio-inspired vision sensors, have shown great potential in a variety of situations, such as fast motion and low illumination scenes. However, most of the event-based object tracking methods are designed for scenarios with untextured objects and uncluttered backgrounds. There are few event-based object tracking methods that support bounding box-based object tracking. The main idea behind this work is to propose an asynchronous Event-based Tracking-by-Detection (ETD) method for generic bounding box-based object tracking. To achieve this goal, we present an Adaptive Time-Surface with Linear Time Decay (ATSLTD) event-to-frame conversion algorithm, which asynchronously and effectively warps the spatio-temporal information of asynchronous retinal events to a sequence of ATSLTD frames with clear object contours. We feed the sequence of ATSLTD frames to the proposed ETD method to perform accurate and efficient object tracking, which leverages the high temporal resolution property of event cameras. We compare the proposed ETD method with seven popular object tracking methods, that are based on conventional cameras or event cameras, and two variants of ETD. The experimental results show the superiority of the proposed ETD method in handling various challenging environments. |

Tasks | Object Tracking |

Published | 2020-02-13 |

URL | https://arxiv.org/abs/2002.05583v1 |

https://arxiv.org/pdf/2002.05583v1.pdf | |

PWC | https://paperswithcode.com/paper/asynchronous-tracking-by-detection-on |

Repo | |

Framework | |