Sunday, July 28
16:00 – 17:00 | Jajodia Auditorium (ENGR 1101) DS01 — Pre-Conference Data Science Meeting
Organized by Jiayang Sun | Chaired by Anand Vidyashankar
Bridging the Gap in Data Science Education: Veridical Data Science
The rapid advancement of AI relies heavily on the foundation of data science, yet its education significantly lags its demand in practice. The upcoming book ‘Veridical Data Science: The Practice of Responsible Data Analysis and Decision Making’ (Yu and Barter, MIT Press, 2024; free online at www.vdsbook.com) tackles this gap by promoting Predictability, Computability, and Stability (PCS) as core principles for trustworthy data insights. It thoroughly integrates these principles into the Data Science Life Cycle (DSLC), from problem formulation to data cleansing and to result communication, fostering a new standard for responsible data analysis. This talk explores the book’s motivations, comparing its approach with traditional ones. Using examples from chapters on data cleansing and clustering analysis, I will demonstrate PCS’s practical applications and describe four types of homework assignments—True/False, conceptual, mathematical, and coding—to solidify learners’ grasp. I will end the talk with a prostate cancer research case study, illustrating PCS’s effectiveness in real-world data analysis.
Keywords: Data Science Education, Data Science Life Cycle, Principles For Trustworthy Data Insights
Hunt for Explainability: Data Innovations and Reproducible Learning Strategy
AI tools and models built on machine learning, statistics, data science, and various other scientific advances are now accessible at our fingertips via LLM. These tools, however, can be harmful if built on biased or poor data. If these models are unexplainable, they can be ineffective or useless in determining mitigation strategies. This talk advocates quality data and will discuss 1) data innovations, 2) our triathlon learning strategy for building explainable models, and 3) their implications and applications via several data examples, including fertility and cardiovascular data.
Keywords: Artificial Intelligence, Explainability, Large Language Models, Data Quality
17:00 – 18:30 | Atrium
Registration and Welcome Reception
Monday, July 29
7:45 – 9:00 | Atrium & Jajodia Auditorium
Breakfast and Registration
9:00 – 10:30 | ENGR 1107 S01 — Statistical Data Science With Outliers And Missing Values
Organized and chaired by Stefan Van Aelst
RobPy: A Python Package for Robust Statistical Analysis
In the evolving landscape of statistical analysis, there is a need for robust methodologies that can handle outliers effectively. Outlying observations may be errors, or they could have been recorded under exceptional circumstances, or belong to another population. We introduce “RobPy,” a comprehensive Python package designed to equip researchers and data scientists with robust statistical methodologies. This package provides a set of tools for robust data analysis, including robust procedures for estimating location and scatter, linear regression and principal component analysis. Through RobPy, we aim to make robust statistical techniques more accessible to researchers and practitioners, facilitating more accurate and reliable data analysis across various fields. The presentation will detail the implementation of these methodologies, demonstrate their application on real-world datasets, and discuss potential future expansions to further support the robust statistics community.
Keywords: Robust Statistical Analysis, Python Package, Linear Regression, Principal Component Analysis
Missing Value Imputation of Wireless Sensor Data for Environmental Monitoring
Over the past few years, the scale of sensor networks has greatly expanded. This generates extended spatiotemporal datasets, which form a crucial information resource in numerous fields, ranging from sports and healthcare to environmental science and surveillance. Unfortunately, these datasets often contain missing values due to systematic or inadvertent sensor misoperation. This incompleteness hampers the subsequent data analysis, yet addressing these missing observations forms a challenging problem. This is especially the case when both the temporal correlation of timestamps within a single sensor and the spatial correlation between sensors are important. Here, we apply and evaluate 12 imputation methods to complete the missing values in a dataset originating from large-scale environmental monitoring. As part of a large citizen science project, IoT-based microclimate sensors were deployed for six months in 4400 gardens across the region of Flanders, generating 15-min recordings of temperature and soil moisture. Methods based on spatial recovery as well as time-based imputation were evaluated, including Spline Interpolation, MissForest, MICE, MCMC, M-RNN, BRITS, and others. The performance of these imputation methods was evaluated for different proportions of missing data (ranging from 10\% to 50\%), as well as a realistic missing value scenario. Techniques leveraging the spatial features of the data tend to outperform the time-based methods, with matrix completion techniques providing the best performance. Our results therefore provide a tool to maximize the benefit from costly, large-scale environmental monitoring efforts as well as generate insights for the imputation of large scale missing values in spatiotemporal data.
Keywords: Missing Data, Imputation, Wireless Sensor Networks, Environmental Monitoring, Time Series
Diagnostic functions to assess the quality of cell type annotations in single-cell RNA-seq data
Annotation transfer from a reference dataset for the cell type annotation of a new query single-cell RNA-sequencing (scRNA-seq) experiment has become an integral component of the typical analysis workflow. The approach provides a fast, automated, and reproducible alternative to the manual annotation of cell clusters based on marker gene expression. However, dataset imbalance and undiagnosed incompatibilities between query and reference dataset can lead to erroneous annotation and distort downstream applications. We present scDiagnostics, an R/Bioconductor package for the systematic evaluation of cell type assignments in scRNA-seq data. scDiagnostics offers a suite of diagnostic functions to assess whether both (query and reference) datasets are aligned, ensuring that annotations can be transferred reliably. scDiagnostics also provides functionality to assess annotation ambiguity, cluster heterogeneity, and marker gene alignment. The implemented functionality helps researchers to determine how accurately cells from a new scRNA-seq experiment can be assigned to known cell types. Some potential extensions using robust statistics will be discussed.
Keywords: Single-Cell Sequencing, Transcriptomics, Cell Type Annotation, Reference-Based Annotation Transfer, R/Bioconductor Package
9:00 – 10:30 | ENGR 1108 S02 — Big Data, Optimal Design, And Gaussian Process Models
Organized and chaired by John Stufken
Gradient-enhanced Gaussian Process Models: What, Why and How
Simulations with gradient information are increasingly used in engineering and science. From a data pooling perspective, it is appealing to use the gradient-enhanced Gaussian process model for emulating such simulations. However, it is computationally challenging to fit such an emulator for large data sets because its covariance matrix has severe singular issues. I will present a theory to show why this problem happens. I will also propose a random Fourier feature method to mitigate the problem. The key idea of the proposed method is to employ random Fourier features to obtain an easily computable, low-dimensional feature representation for shift-invariant kernels involving gradients. The effectiveness of the proposed method is illustrated by several examples.
Keywords: Computer Experiments; Machine Learning; Design Of Experiments; Uncertainty Quantification
Sequential Design for Parameter Optimization via Curing Process Simulation
Thermoset based fiber reinforced composite laminates are popularly processed by the cure process to produce structural parts. These structural parts have found extensive applications in aerospace and automotive industries. A typical input to the cure process is a temperature and pressure cycle also commonly known as a cure cycle. This talk will discuss our work of solving the cure optimization problem of laminated composites through a statistical approach. The approach consisted of using constrained Bayesian Optimization (cBO) along with a Gaussian Process model as a surrogate to sequentially design experiments for curing process simulations. Through the implementation of our proposed method, we efficiently achieved optimal outcomes at a reduced computational cost, demonstrating the approach’s effectiveness.
Keywords: Computer Experiments, Bayesian Optimization, Sequential Design
Scalable Methodologies for Big Data Analysis: Integrating Flexible Statistical Models and Optimal Designs
The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs.
Keywords: Clusters; Information Matrix; Latent Indicator; Regression
Bootstrap Aggregated Designs for Generalized Linear Models
Many experiments require modeling a non-Normal response. In particular, count responses and binary responses are quite common. The relationship between predictors and the responses is typically modeled via a Generalized Linear Model (GLM). Finding D-optimal designs for GLMs, which reduce the generalized variance of the model coefficients, is desired. A common approach to finding optimal designs for GLMs is to use a local design, but local designs are vulnerable to parameter misspecification. To create more robust designs, we use the idea of bootstrap aggregation, or “bagging,” from ensemble machine learning methods. This is done by applying a bagging procedure to pilot data, where the results of many locally optimal designs are aggregated to produce an approximate design that reflects the uncertainty in the model coefficients. Several versions of the bagging procedure are implemented, and they are compared with locally optimal designs and maximin designs. Results show that the proposed bagging procedure has relatively high efficiency and is robust to changes in the underlying model parameters.
Keywords: Design Of Experiments, Bootstrap Aggregation, Generalized Linear Models
9:00 – 10:30 | ENGR 1110 S03 — New Advancements On Spatial And Spatio-Temporal Modeling
Organized and chaired by Abolfazl Safikhani
Robust sparse PCA for spatial data
Our goal is to introduce a robust PCA (Principal Component Analysis) method which takes into account spatial dependence among the observations. Specifically, we want to use robust spatial covariance estimators like the ssMRCD estimator [Puchhammer & Filzmoser, 2023] as basis for PCA. The ssMRCD estimator provides N many robust covariance matrix estimates for each part of a spatial partition using additional smoothing across space to properly address the spatial context. For a set of covariance matrices Σi, for i = 1, . . . , N , we will also get a set of loadings representing the same variables leading to N many p-dimensional loadings. To simplify the interpretation and visualization of the loadings, we revert to sparsity in the most common way, i.e., the L1 norm. Moreover, also group-wise sparsity is included, where the groups consist of the loadings of one variable across spatial units, respectively. We start by defining the objective function for the loadings with explained variance penalized with entry-wise and group-wise sparsity. For the k-th PC we also need additional orthogonality constraints.
To solve this non-convex non-separable optimization problem, we develop an algorithm based on the Alternating Direction Method of Multipliers [Boyd et al., 1991]. We further illustrate the usefulness of the method in data examples and simulations and provide advanced visualization techniques for spatial PCA.
References
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1), 1–122.
Puchhammer, P., & Filzmoser, P. (2023). Spatially smoothed robust covariance estimation for local outlier detection. Journal of Computational and Graphical Statistics,1-13.
Keywords: Spatially Smoothed Mrcd Estimator, Robustness
Bayesian multi-species N-mixture models for large scale spatial data in community ecology
Community ecologists seek to model the local abundance of multiple animal species while taking into account that observed counts only represent a portion of the underlying population size. Analogously, modeling spatial correlations in species’ latent abundances is important when attempting to explain how species compete for scarce resources. We develop a Bayesian multi-species N-mixture model with spatial latent effects to address both issues. On one hand, our model accounts for imperfect detection by modeling local abundance via a Poisson log-linear model. Conditional on the local abundance, the observed counts have a binomial distribution. On the other hand, we let a directed acyclic graph restrict spatial dependence in order to speed up computations, and use recently developed gradient-based Markov-chain Monte Carlo methods to sample a posteriori in the multivariate non-Gaussian data scenarios in which we are interested.
Keywords: Bayesian, Spatial, Multivariate, Ecology, Mcmc
Spatial confounding: Debunking some myths
We will discuss the behavior of popular spatial regression estimators on the task of estimating the effect of an exposure on an outcome in the presence of an unmeasured spatially-structured confounder. This setting is often referred to as “spatial confounding.” We consider spline models, Gaussian processes (GP), generalized least squares (GLS), and restricted spatial regression (RSR) under two data generation processes: one where the confounder is a fixed effect and one where it is a random effect. The literature on spatial confounding is confusing and contradictory, and our results correct and clarify several misunderstandings. We first show that, like an unadjusted OLS estimator, RSR is asymptotically biased under any spatial confounding scenario. We then prove a novel result on the consistency of the GLS estimator under spatial confounding. We finally prove that estimators like GLS, GP, and splines, that are consistent under confounding by a fixed effect will also be consistent under confounding by a random effect. We conclude that, contrary to much of the recent literature on spatial confounding, traditional estimators based on partially linear models are amenable to estimating effects in the presence of spatial confounding. We support our theoretical arguments with simulation studies.
Keywords: Spatial Statistics, Confounding, Gaussian Processes, Generalized Least Squares
10:30 – 10:45 | Atrium & Jajodia Auditorium
Coffee Break
10:45 – 12:15 | ENGR 1107 S05 — Statistical Analysis Of Random Objects In Metric Spaces
Organized and chaired by Wanli Qiao
Matched Pairs Oriented Projective Shapes
The pinhole camera models taking into account the fact that the scene being imagined lies in front of the camera leads to the concepts of oriented projective shape and oriented projective shape space. Furthermore, in previous
work it was shown that the resulting extrinsic statistical techniques for
independent samples of two-dimensional image data have greater statistical
power than comparable statistical techniques which ignore directional
information. Here we develop a novel matched pairs test for 2D oriented projective shapes, and apply this methodology to the problem of face double identification.
In the case of a Russian president we use two galleries of images, pre-2015 and
post-2015, with 38 pictures in each. Our novel matched pairs test, with optimally paired images is highly statistically significant. This is joint work with Rob Paige and Mihaela Pricop-Jackstadt. The authors acknowledge support from National Science Funding awards DMS:2311059(Patrangenaru), DMS:2311058(Paige), and by
a Horizons award (Pricop-Jackstadt).
Keywords: Digital Camera Imaging, Scene Identification, Extrinsic Means, Matched Pairs, Oriented Projective Shape Analysis
Geodesic optimal transport regression
Classical regression models do not cover non-Euclidean data that reside in a general metric space, while the current literature on non-Euclidean regression by and large has focused on scenarios where either predictors or responses are random objects, i.e., non-Euclidean, but not both. In this paper we propose geodesic optimal transport regression models for the case where both predictors and responses lie in a common geodesic metric space and predictors may include not only one but also several random objects. This provides an extension of classical multiple regression to the case where both predictors and responses reside in non-Euclidean metric spaces, a scenario that has not been considered before. It is based on the concept of optimal geodesic transports, which we define as an extension of the notion of optimal transports in distribution spaces to more general geodesic metric spaces, where we characterize optimal transports as transports along geodesics. The proposed regression models cover the relation between non-Euclidean responses and vectors of non-Euclidean predictors in many spaces of practical statistical interest. These include one-dimensional distributions viewed as elements of the 2-Wasserstein space and multidimensional distributions with the Fisher-Rao metric that are represented as data on the Hilbert sphere. Also included are data on finite-dimensional Riemannian manifolds, with an emphasis on spheres, covering directional and compositional data, as well as data that consist of symmetric positive definite matrices. We illustrate the utility of geodesic optimal transport regression with data on summer temperature distributions and human mortality.
Keywords: Geodesic Metric Spaces; Metric Statistics; Multiple Regression; Ran- Dom Objects; Ubiquity; Spherical Data; Distributional Data; Symmetric Positive Definite Matrices.
Density-on-Density Regression
In this study, a density-on-density regression model is introduced, where the association between densities is elucidated via a warping function. The proposed model has the advantage of a being straightforward demonstration of how one density transforms into another. Using the Riemannian representation of density functions, which is the square-root function (or half density), the model is defined in the correspondingly constructed Riemannian manifold. To estimate the warping function, it is proposed to minimize the average Hellinger distance, which is equivalent to minimizing the average Fisher-Rao distance between densities. An optimization algorithm is introduced by estimating the smooth monotone transformation of the warping function. Asymptotic properties of the proposed estimator are discussed. Simulation studies demonstrate the superior performance of the proposed approach over competing approaches in predicting outcome density functions. Applying to a proteomic-imaging study from the Alzheimer’s Disease Neuroimaging Initiative, the proposed approach illustrates the connection between the distribution of protein abundance in the cerebrospinal fluid and the distribution of brain regional volume. Discrepancies among cognitive normal subjects, patients with mild cognitive impairment, and Alzheimer’s disease (AD) are identified and the findings are in line with existing knowledge about AD.
Keywords: Fisher-Rao Metric; Hellinger Distance; Object-Oriented Regression; Probability Density Functions; Riemannian Manifold.
10:45 – 12:15 | ENGR 1108 S06 — Differential Privacy And Robustness
Organized by Elvezio Ronchetti | Chaired by Max Welz
Differentially private penalized M-estimation via noisy optimization
We propose a noisy composite gradient descent algorithm for differentially private statistical estimation in high dimensions. We begin by providing general rates of convergence for the parameter error of successive iterates under assumptions of local restricted strong convexity and local restricted smoothness. Our analysis is local, in that it ensures a linear rate of convergence when the initial iterate lies within a constant-radius region of the true parameter. At each iterate, multivariate Gaussian noise is added to the gradient in order to guarantee that the output satisfies Gaussian differential privacy. We then derive consequences of our theory for linear regression and mean estimation. Motivated by M-estimators used in robust statistics, we study loss functions which downweight the contribution of individual data points in such a way that the sensitivity of function gradients is guaranteed to be bounded, even without the usual assumption that our data lie in a bounded domain. We prove that the objective functions thus obtained indeed satisfy the restricted convexity and restricted smoothness conditions required for our general theory. We then show how the private estimators obtained by noisy composite gradient descent may be used to obtain differentially private confidence intervals for regression coefficients, by leveraging work in Lasso debiasing proposed in high-dimensional statistics. We complement our theoretical results with simulations that illustrate the favorable finite-sample performance of our methods.
Keywords: Differential Privacy, Penalized M-Estimation, Composite Gradient Descent, Debiased-Lasso, Nonconvex Regularization
Privacy preserving Hellinger distance estimator for Hawkes process
Hawkes processes and extensions have been intensely investigated due to their applicability to several scientific fields. The kernel functions of these processes play a central role in describing the probabilistic and statistical properties of the estimators of various functionals of the process. Assuming a parametric kernel function, in this presentation, we describe divergence-based inferential methods for the parameters of the process. Our computational methods are based on stochastic optimization, naturally allowing the incorporation of privacy while retaining utility. We establish the consistency and asymptotic normality of the proposed minimum divergence estimators under appropriate regularity conditions. Using these results, we describe the efficiency and robustness properties of the proposed estimators under potential model misspecification. We illustrate our results using data examples and numerical experiments.
Keywords: Hawkes Process, Differential Privacy, Stochastic Optimization Methods, Divergence Based Estimators
10:45 – 12:15 | ENGR 1110 S07 — Statistical Modeling For Phenological Research
Organized by Jonathan Auerbach | Chaired by Jonathan Auerbach; Additional Panelists: Julio Betancourt
Predicting Cherry Trees Peak Bloom 2024: An Undergraduate Approach
As part of the Statistics program at George Mason University (GMU), students are encouraged to engage in various projects and competitions that allow us to apply different statistical learning algorithms. One such event is the 2024 International Cherry Blossom Prediction Competition. The primary objective of this competition was not only to predict the bloom dates of cherry trees but also to identify key ecosystem factors influencing these dates.
In previous iterations of the competition, participants utilized diverse data sources such as air pollution and temperature to study cherry trees in multiple locations. Additional research has examined crucial biological aspects of cherry trees over time. For this year’s competition, several meteorological variables, including temperature, humidity, precipitation, pressure, and wind speed, were collected, processed, and modeled using the CRISP-DM methodology.
This presentation will detail the implementation of the Gradient Boosting Regression Tree (GBRT) algorithm to predict the peak bloom dates of cherry trees at five locations around the world: Washington D.C., USA; Kyoto, Japan; Vancouver, Canada; Liestal-Weideli, Switzerland; and New York City, USA. The results of this research include point and interval predictions as well as a ranking of the most influential variables that supported these predictions. The model developed and implemented during this research achieved the award for the most accurate prediction.
Keywords: Gradient Boosting Regression Tree, Cherry Blossom Prediction
Is the global fingerprint of climate change slowing down with accelerated warming?
Twenty years ago researchers reported the first ‘globally coherent fingerprint’ of climate change on natural systems: plants and animals were shifting their phenology—the timing of critical life events, such as leafout and nesting—earlier with warming. These advances appear unique across even the longest time-series such as the cherry blossoms records of Japan (> 800 years). Our biological understanding of these processes for plants, however, suggests that advances may slow down or stop with increased warming as winters are too warm to meet the cool winter, `chilling,’ requirements of many species. Recently, multiple studies have reported this predicted slowing down in tree responses to climate change, resulting in declining phenological sensitivities (delta days per degree C). These slowed biological processes have major implications for forecasts of future change, including carbon sequestration. Here, we contrast multiple biological models combined with the statistical models to test expectations of what happens to phenology across diverse tree species. We show that current apparent slowed responses may not be slower when measured in biological time while most models use linear models with calendar time. These results suggest our current biological models and statistical methods may not be robust to detect changing biology. Current methods may thus undermine efforts to identify when and how warming will reshape biological processes.
Keywords: Climate Change, Global Warming, Tree Phenology, Non-Linear Processes, Time-Series
Reassessing Phenological Models: Illuminating Why Growing Degree Day Model Variations Fail to Advance Scientific Knowledge
Phenology, the study of the timing of recurring seasonal biological events, has long relied on the growing degree day (GDD) model to understand the temporal occurrences of various phenological phenomena. However, a growing body of literature is advocating for more complicated models, inspired by the GDD model, as these models seem to improve the accuracy of predicting the timing of phenological events. Despite the enhanced prediction accuracy of these variants of the original GDD model, however, our analysis suggests that these variants fail to yield new scientific insights. Instead, we demonstrate that the observed improvements in prediction accuracy are likely attributable to statistical artifacts and measurement error rather than substantive progress in scientific comprehension. This insight prompts a critical reevaluation of the mechanisms underlying phenological models and calls for a renewed focus on explaining the true drivers of phenological phenomena beyond mere prediction accuracy.
Keywords: Measurement Error, Growing Degree Day Model, Phenology, Prediction Accuracy
10:45 – 12:15 | ENGR 1103 S08 — Novel Estimators And Visualizations For Robust Multivariate Data Analysis
Contributed session chaired by Peter Filzmoser
Visualizing compositional count data with model-based ordination methods
In recent years, model-based ordination of ecological community data has gained a lot of popularity among practitioners, largely due to greater availability and utilization of computational resources. In particular, the family of generalized linear latent variable models [GLLVM, Moustaki, 1996], a factor-analytic and rank-reduced form of mixed effect models, has proven to be both accurate and computationally efficient [Warton et al., 2015]. GLLVMs have been implemented and used for many of the response types common to ecological community data; presence-absence, biomass, overdispersed and/or zero-inflated counts serving as examples [Niku et al., 2019]. In this work, we show how GLLVMs can be applied when visualizing high-dimensional compositional count data. The methods are useful for example in the analysis of microbiome data, as such data are usually collected using modern lab-based sampling tools, and data are inherently compositional due to finite total capacities of the sequencing instruments.
We use vast simulation studies to compare the ordination methods based on GLLVMs with classical compositional data analysis methods that rely on log-transformations [Filzmoser et al., 2018]. Also recently developed fast model-based ordination methods utilizing Copulas [Popovic et al., 2022] are included in comparisons. The methods are illustrated with a microbiome data example.
References
Filzmoser, P., Hron, K. and Templ, M. (2018). Applied Compositional Data Analalysis, Springer, Cham.
Moustaki, I. (1996). A latent trait and a latent class model for mixed observed variables. British Journal of Mathematical and Statistical Psychology, 49, 313–334.
Niku, J., Hui, F.K.C., Taskinen, S. and Warton, D.I. (2019). gllvm – Fast analysis
of multivariate abundance data with generalized linear latent variable models in R.
MEE, 10, 2173–2182.
Popovic, G. C., Hui, F. K. C., and Warton, D. I. (2022). Fast model-based ordination
with copulas. Methods in Ecology and Evolution, 13, 194–202.
Warton, D.I., Guillaume Blanchet, F., O’Hara, R., Ovaskainen, O., Taskinen, S.,
Walker, S.C., and Hui, F.K.C. (2015). So many variables: Joint modeling in commu-nity ecology. Trends in Ecology and Evolution, 30, 766–779.
Keywords: Community-Level Modeling, Copula, Latent Variable Model, Sparsity, Zero-Inflation
A Canonical Variate Analysis Biplot based on the Generalized Singular Value Decomposition
Canonical Variate Analysis (CVA) is a multivariate statistical technique that aims to find linear combinations of variables that best differentiate between groups in a dataset. The data is partitioned into groups based on some predetermined criteria, and then linear combinations of the original variables are derived such that they maximize the separation between the groups. However, a common limitation of this optimization in CVA is that the within cluster scatter matrix must be nonsingular, which restricts the use of datasets when the number of variables is larger than the number of observations. By applying the generalized singular value decomposition (GSVD), the same goal of CVA can be achieved regardless on the number of variables. In this presentation we use this approach to show that CVA can be applied and graphical representations to such data can be constructed. We will be looking at CVA biplots that will display observations as points and variables as axes in a reduced dimension, which provides a highly informative visual display of the respective group separations.
Keywords: Dimension Reduction, Canonical Variate Analysis, Generalized Singular Value Decomposition, Biplots
A Robust Deterministic Affine-Equivariant Algorithm for Multivariate Location and Scatter
With recent advances in measurement technology and data collection methodologies, it is increasingly common that multiple characteristics of interest are simultaneously measured and recorded for an item of interest. The added information provided by analyses of data from several variables better informs decision making. For many questions of interest where multivariate analyses prove advantageous, estimation of two fundamental parameters, the location vector and scatter matrix, are frequently of particular interest. Software that implements algorithms appropriate for these kinds of multivariate applications are now readily available. Being parametric, most of these techniques are typically sensitive to violation of model assumptions. Empirically, this often corresponds to situations when the dataset contains outliers or anomalous observations.
The minimum covariance determinant estimator (MCD) of Rousseeuw (1984) is probably the most prominent robust multivariate estimator of multivariate location and scatter. Computing the MCD estimator requires solving an NP-hard combinatorial optimization problem, which, in view of no-free-lunch theorem, means that no algorithm can in general outperform blind search. Therefore, the pioneering FastMCD algorithm of Rousseeuw & Van Driessen (1999) and its recent DetMCD alternative by Hubert al at (2012) are favored by practitioners and included in most statistical software.
We propose a new algorithm for computing raw and reweighted MCD estimators (Pokojovy & Jobe, 2022). Our new procedure incorporates a projection pursuit approach and the concentration step (C-step) of Rousseeuw & Van Driessen (1999). The resulting estimator is deterministic, affine equivariant and permutation invariant unlike prominent alternatives. The new procedure, referred to as Projection Pursuit MCD (PP MCD), combines a single preliminary estimator obtained with a type of non-linear principal component analysis and the so-called concentration step (C-step). While C-step not only reduces the covariance determinant but converges to a fixed point in a finite number of steps, it has been a long-standing question whether these fixed points are natural covariance determinant minimizers or rather comprise some artificial attractor set. We answer this question by showing fixed points are indeed local minimizers of the covariance determinant over a convex relaxation of the feasible set. This implies the C-step does not terminate prematurely and no further local improvement of the objective is possible.
Extensive comparisons for simulated datasets, multivariate Swiss banknote and image segmentation examples show the new algorithm is competitive with and mostly superior to such state-of-the-art procedures as FastMCD and DetMCD for both Gaussian and heavy-tailed real-world data. Outlier detection for Swiss banknote and image segmentation data is presented. A corresponding R package is available at https://github.com/mpokojovy/PPcovMcd.
Keywords: Robust Estimation, Projection Index, Breakdown, Bulk Purity, Outlier Detection
Learning Claim Severity Models from Lower-truncated and Right-censored Data
Financial and insurance sample data, often left-truncated (LT) and right-censored (RC) due to different loss control mechanisms, pose challenges for standard unimodal predictive models. Recent advances suggest multi-modal spliced/mixture models for better predictive power for unseen data. Addressing the difficulty of modeling LT and RC data, we propose robust algorithmic approaches and robust statistical learning to enhance prediction and management of future claims.
Keywords: Dynamic Estimation, Insurance Payments; Robustness; Relative Efficiency Truncated And Censored Data
12:15 – 13:30 | (Not provided)
Lunch break
Free ice cream in the Atrium (12:30–13:30)!
13:30 – 14:30 | Jajodia Auditorium (ENGR 1101) Keynote Address by Dr. Marianthi Markatou Statistical Distances in Theory and Practice: Beyond Likelihood with Robustness Guarantees
Are comets’ orbits uniformly distributed? Are the medical products we use safe? Can we summarize large amounts of text in an unsupervised manner efficiently? Answers to these questions present us with important challenges. Of particular importance are the challenges presented by the “large magnitude” both in terms of dimension of the data vectors, as well as in terms of the number of data vectors.
Statistical models are the foundation of much statistical work, but they always involve restrictions on the class of the allowable distributions. We present an approximation framework, based on statistical distances, in which we adopt the point of view that we might want to use the model even if it is probably false, provided it is sufficiently close, in a statistical sense, to the scientific true state of nature. Our point of view is that parametric models can provide informative, parsimonious descriptions of the data if the statistical loss for doing so is small enough. We discuss in detail the properties of several distances, including likelihood, and we place emphasis on determining the sense in which we can give these distances meaningful interpretations as measures of statistical loss. The discreteness of data makes it difficult to directly compare a sample distribution and a continuous probability measure, as the two supporting measures are mutually singular. We will discuss the concept of discretization robustness. Furthermore, we will identify general structural features of distances that are useful in using them for statistical analyses. An important unifying step is the identification of the class of quadratic distances with a second order von Mises expansion. We conclude with a discussion on the development of tools based on these distances for use in inference, and we exemplify our statements on data via use of the QuadratiK package.
Dr. Marianthi Markatou is SUNY Distinguished Professor and Associate Chair of Research and HealthcareInformatics, Department of Biostatistics, School of Public Health and Health Professions. She also holds a joint appointment with the Jacobs School of Medicine & Biomedical Sciences, and an Adjunct Professorship in the Department of Computer Science and Engineering, University at Buffalo. Dr. Markatou received a Ph. D. in Statistics from the Pennsylvania State University. Her research interests are broad and include problems at the interface of statistics and machine learning, mixture models, robustness, theory and applications of statistical distances, surveillance methods, emerging safety sciences, biomedical informatics, text mining, and clustering methods for mixed-typedata. Her research work is, continuously funded, by NSF, NIH, FDA, PCORI and other non-profit organizations and research foundations. Dr. Markatou is an Elected Fellow of the American Statistical Association, Elected Fellow of the Institute of Mathematical Statistics, and Elected Member of the International Statistical Institute, Fulbright fellowship awards (2017, 2019), and a Senior Researcher of the Year award (2019) from the School of Public Health & Health Professions, University at Buffalo. She has held many associate editor positions and currently she is Co-Editor-in-Chief for International Statistical Review, the flagship journal of the International Statistical Institute.
14:30 – 14:45 | Jajodia Auditorium (ENGR 1101) P01 — Poster Lightning Talks
Contributed session chaired by David Kepplinger
Robust estimations from distribution structures
Descriptive statistics for parametric or nonparametric models are generally sensitive to departures, gross errors, and/or random errors. Here, semiparametric methods to classify distributions to discover the underlying mechanisms of current robust estimators are explored. Further deductions explain why the Winsorized mean typically has smaller biases compared to the trimmed mean and why the Hodges-Lehmann estimator and Bickel-Lehmann spread are the optimal nonparametric location and scale estimator. From the distribution structures, a series of new estimators were deduced. Some of them are robust to both gross errors and departures from parametric assumptions, making them ideal for estimating the mean and central moments of common unimodal distributions. This presentation sheds light on the understanding of the common nature of probability distributions and the measures of them.
Keywords: Semiparametric,Unimodal,Hodges-Lehmann Estimator,U-Statistics
Predicting Dengue Incidence In Central Argentina Using Google Trends Data
Dengue is a mosquito-borne disease prominent in tropical and subtropical regions of the world but has been emerging in temperate areas. In Cordoba, a city in temperate central Argentina, there have been several dengue outbreaks in the last decade following the city’s first outbreak in 2009. Internet data, such as social media posts and search engine trends, have proven to be useful for predicting the spread of infectious diseases. As the first step in developing a predictive model of dengue incidence in Cordoba using Google Trends data, we have conducted a study of relationships between Google search terms and dengue incidence during recent outbreaks in the city. Specifically, using relevant search terms as predictors and dengue case data as the response variable, our trained model can identify which search terms are significant for predicting dengue cases. We study relationships between predictor and response variables in real-time and with lags in our predictive model. We employ several methods to identify the significance of search terms, and we find that terms such as “mosquito”, “dengue”, “aedes”, “aegypti”, “dengue virus”, and “virus del dengue”, are often strongly correlated with dengue incidence. We observe that the lag data, as compared to real-time data, has a better fit and predictive performance for dengue cases. Our predictive model that utilizes Google Trends data can be integrated with climate, sociodemographic, and other types of information as part of a comprehensive early warning system that predicts outbreaks and informs public health and mosquito control policies.
Keywords: Machine Learning, Disease Modeling
Copula-based Canonical Coherence for EEG data
Common measures of spectral dependence deals with associations between two signals. When the interest is the dependence of two matrices of signals, multivariate analysis (such as canonical analysis) offers a solution. However, this kind of analysis is limited to independent dataset and normal distribution. When the data contains extreme values and has time dependence, usual analysis is not available. We explored this notion in the spectral-domain and provided a way of estimating “canonical coherence” in terms of copula. In this study, a vine-copula model is formed in such a way that it will take into account the cross-hemisphere cross association (in the second tree) which may have interference from neighboring neurons. This measure was compared with the use of the variance-covariance matrix in the usual way of analysis and tested the performance for different elliptical copulas (both light and heavy-tailed). Results of the proposed measure have shown competitive results in the simulation and exceeded the performance of usual-way of canonical analysis especially for heavy-tailed distribution. The paper also tries to introduce a “Granger Causality” counterpart for multivariate data. The measure was applied to Occipital and Parietal lobe observed from a neonate in an intensive care unit in Finland.
Keywords: Canonical Coherence; Multivariate; Vine-Copula, Eeg, Robust
Robust Decentralized Federated Learning Against both Distributional Shifts and Byzantine Attacks
Decentralized federated learning (DFL) offers enhanced resilience to client failures and potential attacks compared to its centralized counterpart. This advantage stems from its ability to aggregate learning models from distributed clients without the need for centralized server coordination. Nonetheless, the adoption of DFL in practical applications faces several challenges that can threaten the robustness of local models. On one hand, the distribution of data might change over time or space due to the decentralized infrastructure, which impacts the aggregated model’s performance on both local agents and test data. On the other hand, Byzantine attacks, which involve certain users sending malicious updates to their neighbors to propagate erroneous knowledge, could compromise the convergence and accuracy of the global model. We observe that there has not yet been any work that claims to resolve both spatial and temporal distributional shifts, let alone when combined with Byzantine attacks. In this work, we aim to fill this gap by proposing a robust decentralized learning algorithm that is resilient to both distributional shifts and Byzantine attacks. To combat distributional shifts, we propose implementing Wasserstein distributionally robust optimization. This approach involves introducing controlled perturbations to the training data across clients, enhancing the model’s robustness against shifts in data’s spatial and temporal distributions. To counter Byzantine attacks, we propose equipping non-Byzantine agents with robust aggregation measures to filter out suspicious information received from neighboring clients. The resilience of the proposed algorithm against both distributional shifts and Byzantine attacks is validated by extensive experiments on real datasets.
Keywords: Decentralized Federated Learning; Distributional Shifts; Byzantine Robustness; Wasserstein Distributionally Robust Optimization; Robust Aggregation.
Uniform manifold approximation and projection for trajectory inference using compatibility of ordinal label
Dimension reduction techniques are widely used to analyze gene expression data reflecting the state of cells in processes such as development and differentiation, and for data visualization.
Uniform Manifold Approximation and Projection (UMAP; Mclnnes et al., 2018) is a useful algorithm for classifying biological data. UMAP embeds high-dimensional data into a low-dimensional space by focusing on local similarities.
A common application of this method in biology is trajectory inference (Street et al., 2018), which uses RNA-seq data from a single cell to infer cell fate, such as cell differentiation.
This method extracts common information by embedding the obtained cells into low dimensions using dimension-reduction algorithms such as UMAP.
The cell differentiation process can be interpreted in a low-dimensional space by performing trajectory inference.
The obtained trajectory is called pseudotime when interpreted as a temporal axis.
Pseudotime estimation can be interpreted as embedding high-dimensional data into a one-dimensional series called pseudotime.
In existing trajectory inference and pseudotime estimation methods, only single-cell RNA-seq data are used.
Therefore, even if the label information for each cell is obtained, an analysis that ignores this information is performed. Consequently, genes related to mutations related to the label cannot be identified.
In single-cell RNA-seq data, information associated with data acquisition, such as age, disease progression, and observation points for each cell, can be obtained as labels; therefore, dimension reduction is useful for biological interpretation.
This study proposes a UMAP that uses this label information and ordinal distance compatibility for single-cell RNA-seq data with ordinal labels.
Using UMAP, the latent cluster information of the obtained single-cell RNA-seq data can be extracted.
Our proposed method preserves the order information for clusters in UMAP by using external ordinal label information as a constraint while preserving the information being clustered.
The limitations of the proposed method include the idea of Local Ordinal Embedding (Terada \& Luxburg, 2014) and ordinal distance compatibility (Weiss et al., 2020) to preserve the order information of the centroids.
Our method is expected to improve the dimension-reduction task in trajectory inference applications using single-cell RNA sequence data to elucidate biological mechanisms.
For example, in cases where cell differentiation patterns and other biological changes are related to the order, the proposed method is expected to capture these patterns more accurately.
Keywords: Dimensionality Reduction, Single-Cell Rna-Seq Data, Visualisation
A Nonparametric Bayesian Model of Citizen Science Data for Monitoring Environments Stressed by Climate Change
We propose a new method to adjust for the bias that occurs when citizen scientists monitor a fixed location and report whether an event of interest has occurred or not, such as whether a plant has bloomed. The bias occurs because the monitors do not record the day each plant first blooms at the location, but rather whether a certain plant has already bloomed when they arrive on site. Adjustment is important because differences in monitoring patterns can make local environments appear more or less anomalous than they actually are, and the bias may persist when the data are aggregated across space or time. To correct for this bias, we propose a nonparametric Bayesian model that uses monotonic splines to estimate the distribution of bloom dates at different sites. We then use our model to determine whether the lilac monitored by citizen scientists in the northeast United States bloomed anomalously early or late, preliminary evidence of environmental stress caused by climate change. Our analysis suggests that failing to correct for monitoring bias would underestimate the peak bloom date by 32 days on average. In addition, after adjusting for monitoring bias, several locations have anomalously early bloom dates that did not appear anomalous before adjustment.
Keywords: Nonparametric Bayes, Monotonic Splines, Monitoring Bias, Bias Correction, Crowdsourcing
Asymptotics of least trimmed absolute deviation in a location-scale model
The least trimmed absolute deviation (LTAD) estimator, introduced by Tableman (1994), is a robust regression estimator. Like Rousseeuw’s (1984) least trimmed squares (LTS) estimator, LTAD identifies observations with the n-h largest absolute deviation as potential outliers, then reports the least absolute deviation (LAD) estimator in the remaining h subsample. Building on a recent work by Berenguer-Rico, Johansen, and Nielsen (2023) on LTS, we argue that the LTAD estimator is maximum likelihood in a model where values of outlying errors are outside the range of realized good Laplace errors. We provide an asymptotic study on LTAD in a location-scale model and find it to be h^{1/2} consistent and asymptotically normal. In particular, the asymptotic distribution is the same as it would have been if LAD had been applied to the infeasible set of good observations. Our assumptions differ from those of Tableman (1994) so that nuisance parameter free asymptotics could be derived. Monte Carlo simulations are conducted to analyze the performance of LTAD under various data generating processes including the classical epsilon-contamination model.
Keywords: Least Trimmed Absolute Deviation; Outliers; Robust Statistics
Byzantine Robust Federated Learning with Enhanced Fairness and Privacy Guarantees
This paper develops a comprehensive framework designed to address three critical challenges in federated learning, i.e: robustness against Byzantine attacks, fairness, and privacy preservation. To improve the system’s defense against Byzantine attacks that send malicious information to bias the system’s performance, a Two-sided Norm Based Screening (TNBS) mechanism is incorporated at the server side to crop the gradients that have l lowest norms and h highest norms. The TNBS mechanism functions as a screening tool to filter out malicious participants whose gradients are far away from the honest ones. To promote egalitarian fairness that ensures some what uniform model performance trained from non-independent and identically distributed (non-IID) data, we adopt the q-fair federated learning scheme. Furthermore, this framework is enhanced with techniques to ensure differential privacy, establishing uncompromising privacy guarantees for individual data contributions. We work on different data sets to demonstrate the significance of the proposed framework in improving robustness and fairness, while effectively managing the trade-off between privacy and accuracy. The proposed integrative approach can serve as a robust basis for developing federated learning systems that are both fairly reliable and strong to external threats, making it a significant addition to the field of distributed machine learning.
Keywords: Federated Learning, Byzantine Robustness, Norm Based Screening, Egalitarian Fairness, Differential Privacy
14:45 – 15:00 | Atrium & Jajodia Auditorium
Coffee Break
15:00 – 17:00 | Jajodia Auditorium (ENGR 1101) In memoriam Dr. Rubén Zamar
Organized and chaired by Gabriela Cohen Freue and Matías Salibián-Barrera
In this session we will honor and remember the contributions and legacy of Dr. Rubén Zamar. The following speakers will share their reflections:
- Doug Martin
- Matías Salibián-Barrera
- Marcelo Ruiz
- Stefan Van Aelst
- Peter Rousseeuw
- Victor Yohai
- Anthony Chistidis
17:00 – 18:00 | Atrium
Cocktail hour and poster session
Tuesday, July 30
7:45 – 8:45 | Atrium & Jajodia Auditorium
Breakfast
8:45 – 10:15 | ENGR 1107 S09 — Advances In Robust Statistical Methods For Complex Data Challenges
Organized and chaired by Lily Wang
Robust Distribution-free Predictive Inference via Round-trip Generative model (DPI-RG) under Distributional Shift
Distribution-free predictive inference is an invaluable approach for establishing precise uncertainty estimates for future predictions without the need for explicit model assumptions. However, the widespread reliance on exchangeability assumptions between training and test data can impose limitations when handling test datasets with distributional shifts. In practical deployment scenarios, exchangeability often stands violated, necessitating innovative solutions.
In response to this challenge, we introduce a cutting-edge methodology for distribution-free predictive inference via round-trip generative model (\DPI), designed to construct predictive sets capable of managing intricate and high-dimensional data, even when confronted with previously unseen data. Our novel roundtrip transformation technique maps the original data to a lower-dimensional space while preserving the conditional distribution of input data for each class label. This transformative process enables the derivation of valid p-values with statistical guarantee, serving as a robust measure of uncertainty.
Keywords: Distribution-Free, Predictive Inference, Generative Model
Changepoint detection in autocorrelated ordinal categorical time series
While changepoint aspects in correlated sequences of continuous random variables have been extensively explored in the literature, changepoint methods for independent categorical time series are only now coming into vogue. This talk extends changepoint methods by developing techniques for serially correlated categorical time series. In this study, a cumulative sum type test is devised to test for a single changepoint in a correlated categorical data sequence. Our categorical series is constructed from a latent Gaussian process through clipping techniques. A sequential parameter estimation method is proposed to estimate the parameters in this model. The methods are illustrated via simulations and applied to a real categorized rainfall time series from Albuquerque, New Mexico.
Keywords: Temporal Discontinuities, Latent Gaussian Process, Autocorrelation, Cusum, Ordinal Categorical Data
Algorithms for ridge estimation with convergence guarantees
The extraction of filamentary structure from a point cloud is discussed. The filaments are modeled as ridge lines or higher dimensional ridges of an underlying density. We propose two novel algorithms, and provide theoretical guarantees for their convergences, by which we mean that the algorithms can asymptotically recover the full ridge set. We consider the new algorithms as alternatives to the Subspace Constrained Mean Shift (SCMS) algorithm for which no such theoretical guarantees are known.
Keywords: Filaments, Ridges, Manifold Learning, Mean Shift, Gradient Ascent
A Robust Projection-based Test for Goodness-of-fit in High-dimensional Linear Models
We propose a robust projection-based test to check linear regression models when the dimension may be divergent. The proposed test can effectively maintain the type I error rate when outliers are present, achieve dimension reduction as if only a single covariate was present, and inherit the robustness of the M-estimation. The test is shown to be consistent and can detect root-n local alternative hypotheses. We further derive asymptotic distributions of the proposed test under the null hypothesis and analyze asymptotic properties under the local and global alternatives. We evaluate the finite-sample performance via simulation studies and apply the proposed method to analyze a real dataset as an illustration.
Keywords: Consistent Test; Curse Of Dimensionality; Divergent Dimension; Empirical Process; Integrated Condition Moment
8:45 – 10:15 | ENGR 1108 S10 — Cellwise Outliers
Organized and chaired by Peter Rousseeuw
Cellwise robust and sparse PCA
Maronna & Yohai (2008) proposed robust Principal Component Analysis (PCA) based on a low-rank approximation of the data matrix. Robustness is attained by substituting the squared loss function for the approximation error by a robust
version. As the loss function is applied variable-wise, this results in a cellwise robust PCA version. We use the same objective function, but add an elastic net penalty to obtain sparsity in the loadings, offering additional modeling flexibility.
To solve the optimization problem, we propose an algorithm based on Riemannian stochastic gradient descent (Bonnabel, 2013), making the approach scalable to high-dimensional data, both in terms of many variables as well as observations. The resulting method is called SCRAMBLE (Sparse Cellwise Robust Algorithm for Manifold-based Learning and Estimation). Simulations show the superiority of this approach in comparison to established methods, both in the casewise and cellwise robustness paradigms. The usefulness of the proposed method is illustrated on real datasets from tribology.
Keywords: Low Rank Approximation, Principal Component Analysis, Cellwise Robustness
Multivariate Singular Spectrum Analysis by Robust Diagonalwise Low-Rank Approximation
Multivariate Singular Spectrum Analysis (MSSA) is recognized as a powerful method for analyzing multivariate time series data. It is applied across a variety of disciplines, including finance, healthcare, ecology, and engineering. Despite its wide applicability, MSSA lacks robustness against outliers. This vulnerability stems from its reliance on singular value decomposition, which is highly susceptible to anomalous data points. In this work a new MSSA method is proposed, named RObust Diagonalwise Estimation of SSA (RODESSA), which enhances the robustness of MSSA against both cellwise and casewise outliers. It substitutes the traditional decomposition step with a novel robust low-rank approximation method that takes the special structure of the trajectory matrix into account. A fast algorithm is constructed, and it is proved that each iteration step decreases the objective function. In order to visualize different types of outliers, a new graphical display is introduced, called an enhanced time series plot. The effectiveness of RODESSA is validated through a comprehensive Monte Carlo simulation study, which benchmarks it against other methods in the literature. A real data example about temperature analysis in passenger railway vehicles demonstrates the practical utility of the proposed approach.
Keywords: Casewise Outliers, Cellwise Outliers, Iteratively Reweighted Least Squares, Multivariate Time Series, Robust Statistics
Cellwise outlier detection for multi-factorial compositions
Similar to classic real-valued multivariate datasets, their compositional counterparts also suffer from the presence of outliers. Compositional outliers are typically characterised by distorted ratios between the compositional parts, but when a coordinate representation is applied, outlying observations can be detected similarly to real-valued data (Filzmoser & Hron [2008], de Sousa et al. [2020]). A more challenging problem is the detection of cellwise outliers, i.e. deviating (compositional) parts (Rousseeuw & van den Bossche [2018]). Even though there are already some methods developed for vector compositions (Rieser et al. [2023], Štefelová et al. [2021]), this contribution focuses on
the more complex problem of detection of outliers occurring in multi-factorial compositional datasets, i.e. datasets formed by compositional tables or cubes (Fačevicová et al. [2023]). The presented methodology enables the detection of both single deviating cells and atypical categories of the constituting factors. After introducing the main theoretical principles, the performance of the method will be demonstrated through the results of a simulation study and an application on an empirical dataset.
References
Fačevicová, K., Filzmoser, P. & Hron, K. (2023). Compositional cubes: a new concept for multi-factorial compositions. Statistical Papers, 64(3), 955–985.
Filzmoser, P. & Hron, K. (2008). Outlier detection for compositional data using robust methods. Mathematical Geosciences, 40(3), 233–248.
de Sousa, J., Hron, K., Fačevicová, K. & Filzmoser, P. (2020). Robust principal component analysis for compositional tables. Journal of Applied Statistics, 48(2), 214–233.
Štefelová, N., Alfons, A., Palarea-Albaladejo, J., Filzmoser, P. & Hron, K. (2021).
Robust regression with compositional covariates including cellwise outliers. Advances in Data Analysis and Classification, 15(4), 869–909.
Rieser, C., Fačevicová, K. & Filzmoser, P. (2023). Cell-wise robust covariance estimation for compositions, with application to geochemical data. Journal of Geochemical Exploration, 253, 107299.
Rousseeuw, P. & van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60(2), 135–145.
Keywords: Anomaly Detection; Cellwise Outliers; Compositional Data; Multi- Factorial Compositions.
8:45 – 10:15 | ENGR 1110 S11 — Spatio-Temporal Statistical Learning In Climate And Environmental Science
Organized and chaired by Ben Lee
Computationally Scalable Bayesian SPDE Modeling for Censored Spatial Responses
Observations of groundwater pollutants, such as arsenic or Perfluorooctane sulfonate (PFOS), are riddled with left censoring. These measurements have impact on the health and lifestyle of the populace. Left censoring of these spatially correlated observations are usually addressed by applying Gaussian processes (GPs), which have theoretical advantages. However, this comes with a challenging computational complexity of O(n^3), which is impractical for large datasets. Additionally, a sizable proportion of the data being left-censored creates further bottlenecks, since the likelihood computation now involves an intractable high-dimensional integral of the multivariate Gaussian density. In this article, we tackle these two problems simultaneously by approximating the GP using the Gaussian Markov random field (GMRF) approach which exploits an explicit link between a GP with Mat ́ern correlation function and a GMRF using stochastic partial differential equations (SPDEs). We introduce a GMRF-based measurement error into the model, which reduces the computational burden of the likelihood computations for censored data, drastically improving the speed of the model while maintaining admirable accuracy. Our approach demonstrates robustness and substantial computational scalability, compared to state-of-the-art methods for censored spatial responses across various simulation settings. Finally, we fit this fully Bayesian model to the concentration of PFOS in groundwater available at 24,959 sites across California, where 46.62% responses are censored. We produce prediction surface and uncertainty quantification in real time, thereby substantiating the applicability and scalability of the proposed method. The implementation code for our method is made available on GitHub as an R package named CensSpBayes.
Keywords: Censored High-Dimensional Spatial Data, Gaussian Markov Random Field, Measurement Error Model, Perfluorooctane Sulfonate, Statistical Epidemiology
Locally Stationary Mapping and Uncertainty Quantification of Ocean Heat Content Based on Argo Profiles During 2004-2022
Argo floats provide us with a unique opportunity to measure the global and regional Ocean Heat Content (OHC) and improve our understanding of Earth Energy Imbalance (EEI). Yet, producing Argo-based OHC estimates with reliable uncertainties is statistically challenging due to the complex structure and large size of the Argo dataset. Here we present the latest version of our mapping and uncertainty quantification framework for Argo-based OHC estimation based on state-of-the-art methods from spatio-temporal statistics. The framework is based on modeling vertically integrated Argo temperature profiles as a locally stationary Gaussian process defined over space and time. This enables us to produce computationally tractable OHC anomaly maps based on data-driven decorrelation scales estimated from the Argo observations. We quantify the uncertainty of these maps using locally stationary conditional simulation ensembles, a novel approach that leads to principled uncertainty quantification that accounts for the spatio-temporal correlations in the mapping uncertainties. A new cross-validation approach is presented to validate these uncertainties. The mapping framework is implemented in an open-source codebase that is designed to be modular, reproducible and extensible. We present a new Argo OHC data product with uncertainties for 2004-2022 based on this framework and report on various climatological estimates and their uncertainties obtained using this product. Finally, we describe how these estimates have contributed to several international climate reports and OHC and EEI intercomparison efforts.
Keywords: Spatio-Temporal Interpolation, Gaussian Process, Conditional Simulation, Physical Oceanography, Statistical Climatology
A Comprehensive Framework Integrating Extreme Precipitation Projection and Watershed Modeling
With the rise in frequency and intensity of extreme flooding events, it is of great interest to generate accurate streamflow projections and understand the impact of human activities. The precision of watershed modeling, employing hydrological models, is significantly influenced by the spatial and temporal dependence characteristics of extreme precipitation events that trigger floods. However, the use of gridded precipitation data products in large-scale watershed modeling can lead to inaccurate flood risk assessment due to the dampening of extreme precipitation, particularly in the context of mapping regional flood risk. This paper introduces an integrated modeling framework designed to address spatio-temporal extremal dependence in weather data inputs and evaluate projected changes in flood inundation risks across the entire Mississippi River Basin. The methodology begins with the application of a Bayesian analogue method, utilizing large, synoptic-scale atmospheric patterns to generate precipitation forecasts with high spatial resolution. The analogue model incorporates a mixture of spatial Student-t processes, featuring varying tail dependence intensities, to enable the data-driven selection of appropriate levels of tail dependence. Then the projected precipitation is used to force the Soil and Water Assessment Tool (SWAT), a well-calibrated model used for regional streamflow simulation in the Mississippi River basin and predicting nutrient cycles in agricultural landscapes. Lastly, streamflow data extracted from SWAT are input in a hydraulic model to estimate regional flood inundation risks using a univariate peaks-over-threshold extreme value approach.
Keywords: Extreme Precipitation, Hydrological Modeling, Flood Risk, Water Quality
Modeling Spatio-Temporal Environmental Processes and Emerging Crises under Data Sparsity
Modeling the dynamics of spatio-temporal processes is often challenging and this is exacerbated under data sparsity (often the case in early stages of a process). For example, modeling the dynamics of a vector-borne infectious disease at early stages is very challenging due to data sparsity (as well as potential lack of knowledge regarding the disease dynamics itself); this is an important issue for modeling emerging and re-emerging epidemics, or emerging climate crisis. Moreover, data sparsity may also result in inefficient inference and ineffective prediction for such processes. This is a common issue in modeling rare or emerging ecological, environmental, epidemiological, and social processes that are new or uncommon in specific areas, specific time periods, or those conditions that are hard to detect. Consequently, due to the urgency of modeling these processes in many situations (e.g., in a crisis situation), often one limited predictor data to use either because of lack of knowledge about the process or the need for fine resolution predictor data. For example, modeling the dynamics of climate-driven human migration may be quite complex to model (in particular, when a crisis occurs and there are abrupt migration in/out flows). Classic models that are commonly used in these areas often fall short of modeling such events and are unable to provide reliable inference and reasonable or accurate forecasts. Also, the factors that are linked with migration processes are often related to long term migration and/or only available at spatially and temporally aggregated level. In this paper, we discuss strategies for dealing with some of the statistical issues of modeling dynamics under data sparsity including: utilizing blended data (i.e., both conventional and organic data sources), considering a mechanistic science-based modeling framework to model the dynamics of a spatio-temporal based on zero-modified hierarchical modeling approaches, and implementing improved parameter estimation and forecasting through transfer learning. We will provide examples based on simulated and real data.
Keywords: Zero-Inflated Counts, Organic And Blended Data, Transfer Learning, Vector-Borne Infectious Disease, Climate Migration
8:45 – 10:15 | ENGR 1103 S12 — Theoretical Insights Into Robust And Complex Statistical Methods
Contributed session chaired by Syed Ejaz Ahmed
Asymptotic and Non-Asymptotic Analysis of Least Squares Estimators based on Compressed Data generated from 1-Bit Quantization, Sketching, or their Combination
In the era of big data, the extreme number of samples in a dataset often cause computational challenges even for simple statistical estimation procedures such as least squares regression. Additionally, the size of datasets can also be problematic when transferring them from its source to a processing location. Lossy compression schemes have been proposed to overcome these difficulties while controlling the error in subsequent analysis steps. In this presentation, we analyze least squares regression when using data compressed by three compression algorithms: 1-bit quantization, random sketching, and their combination, sketching followed by 1-bit quantization. Both asymptotic distributions and non-asymptotic bounds on the estimation error are studied in settings with fixed and random design. We conclude with a discussion on the optimality of these results and possible directions of future work.
Keywords: Data Compression, Quantization, Sketching, High-Dimensional Statistics
The Influence Function of Graphical Lasso Estimators
The precision matrix that encodes conditional linear dependency relations among a set of variables forms an important object of interest in multivariate analysis. Sparse estimation procedures for precision matrices such as the graphical lasso (Glasso) gained popularity as they facilitate interpretability, thereby separating pairs of variables that are conditionally dependent from those that are independent (given all other variables). Glasso lacks, however, robustness to outliers. To overcome this problem, one typically applies a robust plug-in procedure where the Glasso is computed from a robust covariance estimate instead of the sample covariance, thereby providing protection against outliers. In this talk, we study such estimators theoretically, by deriving and comparing their influence function, sensitivity curves and asymptotic variances.
Keywords: Graphical Lasso; Influence Function; Precision Matrix
Application of Sign Depth to Point Process Models
An important task in reliability studies is the lifetime testing of systems consisting of dependent or interacting components. Since the fatigue of a composite material is largely determined by the failure of its components, its risk of breakdown can be linked to the observable component failure times. These failure times form a simple point process that has numerous applications also in econometrics and epidemiology, among others.
A powerful tool for modeling simple point processes is the stochastic intensity, which can be thought of as the instantaneous average rate for the occurance of an event. Here, this event represents the failure of a system component. Under a random time change based on the cumulative intensity, any such point process can be transformed into a unit-rate Poisson process with exponential interarrival times. If we consider a parametric model for the stochastic intensity, we can perform this transformation for each parameter to obtain so-called hazard transforms. As soon as the parameter deviates from its true value, these transforms will generally no longer follow a standard exponential distribution. At this point, familiar goodness-of-fit tests such as the Kolmogorov-Smirnov test can be applied.
However, viewing the transforms as “residuals”, data depth approaches commonly encountered in the regression context can be considered as an alternative. In particular, the 3-sign depth test provides a much more powerful generalization of the classical sign test. We show here the consistency of the 3-sign depth test for a wide range of hypotheses. To do so, we impose local monotonicity assumptions on the residuals and – since the 3-sign depth depends heavily on the order of the observations – introduce a random total order for hazard transforms.
The major benefit of data depth methods lies in their inherent robustness, for instance in the presence of contaminated data due to measurement errors or unexpected external influences. This robustness often entails a drop in power of the associated test. In a final simulation study, we therefore compare the 3-sign depth test with competing approaches in terms of power and robustness, and find that satisfactory results can still be achieved even if almost half of the data is contaminated.
Keywords: Data Depth, Point Process Models, Random Time Change, Hazard Function, Stochastic Intensity
Multivariate sign test for sphericity: robustness to the presence of skewness and dependent observations
The problem of testing that a shape parameter is a multiple of the identity is considered when the sample is drawn from a distribution with elliptical directions. This setting encompasses both cases where some skewness is present and cases where the i.i.d. hypothesis does not hold anymore. In the elliptical directions setting, the existing sphericity test based on the multivariate signs of the observations is valid if the location parameter is specified when constructing the multivariate signs. In practice, the location parameter needs to be estimated, and the asymptotic validity of the test will depend on the asymptotic cost of this estimation. This asymptotic cost is studied and we show under what conditions the resulting tests are asymptotically valid. We also study the local asymptotic power of the test and its asymptotic optimality properties in this new setting, showing that the multivariate sign test is not only extremely robust but also enjoys very high asymptotic power under heavy tailed data-generating process.
Keywords: Robust Signed-Base Inference, Semiparametric Inference, Optimal Testing, Multivariate Analysis, Asymptotic Statistics
10:15 – 10:30 | Atrium & Jajodia Auditorium
Coffee Break
10:30 – 11:30 | Jajodia Auditorium (ENGR 1101) Keynote Address by Dr. Mia Hubert Casewise and Cellwise Robust Dimension Reduction
Mia Hubert is professor at the KU Leuven, department of Mathematics, section of Statistics and Data Science. Her research focuses on robust statistics, outlier detection, data visualization, depth functions, and the development of statistical software. She is an elected fellow of the ISI and has served as associate editor for several journals such as JCGS, CSDA, and Technometrics. She is co-founder and organizer of The Rousseeuw Prize for Statistics, a biennial prize which awards pioneering work in statistical methodology.
11:30 – 18:00 | (Separate ticket)
Excursion
Wednesday, July 31
8:00 – 9:00 | Atrium & Jajodia Auditorium
Breakfast
9:00 – 10:30 | ENGR 1107 S13 — Modern Data Analytics For High-Dimensional And Complex Data
Organized by Anand Vidyashankar | Chaired by Claudio Agostinelli
Deep Learning with Penalized Partially Linear Cox Models
Partially linear Cox models have gained popularity for survival analysis by dissecting the hazard function into parametric and nonparametric components, allowing for the effective incorporation of both well-established risk factors (such as age and clinical variables) and emerging risk factors (e.g., image features) within a unified framework. However, when the dimension of parametric components exceeds the sample size, the task of model fitting becomes formidable, while nonparametric modeling grapples with the curse of dimensionality. We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model. We prove the convergence and asymptotic properties of the estimator and compare it to other methods through extensive simulation studies, evaluating its performance in risk prediction and feature selection.
Keywords: Deep Learning, Cox Model, Penalized Regression
Post-Shrinkage Strategies in Partially Linear Models for High-Dimensional Data Application
In this talk, we present post-shrinkage strategies for the regression parameters of semiparametric models. The regression parameter vector is partitioned into two sub-vectors: the first sub-vector gives the predictors of interest, i.e., main effects (treatment effects), and the second sub-vector is for variables that may or may not needed to be controlled. We establish both theoretically and numerically that the proposed shrinkage strategy which combines two semiparametric estimators computed for the full model and the submodel outshines the full model estimation. A data example is given to show the usefulness of the strategy in practice.
We extend this strategy to high-dimensional data (HDD) analysis. For HDD analysis many penalized methods were introduced for simultaneous variable selection and parameters estimation when the model is sparse. However, a model may have sparse signals as well as several predictors with weak signals. In this scenario variable selection methods may not distinguish predictors with weak signals and sparse signals. We propose a high-dimensional shrinkage strategy to improve the prediction performance of a submodel. We demonstrate that the proposed high-dimensional shrinkage strategy performs uniformly better than the penalized and machine learning methods in many cases. We numerically appraise relative performance of the proposed strategy. Some open research problems will be discussed, as well.
Reference: S. Ejaz Ahmed, Feryaal Ahmed and B. Yuzbasi (2023). Post-Shrinkage Strategies in Statistical and Machine Learning for High Dimensional Data. CRC Press, USA.
Keywords: Post Shrinkage Strategies; Penalized Methods; Machine Learning; Bias And Prediction Errors
Duplicates in Prior Authorization Data: Uncovering the Prevalence, Implications, and Strategies for Mitigating Privacy Risks
A prior authorization (PA), also known as a pre-certification or pre-approval, is a process used to determine a patient’s eligibility to receive a prescription medication or medical service. A typical PA dataset contains patient, provider, and insurance information. Often, a PA dataset also includes the physician name, national provider identifier (NPI), physician address, drug name, international classification of diseases (ICD) code, insurance name, insurance type, date of service, PA start/end date, PA volume, PA status, approval rate, denial rate, or turnaround times. A common problem with the PA dataset is the presence of duplicate records.
According to the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, patient data must be de-identified for secondary use cases that adhere to the HHS guidelines. Estimating the risk of re-identifying a patient based on PA volumes depends on the information included in the PA dataset. Geographical locations for the physician can increase the likelihood of identifying a patient’s location, while PA dates could link back to a patient. Drug names can identify a vulnerable demographic (i.e., gender or age group) in a population who are more likely to use the drug for a specific disease or condition. These issues render many of the existing risk metrics challenging to use routinely.
In this presentation, we describe new metrics that allow us to estimate the risk of re-identification and the role of prevalence. We also provide uncertainty assessments for these metrics. We illustrate our results with several numerical experiments and theoretical analysis.
Keywords: Prior Authorization, De-Identification, Disease Prevalence, Risk Metrics, Hipaa
9:00 – 10:30 | ENGR 1108 S14 — Robust Methods In Selected Regression-Based Inference
Organized and chaired by Xuming He
A simulation-based estimation framework for robust parametric models
Computing estimators that are robust to data contamination is non-trivial. Indeed, to be consistent, these estimators typically rely on a non-negligible correction term with no closed-form expression. Numerical approximation to this term can introduce finite sample bias, especially when the number of parameters p is relatively large compared to the sample size n. To address these challenges, we propose a simulation-based bias correction framework, which allows us to easily construct robust estimators with reduced finite sample bias. The key advantage of the proposed framework is that it bypasses the computation on the correction term in the standard procedure. The resulting estimators also enjoy consistency and asymptotic normality, and can be obtained computationally efficiently even when p is relatively large compared to n. The advantages of the method are highlighted with different simulation studies, such as logistic regression and negative binomial regression models. We also observe empirically that our estimators are actually comparable, in terms of finite sample mean squared error, to classical maximum likelihood estimators under no data contamination.
Keywords: Simulation Based Estimation, Bias Correction, Generalized Linear Models
A joint-modeling quantile regression approach for functional coefficients estimation and latent group identification
In recent years, substantial evidence of pleiotropy has been uncovered, where a single gene or genetic variation affects multiple phenotypes. Understanding the complex genetic effect that impacts various biological functions is crucial for unraveling the etiology of complex diseases. Though it is well-known that genetic effects vary with age, modeling these time-dependent curves is challenging because genetic variants are typically measured once, and longitudinal measurement of gene expression levels over the years is often impractical. To estimate the time-dependent genetic effect across multiple responses, we propose a quantile regression-based modeling approach to simultaneously estimate the coefficients of genetic variants as a function of time and identify latent group structures of those functions. We approximated the unknown functional coefficients by B-spline and developed the asymptotic theory for the coefficients curve estimation and latent group identification. Through the application of UK Biobank data with postmenopausal women, our investigation shed light on distinct patterns in how genes related to body mass index influence lipid traits and body measurements at different quantile levels.
Keywords: Time-Varying Genetic Effect; Multi-Response Modeling; Latent Group Classification; B-Splines Estimation
Expected Shortfall Regression under Heavy Tailedness
Expected Shortfall (ES), also known as superquantile or Conditional Value-at-Risk, has been recognized as an important measure in risk analysis and stochastic optimization. In this talk, we discuss the estimation and inference for conditional ES regression, which complements the well-known quantile regression in several aspects. A particular challenge for ES regression arises when the response or error distribution is heavy-tailed. To address this, we discuss the effectiveness of a two-step method that integrates quantile regression and adaptive Huber regression under three settings: linear model, sparse linear model, and nonparametric model with a hierarchical compositional structure. Theoretical analyses provide useful insights and guidance for selecting the appropriate hyperparameter in the Huber loss function across different models.
Keywords: Expected Shortfall; Quantile Regression; Huber Loss; Heavy-Tailed Distribution; Finite-Sample Theory
9:00 – 10:30 | ENGR 1110 S15 — Regularization And Dimension Reduction
Organized by Andreas Alfons | Chaired by Peter Rousseeuw
Cost-sensitive Logistic Regression Ensembling
In many applications the cost of wrong decisions is not symmetric. Therefore, it makes sense to take the costs associated to wrong decisions into account in the decision process to minimize the risks for stakeholders. To achieve this, cost-sensitive methods have been developed such as cost-sensitive logistic regression [Bahnsen et al. , 2014]. Mo. Moreover, for high-dimensional data ensemble models often yield a much better performance than a single (sparse) model. However, ensembles of a large number of models are difficult to interpret. To combine the interpretability of a single model with the performance of ensemble models, the split-learning framework has been developed by Christidis et al.
[2020, 2021].In this talk we use the split-learning framework to introduce a diverse ensemble of cost-sensitive logistic regression models. This yields an ensemble that is interpretable with a low misclassification cost. To solve the non-convex optimization problem a novel algorithm based on the partial conservative convex separable quadratic approximation is developed. The proposed method delivers outstanding savings as demonstrated through extensive simulation and real-world applications in fraud detection and gene expression analysis.
Keywords: Classification, Logistic Regression, Split Regression, Cost-Sensitive Modeling
Robust PCA and explainable outlier detection for multivariate functional data based on a functional Mahalanobis distance
Outlier detection in multivariate functional data poses significant challenges due to the complex and high-dimensional nature of the data. In this study, we propose a new approach for explainable outlier detection utilizing Shapely values in conjunction with a truncated functional Mahalanobis semi-distance introduced in Galeano et al. [2015], thus focusing on capturing meaningful deviations while mitigating the influence of noise and irrelevant variations in the data.
Calculating truncated functional Mahalanobis distance involves the estimation of the covariance, which can be very biased in the presence of outliers in the data. To ensure robustness, we incorporate an adaptation of the matrix minimum determinant estimator (MMCD) introduced in Mayrhofer et al. [2024] for matrix-variate data, to robustly estimate the functional covariance, and demonstrate the validity of the procedure for the multivariate Gaussian processes. Additionally, robust covariance estimation leads to a robust functional principal component analysis.
Finally, Shapely values are employed to decompose the truncated Mahalanobis distance into contributions of individual features of the detected outliers, offering interpretability in the detection process, where the feature type depends on how we express the functional data object.
The effectiveness and interpretability of the proposed method are demonstrated in both simulated and real-data scenarios. In particular, the approach is applied to fertility curves calculated for various countries, where it was revealed that one can crudely group the studied countries based on when a sudden drop in fertility was observed. The results generally showcase the method’s ability to identify outliers in multivariate functional data while providing valuable insights into the underlying patterns contributing to anomalous observations.
Keywords: Mmcd, Fda, Pca, Shapely Values, Outlier Detection
Robust Estimation and Inference in Categorical Data
Many empirically relevant variables are inherently categorical, such as questionnaire responses, customer satisfaction ratings, or counting processes. Just like their continuous counterpart, categorical variables can be subject to contamination, for instance inattentive responses to questionnaire items, which are well-known to jeopardize the validity of survey-based research (e.g. Meade & Craig, 2012). However, nearly all methods in robust statistics are designed for continuous variables and may fail to be effective or even computable in categorical data.
I propose a general framework for robustly estimating statistical functionals in models for possibly multivariate categorical data (Welz, 2024). The proposed estimator generalizes maximum likelihood estimation, is strongly consistent, asymptotically Gaussian, and is of the same time complexity as maximum likelihood estimation (MLE). Notably, the estimator achieves robustness despite having the same influence function as MLE, and is therefore fully asymptotically efficient at the true model. In addition, I develop a novel test that tests whether or not a categorical observation can be fitted well by the assumed model, thereby conceptualizing the notion of an outlier in categorical data.
I verify the attractive statistical properties of the proposed methodology in extensive simulation studies and demonstrate its practical usefulness in various empirical applications. In particular, I show how it can be used to robustify fundamental analyses of psychometric rating scales such as factor models or scale reliability against inattentive responding. As such, an empirical application on a popular questionnaire instrument reveals compelling evidence for the presence of inattentive responding whose adverse effects the proposed estimator can withstand, unlike MLE. A user-friendly software implementation is provided through the R package robcat (ROBust CATegorical data analysis).
Keywords: Robust Statistics, Categorical Variables, Careless Responding, Asymptotic Normality, Multivariate Statistics
9:00 – 10:30 | ENGR 1103 S16 — Efficient Sampling And Analysis For Big And High-Dimensional Data
Organized and chaired by Nicholas Rios
Distributed Learning for Longitudinal Image-on-Scalar Regression
Neuroimaging studies are becoming increasingly important in medical fields as the prevalence of Alzheimer’s disease. Our study is motivated by investigating brain activities and exploring the relationship between varying scalar predictors and different brain regions over time through longitudinal image-on-scalar regression. However, the difficulty lies in brain regions’ complexity, the sparsity of time measurement for each subject, and the massive computational costs associated with analyzing large scalar imaging data. To address these issues, we proposed an individual growth path model based on image-on-scalar regression in the nonparametric functional data analysis framework. The bivariate penalized spline over triangulation (BPST) method is used to handle the irregular domain of brain images for estimating the coefficient function. We propose a novel approach to parallel computing that utilizes Hilbert space-filling curve-based domain decomposition on BPST (HBDB) to reduce the computational time. The proposed nonparametric varying coefficient functions in both BPST and HBDB methods are proven to be asymptotically normal and consistent under some regularity conditions. The proposed methods are evaluated through extensive simulation studies and analyses of studies in the Alzheimer’s Disease Neuroimaging Initiative.
Keywords: Bivariate Penalized Splines; Domain Decomposition; Longitudinal Data Analysis; Sampling; Large-Scale Data Analysis
Ladle Statistics: A New Criterion for Variable Selection
This talk introduces “Ladle Statistics,” an innovative framework developed to address the inherent complexities and challenges in variable selection within modern scientific datasets. Traditional methods often fall short in accurately identifying significant variables due to the dense and multifaceted nature of contemporary data. By integrating aspects of established variable selection methodologies, “Ladle Statistics” aims to provide a more robust and effective approach, enhancing both the interpretability and accuracy of statistical models.
The novelty of “Ladle Statistics” lies in its unique blend of criteria-based and data-driven approaches. This synthesis not only mitigates the limitations traditionally associated with these methods but also introduces a new dimension of flexibility and precision in model construction and analysis. The framework has been meticulously tested across various settings, including both linear environments and Sparse Additive Models (SAMs), showcasing its versatility and effectiveness.
Furthermore, “Ladle Statistics” leverages bootstrap methods to assess the stability of models in the context of variable selection. This approach enhances the robustness of the selection process by providing insights into the reliability and consistency of the results across different samples. By evaluating model stability in a generalized way, “Ladle Statistics” ensures that the variable selection is not only effective but also reliable in various data environments.
In conclusion, “Ladle Statistics” offers a structured approach to variable selection that adapts well to the complexities of modern datasets. The use of bootstrap methods for evaluating model stability contributes to its practicality, ensuring consistent and reliable variable selection across different studies. This talk will further explore the implications of this method in simplifying the challenges faced in statistical analysis.
Keywords: Variable Selection, Statistical Methods, Ladle Statistics, Model Complexity
Information-Based Optimal Subdata Selection for Clusterwise Linear Regression
Mixture-of-Experts (MoE) models are commonly used when there exist distinct
clusters with different relationships between the independent and dependent variables. Fitting such models for large datasets, however, is computationally virtually impossible. An attractive alternative is to use subdata selected by “maximizing” the Fisher information matrix. A major challenge is that no closed-form expression for the Fisher information matrix is available for such models. Focusing on clusterwise linear regression models, a subclass of MoE models, we develop a framework that overcomes this challenge. We prove that the proposed subdata selection approach is asymptotically optimal, i.e., no other method is statistically more efficient than the proposed one when the full data size is large.
Keywords: D-Optimality; Information Matrix; Latent Indicator; Massive Data; Mle
Robust Estimation of the High-Dimensional Precision Matrix
Estimating the precision matrix (inverse of the covariance matrix) in high-dimensional settings is crucial for various applications, such as Gaussian graphical models and linear discriminant analysis. In this paper, we propose a robust and computationally efficient estimator for high-dimensional heavy-tailed data. Building upon winsorized rank-based regression, our method offers robustness without sacrificing computational tractability. We establish the statistical consistency of our estimator, with a focus on conditions required for the error variance estimator in winsorized rank-based regression. For sub-Gaussian data, the sample variance meets the criteria, while for heavy-tailed data, a robust variance estimator based on the median-of-means approach is proposed. Simulation studies and real data analysis show that the proposed method performs well compared with existing works.
Keywords: Precision Matrix Estimation; High-Dimensional Data; Heavy-Tailed Distributions; Robustness.
10:30 – 10:45 | Atrium & Jajodia Auditorium
Coffee Break
10:45 – 11:45 | Jajodia Auditorium (ENGR 1101) Keynote Address by Dr. Gabriela Cohen Freue Outliers or Reformists? Robust Estimators in Genomics Studies
Gabriela Cohen Freue is an associate professor the Department of Statistics at the University of British Columbia (UBC). She joined the Department as a Canada Research Chair-II in Statistical Proteomics in 2012. She earned her Ph.D in Statistics from the University of Maryland at Collage Park. Her contributions in Biostatistics started during her postdoctoral studies, under the supervision of Drs. Ruben Zamar and Raymond Ng, collaborating for Biomarkers in Transplantation and PROOF Centre of Excellence, at UBC. Her research interests are interdisciplinary; merging components from statistics, computational genomics and medical sciences. She has developed robust estimators for the analysis of complex data, focusing in the analysis of -omics data to advance the identification and validation of biomarker solutions for various diseases.
11:45 – 13:00 | (Not provided)
Lunch break
13:00 – 14:30 | ENGR 1107 S17 — Robustness And Data Science In Biomedicine
Organized and chaired by Bill Rosenberger
A Robust Framework for Semiparametric Analysis of Temporal Covariate Effects with Right-Censored Data
We propose a robust framework for evaluating temporal covariate effects on potentially right-censored failure time data. All or a subset of the covariates are allowed to have completely unspecified time-varying effects on the hazard or odds of failure. We develop likelihood-based estimation and inference procedures. Such procedures are smoothing-free and statistically efficient. Furthermore, the proposed survival function estimator is guaranteed to be non-increasing. We establish the asymptotic properties of the proposed inference procedures. Simulation studies demonstrate that the proposed methods perform well in practical settings. Applications to a breast cancer study and a diabetes study are provided.
Keywords: Cox Proportional Hazards Model; Proportional Odds Model; Semiparametric Efficiency; Time-Varying Coefficient Model
With Random Regressors, Linear Regression is Robust to Arbitrary Correlation Among the Errors
Linear regression is arguably the most important statistical method in empirical research, which assumes fixed regressors and independent error terms in the linear model. We show that with random regressors, linear regression is robust to arbitrary correlation among the error terms. Due to the correlated errors, the standard proof of the central limit theorem does not apply, and we prove the asymptotic normality of the $t$-statistics by establishing the Berry–Esseen bound on self-normalized statistics. We further study the local power of the $t$-test and show that with random regressors, correlation of the errors can even increase the power in the regime of weak signals. Overall, our robustness theory demonstrates that linear regression is applicable more broadly than the existing theory suggests.
Keywords: Random Design; Randomized Experiment; Self-Normalization.
Sparse Heteroskedastic PCA in High Dimensions
Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction. Though it has been well-studied for high-dimensional sparse PCA, little is known when the noise is heteroskedastic, which turns out to be ubiquitous in many scenarios, like single-cell RNA sequencing (scRNA-seq) data and information network data. We propose an iterative algorithm for sparse PCA in the presence of heteroskedastic noise, which alternatively updates the estimates of the sparse eigenvectors using orthogonal iteration with adaptive thresholdings in one step, and imputes the diagonal values of the sample covariance matrix to reduce the estimation bias due to heteroskedasticity in the other step. Our procedure is computationally fast and provably optimal under the generalized spiked covariance model, assuming the leading eigenvectors are sparse. A comprehensive simulation study shows its robustness and effectiveness under various settings. Additionally, the application of our new method to two high-dimensional genomics datasets, i.e., microarray and scRNA-seq data, demonstrates its ability to preserve inherent cluster structures in downstream analyses.
Keywords: Heteroskedasticity; Principal Component Analysis; Orthogonal Iteration; Matrix Denoising.
13:00 – 14:30 | ENGR 1108 S18 — Robust Techniques For Statistical Analysis
Organized and chaired by Tim Verdonck
Incorporating Survey Weights into Robust Fitting of Multinomial Models
Questionnaires often include items on a rating or Thurstone scale with responses taking a discrete value. These type of data are typically modeled using a multinomial distribution, but contamination, e.g., by observations from a multinomial distribution with a different parameter or by unusually high frequency of certain categories, poses a challenge for traditional estimators. In this work we therefore consider the problem of robust and efficient estimation of parameters in the multinomial model. We extend the E-estimator for the binomial model proposed by Ruckstuhl & Welsh (2001) to the multinomial distribution to obtain robust, efficient and reliable estimation and inference. Specifically, letting fn represent the observed frequencies and pπ the probability from the multinomial distribution given the parameter vector π, the E-estimator is constructed by the Huberized relative entropy of the ratio f_n/p_π. We further extend the estimator by incorporating survey weights to adjust for complex sampling designs. We discuss the results of a simulation study demonstrating the failure of the maximum likelihood estimator when contamination exists and compare the performance of our proposed estimator to two other estimators: the Hellinger distance estimator and the negative exponential estimator.
Keywords: Multinomial Distribution, E-Estimator, Survey Weight, Sampling Design
Multivariate Robust Continuum Regression Through Numerical Optimization
Latent variable regression techniques present a unique niche in the set of modeling techniques for multivariate regression, because they combine the regularization properties and the interpretability of dimension reduction techniques with the predictive power of a regression model. Continuum regression is a latent variable regression method that allows to span a continuum between ordinary least squares and principal component regression, while encompassing partial least squares along the path. Both univariate and multivariate versions exist. However, as they are based on optimization criteria that involve classical variances and co-variances, these models are sensitive to outliers.
A robust continuum regression method (RCR) has existed in the univariate setting for over a decade, yet it was never generalized to the multivariate setting. This paper is first to close this gap by introducing multivariate continuum regression (MRCR).
Moreover, the new method presented in this work offers advantages beyond the introduction of a multivariate version of RCR. The original RCR was defined according to an optimization criterion that uses hard trimmed (co)variances, which is not continuously differentiable. Therefore, the original RCR was computationally challenging. In the original publication, the optimization was solved through a brute force projection pursuit. This optimization was later improved upon through the introduction of the “grid” algorithm. While an improvement, the grid algorithm can still be computationally prohibitive in big data settings.
Recently, soft sorting has been proposed. Soft sorting is a transformation of a data vector based on numerical optimization with a tuneable softness parameter. As the softness parameter approaches zero, soft sorting becomes equivalent to hard sorting, whereas when it approaches infinity, the effect of soft sorting vanishes. Since soft sorting itself is based upon a continuously differentiable optimization criterion (except for the limit cases), robust soft trimmed estimators can now be constructed that are entirely based upon continuously differentiable optimization criteria. Therefore, they can be computed efficiently by state of the art numerical optimizers, such as sequential least squares quadratic programming (SLSQP). Moreover, the result becomes virtually equivalent to the result from hard sorted estimators for low values of the softness parameter. The seminal publication on soft sorting already applied it in the context of least trimmed squares. This idea was recently further investigated and expended into an soft loss function for neural networks.
In this work, robust multivariate continuum regression is established in an optimization criterion based on soft sorting. The resulting estimator is then calculated through numerical optimization. Eventually, the usefulness of the new method is illustrated in an example.
Keywords: Multivariate Robust Regression; Robust Continuum Regression; Robust Dimension Reduction; Soft Trimming.
Is Distance Correlation Robust?
Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. In this talk we demonstrate that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, we show its unbounded sensitivity function converging to the bounded influence function for increasing sample size.
To address this sensitivity to outliers we construct a more robust version of distance correlation, which is based on a new data transformation, called “biloop”. Simulations indicate that the resulting method is quite robust, and has good power in the presence of outliers. We illustrate the new method using the biloop transformation on genetic data. Comparing the classical distance correlation with its more robust version provides additional insight in the dependencies between variables in a data set.
Keywords: Breakdown Value, Independence Testing, Influence Function, Robust Statistics.
13:00 – 14:30 | ENGR 1110 S19 — Divergence/Minimum Distance Methods
Organized and chaired by Claudio Agostinelli
Divergence-based methods for Bayesian Robust Estimation in Controlled Branching Processes
Branching processes are mathematical models allowing to describe the evolution of systems whose elements (cells, particles, individuals, etc) produce new ones in such a way that the transition from one to other state of the system is made according to probability laws. The standard branching model, known as Galton-Watson process, is characterized because each individual, independently of everyone else, gives birth to a random number of descendants according to a probability law that is the same for all the individuals and then disappears of the population.
This work is focused on the controlled branching process, a generalization of the standard model that incorporates into the probability model a random control mechanism which determines the number of progenitors in each generation.
The behaviour of these processes is strongly related to their main parameters. In practice, these values are unknown and their estimation is necessary.
This talk is concerned with Bayesian inferential methods for the parameters of interest in controlled branching processes that account for model robustness through the use of disparities. We assume that the offspring distribution belongs to a very general one-dimensional parametric family and the sample given by the entire family tree up to some generation is observed. In this setting, we define the D-posterior density, a density function which is obtained by replacing the log-likelihood in Bayes rule with a conveniently scaled disparity measure. The expectation and mode of the D-posterior density, denoted as EDAP and MDAP estimators, respectively, are proposed as Bayes estimators for the offspring parameter, emulating the point estimators under the squared error loss function or under 0-1 loss function, respectively, for the posterior density. Under regularity conditions, we establish that EDAP and MDAP estimators are consistent and efficient under the posited model. Additionally, we show that the estimates are robust to model misspecification and presence of aberrant outliers. To this end, we develop several fundamental ideas relating minimum disparity estimators to Bayesian estimators built on the disparity-based posterior, for dependent tree-structured data.
We illustrate the methodology through a simulated example and apply our methods to a real data set from cell kinetics. The results presented in this talk has been recently published in González el al. (2021).
References
González, M., Minuesa, C., del Puerto, I. & Vidyashankar, A.N. (2021) Robust Estimation in Controlled Branching Processes: Bayesian Estimators via Disparities. Bayesian Analysis, 16 (3), 1009 – 1037.
Keywords: Branching Process, Controlled Process, Disparity Measures, Robustness, Bayesian Inference.
Minimum Divergence Estimation for Compressed Regression
Big data and streaming data are encountered in a variety of applications in business and industry, and data compression is typically used in these settings for enhancing privacy and reducing operational costs. In these situations, it is common to use sketching and random projections to reduce the dimension of the data yielding compressed data. These data however possess anomalies
such as heterogeneity, outliers, and round-off errors which are hard to detect due to volume and processing challenges. In this presentation, we describe a new robust and efficient minimum divergence estimator (MDE) to analyze the compressed data in a high-dimensional regression model. Specifically, we evaluate the prediction efficiency and residual efficiency of the MDE relative to least-squares estimators derived from uncompressed data. Using large sample theory and numerical experiments, we also demonstrate that routine use of the proposed robust methods is feasible in these contexts.
Keywords: Compressed Regression; Minimum Divergence Estimator; Robust Inference; Random Projection; Statistical Efficiency
Divergences, Large deviation and Monte Carlo constrained optimization
In a first Section we show that Csiszar divergences between two signed distributions on a finite set can be identified with Large Deviation rates for sequences of random vectors which can be explicitly simulated, with distribution depending on the generator of the divergence. This opens the gate to stochastic optimization of the divergence over general sets of vectors with non void interior, and quite general smoothness. The class of divergences sharing this epresentation is quite large, since the generator of the divergence is assumed to be the Fenchel Legendre transform of some moment generating function, a property which is shared by most divergences commonly used in the Csiszar class. This property also allows for the definition of new divergences. Specific versions of this result lead to the optimization of various entropies under general constraints.
As a second by-product development of this fact allows to handle the minimization
of a Csiszar divergence between a probability vector and a model, seen as a subset of the simplex with non void interior. Explicit algorithms can also be produced when projecting an empirical measure pertaining to a sample on such model, hence defining an explicit construction for the estimation of an adequacy index between a distribution and a model. Relation to Importance Sampling mprovements will be discussed.
The second part of the talk pertains to the extension of the above results for divergences which are not of Csiszar type; we will exhibit similar constructions for general Bregman divergences, with corresponding algorithms for their minimization under smooth constraints.
The Large Deviation Principle, or its corresponding formulation for Bregman divergences can be used together with Varadhan’s Lemma for the minimization of general continuous functions of K variables on subsets with non void interior, under general smoothness assumptions. Algorithms approximating both the minimal value of the function and the set of its minimizers will be discussed.
As a concluding Section a simple application to some Neural Network context in supervised image classification will be presented, where the above methods are applied for the minimization both of the l1 norm of a vector of parameters in dimension 400000 and of the number of its non null coefficients under a onstraint pertaining to the quality of the recovery of the classes on some test sample.
References
Broniatowski, M. and Stummer, W. (2023) A precise bare simulation approach to the
minimization of some distances. I. Foundations. IEEE Trans. Inform. Theory, 69:5,
3062–3120.
Broniatowski, M. and Stummer, W. (2024) A precise bare simulation approach to the
minimization of some distances. II. Further foundations. arXiv:2402.08478.
Bertrand, P., Broniatowski, M, and Stummer, W. (2024) Optimizing Neural Networks through Bare Simulation search. arXiv.
Keywords: Divergences, Constrained Optimisation, Large Deviation, Simulation
Large deviation asymptotics for minimum Hellinger distance estimators
Minimum divergence estimators have been widely used as alternatives to maximum likelihood methods. While the asymptotic distributions of these estimators have been well-studied, their large deviation properties are largely unknown. Our primary objective is to study the large deviation probabilities for Hellinger distance estimators under a potential model misspecification, both in one and in higher dimensions. We show that these rare event probabilities decay exponentially, and describe their decay via a large deviation “rate function.” Interestingly, we exhibit an asymmetry in the characterizations of the upper and lower bounds, where certain geometric considerations arise in the study of the lower bound. We also discuss extensions to sharp large deviation bounds, facilitating saddlepoint approximations to minimum divergence estimators.
Keywords: Hellinger Distance; Large Deviations; Divergence Measures; Rare Event Probabilities.
13:00 – 14:30 | ENGR 1103 S20 — Integrating Machine Learning Into Health Analytics, Space-Time Prediction, And Visualization
Organized and chaired by Huixia Judy Wang
ISM: A New Space-Learning Model for Heterogeneous Multi-View Data Reduction, Visualization and Clustering
We describe a new approach for integrating multiple views of data into a common latent space using non-negative tensor factorization (NTF). This approach, which we refer to as the “Integrated Sources Model” (ISM), consists of two main steps: embedding and analysis. In the embedding step, each view is transformed into a matrix with common non-negative components. In the analysis step, the transformed views are combined into a tensor and decomposed using NTF. Noteworthy, ISM can be extended to process multi-view data sets with missing views. We illustrate the new approach using two examples: the UCI digit dataset and a public cell type gene signatures dataset, to show that multi-view clustering of digits or marker genes by their respective cell type is better achieved with ISM than with other latent space approaches. We also show how the non-negativity and sparsity of the ISM model components enable straightforward interpretations, in contrast to latent factors of mixed signs. Finally, we present potential applications to single-cell multi-omics and spatial mapping, including spatial imaging and spatial transcriptomics, and computational biology, which are currently under evaluation. ISM relies on state-of-the-art algorithms invoked via a simple workflow implemented in a Jupyter Python notebook.
Keywords: Principal Component Analysis, Non-Negative Matrix Factorization, Non-Negative Tensor Factorization, Multi-View Clustering, Multi-Dimensional Scaling
Knowledge-Guided Statistical Machine Learning for Longitudinal Biomedical Studies
In large epidemiological studies, a comprehensive analysis of a disease process often includes several statistical sub-models with time-varying risk factors (i.e. functional covariates) and time-to-event disease outcomes. A useful prediction model for the disease should be constructed by incorporating all the influential covariates and functional features (i.e. “history”) of the longitudinal risk factors, so that meaningful clinical interpretations for the disease prediction model could be obtained. Existing statistical machine learning methods lack a systematic approach for incorporating all the influential functional covariates and sub-models in a clinically meaningful way. We develop a “knowledge-guided statistical machine learning” (KGSML) procedure to construct a statistical model for predicting the time-to-event and/or disease outcomes with time-varying risk factors. This KGSML procedure sequentially combines flexible statistical machine learning methods, such as nonparametric regression, spline-based subject-specific best linear unbiased prediction (BLUP), random survival forest, survival regression models and variable selection. Applications of this KGSML procedure to two landmark epidemiological studies of NHLBI, the Coronary Artery Risk Development in Young Adults (CARDIA) Study and the NHLBI Growth and Health Study (NGHS), demonstrate that early treatment of cardiovascular risk factors during young adulthood or even adolescent period could dramatically reduce the risk of incident cardiovascular disease (CVD) in midlife. This finding leads to novel insights into the potential benefit of early primary prevention strategies for reducing the global CVD burden. We finally discuss some future research directions of statistical machine learning and other statistical methods for exploratory and confirmative studies in biomedical research.
Keywords: B-Spline; Blup; Cardiovascular Disease; Longitudinal Data; Machine Learning
Spatio-temporal DeepKriging for Interpolation and Probabilistic Forecasting
Gaussian processes (GP) and Kriging are widely used in traditional spatio-temporal modeling and prediction. These techniques typically presuppose that the data are observed from a stationary GP with a parametric covariance structure. However, processes in real-world applications often exhibit non-Gaussianity and nonstationarity. Moreover, likelihood-based inference for GPs is computationally expensive and thus prohibitive for large datasets. In this paper, we propose a deep neural network (DNN) based two-stage model for spatio-temporal interpolation and forecasting. Interpolation is performed in the first step, which utilizes a dependent DNN with the embedding layer constructed with spatio-temporal basis functions. For the second stage, we use Long Short-Term Memory (LSTM) and convolutional LSTM to forecast future observations at a given location. We adopt the quantile-based loss function in the DNN to provide probabilistic forecasting. Compared to Kriging, the proposed method does not require specifying covariance functions or making stationarity assumptions and is computationally efficient. Therefore, it is suitable for large-scale prediction of complex spatio-temporal processes. We apply our method to monthly PM2.5 data at more than 200,000 space–time locations from January 1999 to December 2022 for fast imputation of missing values and forecasts with uncertainties.
Keywords: Deep Learning, Feature Embedding, Long Short-Term Memory Forecasting, Quantile Machine Learning, Spatio-Temporal Modeling
Responsible Machine Learning for Health Disparity Decomposition
The field of health disparity studies is garnering increasing attention, and health disparity decomposition has emerged as a crucial tool for examining these disparities. The Peters-Belson approach, a notable method in this area, generates pseudo-outcomes for minority groups as if they were part of the majority population. Despite the benefits of this approach, its utility is limited by an inability to effectively elucidate the impact of interventions. As machine learning algorithms have permeated various sectors, ensuring these algorithms are responsible is imperative. In this presentation, I will share my recent endeavors in developing responsible machine learning algorithms for health disparity studies. These algorithms are aimed at predicting pseudo-outcomes for minority populations and clarifying the influence of interventions. Specifically, I will demonstrate how these models can accurately forecast the efficacy of a policy or intervention, based on data from the deployment of different interventions, thereby contributing to a more equitable and informed approach to health policy.
Keywords: Health Disparity Decomposition, Intervention, Machine Learning, Peters-Belson Approach
14:30 – 14:45 | Atrium & Jajodia Auditorium
Coffee Break
14:45 – 16:15 | ENGR 1107 S21 — Advances In Robust Statistics
Organized and chaired by Marianthi Markatou
Estimation and model selection for robust finite mixtures of Tukey’s g-and-h distributions with application to single-cell protein data
Finite mixture distribution is a popular statistical model, which is especially meaningful when the population of interest may include distinct subpopulations. This work is motivated by the analysis of single-cell protein expression levels quantified using immunofluorescence immunohistochemistry assays of breast cancer tissues. It is biologically plausible to assume that due to tumor heterogeneity, different subpopulations of cancer cells have different levels of expression for protein of interest. The observed distributions of single-cell protein expression levels in one breast cancer tissue often exhibit multimodality, skewness and heavy tails, but there is substantial variability between distributions in different tissues from different subjects, while some of these mixture distributions include components consistent with an assumption of a normal distribution.
To accommodate such diversity, we propose a mixture of 4-parameter Tukey’s g-and-h distributions for fitting finite mixtures with Gaussian and non-Gaussian components. Tukey’s g-and-h distribution is a flexible model that allows variable degree of skewness and kurtosis in mixture components, including normal distribution as a particular case. The likelihood of Tukey’s g-and-h distribution does not have a closed analytical form. Therefore, we propose a Quantile Least Mahalanobis Distance (QLMD) estimator for parameters of Tukey’s g-and-h mixtures. QLMD is an indirect estimator minimizing the Mahalanobis distance between the sample and model-based quantiles, and its asymptotic properties and robustness follow from the general theory of indirect estimation. We also developed a stepwise algorithm to select a parsimonious Tukey’s g-and-h mixture model with a given or unknown number of components. A simulation study was conducted to evaluate the finite sample performance of the QLMD estimator with and without model selection for estimating parameters of Tukey’s g-and-h mixture models with various levels of complexity. We also compared Tukey’s g-and-h mixtures to previously considered finite mixtures of continuous non-Gaussian distributions.
We applied Tukey’s g-and-h mixture for modeling distributions of cell-level Cyclin D1 protein expression in breast cancer tissue using the full or reduced 2-component mixture models as well as the model selected with the optimal number of components. The larger estimated component mean was evaluated as a predictor of progression-free survival (PFS) in breast cancer patients. It was demonstrated that better prognostic value is achieved using the parsimonious Tukey’s g-and-h mixture obtained using the proposed model selection algorithm.
Keywords: Finite Mixture Distribution, Indirect Estimator, Quantile Least Mahalanobis Distance Estimator, Single-Cell Protein Data, Cancer Biomarker
Exploring the Robustness of Kernel-Based Quadratic Distance Tests
In the realm of statistical analysis, the reliability of test procedures is paramount for drawing accurate inference from the data. Central to this effort is that the data and the underlying models applied to understand them, accurately reflect the underlying distributions. However, the presence of outliers and instances of model specification stand as important challenges, possibly leading to erroneous conclusions.
These challenges underscore the necessity for robust statistical test procedures that can withstand the influence of outliers and model misspecification, retaining their testing power and accuracy across different scenarios.
Kernel-based methods are critical tools in non-parametric statistics, particularly valued for their flexibility in modeling complex data structures across multiple dimensions, frequently encountered in fields ranging from bioinformatics to machine learning. Specifically, kernel-based quadratic distances have been efficiently used for constructing goodness-of-fit tests. The asymptotic distribution under the null and alternative hypotheses of the tests were also studied.
This research aims to embark on the exploration of the robustness properties of the kernel-based quadratic distance tests, previously framed within the testing for normality, two-sample, and k-sample problems constructed using the concept of matrix distance. These tests are examined to access their sensitivity to outliers and their capacity to maintain power and accuracy despite data contamination. \\
Our approach starts with normality tests, to detect deviations from Gaussian distributions, a common assumption in many statistical analyses. This exploration sets the stage for subsequent analyses in more complex two-sample and k-sample contexts.
To illustrate the practical implications and performance of these tests, we employ a series of simulation studies and real data applications, showcasing the behavior of the kernel-based quadratic distance tests across different scenarios. The simulation studies are designed to replicate a wide array of real-world data conditions, thereby providing a thorough assessment of the tests’ performance metrics. Concurrently, the application to real data sets serves to validate the theoretical findings and to demonstrate the practical utility of the proposed methods over traditional approaches.
By addressing the robustness of kernel-based quadratic distance tests, this work will aid researchers in conducting more reliable and accurate statistical tests, particularly in environments where traditional assumptions of normality and homogeneity may be violated.
Keywords: Goodness-Of-Fit, High-Dimensional Data, Kernel-Based Quadratic Distance Tests, Outliers Resistance, Statistical Software.
Robust estimation for Torus data
Robust estimation of wrapped models to multivariate circular data (torus data) is considered based on the weighted likelihood methodology. Robust model fitting is achieved by a set of estimating equations based on the computation of data dependent weights aimed to down-weight anomalous values, such as unexpected directions that do not share the main pattern of the bulk of the data. Asymptotic properties and robustness features of the estimators under study are presented.
Keywords: Asymptotics, Computational Methods, Influence Function, Weighted Likelihood
14:45 – 16:15 | ENGR 1108 S22 — Recent Development Of Data Depth And Its Statistical Applications
Organized and chaired by Hyemin Yeon
Geodesic depth based robust PCA on flag manifolds
It is well known that Principal Component Analysis (PCA) is susceptible to outliers and can fail even in the presence of a single contaminated data point. Extensive work has been done in the literature to create variations of PCA that are resilient to outliers.
In this talk, we propose a geodesic depth function and introduce a robust variation of PCA as the deepest point of a collection of “weakly concentrated” principal axes over the real flag variety. We formulate this as an optimization problem over the real flag manifold to preserve the ordering of principal subspaces. We discuss the asymptotic behaviour and contamination sensitivity of the sample flag median and algorithms to efficiently compute the estimate over the exact Riemannian flag geometry by leveraging the diagonalizable nature of the orthogonal coordinates of flags. We illustrate our approach through simulation studies and real-world applications such as video-foreground separation and anomaly detection.
Keywords: Anomaly Detection; Flag Geometry; Geodesics; Pca; Riemannian Manifold.
Differentially private projection-depth-based medians
Multivariate medians based on projection depth are popular robust location estimates. The propose-test-release framework offers a methodology for developing differentially private versions of robust statistics. We explore how a combination of these two techniques can be used to produce approximately differentially private projection-depth-based medians. Under general distributional assumptions, we quantify both the probability of failing the “test portion” of the algorithm and the accuracy-privacy trade-off. We demonstrate the methodology via the canonical projection depth-based median, and consider its finite sample deviations under different population measures. Specifically, under heavy tailed population measures, we show that the “outlier error amplification” outweighs the cost of privacy.
Keywords: Differential Privacy; Projected Outlyingness; Multivariate Median; Propose-Test-Release
Fast kernel half-space depth for data with non-convex supports
Data depth is a statistical function that generalizes order and quantiles to the multivariate setting and beyond, with applications spanning over descriptive and visual statistics, anomaly detection, testing, etc. The celebrated halfspace depth exploits data geometry via an optimization program to deliver properties of invariances, robustness, and non-parametricity. Nevertheless, it implicitly assumes convex data supports and requires exponential computational cost. To tackle distribution’s multimodality, we extend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We show that the obtained depth is intuitive and establish its consistency with provable concentration bounds that allow for homogeneity testing. The proposed depth can be computed using manifold gradient making it faster than half-space depth. The performance of our depth is demonstrated through numerical simulations as well as applications such as anomaly detection on real data and homogeneity testing.
Keywords: Data Depth; Kernel Methods; Gradient Descent.
Robustness of scatter depth
Statistical depth provides robust nonparametric tools to analyze distributions. Depth functions indeed measure the adequacy of distributional parameters to underlying probability measures. Recently, depth notions for scatter parameters have been defined and studied. The robustness properties of this latter depth function remain, however, largely unknown. In this talk, we present several results regarding the robustness of the scatter depth function and its associated scatter median.
Keywords: Scatter Depth, Robustness, Influence Function, Breakdown Point
14:45 – 16:15 | ENGR 1110 S23 — Robustness In Complex Models
Organized by Graciela Boente | Chaired by Jeffrey F. Collamore
Robust estimation of sparse Gaussian graphical models
Given X a p-variate random vector with Gaussian distribution with zero mean vector and positive definite covariance matrix Sigma, the precision matrix Omega contains the information about the conditional distribution of every pair of variables given the remaining variables. So, given a sample of X, the estimation of Omega has become a significant statistical problem.
Until a few decades ago, statistical procedures assumed that datasets included many observations of a few and carefully chosen variables. Nowadays, datasets may contain a large number of variables relative to the sample size, bringing along blessings but also curses of dimensionality [Donoho, 2017]. Therefore, in a high-dimensional setting, the estimation of precision matrices faces significant challenges.
To deal with this problem several Omega estimation procedures based on regularization have been developed under the assumption that Omega is sparse, like Graphical lasso (Glasso) proposed by [Friedman et al., 2008]. Glasso is based on the estimation of the covariance matrix Sigma.
It is well known that Glasso is not robust under the presence of contaminated observations, specially under cellwise contamination. We propose [Lafit et al., 2023] the use of a robust covariance estimator based on multivariate Winsorization in the context of the Tarr–Müller–Weber framework for sparse estimation of the precision matrix of a Gaussian graphical model [Tarr et al., 2016]. Likewise Croux–Öllerer’s precision matrix estimator [Öllerer and Croux, 2015], our proposed estimator attains the maximum finite sample breakdown point of 0.5 under cellwise contamination. We conduct an extensive Monte Carlo simulation study to assess the performance of ours and the currently existing proposals. We find that ours has a competitive behavior, regarding the estimation of the precision matrix and the recovery of the graph. We demonstrate the usefulness of the proposed methodology in a real application to a health sciences problem.
Keywords: Gaussian Graphical Model; Precision Matrix; Sparse Robust Estimation; Cellwise Contamination; Winsorization
Toward Robust Large-Scale Spatial Data Science with ExaGeoStat
Spatial data science relies on some fundamental problems such as: 1) Spatial Gaussian likelihood inference; 2) Spatial kriging; 3) Gaussian random field simulations; 4) Multivariate Gaussian probabilities; and 5) Robust inference for spatial data. These problems develop into very challenging tasks when the number of spatial locations grows large. Moreover, they are the cornerstone of more sophisticated procedures involving non-Gaussian distributions, multivariate random fields, or space-time processes. Parallel computing becomes necessary for avoiding computational and memory restrictions associated with large-scale spatial data science applications. In this talk, I will demonstrate how high-performance computing (HPC) can provide solutions to the aforementioned problems using tile-based linear algebra, tile low-rank approximations, as well as
multi- and mixed-precision computational statistics. I will introduce ExaGeoStat, and its R version ExaGeoStatR, a powerful software that can perform exascale (10^18 flops/s) geostatistics by exploiting the power of existing parallel computing hardware systems, such as shared-memory, possibly equipped with GPUs, and distributed-memory systems, i.e., supercomputers. I will then describe how ExaGeoStat can be used toward performing robust large-scale spatial data science.
Keywords: Hpc, Large Data Sets, Outliers, Robustness, Spatial Statistics
Distribution-free prediction bands for clustered data with missing responses
Existing methods for missing clustered data often rely on strong model assumptions and are therefore prone to model misspecification. We construct covariate-dependent prediction bands for new subjects at specific points or trajectories for missing cluster data, without making any assumptions about the model specification or within-cluster dependence structure. The proposed methods are based on conformal inference combined with subsampling and appropriate weighting to accommodate within-cluster correlation and the missing data mechanism, thereby ensuring coverage guarantee in finite samples. To provide an asymptotic conditional coverage guarantee for each given subject, we further propose predictions by establishing the highest posterior density region of the target, which is more accurate under complex error distributions, such as asymmetric and multimodal distributions. Simulation studies show that our methods have better finite-sample performance under various complex error distributions compared to other alternatives. We will further demonstrate the proposed method’s practical usage through the analyses of serum cholesterol and CD4+ cell data sets.
Keywords: Clustered Data; Missing At Random; Conformal Prediction; Conditional Coverage; Covariate Shift
14:45 – 16:15 | ENGR 1103 S24 — Recent Development On Urban Analytics
Organized and chaired by Abolfazl Safikhani
From Pixels to Pathways: Bridging Informational Gaps in Urban Analytics
While cities worldwide are promoting streets and public spaces that prioritize pedestrians over vehicles, significant data gaps continue to challenge the analysis, and assessment of pedestrian infrastructure in urban settings. This imbalance in data collection and use not only misrepresents the needs of urban dwellers but also propagates existing inequalities in resource allocation, reinforcing the disparities that urban scientists and planners seek to resolve.
Moreover, despite the growing attention to urban data analysis, there is a gap between the real needs of researchers and practitioners directly studying urban problems and the urban analysis tools being developed. My research aims at addressing both issues by designed various practical tools, frameworks, and systems aimed at extracting critical built environment data and providing a human-centered approach to analyzing extensive, spatiotemporally rich urban datasets of diverse types and scales, including images, video, sound, geometry, and tabular data. In this talk, I will focus on novel computer vision methods for the extraction of important built environment features at scales to tackle data scarcity. This line of my research seeks to promote inclusive and accessible public spaces, enabling different cities around the world to conduct comprehensive analyses of their pedestrian infrastructure through a threefold understanding of where sidewalks are, how they are connected, and what their condition is.
Through these research efforts, we aim to bridge the data gap in urban analytics, providing researchers and practitioners with robust tools to analyze and understand urban infrastructure, ultimately contributing to more equitable and sustainable urban environments.
Keywords: Urban Science, Computer Vision, Geo-Spatial Analysis, Open-Source Software, Data Democratization.
Self-organized, globally optimal, and real-world urban transportation networks — Similarities and discrepancies
Transportation networks form the backbones of our cities. However, transportation planners typically follow a top-down (vision-led) strategy or expand the network in a rather ad hoc way, without sufficiently taking into account people’s bottom-up, self-organizing mobility behavior. This negligence may result in inefficient transportation systems, congestion, and fragmented urban neighborhoods.
In this work, we introduce a data-driven modeling framework to study the similarities and discrepancies among i) self-organized, ii) globally optimal, and iii) real-world transportation networks. The minimal model for the self-organized network is motivated by the bottom-up formation of human trail networks. It balances the human preference for taking the shortest path and the propensity to share popular route segments with others (e.g., for increased travel speed and comfort, access to amenities), thus encapsulating the inherent tendencies of navigation and social behavior. The globally optimal transportation network is obtained by constructing a two-layer model that minimizes the total travel time for a fixed infrastructure cost, using link width (reflecting the travel speed on a given path segment) as the decision variable. In the lower layer, individuals will choose the minimal travel time path between their origin and destination, while the role of the upper layer is to minimize the objective function, i.e., the total travel time summed over all trips. Both network models are derived purely based on origin-destination (OD) trips of urban populations and do not consider existing transportation systems. Subsequently, the most straightforward network comparison follows from a geometric perspective. In addition, we apply principles from optimal transport theory to analyze and comprehend the geometry of functional spaces.
Taking Singapore as a case study, we apply our framework to a large mobile phone GPS dataset, from which we extract 3.9 million OD trips covering one month in 2019. These OD trips are then used to derive the self-organized and globally optimal networks, which are subsequently compared to aggregated OD trip data from Singapore’s Mass Rapid Transit (MRT) system. Our results show that the three networks have similar geometric silhouettes (as measured by the Hausdorff distance). The existing MRT network in Singapore is more similar to the self-organized network than to the globally optimal network. In other terms, the costs for modifying the existing system into the self-organized network are lower than those for modifying it into the globally optimal network.
The insights gained from understanding the similarities and discrepancies among the networks shall provide a first step into designing transportation more human-centric networks — instead of building globally optimal networks that might miss people’s actual needs.
Keywords: City Science, Mobility, Transport Networks
Parcel-level prediction of future land-use changes using a machine learning-based dynamic spatiotemporal modeling
Over time, rural and urban areas have developed due to natural and human activities. Some changes in land use, such as converting agricultural land to residential areas, are irreversible. Once the topsoil is removed during land conversion, the land loses its fertility and cannot be used for agriculture in the future. Therefore, predicting future land use changes at fine geographical units is essential for urban planners and policymakers because such predictions provide vital information about the potential land development dynamics. Land Use Change (LUC) models are commonly preferred to predict future land developments. These methods reveal complex relationships in land developments, which are essential for predicting future conditions. Modeling land use changes involves methodological and computational challenges. Complexities in urban systems require researchers to incorporate such complex relationships in their models while increasing data sizes introduces computational challenges. Machine learning (ML) and deep learning (DL) algorithms such as Random Forest (RF) and Artificial Neural Networks (ANN) offer computationally feasible approaches to develop large-scale LUC models while accounting for complex non-linear relationships (Kim et al., 2022; Ron-Ferguson et al., 2021; Soares-Filho et al., 2013; Talukdar et al., 2021; Zhai et al., 2020).
The main goal of this study is to introduce a dynamic LUC modeling framework for predicting future land developments without requiring manual modification of the covariate matrix using ML and DL algorithms, such as RF and ANN approaches. Once a LUC model is developed, experts must design the covariate matrix to predict future land developments. Such a requirement limits implementation opportunities of LUC models to predict future land use changes. Our modeling framework utilizes spatial and temporal dependencies to mitigate this issue to improve prediction accuracy.
In this research, the introduced modeling framework is applied to the Florida Parcel Database, provided by UF GeoPlan Center. The statewide parcel database consists of approximately 9 million unique parcels and contains information about parcel geometry, actual construction years of buildings, land uses, sale records, and the number of buildings. Historical land development conditions are generated using the actual construction year information, and significant parcel and neighborhood characteristics are retrieved from the data set. Historical records can be traced back to 1650 when the most recent year was 2019. In this study, we focus on the period between 1900 and 2019. Future land use changes are estimated based on RF- and ANN-based LUC models. We train all our models using the UF HiPerGator 3.0 supercomputer. Further, model training is substantially accelerated by utilizing GPU parallel processing. Using the modeling framework, future land use changes at the parcel level are predicted for every year until the target year of 2040.
Keywords: Land-Use Change; Predictions; Spatio-Temporal Modeling; Machine Learning; Florida
18:00 – 22:00 | (Separate ticket)
Conference Dinner at Draper’s Steakhouse & Seafood
Address:
3936 Blenheim Blvd
Fairfax, VA 22030
Thursday, August 01
7:30 – 8:30 | Atrium & Jajodia Auditorium
Breakfast
8:30 – 10:00 | ENGR 1107 S25 — Statistical Optimal Transport
Organized and chaired by Claudio Agostinelli
Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow
Gaussian mixture models form a flexible and expressive parametric family of distributions that has found applications in a wide variety of applications. Unfortunately, fitting these models to data is a notoriously hard problem from a computational perspective. Currently, only moment-based methods enjoy theoretical guarantees while likelihood-based methods are dominated by heuristics such as Expectation-Maximization that are known to fail in simple examples. In this work, we propose a new algorithm to compute the nonparametric maximum likelihood estimator (NPMLE) in a Gaussian mixture model. Our method is based on gradient descent over the space of probability measures equipped with the Wasserstein-Fisher-Rao geometry for which we establish convergence guarantees. In practice, it can be approximated using an interacting particle system where the weight and location of particles are updated alternately. We conduct extensive numerical experiments to confirm the effectiveness of the proposed algorithm compared not only to classical benchmarks but also to similar gradient descent algorithms with respect to simpler geometries. In particular, these simulations illustrate the benefit of updating both weight and location of the interacting particles. This is based on joint work with Kaizheng Wang and Philippe Rigollet.
Keywords: Gaussian Mixture Model, Nonparametric Mle, Wasserstein-Fisher-Rao Geometry, Optimal Transport, Overparameterization
Central Limit Theorems for Smooth Optimal Transport Maps
One of the central objects in the theory of optimal transport is the Brenier map: the unique monotone transformation which pushes forward an absolutely continuous probability law onto any other given law. Several recent works have analyzed the L^2 risk of plugin estimators of Brenier maps, which are defined as the unique Brenier map between density estimates of the underlying distributions. In this work, we show that such estimators enjoy pointwise central limit theorems. These results provide a first step toward the question of performing statistical inference for smooth Brenier maps in general dimension. As a motivating application, we will discuss how these inferential ideas provide first steps toward the question of constructing confidence bands for optimal transport colocalization curves—a robust measure of colocalization for super-resolution microscopy, which was recently proposed by Tameling et al. (2021).
Keywords: Nonparametric Inference, Brenier Map, Kernel Density Estimation, Colocalization
Permuted and Unlinked Monotone Regression in general dimension: an approach based on mixture modeling and optimal transport
Motivated by challenges in data integration and privacy-aware data analysis, we consider the task of learning a map between d-dimensional inputs and d-dimensional noisy outputs, without observing (input, output)-pairs, but only separate unordered lists of inputs and outputs. We develop an easy-to-use algorithm for denoising based on the Kiefer-Wolfowitz nonparametric maximum likelihood estimator (NPMLE) and techniques from the theory of optimal transport.
Keywords: Deconvolution; Estimation Of Transport Maps; Isotonic Regression; Mixture Estimation; Nonparametric Maximum Likelihood
8:30 – 10:00 | ENGR 1108 S26 — Recent Advances In Data Depth
Organized by Davy Paindaveine | Chaired by Germain Van Bever
Distributionally robust halfspace depth
Data depth is a statistical function that measures centrality of an arbitrary point of the space with respect to a probability distribution or a data cloud. The earliest and arguably most popular depth notion – for multivariate data – is the halfspace depth. By exploiting the geometry of data, halfspace depth is fully non-parametric, robust to both outliers and heavy tailed distributions, satisfies the affine-invariance property, and is used in a variety of tasks as a generalization of quantiles in higher dimensions and as an alternative to the probability density. These advantages make halfspace depth vital for many applications: supervised and unsupervised machine learning, statistical quality control, extreme value theory, imputation of missing data, to name but a few.
Halfspace depth can be seen as a stochastic program and thus it is not guarded against optimizer’s curse, so that a limited training sample may easily result in a poor out-of-sample performance. In the current work, we propose a generalized halfspace depth concept relying on the recent advances in distributionally robust optimization, where every halfspace is examined using the respective worst-case distribution in the Wasserstein ball of (a small) positive radius centered at the empirical law. This new depth can be seen as a smoothed and regularized classical halfspace depth which is retrieved as the ball’s radius tends towards zero.
The distributionally robust halfspace depth inherits most of the main properties of the original halfspace depth and, additionally, enjoys various new attractive features such as continuity and strict positivity beyond the convex hull of the support. We provide numerical illustrations of the new depth and its advantages, and develop some fundamental theory. In particular, we study the upper level sets and the median region including their breakdown properties, as well as consider applications to outlier detection and supervised classification.
Keywords: Data Depth, Tukey Depth, Optimal Transport, Wasserstein Distance, Multivariate Median
A novel halfspace depth for dense and sparse functional data
Data depth is a powerful nonparametric tool originally proposed to rank multivariate data from center outward. In this context, one of the most archetypical depth notions is Tukey’s halfspace depth. In the last few decades, notions of depth have also been proposed for functional data. However, Tukey’s depth cannot be extended to handle functional data because of a degeneracy issue. Here, we propose a new halfspace depth for functional data which avoids degeneracy by regularization. The halfspace projection directions are constrained to have a small reproducing kernel Hilbert space norm. Desirable theoretical properties of the proposed depth, such as isometry invariance, maximality at center, monotonicity relative to a deepest point, and upper semi-continuity, are established. Moreover, the proposed regularized halfspace depth can rank functional data with varying emphasis in shape or magnitude, depending on the regularization. A new outlier detection approach is also proposed, which is capable of detecting both shape and magnitude outliers. It is applicable to trajectories in L2, a very general space of functions that include non-smooth trajectories. Based on extensive numerical studies, our methods are shown to perform well in terms of detecting outliers of different types. Finally, an extension to spare functional data will be discussed.
Keywords: Functional Data Analysis, Functional Rankings, Infinite Dimension, Outlier Detection, Robust Statistics
Halfspace depth for directional data: Theory and computation
The angular halfspace depth (ahD) is a natural modification of the celebrated halfspace (or Tukey) depth to the setup of directional data. It allows us to define elements of nonparametric inference, such as the median, the inter-quantile regions, or the rank statistics, for datasets supported in the unit sphere. Despite being introduced already in 1987, ahD has never received ample recognition in the literature, mainly due to the lack of efficient algorithms for its computation.
In this talk, we address both the computation and the theory of ahD. First, we express the angular depth ahD in the unit sphere as a generalized (Euclidean) halfspace depth, using a projection approach. That allows us to develop fast exact computational algorithms for ahD in any dimension. Second, we show that similarly to the classical halfspace depth for multivariate data, also ahD satisfies many desirable properties of a statistical depth function. Further, we derive uniform continuity/consistency results for the associated set of directional medians, and the central regions of ahD, the latter representing a depth-based analog of the quantiles for directional data.
Keywords: Directional Data Analysis, Angular Halfspace Depth, Angular Tukey Depth, Nonparametric Analysis, Computational Statistics
8:30 – 10:00 | ENGR 1110 S27 — Robustness In Post-Linkage Data Analysis/Data Integration
Organized and chaired by Martin Slawski
Fay Herriot Small Area Estimation with Linked Data
In finite population inference, estimates of descriptive quantities such as means or totals are often needed not only for the population as a whole but also for geographical subdivisions or other subsets (domains). In recent years, there has been a growing demand to produce estimates for subsets of populations that we can label as small areas. Small area specific sample sizes are typically not large enough to allow direct estimates achieve adequate precision in most or all cases. Small area estimation methods are then introduced to complement survey data with auxiliary information from alternative sources such as administrative archives, or remote sensing \citep{rao2015small}.
Within the vast small area literature we focus on the Fay-Herriot (FH) model, (Fay and Herriot, 1979) which is based on the idea of integrating the survey and the auxiliary information at the area level.
In this research, we assume that the target and the auxiliary variables and are from different registers, linked at the population unit level with non zero linkage error probability, , so that units can be wrongly linked across areas. Moreover, we assume the the auxiliary variables register includes area membership indicator.
We assume that linkage errors are exchangeable and that the probability of wrong links is known.
Linkage errors can affect predictions based on the FH model, since units from different areas can be (wrongly) linked. This can create artificial outliers leading to an increase in the direct estimate’s variances; moreover linkage error-contaminated direct estimates can bias the estimation of the slope coefficient in the linear regression model at the core of the FH area level model (attenuation effect).
We derive a modified Fay-Herriot predictor based on the application of the missing information principle (Chambers et al., 2012) We study its properties, using both analytical methods and simulation exercises.
These simulation results show that the modified FH predictor is characterized by a markedly lower bias with respect to a naive FH predictor that overlooks linkage error and in line with that of a FH predictor based on full information (without linkage errors). There is an improvement also in the variances of the modified predictors that summed to that in the biases leads to an improved efficiency.
References:
Chambers, R. L., Steel, D. G., Wang, S., & Welsh, A. (2012). Maximum likelihood estimation for sample surveys. CRC Press.
Fay III, R. E., & Herriot, R. A. (1979). Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269-277.
Rao, J. N., & Molina, I. (2015). Small area estimation. John Wiley & Sons.
Keywords: Maximum Likelihood, Exchangeable Linkage Error, Missing Information Principle, Bias Reduction
Using Entity Resolution to Improve Inward FDI—QCEW Estimates
Statistical agencies increasingly rely on integrating existing data to create new products. One such example are research statistics on employment, wages, and occupations for domestic establishments that are foreign owned: https://www.bls.gov/fdi/. These statistics are based on integrated micro-data that combines the Bureau of Economic Analysis’ enterprise-level data on inward Foreign Direct Investment (FDI) with establishment data from the Bureau of Labor Statistic’s Quarterly Census of Wages and Employment (QCEW).
Integrating these data sources is particularly challenging because they lack a consistent common identifier. Previous linking efforts involved a great deal of manual intervention. In linking 2012 data, a preliminary link using common identifiers yielded an initial error rate of 87.7%. The error rate was subsequently reduced to 19.0% via manual intervention, but at the cost of 1,510.5 hours of analyst labor.
We use entity resolution to reduce initial linkage error and subsequent labor costs. Entity resolution (also known as record linkage or data linking) augments linking the data sources using identifiers by additionally making use of other features the data sources share in common. These features, such as business names, industrial classification, and employment levels, are similar for true linkages (i.e., for pairs of records from the two data sources that refer to the same underlying entity).
A pipeline of four principle steps is employed. First, the QCEW and FDI data sources are aligned. Common features are transformed so they are comparable. Additionally, QCEW establishment data are aggregated in order to be better align to the enterprise-level FDI data. The aligned data sources are next indexed (also known as blocking) to reduce the entity resolution search space. Following the indexing step, candidate pairs are formed and supervised machine learning techniques are used to predict whether each candidate pair represents a true linkage based on how similar the common features of the two data sources are for the candidate pair. Finally, post-classification clustering methods are implemented to refine the predicted linkages to ensure a coherent output.
Keywords: Entity Resolution, Record Linkage, Data Linking, Blocking, Clustering
Cox Proportional Hazards Regression using Linked Data: an Approach based on Mixture Modeling
Probabilistic record linkage is increasingly used to combine multiple data sources at the level of individual records. While the resulting linked data sets carry significant potential for scientific discovery and informing decisions, data contamination arising from mismatched records due to non-unique or noisy identifiers used for linking is not uncommon. Accounting for such mismatch error in the downstream analysis performed on the linked file is critical to ensure valid statistical inference. Here, we present an approach to enable valid post-linkage inference in the analysis of survival data when survival times reside in one file and covariates in another file. The proposed approach addresses the secondary analysis setting in which only the linked file (but not the two individual files) is given and is built on a general framework based on a mixture model (Slawski et al., 2023). According to that framework, inference can be conducted in terms of composite likelihood and the EM algorithm, enhanced by careful modifications to address the semiparametric nature of the Cox PH model. The effectiveness of the approach is investigated by simulation studies and an illustrative case study.
Keywords: Record Linkage, Data Integration, Linkage Errors, Secondary Analysis
Analysis of Linked Files: A Missing Data Perspective
In many applications, researchers seek to identify common entities across multiple data files. Record linkage algorithms facilitate the identification of overlapping entities, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in erroneous estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the different linkage mechanisms that underpin analysis methods with linked files. We group these methods under three broad categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios.
Keywords: Record Linkage; Missing Data; Error Adjustment
8:30 – 10:00 | ENGR 1103 S28 — Advances In Statistical Modeling Of Non-Stationary And Complex Spatial Processes
Organized and chaired by Ben Lee
Identifying Geometric Anisotropy in Spatial Random Field
Geometric anisotropy is a convenient way to model stationary spatial data. In this work, we investigate the amount and nature of anisotropy present in a given spatial random field. We propose a minimum $L^2$ distance measure, which is zero in the case of isotropic covariance structure and positive otherwise. We propose a consistent estimator of this measure and derive its asymptotic distribution under both increasing domain and infill asymptotic schemes. We use these results to construct a consistent test for the hypothesis of isotropy. We also propose an estimator of appropriate scaling factor and rotation to make the covariance structure isotropic by minimizing the proposed distance. The small sample validity of the proposed methods is investigated using extensive simulations and a data example.
Keywords: Geostatistics, Geometric Anisotropy, Model Validation
Frequency Band Analysis of Nonstationary Multivariate Time Series
Information from frequency bands in biomedical time series provides useful summaries of the observed signal. Many existing methods consider summaries of the time series obtained over a few well-known, pre-defined frequency bands of interest. However, these methods do not provide data-driven methods for identifying frequency bands that optimally summarize frequency-domain information in the time series. A new method to identify partition points in the frequency space of a multivariate locally stationary time series is proposed. These partition points signify changes across frequencies in the time-varying behavior of the signal and provide frequency band summary measures that best preserve the nonstationary dynamics of the observed series. An L2 norm-based discrepancy measure that finds differences in the time-varying spectral density matrix is constructed, and its asymptotic properties are derived. New nonparametric bootstrap tests are also provided to identify significant frequency partition points and to identify components and cross-components of the spectral matrix exhibiting changes over frequencies. Finite-sample performance of the proposed method is illustrated via simulations. The proposed method is used to develop optimal frequency band summary measures for characterizing time-varying behavior in resting-state electroencephalography (EEG) time series, as well as identifying components and cross-components associated with each frequency partition point.
Keywords: Multivariate Time Series, Nonstationary, Frequency Domain, Spectral Matrix, Electroencephalography
Flexible Basis Representations for Modeling Large Non-Gaussian Spatial Data
Nonstationary and non-Gaussian spatial data are common in various fields, including ecology (e.g., counts of animal species), epidemiology (e.g., disease incidence counts in susceptible regions), and environmental science (e.g., remotely-sensed satellite imagery). Due to modern data collection methods, the size of these datasets have grown considerably. Spatial generalized linear mixed models (SGLMMs) are a flexible class of models used to model nonstationary and non-Gaussian datasets. Despite their utility, SGLMMs can be computationally prohibitive for even moderately large datasets (e.g., 5,000 to 100,000 observed locations). To circumvent this issue, past studies have embedded nested radial basis functions into the SGLMM. However, two crucial specifications (knot placement and bandwidth parameters), which directly affect model performance, are typically fixed prior to model-fitting. We propose a novel approach to model large nonstationary and non-Gaussian spatial datasets using adaptive radial basis functions. Our approach: (1) partitions the spatial domain into subregions; (2) employs reversible-jump Markov chain Monte Carlo (RJMCMC) to infer the number and location of the knots within each partition; and (3) models the latent spatial surface using partition-varying and adaptive basis functions. Through an extensive simulation study, we show that our approach provides more accurate predictions than competing methods while preserving computational efficiency. We demonstrate our approach on two environmental datasets – incidences of plant species and counts of bird species in the United States.
Keywords: Bayesian Hierarchical Spatial Models; Non-Gaussian Spatial Models; Nonstationary Spatial Processes; Reversible-Jump Mcmc; Spatial Basis Functions
Exploring Validity in Multivariate Covariance Functions
In this talk, we tackle the use of Gaussian processes for modeling multivariate random fields, with an emphasis on establishing valid covariance structures. The requirement for these structures to be positively definite often introduces intricate constraints on the parameters. We present streamlined techniques for ensuring multivariate validity across different covariance function types, utilizing multivariate mixtures. Our methods provide both straightforward and comprehensive conditions for validity, underpinned by the theory of conditionally negative semidefinite matrices and the Schur product theorem. Additionally, we investigate the potential of spectral densities in creating valid multivariate models, which opens up new avenues for cross-covariance function development. Our results demonstrate that these approaches generate valid multivariate cross-covariance structures, which maintain important marginal properties. The effectiveness of our strategies is confirmed through numerical experiments and the analysis of real-world data in the realms of earth sciences and climatology.
Keywords: Spatial Statistics; Multivariate Random Fields; Gaussian Processes; Covariance Structures.
10:00 – 10:15 | Atrium & Jajodia Auditorium
Coffee Break
10:15 – 11:15 | ENGR 1107 S29 — Robustness And Variable Selection
Contributed session chaired by Matías Salibián-Barrera
Detection of cellwise outliers in multivariate and functional relative data
In any data processing it is important to be aware of the possible presence of outliers in the data set. Observations that deviate from the model assumptions can seriously affect the results and their interpretability, making outlier detection an important step in the analysis. For multivariate data, it may be convenient to detect outliers at the cell level, i.e. to look for deviations in individual cells of a data matrix. When working with data carrying relative information, i.e. compositional data in the multivariate case and probability density functions in the functional case, the essential information is contained in (log-)ratios, which must also be reflected by any cellwise outlier detection approach. A well-established method for comprehensive outlier detection is the Deviating Data Cells (DDC) algorithm, which allows the search for both rowwise and cellwise outliers. In both cases of relative data, it is therefore necessary to adapt the DDC algorithm to deal with the respective logratio representations. While in the multivariate case this means applying the DDC algorithm to all pairwise logratios and then aggregating the result of outlier detection at the level of the original compositional parts, in the functional case the idea of the DDC algorithm is applied to a spline representation of centred logratio transformed probability density functions (PDFs). Using the information contained in the spline coefficients, it is then possible to highlight parts of the PDFs whose behaviour deviates from the general trend. Theoretical developments are demonstrated using data from a variety of applications.
Keywords: Cellwise Outliers, Rowwise Outliers, Compositional Data, Probability Density Funtions, Deviating Data Cells
Variable Selection in Regression Models with Dependent and Asymmetrically Distributed Error Term
Variable selection methods gained enormous attention in statistics, as many fields require reducing prediction error while identifying important variables simultaneously. In the literature the variable selection methods are studied extensively. However, most of them are limited to the independent error case, and many only deal with the assumption of symmetric error distribution. In this study we will combine the regression model with dependent error by using penalty-based variable selection methods to perform estimation and variable selection simultaneously when the innovations have skewed and heavy-tailed skew distributions. We conduct a simulation study and provide a real-data example to demonstrate that the proposed method outperforms classical approaches in the presence of outliers.
Keywords: Dependent Errors; Em-Algorithm; Penalized Methods; Skew Distributions
Robust Variable Selection in High-dimensional Nonparametric Additive Model
Additive models belong to the class of structured nonparametric regression models that do not suffer from the curse of dimensionality. Finding the additive components that are nonzero when the true model is assumed to be sparse is an important problem, and it is well-studied in the literature. The majority of the existing methods focused on using the $L_2$ loss function, which is sensitive to outliers in the data. We propose a new variable selection method for additive models that is robust to outliers in the data. The proposed method employs a nonconcave penalty for variable selection and considers the framework of B-splines and density power divergence loss function for estimation. The loss function produces an M-estimator that down weights the effect outliers. The asymptotic results are derived under the sub-Weibull assumption, which allows the error distribution to have an exponentially heavy tail. Under regularity conditions, we show that the proposed method achieves the optimal convergence rate. In addition, our results include the convergence rates for sub-Gaussian and sub-Exponential distributions as special cases. We numerically validate our theoretical findings using simulations and real data analysis.
Let $(Y_i, X_{i1}, \ldots, X_{ip})$, $i=1,2,\ldots,n$, be $n$ independent and identically distributed (i.i.d.) copies of $(Y, X_1, \ldots, X_p)$, where $Y$ is a scalar response and $\{ X_j, j=1,\ldots, p\}$ are covariates. We consider the following model:
$$Y_i &= \mu + \sum_{j=1}^p g_j(X_{ij}) + \epsilon_i, \qquad i=1,\ldots,n,$$
where $\epsilon_i$ is the random error with mean zero and finite variance, $\sigma^2$, and $g_j(\cdot)$’s are unknown functions with $E[g_j(X_{ij})]=0$, $j=1,2,\ldots,p$. When $p$ is much smaller than $n$, the popular methods for estimation of (1}) include backfitting, penalized splines, smooth backfitting, and marginal integration [Buja et al., 1989, Mammen et al., 1999, Linton
and Nielsen, 1995].
This study considers a new approach for robust estimation of a high-dimensional additive model using density power divergence (DPD) with a nonconcave penalty function. asu et al. [1998] introduced the DPD measure that produces an M-estimator and improves performance in the presence of outliers and heavy-tailed distribution.
Our contributions to this study are twofold:
1. The proposed method considers the DPD loss function and employs a nonconcave penalty. Given the computational simplicity of the DPD loss function, the proposed method is computationally elegant. Moreover, the DPD estimator has better interpretation than other M-estimators as it is derived from a distance-like measure.
2. In terms of theory, our results assume the error to be sub-Weibull, which is a more general condition than the sub-Gaussian or the sub-Exponential conditions used in the literature. Our theoretical results are new to the literature on nonparametric additive models.
Keywords: Nonparametric Additive Model; Nonconcave Penalty; M-Estimator; Sub-Weibull; Exponentially Heavy-Tailed.
10:15 – 11:15 | ENGR 1108 S30 — Robust Statistical Learning In Complex Models
Contributed session chaired by Lily Wang
Robust learning and inference for mean functions in functional data analysis of imaging data
Imaging data has emerged as a pivotal tool in biomedical research and various other fields. Accurately identifying and locating significant effects, such as abnormalities or regions of interest, within these imaging datasets is crucial for making informed decisions. However, medical imaging data often suffers from unwanted noise contamination caused by factors such as instrument imperfections. Recognizing the impact of such noise contamination, we present a robust nonparametric method for learning and inferring noisy imaging data by modeling them as contaminated functional data, enabling accurate estimation of the underlying signals and efficient detection and localization of significant effects. We propose a robust, smoothed M-estimator based on bivariate penalized splines over triangulation to handle the challenges associated with contaminated imaging data, as well as spatial dependencies and irregular domains commonly found in brain images and other biomedical imaging applications. We establish the $L_2$ convergence of the proposed M-based mean function estimator, demonstrating the optimal convergence rate under certain regularity conditions, and investigate its asymptotic normality. Furthermore, we propose a novel approach for constructing a simultaneous confidence corridor (SCC) for the mean of noisy imaging data. We also extend the SCC to the difference in mean functions between two populations of noisy imaging data. Extensive simulation studies and a real-data application using brain imaging data demonstrate the performance of the proposed robust methods.
Keywords: Bivariate Penalized Splines, M-Estimator, Simultaneous Inference, Imaging Analysis, Triangulation
Exact Feature Collisions in Neural Networks
Predictions made by deep neural networks have been shown to be highly sensitive to small changes made in the input space where such maliciously crafted data points containing small perturbations are being referred to as adversarial examples. On the other hand, recent research suggests that the same networks can also be extremely insensitive to changes of large magnitude, where predictions of two largely different data points are mapped to approximately the same output. In such cases, features of two data points are said to approximately collide, thus leading to largely similar predictions.
Our results improve and extend the work of Li, Zhang & Malik (2019), and a provide specific criteria for data points to have colliding features from the perspective of weights of neural networks, revealing that neural networks (theoretically) not only suffer from features that approximately collide but also suffer from features that exactly collide. We identify the sufficient conditions for the existence of such scenarios, hereby investigating a large number of deep neural networks that have been used to solve various computer vision problems. Furthermore, we propose the Null-space search, a numerical approach that does not rely on heuristics, to create data points with colliding features for any input and for any task, including, but not limited to, classification, segmentation, and localization.
Keywords: Computer Vision, Feature Collisions, Neural Networks.
Outlier Detection with Cluster Catch Digraphs
CCDs are a family of clustering algorithms. The CCD algorithms are graph-based, density-based, and distribution-based approaches. They construct a certain number of hyperspheres to capture latent cluster structures. There are two versions of CCDs: KS-CCDs and RK-CCDs. The latter ones are especially appealing as they are almost parameter-free.
We develop an outlier detection algorithm that first utilizes RK-CCDs to build hyperspheres for each latent cluster. Then, we construct a Mutual Catch Graph (MCG) from the KS-CCD and identify outliers among those points that are not in the hyperspheres. We call this approach the Mutual catch graph with the Cluster Catch Digraph (M-CCD) algorithm. However, due to some shortcomings of RK-CCDs, the performance degrades with high dimensions. To resolve this issue, we propose a new version of CCDs with the Nearest Neighbor Distance (NND) and refer to it as NN-CCD. Subsequently, we proposed another outlier detection algorithm based on NN-CCDs, which has much better performance with high dimensions; we call this approach the Mutual catch graph with Nearest Neighbor Cluster Catch Digraph (M-NNCCD) algorithm. We also propose ways to adapt the above algorithms to the cases when the shapes of clusters are arbitrary; we call those approaches the Mutual catch graph with Flexible Cluster Catch Digraph (M-FCCD) algorithm and the Mutual catch graph with Flexible Nearest Neighbor Cluster Catch Digraph (M-FNNCCD) algorithm. Additionally, we offer two outlyingness scores based on CCDs: Outbound Outlyingness Score (OOS) and Inbound Outlyingness Score (IOS). In experimental analysis, they deliver better performance when compared to both existing outlier detection algorithms and CCD-based outlier detection algorithms.
Keywords: Outlier Detection, Graph-Based Clustering, Cluster Catch Digraphs, K-Nearest-Neighborhood, Mutual Catch Graphs
10:15 – 11:15 | ENGR 1110 S31 — Novel Tools And Insights For Data Science
Contributed session chaired by Peter Filzmoser
Advancing Anomaly Detection Evaluation: Novel Measures for Private Data Analysis
In recent years, the demand for effective anomaly detection has surged across industries. With thousands of fraudulent banking transactions occurring daily, millions of users exhibiting non-conforming behaviors in specific applications, and security breaches on the rise, the need for robust anomaly detection algorithms has never been greater. Despite significant advancements in anomaly detection algorithms, many challenges persist. The scientific community’s years of dedicated work have yielded a plethora of powerful anomaly detection algorithms [Kiegel et al. [2008], He et al. [2003], Lazarevic & Kumar [2005], Goldstein & Dengel [2012], Liu et al. [2008], Breuning et al. [2000]]. However, a novel approach is required to address these ongoing issues and enhance anomaly detection performance.
To tackle these challenges and advance anomaly detection evaluation, we introduce novel method for private data analysis. Our method combines ”AdarSim” (adaptive similarity) and ”DSSM” (Data Structure Similarity Measure). Leveraging these measures allows for a comprehensive assessment of dataset similarities between private and public datasets and of the algorithm performance on them, enabling practitioners to make informed decisions when applying anomaly detection algorithms to private data.
The AdarSim uses the information about the behaviour of the algorithms performed on benchmark and private data set. The DSSM utilizes various measures that provide insights into the characteristics of the data. We propose a novel method that combines the AdarSim alongside DSSM to compare the private and the public datasets and to find similarities between them. Based on the similarities, we analyse the behaviour of algorithms on the datasets by considering the data structure characteristics. By grouping algorithms according to their similarity levels and data structure similarities, we can effectively transfer performance information from benchmark data to private data. Our study demonstrates that only through the combination of both metrics we achieve optimal results on the datasets used in our experiments.
Our results highlight the importance of considering both, algorithm similarity and data structure characteristics, in finding the similarities between the datasets and evaluating the unsupervised anomaly detection methods. This comprehensive approach enables practitioners to make informed decisions when applying unsupervised anomaly detection algorithms to private data.
Keywords: Anomaly Detection; Unsupervised Learning; Adarsim (Adaptive Similarity Measure); Dssm (Data Structure Similarity Measure)
Qindex R package for developing quantile biomarkers based on single-cell multiplex immunofluorescence imaging data
Modern pathology platforms for multiplex fluorescence-based immunohistochemistry provide distributions of cellular signal intensity (CSI) levels of proteins across the entire cell populations within the sampled tumor tissue. However, heterogeneity of CSI levels is usually ignored, and the simple mean signal intensity (MSI) value is considered as a cancer biomarker. To account for tumor heterogeneity, we consider the entire CSI distribution as a predictor of clinical outcome. This allows retaining all quantitative information at the single-cell level by considering the values of the quantile function (inverse of the cumulative distribution function) estimated from a sample of CSI levels in a tumor tissue.
We consider the entire distribution of CSI expression levels of a given protein in the cancer cell population as a predictor of clinical outcome and derive new biomarkers as single-index predictors based on the entire CSI distribution summarized as a quantile function. The proposed Quantile Index (QI) biomarker is defined as a linear or nonlinear functional regression predictor of outcome. The linear functional regression quantile Index (FR-QI) is the integral of subject-specific CSI quantile function multiplied by the common weight function [1]. The nonlinear functional regression quantile index (nFR-QI) is computed as the integral of unspecified bivariate twice differentiable function with probability p and subject-specific quantile function as arguments. The weight and nonlinear bivariate function are represented by penalized splines and estimated by fitting suitable functional regression models to a clinical outcome.
We have developed an R package Qindex, implementing optimization of QI biomarkers in a training set and their validation in an independent test set. Qindex package is available at the Comprehensive R Archive Network (CRAN), https://CRAN.R-project.org/package=Qindex. The proposed QI biomarkers were derived for proteins expressed in cancer cells of malignant breast tumors and compared to the standard MSI predictors.
References:
1. Yi, M., Zhan, T., Peck, A.R., Hooke, J.A., Kovatich, A.J., Shriver, C.D., Hu, H., Sun, Y., Rui, H. and Chervoneva, I., 2023. Quantile Index Biomarkers Based on Single-Cell Expression Data. Laboratory Investigation, 103(8), p.100158.
Keywords: Cancer Biomarker, Distribution Quantiles, Functional Regression, Multiplex Immunofluorescence Immunohistochemistry, Single-Cell Imaging
Time-aware tensor decomposition considering the relationships between viewpoints
Time-series data obtained from objects can be used to understand the relationship between objects and their time-series changes. For example, we can understand air pollution patterns using data consisting of features such as time points and the type of air pollutants at a given observation point. Such data can be represented as a tensor composed of features such as (time point)×(type of air pollutant)×(observation point). For such data, tensor decomposition can be used to capture the relationships among features.
Conventional tensor decomposition does not use temporal order information and does not sufficiently consider the unique aspect in which observations change with autocorrelation. Autoregressive tensor decomposition has been proposed to overcome this problem.
However, it is difficult to interpret the relationships using existing methods when external features are provided for the same observation object. The features collectively obtained from the same observation object are called views, and the data observed from multiple views are called multiview data. When understanding air pollution patterns using multiview data, such as the types of air pollutants observed at a certain observation point and meteorological data from that observation point, it is possible to clarify how specific meteorological conditions affect air pollution.
We propose a new tensor decomposition method based on existing autoregressive models for multiview data with spatiotemporal structures, such as air pollution data. To capture the relationship between views, we introduce a penalty term into the existing autoregressive tensor decomposition. The estimation results of the regression coefficients introduced by this penalty term are expected to clarify the influence of specific weather conditions on air pollution.
Keywords: Time-Series Data, Autoregression, Multiview Data
Copula-Based Regression with Discrete Covariates
In this project we investigated a new approach of estimating a regression function based on copulas when covariates are mixture of continuous and discrete variables, which is considered as an extension of Noh et al (2013) context. It also targeted extending existing model to accommodate different sets of data arising from experiments.
The main idea behind this approach is the writing of the regression function in terms of copula and marginal distributions, thereafter we estimated the copula and marginal distributions. Now, since various methods are available in the literature to estimate both the copula and the marginals, this approach offered us a rich and flexible alternative to many existing regression estimators. We have studied the asymptotic behavior of the estimators obtained as well as the finite sample performance of the estimators and illustrated their usefulness by analyzing real data. An adapted algorithm is applied to construct copulas. Monte Carlo simulations are carried out to replicate datasets, estimate prediction model parameters and validate them.
Keywords: Copula; Conditional Copula: Regression; Ols; Data Analysis
11:15 – 11:30 | Atrium & Jajodia Auditorium
Coffee Break
11:30 – 12:30 | Jajodia Auditorium (ENGR 1101) Keynote Address by Dr. Philippe Rigollet Statistical Optimal Transport
Philippe Rigollet is a Professor of Mathematics at MIT where he serves as the Chair of Applied Mathematics. He works at the intersection of statistics, machine learning, and optimization, focusing primarily on the design and analysis of efficient statistical methods. His current research is on statistical optimal transport and the mathematical theory behind transformers. His research has been recognized by the CAREER award from the Nation Science Foundation and a Best Paper Award at the Conference on Learning Theory in 2013 for his pioneering work on statistical-to-computational tradeoffs. He is also a recognized speaker and has was selected to present his work on Statistical Optimal Transport during a Medallion lecture by the Institute for Mathematical Statistics and as well as the 2019 St Flour Lectures in Probability and Statistics.
12:30 – 13:00 | Jajodia Auditorium (ENGR 1101)