Seong Ho Lee
Published: 2023
Total Pages: 0
Get eBook
This dissertation focuses on developing statistical methods for semiparametric inference and its applications. Semiparametric theory provides statistical tools that are flexible and robust to model misspecification. Utilizing the theory, this work proposes robust estimation approaches that are applicable to several scenarios with mild conditions, and establishes their asymptotic properties for inference. Chapter 1 provides a brief review of the literature related to this work. It first introduces the concept of semiparametric models and the efficiency bound. It further discusses two nonparametric techniques employed in the following chapters, kernel regression and B-spline approximation. The chapter then addresses the concept of dataset shift. In Chapter 2, novel estimators of causal effects for categorical and continuous treatments are proposed by using an optimal covariate balancing strategy for inverse probability weighting. The resulting estimators are shown to be consistent for causal contrasts and asymptotically normal, when either the model explaining the treatment assignment is correctly specified, or the correct set of bases for the outcome models has been chosen and the assignment model is sufficiently rich. Asymptotic results are complemented with simulations illustrating the finite sample properties. A data analysis suggests a nonlinear effect of BMI on self-reported health decline among the elderly. In Chapter 3, we consider a semiparametric generalized linear model and study estimation of both marginal mean effects and marginal quantile effects in this model. We propose an approximate maximum likelihood estimator and rigorously establish the consistency, the asymptotic normality, and the semiparametric efficiency of our method in both the marginal mean effect and the marginal quantile effect estimation. Simulation studies are conducted to illustrate the finite sample performance, and we apply the new tool to analyze non-labor income data and discover a new interesting predictor. In Chapter 4, we propose a procedure to select the best training subsample for a classification model. Identifying patient's disease status from electronic health records (EHR) is a frequently encountered task in EHR related research. However, assessing patient's phenotype is costly and labor intensive, hence a proper selection of EHR as a training set is desired. We propose a procedure to tailor the training subsample for a classification model minimizing its mean squared error (MSE). We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real data example, and is found often satisfactory under criteria beyond mean squared error. In Chapter 5, we study label shift assumption and propose robust estimators for quantities of interest. In studies ranging from clinical medicine to policy research, the quantity of interest is often sought for a population from which only partial data is available, based on complete data from a related but different population. In this work, we consider this setting under the so-called label shift assumption. We propose an estimation procedure that only needs standard nonparametric techniques to approximate a conditional expectation, while by no means needs estimates for other model components. We develop the large sample theory for the proposed estimator, and examine its finite-sample performance through simulation studies, as well as an application to the MIMIC-III database.