Download Free Learning And Validating Clinically Meaningful Phenotypes From Electronic Health Data Book in PDF and EPUB Free Download. You can read online Learning And Validating Clinically Meaningful Phenotypes From Electronic Health Data and write the review.

The ever-growing adoption of electronic health records (EHR) to record patients' health journeys has resulted in vast amounts of heterogeneous, complex, and unwieldy information [Hripcsak and Albers, 2013]. Distilling this raw data into clinical insights presents great opportunities and challenges for the research and medical communities. One approach to this distillation is called computational phenotyping. Computational phenotyping is the process of extracting clinically relevant and interesting characteristics from a set of clinical documentation, such as that which is recorded in electronic health records (EHRs). Clinicians can use computational phenotyping, which can be viewed as a form of dimensionality reduction where a set of phenotypes form a latent space, to reason about populations, identify patients for randomized case-control studies, and extrapolate patient disease trajectories. In recent years, high-throughput computational approaches have made strides in extracting potentially clinically interesting phenotypes from data contained in EHR systems. Tensor factorization methods have shown particular promise in deriving phenotypes. However, phenotyping methods via tensor factorization have the following weaknesses: 1) the extracted phenotypes can lack diversity, which makes them more difficult for clinicians to reason about and utilize in practice, 2) many of the tensor factorization methods are unsupervised and do not utilize side information that may be available about the population or about the relationships between the clinical characteristics in the data (e.g., diagnoses and medications), and 3) validating the clinical relevance of the extracted phenotypes requires domain training and expertise. This dissertation addresses all three of these limitations. First, we present tensor factorization methods that discover sparse and concise phenotypes in unsupervised, supervised, and semi-supervised settings. Second, via two tools we built, we show how to leverage domain expertise in the form of publicly available medical articles to evaluate the clinical validity of the discovered phenotypes. Third, we combine tensor factorization and the phenotype validation tools to guide the discovery process to more clinically relevant phenotypes.
With the widespread adoption of electronic health records (EHR), a large volume of EHR data has been accumulated, providing researchers and clinicians with valuable opportunities to accelerate clinical research and to improve the quality of care by advanced analysis of the EHR data. One approach to transforming the raw EHR to actionable insights is computational phenotyping -- the process of discovering meaningful combinations of clinical items, e.g. diagnosis and medications, from the raw EHR data for characterizing health conditions with minimum human supervision. Many data-driven approaches have been proposed to tackle the problem, among which non-negative tensor factorization (NTF) has been shown effective for high-throughput discovery of phenotypes from structural EHR data. Although great efforts have been made, several open challenges limit the robustness of existing NTF-based computational phenotyping models. (1) The correspondence information between different modalities (e.g., between diagnosis and medication) is often not recorded in EHR data, and existing models rely on unrealistic assumptions to construct input tensors for phenotyping which introduces inevitable errors. (2) EHR data are often recorded over time, presenting serious temporal irregularity: patients have different lengths of stay and the time gap between clinical visits can vary significantly. Existing models are limited in considering the temporal irregularity and temporal dependency, which limits their generalizability and robustness. (3) Heavy missingness is unavoidable in the raw EHR data due to recording mistakes or operational reasons. Existing models mostly do not take the missing data into account and assume that the data are fully observed, which can greatly compromise their robustness. In this thesis research study, we propose a series of robust tensor factorization models to address these challenges. First, we propose a hidden interaction tensor factorization (HITF) model to discover the inter-modal correspondence jointly with the learning of latent phenotypes. It is further extended to the multi-modal setting by the collective hidden interaction tensor factorization (cHITF) framework. Second, we propose a collective non-negative tensor factorization (CNTF) model to extract phenotypes from temporally irregular EHR data and separate phenotypes that appear at different stages of the disease progression. Third, we propose a temporally dependent PARAFAC2 factorization (TedPar) model to further capture the temporal dependency between phenotypes by capturing the transitions between them over time. Forth, we propose a logistic PARAFAC2 factorization (LogPar) model to jointly complete the one-class missing data in the binary irregular tensor and learn phenotypes from it. Finally, we propose context-aware time series imputation (CATSI) to capture the overall health condition of patients and use it to guide the imputation of clinical time series. We empirically validate the proposed models using a number of real-world, largescale, and de-identified EHR datasets. The empirical evaluation results show that the proposed models are significantly more robust than the existing ones. Evaluated by the clinician, HITF and cHITF discovers more clinically meaningful inter-modal correspondence, CNTF learns phenotypes that better separate early and later stages of disease progression, TedPar captures meaningful phenotype transition patterns, and LogPar also derives clinically meaningful phenotypes. Quantitatively, LogPar and CATSI show significant improvement than baselines in tensor completion and time series imputation, respectively. Besides, HITF, cHITF, CNTF, and LogPar all significantly outperform baseline models in terms of downstream prediction tasks.
Healthcare applications of machine learning tend toward greater requirements for model transparency than most applications. Yet the often high dimensionality of the data presents a significant impediment to meeting this requirement, particularly as it relates to the underlying relationships contributing to an individual prediction. Thus emerged the concept of "data phenotypes", clinically relevant groupings that facilitate population statistics and reduce barriers in the development of quality machine learning models. However, the results of current phenotyping methods are often difficult to interpret, and they often require clarification from an experienced clinician to be useful. This is a problem for administration-level prediction problems in particular, for example Length of Stay prediction, because those developing the models are not commonly clinicians, and because the results of these models are often desired with a fast turnaround. With the above in mind, this thesis reviews the utility of four prominent phenotyping approaches: k-means, agglomerative clustering, non-negative matrix factorization, and non-negative tensor factorization. We propose variants of the four approaches with the goal of producing distinct feature membership. We then show that our proposals can produce easily understandable phenotypes at no detriment to prediction performance over some real healthcare tasks.
This open access book explores ways to leverage information technology and machine learning to combat disease and promote health, especially in resource-constrained settings. It focuses on digital disease surveillance through the application of machine learning to non-traditional data sources. Developing countries are uniquely prone to large-scale emerging infectious disease outbreaks due to disruption of ecosystems, civil unrest, and poor healthcare infrastructure – and without comprehensive surveillance, delays in outbreak identification, resource deployment, and case management can be catastrophic. In combination with context-informed analytics, students will learn how non-traditional digital disease data sources – including news media, social media, Google Trends, and Google Street View – can fill critical knowledge gaps and help inform on-the-ground decision-making when formal surveillance systems are insufficient.
This User’s Guide is intended to support the design, implementation, analysis, interpretation, and quality evaluation of registries created to increase understanding of patient outcomes. For the purposes of this guide, a patient registry is an organized system that uses observational study methods to collect uniform data (clinical and other) to evaluate specified outcomes for a population defined by a particular disease, condition, or exposure, and that serves one or more predetermined scientific, clinical, or policy purposes. A registry database is a file (or files) derived from the registry. Although registries can serve many purposes, this guide focuses on registries created for one or more of the following purposes: to describe the natural history of disease, to determine clinical effectiveness or cost-effectiveness of health care products and services, to measure or monitor safety and harm, and/or to measure quality of care. Registries are classified according to how their populations are defined. For example, product registries include patients who have been exposed to biopharmaceutical products or medical devices. Health services registries consist of patients who have had a common procedure, clinical encounter, or hospitalization. Disease or condition registries are defined by patients having the same diagnosis, such as cystic fibrosis or heart failure. The User’s Guide was created by researchers affiliated with AHRQ’s Effective Health Care Program, particularly those who participated in AHRQ’s DEcIDE (Developing Evidence to Inform Decisions About Effectiveness) program. Chapters were subject to multiple internal and external independent reviews.
The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning.
The use of Electronic Health Records (EHR)/Electronic Medical Records (EMR) data is becoming more prevalent for research. However, analysis of this type of data has many unique complications due to how they are collected, processed and types of questions that can be answered. This book covers many important topics related to using EHR/EMR data for research including data extraction, cleaning, processing, analysis, inference, and predictions based on many years of practical experience of the authors. The book carefully evaluates and compares the standard statistical models and approaches with those of machine learning and deep learning methods and reports the unbiased comparison results for these methods in predicting clinical outcomes based on the EHR data. Key Features: Written based on hands-on experience of contributors from multidisciplinary EHR research projects, which include methods and approaches from statistics, computing, informatics, data science and clinical/epidemiological domains. Documents the detailed experience on EHR data extraction, cleaning and preparation Provides a broad view of statistical approaches and machine learning prediction models to deal with the challenges and limitations of EHR data. Considers the complete cycle of EHR data analysis. The use of EHR/EMR analysis requires close collaborations between statisticians, informaticians, data scientists and clinical/epidemiological investigators. This book reflects that multidisciplinary perspective.
This volume presents the proceedings of the International Conference on Biomedical and Health Informatics (ICBHI). The conference was a new special topic conference and a common initiative by the International Federation of Medical and Biological Engineering (IFMBE) and IEEE Engineering in Medicine and Biology Society (IEEE- EMBS). BHI2015 was held in Haikou, China, 8-10 October 2015. The main theme of the BHI2015 is “The Convergence: Integrating Information and Communication Technologies with Biomedicine for Global Health”. The ICBHI2015 proceedings examine enabling technologies of sensors, devices and systems that optimize the acquisition, transmission, processing, storage, retrieval, use of biomedical and health information as well as to report novel clinical applications of health information systems and the deployment of m-Health, e-Health, u-Health, p-Health and Telemedicine.
Each year the National Institute of Health spends over 12 billion dollars on patient related medical research. Accurately classifying patients into categories representing disease, exposures, or other medical conditions important to a study is critical when conducting patient-related research. Without rigorous characterization of patients, also referred to as phenotyping, relationships between exposures and outcomes could not be assessed, thus leading to non-reproducible study results. Developing tools to extract information from the electronic health record (EHR) and methods that can augment a team's perspective or reasoning capabilities to improve the accuracy of a phenotyping model is the focus of this research. This thesis demonstrates that employing state-of-the-art computational methods makes it possible to accurately phenotype patients based entirely on data found within an EHR, even though the EHR data is not entered for that purpose. Three studies using the Marshfield Clinic EHR are described herein to support this research. The first study used a multi-modal phenotyping approach to identify cataract patients for a genome-wide association study. Structured query data mining, natural language processing and optical character recognition where used to extract cataract attributes from the data warehouse, clinical narratives and image documents. Using these methods increased the yield of cataract attribute information 3-fold while maintaining a high degree of accuracy. The second study demonstrates the use of relational machine learning as a computational approach for identifying unanticipated adverse drug reactions (ADEs). Matching and filtering methods adopted were applied to training examples to enhance relational learning for ADE detection. The final study examines relational machine learning as a possible alternative for EHR-based phenotyping. Several innovations including identification of positive examples using ICD-9 codes and infusing negative examples with borderline positive examples were employed to minimize reference expert effort, time and even to some extent possible bias. The study found that relational learning performed significantly better than two popular decision tree learning algorithms for phenotyping when evaluating area under the receiver operator characteristic curve. Findings from this research support my thesis that states: Innovative use of computational methods makes it possible to more accurately characterize research subjects based on EHR data.