Publications
2024
- bioRxivA deep profile of gene expression across 18 human cancersWei Qiu, Ayse B Dincer, Joseph D Janizek, Safiye Celik, and 3 more authorsAccepted to Nature Biomedical Engineering, 2024
Clinically and biologically valuable information may reside untapped in large cancer gene expression data sets. Deep unsupervised learning has the potential to extract this information with unprecedented efficacy but has thus far been hampered by a lack of biological interpretability and robustness. Here, we present DeepProfile, a comprehensive framework that addresses current challenges in applying unsupervised deep learning to gene expression profiles. We use DeepProfile to learn low-dimensional latent spaces for 18 human cancers from 50,211 transcriptomes. DeepProfile outperforms existing dimensionality reduction methods with respect to biological interpretability. Using DeepProfile interpretability methods, we show that genes that are universally important in defining the latent spaces across all cancer types control immune cell activation, while cancer type-specific genes and pathways define molecular disease subtypes. By linking DeepProfile latent variables to secondary tumor characteristics, we discover that tumor mutation burden is closely associated with the expression of cell cycle-related genes. DNA mismatch repair and MHC class II antigen presentation pathway expression, on the other hand, are consistently associated with patient survival. We validate these results through Kaplan-Meier analyses and nominate tumor-associated macrophages as an important source of survival-correlated MHC class II transcripts. Our results illustrate the power of unsupervised deep learning for discovery of novel cancer biology from existing gene expression data.
- JBIRetrieval augmentation of large language models for lay language generationYue Guo*, Wei Qiu*, Gondy Leroy, Sheng Wang, and 1 more authorJournal of Biomedical Informatics, 2024
The complex linguistic structures and specialized terminology of expert-authored content limit the accessibility of biomedical literature to the general public. Automated methods have the potential to render this literature more interpretable to readers with different educational backgrounds. Prior work has framed such lay language generation as a summarization or simplification task. However, adapting biomedical text for the lay public includes the additional and distinct task of background explanation: adding external content in the form of definitions, motivation, or examples to enhance comprehensibility. This task is especially challenging because the source document may not include the required background knowledge. Furthermore, background explanation capabilities have yet to be formally evaluated, and little is known about how best to enhance them. To address this problem, we introduce Retrieval-Augmented Lay Language (RALL) generation, which intuitively fits the need for external knowledge beyond that in expert-authored source documents. In addition, we introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. To evaluate RALL, we augmented state-of-the-art text generation models with information retrieval of either term definitions from the UMLS and Wikipedia, or embeddings of explanations from Wikipedia documents. Of these, embedding-based RALL models improved summary quality and simplicity while maintaining factual correctness, suggesting that Wikipedia is a helpful source for background explanation in this context. We also evaluated the ability of both an open-source Large Language Model (Llama 2) and a closed-source Large Language Model (GPT-4) in background explanation, with and without retrieval augmentation. Results indicate that these LLMs can generate simplified content, but that the summary quality is not ideal. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. Our code and data are publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.
2023
- Lancet HLExplaiNAble BioLogical Age (ENABL Age): an artificial intelligence framework for interpretable biological ageWei Qiu, Hugh Chen, Matt Kaeberlein, and Su-In LeeThe Lancet Healthy Longevity, 2023
Biological age is a measure of health that offers insights into ageing. The existing age clocks, although valuable, often trade off accuracy and interpretability. We introduce ExplaiNAble BioLogical Age (ENABL Age), a computational framework that combines machine-learning models with explainable artificial intelligence (XAI) methods to accurately estimate biological age with individualised explanations. To construct the ENABL Age clock, we first predicted an age-related outcome (eg, all-cause or cause-specific mortality), and then rescaled these predictions to estimate biological age, using UK Biobank and National Health and Nutrition Examination Survey (NHANES) datasets. We adapted existing XAI methods to decompose individual ENABL Ages into contributing risk factors. For broad accessibility, we developed two versions: ENABL Age-L, based on blood tests, and ENABL Age-Q, based on questionnaire characteristics. Finally, we validated diverse ageing mechanisms captured by each ENABL Age clock through genome-wide association studies (GWAS) association analyses. Our ENABL Age clock was significantly correlated with chronological age (r=0·7867, p<0·0001 for UK Biobank; r=0·7126, p<0·0001 for NHANES). These clocks distinguish individuals who are healthy (ie, their ENABL Age is lower than their chronological age) from those who are unhealthy (ie, their ENABL Age is higher than their chronological age), predicting mortality more effectively than existing clocks. Groups of individuals who were unhealthy showed approximately three to 12 times higher log hazard ratio than healthy groups, as per ENABL Age. The clocks achieved high mortality prediction power with an area under the receiver operating characteristic curve of 0·8179 for 5-year mortality and 0·8115 for 10-year mortality on the UK Biobank dataset, and 0·8935 for 5-year mortality and 0·9107 for 10-year mortality on the NHANES dataset. The individualised explanations that revealed the contribution of specific characteristics to ENABL Age provided insights into the important characteristics for ageing. An association analysis with risk factors and ageing-related morbidities and GWAS results on ENABL Age clocks trained on different mortality causes showed that each clock captures distinct ageing mechanisms. ENABL Age brings an important leap forward in the application of XAI for interpreting biological age clocks. ENABL Age also carries substantial potential in practical settings, assisting medical professionals in untangling the complexity of ageing mechanisms, and potentially becoming a valuable tool in informed clinical decision-making processes.
- IMLHExplanation-guided dynamic feature selection for medical risk predictionNicasia Beebe-Wang, Wei Qiu, and Su-In LeeIn ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH) , 2023
In medical risk prediction scenarios, machine learning methods have demonstrated an ability to learn complex and predictive relationships among rich feature sets. However, in practice, when faced with new patients, we may not have access all information expected by a trained risk model. We propose a framework to simultaneously provide flexible risk estimates for samples with missing features, as well as context-dependent feature recommendations to identify what piece of information may be most valuable to collect next. Our approach uses a fixed prediction model, a local feature explainer, and ensembles of imputed samples to generate risk prediction intervals and feature recommendations. Applied to a myocardial infarction risk prediction task in the UK Biobank dataset, we find that our approach can more efficiently predict risk of a heart attack with fewer observed features than traditional fixed imputation and global feature selection methods.
- ICMLLearning to maximize mutual information for dynamic feature selectionIan Connick Covert, Wei Qiu, Mingyu Lu, Na Yoon Kim, and 2 more authorsIn International Conference on Machine Learning , 2023
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning, but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality, and it outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
- bioRxivIsolating structured salient variations in single-cell transcriptomic data with StrastiveVIWei Qiu, Ethan Weinberger, and Su-In LeebioRxiv, 2023
Single-cell RNA sequencing (scRNA-seq) has provided deeper insights into biological processes by highlighting differences at the cellular level. Within these single-cell omics measurements, researchers are often interested in identifying variations associated with a specific covariate. For instance, in aging research, it becomes vital to differentiate variations related to aging. To address this, we introduce StrastiveVI (Structured Contrastive Variational Inference; https://github.com/suinleelab/StrastiveVI), which effectively separates the variations of interest from other dominant biological signals in scRNA-seq datasets. When deployed on aging and Alzheimer’s disease (AD) datasets, StrastiveVI efficiently isolates aging and AD-associated patterns, distinguishing them from dominant variations linked to sex, tissue, and cell type that are unrelated to aging or AD. In doing so, it underscores both well-known genes and potential novel genes related to aging or AD.
2022
- Commun. Med.Interpretable machine learning prediction of all-cause mortalityWei Qiu, Hugh Chen, Ayse Berceste Dincer, Scott Lundberg, and 2 more authorsCommunications Medicine, 2022
Unlike linear models which are traditionally used to study all-cause mortality, complex machine learning models can capture non-linear interrelations and provide opportunities to identify unexplored risk factors. Explainable artificial intelligence can improve prediction accuracy over linear models and reveal great insights into outcomes like mortality. This paper comprehensively analyzes all-cause mortality by explaining complex machine learning models. We propose the IMPACT framework that uses XAI technique to explain a state-of-the-art tree ensemble mortality prediction model. We apply IMPACT to understand all-cause mortality for 1-, 3-, 5-, and 10-year follow-up times within the NHANES dataset, which contains 47,261 samples and 151 features. We show that IMPACT models achieve higher accuracy than linear models and neural networks. Using IMPACT, we identify several overlooked risk factors and interaction effects. Furthermore, we identify relationships between laboratory features and mortality that may suggest adjusting established reference intervals. Finally, we develop highly accurate, efficient and interpretable mortality risk scores that can be used by medical professionals and individuals without medical expertise. We ensure generalizability by performing temporal validation of the mortality risk scores and external validation of important findings with the UK Biobank dataset. IMPACT’s unique strength is the explainable prediction, which provides insights into the complex, non-linear relationships between mortality and features, while maintaining high accuracy. Our explainable risk scores could help individuals improve self-awareness of their health status and help clinicians identify patients with high risk. IMPACT takes a consequential step towards bringing contemporary developments in XAI to epidemiology.
2021
- AAAIAutomated lay language summarization of biomedical scientific reviewsYue Guo*, Wei Qiu*, Yizhong Wang, and Trevor CohenIn Proceedings of the AAAI Conference on Artificial Intelligence , 2021
Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes. However, medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret. Thus, there is an urgent unmet need for automated methods to enhance the accessibility of the biomedical literature to the general population. This problem can be framed as a type of translation problem between the language of healthcare professionals, and that of the general public. In this paper, we introduce the novel task of automated generation of lay language summaries of biomedical scientific reviews, and construct a dataset to support the development and evaluation of automated methods through which to enhance the accessibility of the biomedical literature. We conduct analyses of the various challenges in performing this task, including not only summarization of the key points but also explanation of background knowledge and simplification of professional language. We experiment with state-of-the-art summarization models as well as several data augmentation techniques, and evaluate their performance using both automated metrics and human assessment. Results indicate that automatically generated summaries produced using contemporary neural architectures can achieve promising quality and readability as compared with reference summaries developed for the lay public by experts (best ROUGE-L of 50.24 and Flesch-Kincaid readability score of 13.30). We also discuss the limitations of the current effort, providing insights and directions for future work.
2020
- IEEE BigDataIFGAN: missing value imputation using feature-specific generative adversarial networksWei Qiu*, Yangsibo Huang*, and Quanzheng LiIn 2020 IEEE International Conference on Big Data (Big Data) , 2020
Missing value imputation is a challenging and well- researched topic in data mining. In this paper, we propose IFGAN, a missing value imputation algorithm based on Feature- specific Generative Adversarial Networks (GAN). Our idea is intuitive yet effective: a feature-specific generator is trained to impute missing values, while a discriminator is expected to distinguish the imputed values from observed ones. The proposed architecture is capable of handling different data types, data distributions, missing mechanisms, and missing rates. It also improves post-imputation analysis by preserving inter-feature correlations. We empirically show on several real-life datasets that IFGAN outperforms current state-of-the-art algorithm under various missing conditions.
- IEEE BigDataMulti-label detection and classification of red blood cells in microscopic imagesWei Qiu*, Jiaming Guo*, Xiang Li, Mengjia Xu, and 3 more authorsIn 2020 IEEE International Conference on Big Data (Big Data) , 2020
Cell detection and cell type classification from biomedical images play an important role for high-throughput imaging and various clinical application. While classification of single cell sample can be performed with standard computer vision and machine learning methods, analysis of multi-label samples (region containing congregating cells) is more challenging, as separation of individual cells can be difficult (e.g. touching cells) or even impossible (e.g. overlapping cells). As multi-instance images are common in analyzing Red Blood Cell (RBC) for Sickle Cell Disease (SCD) diagnosis, we develop and implement a multi-instance cell detection and classification framework to address this challenge. The framework firstly trains a region proposal model based on Region-based Convolutional Network (RCNN) to obtain bounding-boxes of regions potentially containing single or multiple cells from input microscopic images, which are extracted as image patches. High-level image features are then calculated from image patches through a pre-trained Convolutional Neural Network (CNN) with ResNet-50 structure. Using these image features inputs, six networks are then trained to make multi-label prediction of whether a given patch contains cells belonging to a specific cell type. As the six networks are trained with image patches consisting of both individual cells and touching/overlapping cells, they can effectively recognize cell types that are presented in multi-instance image samples. Finally, for the purpose of SCD testing, we train another machine learning classifier to predict whether the given image patch contains abnormal cell type based on outputs from the six networks. Testing result of the proposed framework shows that it can achieve good performance in automatic cell detection and classification.
2019
- IEEE BigDataPredicting Alzheimer’s disease by hierarchical graph convolution from positron emission tomography imagingJiaming Guo*, Wei Qiu*, Xiang Li, Xuandong Zhao, and 2 more authorsIn 2019 IEEE international conference on big data (big data) , 2019
Imaging-based early diagnosis of Alzheimer Disease (AD) has become an effective approach, especially by using nuclear medicine imaging techniques such as Positron Emission Topography (PET). In various literature it has been found that PET images can be better modeled as signals (e.g. uptake of florbetapir) defined on a network (non-Euclidean) structure which is governed by its underlying graph patterns of pathological progression and metabolic connectivity. In order to effectively apply deep learning framework for PET image analysis to overcome its limitation on Euclidean grid, we develop a solution for 3D PET image representation and analysis under a generalized, graph-based CNN architecture (PETNet), which analyzes PET signals defined on a group-wise inferred graph structure. Computations in PETNet are defined in non-Euclidean, graph (network) domain, as it performs feature extraction by convolution operations on spectral-filtered signals on the graph and pooling operations based on hierarchical graph clustering. Effectiveness of the PETNet is evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, which shows improved performance over both deep learning and other machine learning-based methods.
- CirculationRecurrent neural network enhance phenotyping in heart failure with preserved ejection fraction using electronic health recordHui Ren, Wei Qiu, Aoxiao Zhong, Sijia Yu, and 3 more authorsCirculation, 2019
- CirculationPersonalized Treatment for Heart Failure With Preserved Ejection Fraction Using Deep Reinforcement LearningHui Ren, Sijia Yu, Xiang Li, Wei Qiu, and 3 more authorsCirculation, 2019