057_ Unsupervised Clustering of Multiomics Data to Link Biotypes to Diagnosis and Outcome Parameters in PsyCourse
Research Question and Aims
Next to deep-phenotyping, many different levels of omics data are already available for PsyCourse. These include - in addition to genotyping data - epigenomic, transcriptomic, miRNAomic, lipidomic, and proteomic data. While several lines of analyses are already performed using these data in single- or dual-level omics approaches, we propose to use as many layers as possible (missing data permitting) in an unsupervised machine learning approach. To our knowledge, to date, predictive modeling using machine learning approaches and the entire wealth of biological data available in PsyCourse has only rarely been attempted. The aim of the proposed analysis will be the generation of multiomic profiles. In a discovery sample, unsupervised clustering will be used to establish clusters of specific multiomic profiles and we will then ask in how far the identified clusters are related to specific diagnostic entities, disease severity, and outcomes as well as treatment response. We will use a replication sample to validate the identified clusters. We will also compare the multiomics-based clusters to clusters derived from phenotypic data (e.g. PMID 32049274).
Ultimately, this analysis will highlight biological pathways implicated in specific clusters of individuals with mental disorders from the affective and psychotic spectrum and, in doing so, will contribute to a better understanding of the common biology underlying these clusters. The analysis will also enable us to link biological profiles to specific outcome and severity measures, thus, not only informing the underlying biology but also opening up potential avenues to biomarker discovery and novel treatment options.
We will use unsupervised machine learning to cluster multiomic profiles consisting of different available biological layers and link these to both established diagnostic categories, outcome and severity measures, and also clusters that have been identified using the phenotypic data from PsyCourse in similar approaches (e.g. PMID 32049274). Wherever deemed necessary, we will use available data on sex, age, illness duration, time of day of sampling, BMI, medication use, etc. to correct the biological measurements for potential confounding effects. If necessary, we will use dimensionality reduction such as, for example, reducing highly dimensional genotyping data to polygenic risk scores (PRS) for severe mental disorders like schizophrenia, bipolar disorder, or major depressive disorder or by focusing on specific pathways only, in order to incorporate data strata of higher dimensionality into our analysis. We will use biological data available for visit 1 only in this proposal, because that is the timepoint for which there is the most comprehensive biological dataset available for all individuals. However, future analyses could also take longitudinal measurements into account.
Clusters with similar multiomic profiles will be established using unsupervised clustering approaches such as, for example, nonnegative matrix factorization (NNMF) consensus clustering. Multigroup supervised machine learning models will be trained to predict cluster membership in the discovery data using a reduced feature set of critical variables. These models will then be applied to the replication data to assess external validity. We will then ask in how far the identified clusters are related to established diagnosis categories, to outcome measures and the phenotypically established clusters from Dwyer et al., JAMA Psychiatry, 2020 (PMID 32049274).
raw medication data sets (v1_med_clin_orig, v1_med_con_orig)
Plasma lipidome data
Plasma proteome MS data
Plasma proteome antibody-based data
Lexogene whole blood transcriptomic data
Small RNAome sequencing data
Imputed genotypes for the calculation of PRS
Raw genotypes to calculate PCAs