PsyCourse

2022-05-18

057_ Unsupervised Clustering of Multiomics Data to Link Biotypes to Diagnosis and Outcome Parameters in PsyCourse

Research Question and Aims

Next to deep-phenotyping, many different levels of omics data are already available for PsyCourse. These include - in addition to genotyping data - epigenomic, transcriptomic, miRNAomic, lipidomic, and proteomic data. While several lines of analyses are already performed using these data in single- or dual-level omics approaches, we propose to use as many layers as possible (missing data permitting) in an unsupervised machine learning approach. To our knowledge, to date, predictive modeling using machine learning approaches and the entire wealth of biological data available in PsyCourse has only rarely been attempted. The aim of the proposed analysis will be the generation of multiomic profiles. In a discovery sample, unsupervised clustering will be used to establish clusters of specific multiomic profiles and we will then ask in how far the identified clusters are related to specific diagnostic entities, disease severity, and outcomes as well as treatment response. We will use a replication sample to validate the identified clusters. We will also compare the multiomics-based clusters to clusters derived from phenotypic data (e.g. PMID 32049274).
Ultimately, this analysis will highlight biological pathways implicated in specific clusters of individuals with mental disorders from the affective and psychotic spectrum and, in doing so, will contribute to a better understanding of the common biology underlying these clusters. The analysis will also enable us to link biological profiles to specific outcome and severity measures, thus, not only informing the underlying biology but also opening up potential avenues to biomarker discovery and novel treatment options.

Analytic Plan

We will use unsupervised machine learning to cluster multiomic profiles consisting of different available biological layers and link these to both established diagnostic categories, outcome and severity measures, and also clusters that have been identified using the phenotypic data from PsyCourse in similar approaches (e.g. PMID 32049274). Wherever deemed necessary, we will use available data on sex, age, illness duration, time of day of sampling, BMI, medication use, etc. to correct the biological measurements for potential confounding effects. If necessary, we will use dimensionality reduction such as, for example, reducing highly dimensional genotyping data to polygenic risk scores (PRS) for severe mental disorders like schizophrenia, bipolar disorder, or major depressive disorder or by focusing on specific pathways only, in order to incorporate data strata of higher dimensionality into our analysis. We will use biological data available for visit 1 only in this proposal, because that is the timepoint for which there is the most comprehensive biological dataset available for all individuals. However, future analyses could also take longitudinal measurements into account.
Clusters with similar multiomic profiles will be established using unsupervised clustering approaches such as, for example, nonnegative matrix factorization (NNMF) consensus clustering. Multigroup supervised machine learning models will be trained to predict cluster membership in the discovery data using a reduced feature set of critical variables. These models will then be applied to the replication data to assess external validity. We will then ask in how far the identified clusters are related to established diagnosis categories, to outcome measures and the phenotypically established clusters from Dwyer et al., JAMA Psychiatry, 2020 (PMID 32049274).

Resources needed

v1_id
v1_stat
v1_center
v1_sex
v1_age
v1_cntr_brth_m
v1_cntr_brth_f
v1_cur_psy_trm
v1_age_1st_out_trm
v1_age_1st_inpat_trm
v1_dur_illness
raw medication data sets (v1_med_clin_orig, v1_med_con_orig)
v1_fam_hist
v1_lith
v1_lith_prd
v1_bmi
v1_no_cig
v1_alc_pst12_mths
v1_scid_dsm_dx
v1_scid_dsm_dx_cat
v1_panss_sum_pos
v1_panss_sum_neg
v1_panss_sum_gen
v1_idsc_sum
v1_ymrs_sum
v1_cgi_s
v1_gaf
v1_whoqol
v1_tms_daypat_outpat_trm
v1_cat_daypat_outpat_trm

v2_wrk_abs_pst_6_mths
v2_clin_ill_ep_snc_lst
v2_clin_no_ep

v3_wrk_abs_pst_6_mths
v3_clin_ill_ep_snc_lst
v3_clin_no_ep

v4_wrk_abs_pst_6_mths
v4_clin_ill_ep_snc_lst
v4_clin_no_ep
v4_opcrit
v4_alda_A
v4_alda_B1
v4_alda_B2
v4_alda_B3
v4_alda_B4
v4_alda_B5
v4_panss_sum_pos
v4_panss_sum_neg
v4_panss_sum_gen
v4_idsc_sum
v4_ymrs_sum
v4_cgi_s
v4_gaf
v4_whoqol

gsa_id
gsa_imp_id
v1_smRNAome_id
v1_lexo_id
v1_prot_id
v1_ab_prof_id
v1_lip_id

Plasma lipidome data
Plasma proteome MS data
Plasma proteome antibody-based data
Lexogene whole blood transcriptomic data
Small RNAome sequencing data
Imputed genotypes for the calculation of PRS
Raw genotypes to calculate PCAs