Fig. 1
From: A validated heart-specific model for splice-disrupting variants in childhood heart disease

Schematic workflow for development, validation, and application of a random forest model for selecting high-confidence splice-disrupting variants for congenital heart disease. Selection strategy is shown for the identification of splice-disrupting variants in CHD genes. Model development: CHD Discovery cohort (n = 106) was used to identify putative splice-disrupting variants in genome sequencing (GS) data and confirm whether the variants were associated with a significant effect in RNA-sequencing (RNA-Seq) data derived from patient myocardium. These variants and their confirmed effect were then used to construct random forest models for predicting splice-disrupting variants with high-confidence. Model validation: Model performance was validated using independent CHD validation (n = 48) and cardiomyopathy validation (n = 43) cohorts, where both GS and RNA-Seq profiles were available for all probands. Model application: The optimal random forest model was applied to a CHD Extension cohort (n = 947), where only GS data were available for all probands. One hundred thirty two (12%) CHD probands harbored 133 rare, high-confidence splice-disrupting variants in CHD genes, including 47 variants in Tier 1 CHD genes and 86 variants in Tier 2 haploinsufficiency-intolerant CHD genes. RNA-Seq, RNA sequencing; GS, genome sequencing; FDR, false discovery rate; MAF, minor allele frequency; CHD, congenital heart disease