Build Expression Predictors#
This page illustrates command templates for training FGMB molecular-trait prediction models with the xqtl-protocol multi-context regression workflow. The examples are based on the StatFunGen/xqtl-protocol pipeline and focus on two sections:
susie_twas: train univariate RWAS weights for one molecular context at a time. Pipeline tutorial linkmnm: train multivariate / multi-context RWAS weights across multiple molecular contexts. Pipeline tutorial link
The paths below are templates. Replace cohort names, phenotype files, covariate files, association windows, and gene IDs with the files generated for each FGMB atlas context.
Required Inputs#
The regression workflow expects harmonized genotype, phenotype, covariate, and region-definition files. The input conventions follow the mnm_regression.ipynb documentation in xqtl-protocol.
Input |
Purpose |
|---|---|
|
PLINK |
|
Region-indexed molecular phenotype file list. For |
|
Covariate matrix matched to samples in the phenotype file. For |
|
Gene- or region-specific cis windows. The fourth column should match the molecular trait or gene IDs used in the phenotype list. |
|
Names used to label one or more molecular contexts. For |
|
Optional subset of genes or regions to train. |
|
LD reference metadata used by RWAS-weight cross-validation and variant annotation. |
|
Optional sample subset file used to restrict model training to selected individuals. |
|
Optional variant subset file used to restrict the genotype matrix. |
The number of --phenoFile, --covFile, and --phenotype-names entries should match.
Univariate TWAS Weights: susie_twas example commands#
Use susie_twas when training prediction weights for a single context, such as one brain region, cell type, or molecular modality. This step runs univariate fine-mapping and, unless disabled, generates RWAS weights for the selected molecular traits. For more details, please refer to the susie_twas pipeline vignette.
The example below illustrates three ROSMAP cell-type contexts.
sos run ./xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.ipynb susie_twas \
--name ROSMAP_DeJager --cwd ../output/ \
--genoFile ../input/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \
--phenoFile eQTL/ROSMAP/DLPFC_Mic/analysis_ready/phenotype_preprocessing/Mic.log2cpm.region_list.txt \
eQTL/ROSMAP/DLPFC_Ast/analysis_ready/phenotype_preprocessing/Ast.log2cpm.region_list.txt \
eQTL/ROSMAP/DLPFC_Oli/analysis_ready/phenotype_preprocessing/Oli.log2cpm.region_list.txt \
--covFile covariate_preprocessing/Mic.log2cpm.Mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
covariate_preprocessing/Ast.log2cpm.Ast.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
covariate_preprocessing/Oli.log2cpm.Oli.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
--customized-association-windows xqtl-analysis/resource/TADB_enhanced_cis.coding.bed \
--no-skip-twas-weights --skip-fine-mapping --no-save_data \
--phenotype-names Mic_DeJager_eQTL Ast_DeJager_eQTL Oli_DeJager_eQTL \
--max_cv_variants 8000 \
--region-name ENSG00000073921 \
--ld_reference_meta_file resource/ADSP_R4_EUR_LD/ld_meta_file.tsv -s build
Multivariate RWAS Weights: mnm example commands#
Use mnm when training weights jointly across multiple contexts. For example, single-nucleus pseudo-bulk cell types can be modeled together so that shared and context-specific regulatory effects are learned in one multi-context model. For more details, please refer to the mnm pipeline vignette.
sos run ./xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.ipynb mnm \
--name ROSMAP_DeJager --cwd ~/output/ \
--genoFile ../input/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.8.bed \
--phenoFile eQTL/ROSMAP/DLPFC_Mic/analysis_ready/phenotype_preprocessing/Mic.log2cpm.region_list.txt \
eQTL/ROSMAP/DLPFC_Ast/analysis_ready/phenotype_preprocessing/Ast.log2cpm.region_list.txt \
eQTL/ROSMAP/DLPFC_Oli/analysis_ready/phenotype_preprocessing/Oli.log2cpm.region_list.txt \
--covFile covariate_preprocessing/Mic.log2cpm.Mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
covariate_preprocessing/Ast.log2cpm.Ast.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
covariate_preprocessing/Oli.log2cpm.Oli.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
--customized-association-windows xqtl-analysis/resource/TADB_enhanced_cis.coding.bed \
--no-skip-twas-weights --no-save_data \
--phenotype-names Mic_eQTL Ast_eQTL Oli_eQTL \
--mixture_prior analysis_result/mash/adgwas_20k_prune37_ed_bovy_1e3.EZ.prior.rosmap.rds \
--max_cv_variants 8000 --skip-analysis-pip-cutoff -1 \
--region-name ENSG00000073921 \
--ld_reference_meta_file resource/ADSP_R4_EUR_LD/ld_meta_file.tsv -s build