Build Expression Predictors#

This page illustrates command templates for training FGMB molecular-trait prediction models with the xqtl-protocol multi-context regression workflow. The examples are based on the StatFunGen/xqtl-protocol pipeline and focus on two sections:

The paths below are templates. Replace cohort names, phenotype files, covariate files, association windows, and gene IDs with the files generated for each FGMB atlas context.

Required Inputs#

The regression workflow expects harmonized genotype, phenotype, covariate, and region-definition files. The input conventions follow the mnm_regression.ipynb documentation in xqtl-protocol.

Input

Purpose

--genoFile

PLINK bed genotype file or a chromosome-indexed genotype file list.

--phenoFile

Region-indexed molecular phenotype file list. For mnm, provide one phenotype list per context.

--covFile

Covariate matrix matched to samples in the phenotype file. For mnm, provide one covariate file per context in the same order as --phenoFile.

--customized-association-windows

Gene- or region-specific cis windows. The fourth column should match the molecular trait or gene IDs used in the phenotype list.

--phenotype-names

Names used to label one or more molecular contexts. For mnm, the names should match the order of phenotype and covariate files.

--region-name / --region-list

Optional subset of genes or regions to train. --region-name is convenient for smoke tests; --region-list is preferable for production runs.

--ld_reference_meta_file

LD reference metadata used by RWAS-weight cross-validation and variant annotation.

--keep-samples

Optional sample subset file used to restrict model training to selected individuals.

--keep-variants

Optional variant subset file used to restrict the genotype matrix.

The number of --phenoFile, --covFile, and --phenotype-names entries should match.

Univariate TWAS Weights: susie_twas example commands#

Use susie_twas when training prediction weights for a single context, such as one brain region, cell type, or molecular modality. This step runs univariate fine-mapping and, unless disabled, generates RWAS weights for the selected molecular traits. For more details, please refer to the susie_twas pipeline vignette.

The example below illustrates three ROSMAP cell-type contexts.

sos run ./xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.ipynb susie_twas \
    --name ROSMAP_DeJager --cwd ../output/ \
    --genoFile ../input/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \
   --phenoFile eQTL/ROSMAP/DLPFC_Mic/analysis_ready/phenotype_preprocessing/Mic.log2cpm.region_list.txt \
        eQTL/ROSMAP/DLPFC_Ast/analysis_ready/phenotype_preprocessing/Ast.log2cpm.region_list.txt \
        eQTL/ROSMAP/DLPFC_Oli/analysis_ready/phenotype_preprocessing/Oli.log2cpm.region_list.txt \
    --covFile covariate_preprocessing/Mic.log2cpm.Mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
        covariate_preprocessing/Ast.log2cpm.Ast.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
        covariate_preprocessing/Oli.log2cpm.Oli.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
    --customized-association-windows xqtl-analysis/resource/TADB_enhanced_cis.coding.bed \
    --no-skip-twas-weights --skip-fine-mapping --no-save_data \
    --phenotype-names Mic_DeJager_eQTL Ast_DeJager_eQTL Oli_DeJager_eQTL \
    --max_cv_variants 8000 \
    --region-name ENSG00000073921 \
    --ld_reference_meta_file resource/ADSP_R4_EUR_LD/ld_meta_file.tsv -s build

Multivariate RWAS Weights: mnm example commands#

Use mnm when training weights jointly across multiple contexts. For example, single-nucleus pseudo-bulk cell types can be modeled together so that shared and context-specific regulatory effects are learned in one multi-context model. For more details, please refer to the mnm pipeline vignette.

sos run ./xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.ipynb mnm \
    --name ROSMAP_DeJager --cwd ~/output/ \
    --genoFile ../input/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.8.bed \
    --phenoFile eQTL/ROSMAP/DLPFC_Mic/analysis_ready/phenotype_preprocessing/Mic.log2cpm.region_list.txt \
        eQTL/ROSMAP/DLPFC_Ast/analysis_ready/phenotype_preprocessing/Ast.log2cpm.region_list.txt \
        eQTL/ROSMAP/DLPFC_Oli/analysis_ready/phenotype_preprocessing/Oli.log2cpm.region_list.txt \
    --covFile covariate_preprocessing/Mic.log2cpm.Mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
        covariate_preprocessing/Ast.log2cpm.Ast.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
        covariate_preprocessing/Oli.log2cpm.Oli.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
    --customized-association-windows xqtl-analysis/resource/TADB_enhanced_cis.coding.bed \
    --no-skip-twas-weights --no-save_data \
    --phenotype-names Mic_eQTL Ast_eQTL Oli_eQTL \
    --mixture_prior analysis_result/mash/adgwas_20k_prune37_ed_bovy_1e3.EZ.prior.rosmap.rds \
    --max_cv_variants 8000 --skip-analysis-pip-cutoff -1 \
    --region-name ENSG00000073921 \
    --ld_reference_meta_file resource/ADSP_R4_EUR_LD/ld_meta_file.tsv -s build