Title: | Scalable Exact Algorithm for Large-Scale Set-Based Gene-Environment Interaction Tests |
---|---|
Description: | The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). We present 'SEAGLE', a Scalable Exact AlGorithm for Large-scale Set-based GxE tests, to permit GxE VC test scalable to biobank data. 'SEAGLE' employs modern matrix computations to achieve the same “exact” results as the original GxE VC tests, and does not impose additional assumptions nor relies on approximations. 'SEAGLE' can easily accommodate sample sizes in the order of 10^5, is implementable on standard laptops, and does not require specialized equipment. The accompanying manuscript for this package can be found at Chi, Ipsen, Hsiao, Lin, Wang, Lee, Lu, and Tzeng. (2021+) <arXiv:2105.03228>. |
Authors: | Jocelyn Chi [aut, cre], Ilse Ipsen [aut], Jung-Ying Tzeng [aut] |
Maintainer: | Jocelyn Chi <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-11-23 03:33:13 UTC |
Source: | https://github.com/jocelynchi/seagle |
Function for applying t(A) on the left for REML EM
applyAt(qrXtilde, RHS)
applyAt(qrXtilde, RHS)
qrXtilde |
Object from QR decomposition of Xtilde |
RHS |
Object on right hand side of null of Xtilde^T |
Matrix or vector resulting from left multiplication of At with matrix or vector input RHS
A dataset containing 10,000 haplotypes of SNP sequences mimicking the European population generated from the COSI software
data(cosihap)
data(cosihap)
An object of class dgCMatrix
with 10000 rows and 604 columns.
https://genome.cshlp.org/content/15/11/1576.abstract
REML EM algorithm for estimating variance components
estimate.vc( y, Xtilde, qrXtilde, beta, G, init.sigma = 0.5, init.tau = 0.5, tol = 0.001, maxiters = 1000 )
estimate.vc( y, Xtilde, qrXtilde, beta, G, init.sigma = 0.5, init.tau = 0.5, tol = 0.001, maxiters = 1000 )
y |
Vector of observed phenotypes |
Xtilde |
Matrix of covariates (first column contains the intercept, last column contains the E factor for studying the GxE effect) |
qrXtilde |
Object containing QR decomposition of Xtilde |
beta |
Coefficient vector for covariate matrix Xtilde |
G |
Matrix of genotype markers |
init.sigma |
Initial sigma input (Default is 0.5) |
init.tau |
Initial tau input (Default is 0.5) |
tol |
Tolerance for convergence (Default is 1e-3) |
maxiters |
Maximum number of iterations (Default is 1000) |
Estimates for tau and sigma
This function generates synthetic from the fixed effects model described in the experimental studies portion of the paper.
makeSimData( H, n, L = 100, maf = 0.01, gamma0 = 1, gammaX = 1, gammaE = 1, gammaG, gammaGE, causal = 40, seed = 12345 )
makeSimData( H, n, L = 100, maf = 0.01, gamma0 = 1, gammaX = 1, gammaE = 1, gammaG, gammaGE, causal = 40, seed = 12345 )
H |
Matrix of haplotype data (e.g. cosihap) |
n |
Number of individuals |
L |
Number of SNPs in the G matrix (Default is 100), should be a value between 1 and 604 |
maf |
Minor allele frequency (Default is 0.01) |
gamma0 |
gamma0 Fixed effect coefficient for intercept (Default is 1) |
gammaX |
gammaX Fixed effect coefficient for confounding covariates (Default is 1) |
gammaE |
gammaE Fixed effect coefficient for E effect (Default is 1) |
gammaG |
gammaG Fixed effect coefficient for G main effect |
gammaGE |
gammaGE Fixed effect coefficient for GxE interaction effect |
causal |
Number of causal SNPs (default is 40) |
seed |
Seed (Default is 12345) |
Synthetic dataset containing y, X, E, G, epsilon, and number of causal SNPs
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1)
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1)
This function checks and formats data for input into SEAGLE function
prep.SEAGLE(y, X, intercept, E, G)
prep.SEAGLE(y, X, intercept, E, G)
y |
Vector of observed phenotypes |
X |
Matrix of covariates without genetic marker interactions |
intercept |
1 if the first column of X is the all ones vector, 0 otherwise |
E |
E Vector of environmental covariates |
G |
G Matrix of genotype data |
List object containing prepared data for input into SEAGLE function
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1) objSEAGLE <- prep.SEAGLE(y=dat$y, X=dat$X, intercept=1, E=dat$E, G=dat$G)
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1) objSEAGLE <- prep.SEAGLE(y=dat$y, X=dat$X, intercept=1, E=dat$E, G=dat$G)
Function for applying R inverse to AtG in REML EM algorithm
Rinv.AtG(G, AtG, GtAAtG, tau, sigma)
Rinv.AtG(G, AtG, GtAAtG, tau, sigma)
G |
Matrix of genotype markers (size n x L) |
AtG |
AtG from pre-computation |
GtAAtG |
GtAAtG from pre-computation |
tau |
Variance component from G main effect |
sigma |
Variance component from model noise epsilon |
Matrix resulting from left multiplication of Rinv with input matrix AtG
Function for applying R inverse to u in REML EM algorithm
Rinv.u(G, AtG, GtAAtG, GtAu, u, tau, sigma)
Rinv.u(G, AtG, GtAAtG, GtAu, u, tau, sigma)
G |
Matrix of genotype markers (size n x L) |
AtG |
AtG from precomputation |
GtAAtG |
GtAAtG from precomputation |
GtAu |
GtAu from precomputation |
u |
u=Aty from REML EM |
tau |
Variance component from G main effect |
sigma |
Variance component from model noise epsilon |
Vector resulting from left multiplication of Rinv with input vector u
This function computes the score test statistic and corresponding p-value for the GxE test with the SEAGLE algorithm with input data that have been prepared with the prep.SEAGLE function
SEAGLE(obj.SEAGLE, init.tau = 0.5, init.sigma = 0.5, pv = "liu")
SEAGLE(obj.SEAGLE, init.tau = 0.5, init.sigma = 0.5, pv = "liu")
obj.SEAGLE |
Input data prepared with prep.SEAGLE function |
init.tau |
Initial estimate for tau (Default is 0.5) |
init.sigma |
Initial estimate for sigma (Default is 0.5) |
pv |
Method of obtaining p-value (Either "liu" or "davies", Default is liu) |
Score-like test statistic T for the GxE effect and corresponding p-value
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1) objSEAGLE <- prep.SEAGLE(y=dat$y, X=dat$X, intercept=1, E=dat$E, G=dat$G) res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
dat <- makeSimData(H=cosihap, n=500, L=10, gammaG=1, gammaGE=0, causal=4, seed=1) objSEAGLE <- prep.SEAGLE(y=dat$y, X=dat$X, intercept=1, E=dat$E, G=dat$G) res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
This function applies V inverse via the Woodbury matrix identity
Vinv(G, qrM, tau_over_sigma, sigma, RHS)
Vinv(G, qrM, tau_over_sigma, sigma, RHS)
G |
Matrix of genotype markers (size n x L) |
qrM |
Pre-computation for LxL linear system solve |
tau_over_sigma |
Tau over sigma from precomputation |
sigma |
Variance component from model noise epsilon |
RHS |
Matrix or vector on right-hand side of V inverse |
Matrix or vector resulting from left multiplication of Vinv with input RHS