This tutorial demonstrates how to
use the SEAGLE
package for chromosome-wide gene-based
studies when the genotype data are from GWAS studies (.bed + .bim +
.fam). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/
to generate a .raw file for each of the SNP sets. For more information
on .raw files, please refer to https://www.cog-genomics.org/plink/2.0/formats#raw.
The .raw file records the allelic dosage for each SNP.
Examples files containing GWAS data (.bed, .bim, and .fam files) for
all the genes in chromosome 22 and a corresponding gene list in
glist-hg19
can be downloaded via the
SEAGLE-example4-data.zip
file from http://jocelynchi.com/SEAGLE/SEAGLE-example4-data.zip.
To follow along with the PLINK conversion procedures, you will first
need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/
and unzip the example SEAGLE-example4-data.zip
file.
Afterwards, you can type the PLINK commands below in a command prompt or
terminal window in the folder where these files are located. This will
produce a .raw file for each gene that you can read into R to obtain a
relevant genetic marker matrix ${\bf
G}$ that you can input into SEAGLE
.
After unzipping the SEAGLE-example4-data.zip
file, you
will find two empty directories named plink_bash
and
R_codes
. The plink_bash
directory is where we
will write the PLINK1.9 commands that you will run to create the .raw
files for each gene. The R_codes
directory is where we will
write the R scripts for loading the resulting .raw files into R.
We will begin by telling R the location where you unzipped the
SEAGLE-example4-data.zip
file. Then we will read in the
gene list in glist-hg19
and subset for the genes in
chromosome 22. If you unzipped the file to your Downloads directory on a
Mac, your path might look like the following.
dir <- "/Users/user_name/Downloads/SEAGLE-example4-data/"
glist <- read.table(paste0(dir,"glist-hg19"), header = F)
glist_chr22 <- subset(glist, glist$V1 == 22)
Next, we will remove any duplicated genes for chromosome 22 in our gene list.
dup <- glist_chr22$V4[duplicated(glist_chr22$V4)] # find duplicated gene
dup
Notice that in this example, there are 4 duplicated genes given by the following.
# GSTT2: chr22:24,322,339-24,326,106(GRCh37/hg19 by Ensembl)
# GSTT2B: chr22:24,299,601-24,303,373(GRCh37/hg19 by Ensembl)
# RIMBP3B: chr22:21,738,040-21,743,458(GRCh37/hg19 by Entrez Gene)
# RIMBP3C: chr22:21,899,646-21,905,750(GRCh37/hg19 by Ensembl)
We will update our gene list by manually updating the positions of the duplicated genes.
glist_chr22 <- glist_chr22[!duplicated(glist_chr22$V4),]
glist_chr22[which(glist_chr22$V4 == "GSTT2"),2:3] <- c(24322339, 24326106)
glist_chr22[which(glist_chr22$V4 == "GSTT2B"),2:3] <- c(24299601,24303373)
glist_chr22[which(glist_chr22$V4 == "RIMBP3B"),2:3] <- c(21738040,21743458)
glist_chr22[which(glist_chr22$V4 == "RIMBP3C"),2:3] <- c(21899646,21905750)
write.table(glist_chr22, "glist-hg19_chr22_updated", col.names = F, row.names = F, quote = F)
# replace hyphen with underscore
glist_chr22$V4 <- sub("-", "_", glist_chr22$V4)
Now we will use PLINK1.9 to create a .raw file for each gene in
chromosome 22, which includes all SNPs for a given gene. The following
code will produce a sequence of bash files to produce .raw files for
each gene in glist_chr22
. It will also produce a
corresponding sequence of R code snippets for reaching each .raw file
into R. Each bash file will run PLINK1.9 for num_genes=50
genes at a time.
# of plink commands per job
num_genes <- 50
bash_plink <- NULL # bash command
command <- c("#! /bin/bash", "#SBATCH --job-name=plink") # plink command
codeR <- NULL # R code
for (i in 1:nrow(glist_chr22)) {
command <- c(command, paste0("plink --bfile 1kg_phase1_chr22 --make-set
glist-hg19_chr22_updated --gene ", glist_chr22$V4[i]," --out ",
glist_chr22$V4[i]," --export A") )
# write R codes for reading .raw files, one R file for one gene
codeR <- paste0(glist_chr22$V4[i], " <- read.delim(\"", glist_chr22$V4[i],".raw\", sep=\" \")")
fileConn<-file(paste0(dir,"R_codes/", glist_chr22$V4[i],".R"))
writeLines(codeR, fileConn)
close(fileConn)
if(i %% num_genes == 0 | i==nrow(glist_chr22)){
# write plink commands in bash file
fileConn<-file(paste0(dir,"plink_bash/plink_job",i-num_genes+1,"-",i,".sh"))
writeLines(command, fileConn)
close(fileConn)
command <- c("#! /bin/bash", "#SBATCH --job-name=plink")
bash_plink <- c(bash_plink, paste0("sbatch plink_job",i-num_genes+1,"-",i,".sh"))
}
}
# Write bash commands to a text file
fileConn<-file(paste0("bash_plink.txt"))
writeLines(bash_plink, fileConn)
close(fileConn)
Note that if you want to run the .sh files in the
plink_bash
directory on your local computer, you can employ
the bash
command instead of SBATCH
in terminal
or command prompt. Additionally, if you are using Linux, you may need to
specify plink1.9
rather than just plink
.
Running the above code will setup the code and files to produce a
single .raw file for each gene when running the commands in the
bash_plink.txt
file. The .raw file for each gene will
contain all the SNPs for that gene.
Now that we have data for each gene in the .raw format, we can load
the .raw files sequentially into R for input into SEAGLE
as
follows. We’ll first load the SEAGLE
package.
The following code shows how to test for the GxE interaction effect
using SEAGLE
on these genes. To illustrate how to use
SEAGLE, we will generate synthetic phenotype and covariate data for
n = 1092 study
participants.
As an example, we’ve included .raw files for the following subset of
genes from chromosome 22 in the extdata
folder of this
package: ACR, APOBEC3A, APOBEC3C, ARSA, ATF4, ATP5L2, BCRP2, BMS1P17,
and BMS1P18. We’ve also included a corresponding gene list in
glist-hg19_chr22_example
. In practice, you will want to
specify your own dir
for the directory where you’ve stored
your .raw files.
# Read in gene list
dir <- "../inst/extdata/"
genelist <- read.delim(paste0(dir, "glist-hg19_chr22_example"), sep=" ", header=FALSE)
# Identify number of genes in genelist
num_genes <- dim(genelist)[1]
# Generate synthetic phenotype and covariate data
n <- 1092 # number of study participants
set.seed(1)
y <- 2 * rnorm(n)
set.seed(2)
X <- as.matrix(rnorm(n))
set.seed(3)
E <- as.matrix(rnorm(n))
Now that we have ${\bf y}$, ${\bf X}$, ${\bf
E}$, and ${\bf G}$, we can run
SEAGLE
on each gene in genelist
. We will first
perform data checking procedures on ${\bf
G}$. Then we will input ${\bf
y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$ for each gene into the
prep.SEAGLE
function. The intercept = 0
parameter indicates that the first column of ${\bf X}$ is not the all ones vector for the
intercept.
The preparation procedure formats the input data for the
SEAGLE
function by checking the dimensions of the input
data. It also pre-computes a QR decomposition for $\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n}
& {\bf X} & {\bf E} \end{pmatrix}$, where ${\bf 1}_{n}$ denotes the all ones vector of
length n.
Afterwards, we will input the prepared data for each gene into the
SEAGLE
function to compute the score-like test statistic
T and its corresponding
p-value. The init.tau
and init.sigma
parameters are the initial values for estimating τ and σ in the REML EM algorithm.
# Initialize output containers for T and p-value for each gene
T.list <- numeric(length=num_genes)
pv.list <- numeric(length=num_genes)
# Run SEAGLE on each gene in genelist
for (i in 1:num_genes) {
# Identify current gene
gene_name <- genelist[i,4]
# Obtain G
Gtemp <- read.delim(paste0(dir, gene_name, ".raw"), sep=" ")
G <- as.matrix(Gtemp[,-c(1:6)])
L <- dim(G)[2] ## number of SNPs
# Make weights
avg_newsnp <- colMeans(G)
fA = avg_newsnp/2 ## freq of allele "A"
fa = 1-fA ## freq of allele "a"
maf = fA; maf[fA>0.5]= fa[fA>0.5]## maf should be b/w 0 and 0.5
G = G[ ,maf>0] ## only keep those SNPs with MAF>0
maf <- maf[maf > 0] ## Keep only MAF > 0)
wt = sqrt(maf^(-3/4)) ## Note we take the square root here
if (length(wt) > 1) {
G_final = G %*% diag(wt) ## Input this G
} else {
T.list[i] <- NA
pv.list[i] <- NA
next
}
# Run SEAGLE
objSEAGLE <- prep.SEAGLE(y=y, X=X, intercept=0,
E=E, G=G_final)
res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
# Save T and p-value into output lists
T.list[i] <- res$T
pv.list[i] <- res$pv
}
The score-like test statistics T for the G×E effect and their corresponding p-values
can be found in T.list
and pv.list
,
respectively. We can take a look at the test statistics and p-values
computed for each of the genes in genelist
.
resMat <- cbind(T.list, pv.list)
colnames(resMat) <- c("T", "p-value")
resMat
#> T p-value
#> [1,] 7693.9187 0.6148780
#> [2,] 4022.4392 0.7990406
#> [3,] 3375.4804 0.9771967
#> [4,] 9698.0070 0.2089241
#> [5,] 3523.1752 0.3668638
#> [6,] 716.1017 0.3435655
#> [7,] 4792.4809 0.9336786
#> [8,] 139.5526 0.9186002
#> [9,] 139.5526 0.9186002
Many thanks to Yueyang Huang for generating the example data, PLINK1.9 code and bash files, and R scripts for reading in the .raw files for this tutorial.