Example 4: Using SEAGLE for Chromosome-Wide Gene-Based Analysis

This tutorial demonstrates how to use the SEAGLE package for chromosome-wide gene-based studies when the genotype data are from GWAS studies (.bed + .bim + .fam). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/ to generate a .raw file for each of the SNP sets. For more information on .raw files, please refer to https://www.cog-genomics.org/plink/2.0/formats#raw. The .raw file records the allelic dosage for each SNP.

Examples files containing GWAS data (.bed, .bim, and .fam files) for all the genes in chromosome 22 and a corresponding gene list in glist-hg19 can be downloaded via the SEAGLE-example4-data.zip file from http://jocelynchi.com/SEAGLE/SEAGLE-example4-data.zip.

To follow along with the PLINK conversion procedures, you will first need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/ and unzip the example SEAGLE-example4-data.zip file. Afterwards, you can type the PLINK commands below in a command prompt or terminal window in the folder where these files are located. This will produce a .raw file for each gene that you can read into R to obtain a relevant genetic marker matrix ${\bf G}$ that you can input into SEAGLE.

After unzipping the SEAGLE-example4-data.zip file, you will find two empty directories named plink_bash and R_codes. The plink_bash directory is where we will write the PLINK1.9 commands that you will run to create the .raw files for each gene. The R_codes directory is where we will write the R scripts for loading the resulting .raw files into R.

Writing .raw files for each gene in chromosome 22 with PLINK1.9

We will begin by telling R the location where you unzipped the SEAGLE-example4-data.zip file. Then we will read in the gene list in glist-hg19 and subset for the genes in chromosome 22. If you unzipped the file to your Downloads directory on a Mac, your path might look like the following.

dir <- "/Users/user_name/Downloads/SEAGLE-example4-data/"
glist <- read.table(paste0(dir,"glist-hg19"), header = F)
glist_chr22 <- subset(glist, glist$V1 == 22)

Next, we will remove any duplicated genes for chromosome 22 in our gene list.

dup <- glist_chr22$V4[duplicated(glist_chr22$V4)] # find duplicated gene
dup

Notice that in this example, there are 4 duplicated genes given by the following.

# GSTT2: chr22:24,322,339-24,326,106(GRCh37/hg19 by Ensembl)
# GSTT2B: chr22:24,299,601-24,303,373(GRCh37/hg19 by Ensembl)
# RIMBP3B: chr22:21,738,040-21,743,458(GRCh37/hg19 by Entrez Gene)
# RIMBP3C: chr22:21,899,646-21,905,750(GRCh37/hg19 by Ensembl)

We will update our gene list by manually updating the positions of the duplicated genes.

glist_chr22 <- glist_chr22[!duplicated(glist_chr22$V4),]
glist_chr22[which(glist_chr22$V4 == "GSTT2"),2:3] <- c(24322339, 24326106)
glist_chr22[which(glist_chr22$V4 == "GSTT2B"),2:3] <- c(24299601,24303373)
glist_chr22[which(glist_chr22$V4 == "RIMBP3B"),2:3] <- c(21738040,21743458)
glist_chr22[which(glist_chr22$V4 == "RIMBP3C"),2:3] <- c(21899646,21905750)
write.table(glist_chr22, "glist-hg19_chr22_updated", col.names = F, row.names = F, quote = F)

# replace hyphen with underscore
glist_chr22$V4 <- sub("-", "_", glist_chr22$V4)

Now we will use PLINK1.9 to create a .raw file for each gene in chromosome 22, which includes all SNPs for a given gene. The following code will produce a sequence of bash files to produce .raw files for each gene in glist_chr22. It will also produce a corresponding sequence of R code snippets for reaching each .raw file into R. Each bash file will run PLINK1.9 for num_genes=50 genes at a time.

# of plink commands per job
num_genes <- 50

bash_plink <- NULL # bash command
command <- c("#! /bin/bash", "#SBATCH --job-name=plink") # plink command
codeR <- NULL # R code
for (i in 1:nrow(glist_chr22)) {
  command <- c(command, paste0("plink --bfile 1kg_phase1_chr22 --make-set
  glist-hg19_chr22_updated --gene ", glist_chr22$V4[i]," --out ",
  glist_chr22$V4[i]," --export A") )

  # write R codes for reading .raw files, one R file for one gene
  codeR <- paste0(glist_chr22$V4[i], " <- read.delim(\"", glist_chr22$V4[i],".raw\", sep=\" \")")
  fileConn<-file(paste0(dir,"R_codes/", glist_chr22$V4[i],".R"))
  writeLines(codeR, fileConn)
  close(fileConn)

  if(i %% num_genes == 0 | i==nrow(glist_chr22)){
    # write plink commands in bash file
    fileConn<-file(paste0(dir,"plink_bash/plink_job",i-num_genes+1,"-",i,".sh"))
    writeLines(command, fileConn)
    close(fileConn)

    command <- c("#! /bin/bash", "#SBATCH --job-name=plink")
    bash_plink <- c(bash_plink, paste0("sbatch plink_job",i-num_genes+1,"-",i,".sh"))
  }
}

# Write bash commands to a text file
fileConn<-file(paste0("bash_plink.txt"))
writeLines(bash_plink, fileConn)
close(fileConn)

Note that if you want to run the .sh files in the plink_bash directory on your local computer, you can employ the bash command instead of SBATCH in terminal or command prompt. Additionally, if you are using Linux, you may need to specify plink1.9 rather than just plink.

Running the above code will setup the code and files to produce a single .raw file for each gene when running the commands in the bash_plink.txt file. The .raw file for each gene will contain all the SNPs for that gene.

Loading .raw files into R and extracting the genetic marker matrix ${\bf G}$

Now that we have data for each gene in the .raw format, we can load the .raw files sequentially into R for input into SEAGLE as follows. We’ll first load the SEAGLE package.

library(SEAGLE)

The following code shows how to test for the GxE interaction effect using SEAGLE on these genes. To illustrate how to use SEAGLE, we will generate synthetic phenotype and covariate data for n = 1092 study participants.

As an example, we’ve included .raw files for the following subset of genes from chromosome 22 in the extdata folder of this package: ACR, APOBEC3A, APOBEC3C, ARSA, ATF4, ATP5L2, BCRP2, BMS1P17, and BMS1P18. We’ve also included a corresponding gene list in glist-hg19_chr22_example. In practice, you will want to specify your own dir for the directory where you’ve stored your .raw files.

# Read in gene list
dir <- "../inst/extdata/"
genelist <- read.delim(paste0(dir, "glist-hg19_chr22_example"), sep=" ", header=FALSE)

# Identify number of genes in genelist
num_genes <- dim(genelist)[1]

# Generate synthetic phenotype and covariate data
n <- 1092 # number of study participants

set.seed(1)
y <- 2 * rnorm(n)

set.seed(2)
X <- as.matrix(rnorm(n))

set.seed(3)
E <- as.matrix(rnorm(n))

Now that we have ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$, we can run SEAGLE on each gene in genelist. We will first perform data checking procedures on ${\bf G}$. Then we will input ${\bf y}$, ${\bf X}$, ${\bf E}$, and ${\bf G}$ for each gene into the prep.SEAGLE function. The intercept = 0 parameter indicates that the first column of ${\bf X}$ is not the all ones vector for the intercept.

The preparation procedure formats the input data for the SEAGLE function by checking the dimensions of the input data. It also pre-computes a QR decomposition for $\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n} & {\bf X} & {\bf E} \end{pmatrix}$, where ${\bf 1}_{n}$ denotes the all ones vector of length n.

Afterwards, we will input the prepared data for each gene into the SEAGLE function to compute the score-like test statistic T and its corresponding p-value. The init.tau and init.sigma parameters are the initial values for estimating τ and σ in the REML EM algorithm.

# Initialize output containers for T and p-value for each gene
T.list <- numeric(length=num_genes)
pv.list <- numeric(length=num_genes)

# Run SEAGLE on each gene in genelist
for (i in 1:num_genes) {
  
  # Identify current gene
  gene_name <- genelist[i,4]
  
  # Obtain G
  Gtemp <- read.delim(paste0(dir, gene_name, ".raw"), sep=" ")
  G <- as.matrix(Gtemp[,-c(1:6)])
  L <- dim(G)[2]                ## number of SNPs
  
  # Make weights
  avg_newsnp <- colMeans(G)
  fA  = avg_newsnp/2            ## freq of allele "A"
  fa  = 1-fA                    ## freq of allele "a"
  maf = fA; maf[fA>0.5]= fa[fA>0.5]## maf should be b/w 0 and 0.5
  G = G[ ,maf>0]                ## only keep those SNPs with MAF>0
  maf <- maf[maf > 0]           ## Keep only MAF > 0)
  wt   = sqrt(maf^(-3/4))       ## Note we take the square root here
  if (length(wt) > 1) {
    G_final    = G %*% diag(wt) ## Input this G
  } else {
    T.list[i] <- NA
    pv.list[i] <- NA
    next
  }

  # Run SEAGLE
  objSEAGLE <- prep.SEAGLE(y=y, X=X, intercept=0, 
                           E=E, G=G_final)
  res <- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
  
  # Save T and p-value into output lists
  T.list[i] <- res$T
  pv.list[i] <- res$pv
}

The score-like test statistics T for the G×E effect and their corresponding p-values can be found in T.list and pv.list, respectively. We can take a look at the test statistics and p-values computed for each of the genes in genelist.

resMat <- cbind(T.list, pv.list)
colnames(resMat) <- c("T", "p-value")
resMat
#>               T   p-value
#>  [1,] 7693.9187 0.6148780
#>  [2,] 4022.4392 0.7990406
#>  [3,] 3375.4804 0.9771967
#>  [4,] 9698.0070 0.2089241
#>  [5,] 3523.1752 0.3668638
#>  [6,]  716.1017 0.3435655
#>  [7,] 4792.4809 0.9336786
#>  [8,]  139.5526 0.9186002
#>  [9,]  139.5526 0.9186002

Acknowledgments

Many thanks to Yueyang Huang for generating the example data, PLINK1.9 code and bash files, and R scripts for reading in the .raw files for this tutorial.