This tutorial demonstrates how to
use the SEAGLE
package when the genotype data are from GWAS
studies (.bed + .bim + .fam) or next generation sequencing studies
(.vcf). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/
to generate a .raw file (see https://www.cog-genomics.org/plink/2.0/formats#raw)
for each of the SNP sets. The .raw file records the allelic dosage for
each SNP.
Examples files containing 1kg_phase1_chr22.bed
,
1kg_phase1_chr22.bim
, 1kg_phase1_chr22.fam
,
and 1kg_phase1_chr22.vcf
files and a corresponding gene
list in glist-hg19
can be downloaded via the
SEAGLE-example3-data.zip
file from http://jocelynchi.com/SEAGLE/SEAGLE-example3-data.zip.
To follow along with the PLINK conversion procedures, you will first
need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/
and unzip the example SEAGLE-example3-data.zip
file.
Afterwards, you can type the PLINK commands below in a command prompt or
terminal window in the folder where these files are located. This will
produce .raw files that you can read into R to obtain the relevant
genetic marker matrix ${\bf G}$ for
input into SEAGLE
.
For GWAS data in the .bed + .bim + .fam format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
For next generation sequencing data in the .vcf format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
Now that we have the genotype data in the .raw format, we can load it
into R as follows. We’ll first load the SEAGLE
package.
The following code will load the ACR.raw
file produced
from the PLINK conversion for GWAS data for single gene analysis
described above. Of course, the actual file location will depend on
where you stored the results of the PLINK conversion on your local
drive.
acr <- read.delim("ACR.raw", sep=" ")
As an example, we’ve included the ACR.raw
file in the
extdata
folder of this package. The following code loads
that file so we can use it in this tutorial.
acr_loc <- system.file("extdata", "ACR.raw", package = "SEAGLE")
acr <- read.delim(acr_loc, sep=" ")
We can take a quick look at the resulting acr
object.
The first 6 columns correspond to identifying information.
dim(acr)
#> [1] 1092 107
head(acr)
#> FID IID PAT MAT SEX PHENOTYPE rs200755264_A rs145373249_G rs201146734_G
#> 1 0 HG00096 0 0 1 -9 0 0 0
#> 2 0 HG00097 0 0 2 -9 0 0 0
#> 3 0 HG00099 0 0 2 -9 0 0 0
#> 4 0 HG00100 0 0 2 -9 0 0 0
#> 5 0 HG00101 0 0 1 -9 0 0 0
#> 6 0 HG00102 0 0 2 -9 0 0 0
#> rs79314756_T rs200079264_C rs142494818_T rs77216309_T rs190253671_C
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs182163802_A rs73174437_T rs12165648_A rs116026001_G rs146910789_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 1 0 0 0
#> 6 0 0 0 0 0
#> rs185375060_A rs191165317_A rs140628153_T rs1064733_T rs114634361_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs150273293_T rs77313877_A rs189214490_A rs75385660_C rs147948077_A
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs200217266_G rs2285395_A rs143887109_A rs140364806_T rs182889325_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs6010067_C rs58109483_G rs141844965_T rs145800154_C rs148192955_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs2283675_A rs140237069_C rs191736410_C rs59998603_T rs142530125_G
#> 1 1 0 0 0 0
#> 2 1 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs184104408_T rs188103638_T rs146034851_C rs138777072_A rs192800892_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs2283676_A rs185550366_T rs190654794_C rs181504904_G rs186157598_A
#> 1 1 0 0 0 0
#> 2 1 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs188568125_T rs181184416_T rs185832541_A rs5770999_T rs73433406_T
#> 1 0 0 0 1 0
#> 2 0 0 0 2 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs150628453_A rs6010068_T rs190741616_T rs6010069_G rs9616824_T rs6010070_G
#> 1 0 1 0 0 1 0
#> 2 0 1 0 0 0 0
#> 3 0 1 0 1 0 1
#> 4 0 0 0 1 0 0
#> 5 0 1 0 1 0 1
#> 6 0 0 0 0 0 0
#> rs183029053_G rs150059289_C rs186986450_T rs201638160_T rs57763389_A
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs144808185_T rs138662706_C rs199878967_A rs141891250_T rs201174415_CT
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 1 1 1 0 0
#> 6 1 1 1 0 0
#> rs112131405_C rs13058392_G rs149554062_A rs6010072_G rs9616953_C rs6009960_G
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 1 0 1 1 1 1
#> 4 1 0 1 1 1 1
#> 5 1 0 0 1 1 1
#> 6 1 0 0 1 1 0
#> rs9616954_G rs13056271_C rs56807126_C rs34756634_G rs9616955_T rs13056621_A
#> 1 0 0 0 0 0 0
#> 2 0 0 0 1 0 1
#> 3 1 0 0 1 1 1
#> 4 0 1 1 0 0 0
#> 5 0 1 0 0 0 0
#> 6 0 0 0 0 0 1
#> rs201458351_G rs9616825_G rs182754197_G rs146416000_A rs6010073_C
#> 1 0 1 0 0 0
#> 2 0 2 1 0 0
#> 3 0 0 0 0 2
#> 4 0 0 0 0 0
#> 5 0 0 0 0 1
#> 6 1 0 0 0 0
#> rs187441595_T rs116139503_G rs190644757_G rs28516879_A rs182503014_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 1
#> 3 0 0 0 1 0
#> 4 0 0 0 0 0
#> 5 0 0 0 1 1
#> 6 0 0 0 0 1
#> rs188459010_C rs56238942_G rs6009961_A rs6010074_T rs139027620_C rs58651371_T
#> 1 0 0 1 0 0 1
#> 2 0 0 2 0 0 2
#> 3 0 0 0 0 1 0
#> 4 0 0 0 0 0 0
#> 5 0 0 0 0 1 1
#> 6 0 0 0 0 1 1
#> rs71232632_G rs5771002_A rs184517959_T rs201494682_T
#> 1 0 1 0 0
#> 2 0 1 0 0
#> 3 0 1 2 0
#> 4 0 1 0 0
#> 5 0 1 1 0
#> 6 0 2 0 0
The remaining columns contain our genetic marker matrix ${\bf G}$. We’ll extract these to input into
SEAGLE
.
As an example, we’ll generate synthetic phenotype, ${\bf y}$, and covariate data, ${\bf X}$ and ${\bf E}$.
# Determine number of individuals and loci
n <- dim(G)[1]
L <- dim(G)[2]
# Generate synthetic phenotype and covariate data
set.seed(1)
y <- 2 * rnorm(n)
set.seed(2)
X <- rnorm(n)
set.seed(3)
E <- rnorm(n)
Now that we have ${\bf y}$, ${\bf X}$, ${\bf
E}$, and ${\bf G}$, we can
prepare our data for input into SEAGLE
. We will input our
${\bf y}$, ${\bf X}$, ${\bf
E}$, and ${\bf G}$ into the
prep.SEAGLE
function. The intercept = 0
parameter indicates that the first column of ${\bf X}$ is not the all ones vector for the
intercept.
This preparation procedure formats the input data for the
SEAGLE
function by checking the dimensions of the input
data. It also pre-computes a QR decomposition for $\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n}
& {\bf X} & {\bf E} \end{pmatrix}$, where ${\bf 1}_{n}$ denotes the all ones vector of
length n.
Now we can input the prepared data into the SEAGLE
function to compute the score-like test statistic T and its corresponding p-value. The
init.tau
and init.sigma
parameters are the
initial values for estimating τ and σ in the REML EM algorithm.
The score-like test statistic T for the G×E effect and its corresponding p-value can
be found in res$T
and res$pv
,
respectively.
Many thanks to Yueyang Huang for his help with generating the example data and PLINK1.9 code for this tutorial.