association studies (GWAS) have seen an increase in popularity and success due
to continued advancements in recent years. Such studies aim to provide a way to
identify common genetic variants that could account for the genetic risk components
of common human diseases. There are two common GWAS designs, a case-control
study and a population-based study. The study discussed here was based on the
case-control design. The disease in question for this study was
chronic kidney disease (CKD), a long-term condition that affects millions
worldwide. The aim of this GWAS study was to try and identify SNPs that may be
associated with CKD. The
identification of such genetic variants may help to reveal the biological
processes underlying CKD and could aid in improving risk estimates and
detection for the disease. Following quality control and filtering
the study managed to identify two SNPs in the discovery dataset that also
appeared in the replication dataset. This could indicate a potential link between
these SNPs and the occurrence of CKD, but further investigation would need to
be carried out to verify these results. This review
also aims to describe the methods as well as some of the strengths and limitations
of GWAS for carrying out such studies.
Kidneys are responsible for the
life-sustaining functions of filtration, reabsorption, secretion and excretion within
the human body. The kidneys filter
and reabsorb over 220 litres of fluid to the bloodstream every 24 hours. Without
them toxic levels of waste products and excess water begin to build up in the body.
CKD is a long-term condition and usually refers to any renal condition whereby
the kidneys begin to gradually lose function. It is estimated that around 2.6
million people aged 16 and older suffer from CKD in England alone (1) and prevalence has been seen increasing
worldwide. Therefore, it presents a significant public health problem. In its most severe form,
end-stage renal disease (ESRD), dialysis is required. In 2009, it was estimated
that ESRD affected over 500,000 adults in the USA with expenditure reaching
$42.5 billion, at that time (2). Typically, CKD is discovered by accident
when patients seek medical attention regarding something else or during routine
check-ups. If left undiscovered, by the time any outward symptoms appear it is
often too late for preventative measures and life-saving surgery may be the
While anyone can suffer from CKD it has
been found to be more prevalent among people of black and South Asian descent (3). CKD has many causes, however, the two biggest
and most well-known causes are hypertensive nephropathy and diabetic kidney
disease (DKD), (commonly referred to as diabetic nephropathy). DKD affects up
to 40% of diabetic patients and is the leading cause of ESRD (4). While links between ethnic background and
lifestyle have been established as increased risk factors of CKD, a portion of
the risk of CKD still remains unexplained. This may suggest a possible genetic
contribution to CKD. Diabetic siblings of patients with ESRD, due to diabetes, were found to
be 5-times more at risk of developing ESRD compared with those without a family
history of the disease (5). The genetic
component of CKD has been shown in previous familial aggregation studies that
looked at families with a history of diabetes and hypertension. The heritability for glomerular filtration rate
(GFR) was estimated to range from 36 to 75% and from 16 to 49% for albuminuria (6)(7). Given the many potential genetic risk
factors for common diseases such as CKD, a genome-wide association study is an
excellent screening tool to discover genetic risk.
discussed here was based on the case-control design, these typically compare the frequency of alleles or genotypes at
single-nucleotide polymorphisms (SNPs) (8) in order to determine if there is an
association between SNPs and disease phenotypes. The allele frequency of each SNPs is compared between individuals
with a disease (known as the cases), and individuals without (known as the controls).
A precise definition of cases and controls is crucial, as case-control studies tend
to be prone to selection bias. This occurs when controls are not representative
of the population of cases, and this is one limitation often involved in GWAS.
The wider aim of such studies is to
identify sets of loci that may be linked to common complex diseases, these loci
require further analysis after GWAS. Therefore, GWAS act as an important
preliminary step in the gene identification process (9).
· Computer workstation with Windows GUI
software (10) for
genome-wide association analysis:
for data analysis and graphing:
Genotype quality control (QC) and filtering was
conducted at both the individual level and the SNP level. It is important to carry out filtering and QC
to try and remove any false positive associations
within the datasets. Several QC steps must be carried out in an attempt to
remove individuals or markers with particularly high error rates. Filtering and QC was carried out using gPLINK.
gPLINK is a JAVA based program that allows us to carry out common PLINK (10) commands on a simplified interface. gPLINK
allows for integration of results into Haploview (11), which was then used to produce the
Manhattan Plots for this report.
The discovery dataset contained 39637 variants and 478 individuals,
233 of which were males and 245 of which were females. Per-individual QC of GWA data consists
of at least four steps, these involve the identification of individuals:
1. With discordant sex information
2. With outlying missing genotype or heterozygosity rates
3. Of duplicated or related individuals
4. Of divergent ancestry (12)
The first step was to convert the MAP files into BED files and then check
the discovery dataset for potential sample identity problems. Each of these
steps were carried out on gPLINK following standard procedure for such GWAS.
The standard protocol and reasoning behind each step can be found in the
literature (12) (13). After
carefully examining other GWAS it was decided to
exclude all individuals with a genotype failure rate ? 0.03 and/or
heterozygosity rate ± 3 standard deviations from the mean (12). To reduce computational
complexity the number of SNPs used to create the identity by state (IBS) matrix,
in the next step, were provided from a pre-pruned dataset. Duplicated or related
individuals were filtered using an IBD > 0.185, this figure was chosen because it
is standard in other literature, as it is considered to be halfway between the
expected IBD for third- and second-degree relatives (14). Five nearest neighbours were identified for each
individual based upon the pairwise IBS distance. IBS distance to each of the
five nearest neighbours was then transformed into a Z score. Individuals with a minimum Z score among the five nearest neighbours less than
-4 were excluded from analysis as population outliers (15).
QC of GWA data consists of at least four steps, these involve the identification
1. With an excessive missing genotype
2. Demonstrating a significant
deviation from Hardy-Weinberg
3. With significantly different
missing genotype rates between cases and controls
4. The removal of all makers with a
very low minor allele frequency (MAF) (12)
It should be
noted that there are no universally accepted thresholds for the exclusion
criteria in QC, but all values used below were chosen based on other similar
GWAS literature (16)(17).
Variants were excluded if they did not meet the following thresholds:
SNP missingness: 0.05
Individual missingness: 0.03
This produced the clean GWA dataset which was then used in the
Association Analysis. A conventional ?2 test
for association was carried out, details of which can be found in the
literature (8). The
following criteria were selected:
required observation per cell: 5 ?
intervals: 0.95 ?
options: max(T) permutation mode: 10000 ?
The odds ratio (OR) was then
calculated according to a model of logistic regression without considering
covariates. The PLINK (10) command “–allow-no-sex” was used for each step
in the association as well as the inclusion of the alternate phenotype file.
The values in the P column of the
data produced were then filtered by p