association studies (GWAS) have seen an increase in popularity and success due
to continued advancements in recent years. Such studies aim to provide a way to
identify common genetic variants that could account for the genetic risk components
of common human diseases. There are two common GWAS designs, a case-control
study and a population-based study. The study discussed here was based on the
case-control design. The disease in question for this study was
chronic kidney disease (CKD), a long-term condition that affects millions
worldwide. The aim of this GWAS study was to try and identify SNPs that may be
associated with CKD. The
identification of such genetic variants may help to reveal the biological
processes underlying CKD and could aid in improving risk estimates and
detection for the disease. Following quality control and filtering
the study managed to identify two SNPs in the discovery dataset that also
appeared in the replication dataset. This could indicate a potential link between
these SNPs and the occurrence of CKD, but further investigation would need to
be carried out to verify these results. This review
also aims to describe the methods as well as some of the strengths and limitations
of GWAS for carrying out such studies.
Kidneys are responsible for the
life-sustaining functions of filtration, reabsorption, secretion and excretion within
the human body. The kidneys filter
and reabsorb over 220 litres of fluid to the bloodstream every 24 hours. Without
them toxic levels of waste products and excess water begin to build up in the body.
CKD is a long-term condition and usually refers to any renal condition whereby
the kidneys begin to gradually lose function. It is estimated that around 2.6
million people aged 16 and older suffer from CKD in England alone (1) and prevalence has been seen increasing
worldwide. Therefore, it presents a significant public health problem. In its most severe form,
end-stage renal disease (ESRD), dialysis is required. In 2009, it was estimated
that ESRD affected over 500,000 adults in the USA with expenditure reaching
$42.5 billion, at that time (2). Typically, CKD is discovered by accident
when patients seek medical attention regarding something else or during routine
check-ups. If left undiscovered, by the time any outward symptoms appear it is
often too late for preventative measures and life-saving surgery may be the
While anyone can suffer from CKD it has
been found to be more prevalent among people of black and South Asian descent (3). CKD has many causes, however, the two biggest
and most well-known causes are hypertensive nephropathy and diabetic kidney
disease (DKD), (commonly referred to as diabetic nephropathy). DKD affects up
to 40% of diabetic patients and is the leading cause of ESRD (4). While links between ethnic background and
lifestyle have been established as increased risk factors of CKD, a portion of
the risk of CKD still remains unexplained. This may suggest a possible genetic
contribution to CKD. Diabetic siblings of patients with ESRD, due to diabetes, were found to
be 5-times more at risk of developing ESRD compared with those without a family
history of the disease (5). The genetic
component of CKD has been shown in previous familial aggregation studies that
looked at families with a history of diabetes and hypertension. The heritability for glomerular filtration rate
(GFR) was estimated to range from 36 to 75% and from 16 to 49% for albuminuria (6)(7). Given the many potential genetic risk
factors for common diseases such as CKD, a genome-wide association study is an
excellent screening tool to discover genetic risk.
discussed here was based on the case-control design, these typically compare the frequency of alleles or genotypes at
single-nucleotide polymorphisms (SNPs) (8) in order to determine if there is an
association between SNPs and disease phenotypes. The allele frequency of each SNPs is compared between individuals
with a disease (known as the cases), and individuals without (known as the controls).
A precise definition of cases and controls is crucial, as case-control studies tend
to be prone to selection bias. This occurs when controls are not representative
of the population of cases, and this is one limitation often involved in GWAS.
The wider aim of such studies is to
identify sets of loci that may be linked to common complex diseases, these loci
require further analysis after GWAS. Therefore, GWAS act as an important
preliminary step in the gene identification process (9).
· Computer workstation with Windows GUI
software (10) for
genome-wide association analysis:
for data analysis and graphing:
Genotype quality control (QC) and filtering was
conducted at both the individual level and the SNP level. It is important to carry out filtering and QC
to try and remove any false positive associations
within the datasets. Several QC steps must be carried out in an attempt to
remove individuals or markers with particularly high error rates. Filtering and QC was carried out using gPLINK.
gPLINK is a JAVA based program that allows us to carry out common PLINK (10) commands on a simplified interface. gPLINK
allows for integration of results into Haploview (11), which was then used to produce the
Manhattan Plots for this report.
The discovery dataset contained 39637 variants and 478 individuals,
233 of which were males and 245 of which were females. Per-individual QC of GWA data consists
of at least four steps, these involve the identification of individuals:
1. With discordant sex information
2. With outlying missing genotype or heterozygosity rates
3. Of duplicated or related individuals
4. Of divergent ancestry (12)
The first step was to convert the MAP files into BED files and then check
the discovery dataset for potential sample identity problems. Each of these
steps were carried out on gPLINK following standard procedure for such GWAS.
The standard protocol and reasoning behind each step can be found in the
literature (12) (13). After
carefully examining other GWAS it was decided to
exclude all individuals with a genotype failure rate ? 0.03 and/or
heterozygosity rate ± 3 standard deviations from the mean (12). To reduce computational
complexity the number of SNPs used to create the identity by state (IBS) matrix,
in the next step, were provided from a pre-pruned dataset. Duplicated or related
individuals were filtered using an IBD > 0.185, this figure was chosen because it
is standard in other literature, as it is considered to be halfway between the
expected IBD for third- and second-degree relatives (14). Five nearest neighbours were identified for each
individual based upon the pairwise IBS distance. IBS distance to each of the
five nearest neighbours was then transformed into a Z score. Individuals with a minimum Z score among the five nearest neighbours less than
-4 were excluded from analysis as population outliers (15).
QC of GWA data consists of at least four steps, these involve the identification
1. With an excessive missing genotype
2. Demonstrating a significant
deviation from Hardy-Weinberg
3. With significantly different
missing genotype rates between cases and controls
4. The removal of all makers with a
very low minor allele frequency (MAF) (12)
It should be
noted that there are no universally accepted thresholds for the exclusion
criteria in QC, but all values used below were chosen based on other similar
GWAS literature (16)(17).
Variants were excluded if they did not meet the following thresholds:
SNP missingness: 0.05
Individual missingness: 0.03
This produced the clean GWA dataset which was then used in the
Association Analysis. A conventional ?2 test
for association was carried out, details of which can be found in the
literature (8). The
following criteria were selected:
required observation per cell: 5 ?
intervals: 0.95 ?
options: max(T) permutation mode: 10000 ?
The odds ratio (OR) was then
calculated according to a model of logistic regression without considering
covariates. The PLINK (10) command “–allow-no-sex” was used for each step
in the association as well as the inclusion of the alternate phenotype file.
The values in the P column of the
data produced were then filtered by p
< 10?5 to identify the statistically significant associations. While the current standard for genome-wide significance is p?5?×?10?8 (18), some argue this p value threshold is too conservative (19) and that a relaxation in the threshold may be appropriate for some studies (20). The National Human Genome Research Institute have used the cut-off value of p < 10?5 in over 700 GWA studies (21) and this value has been chosen for this study also. All of the QC, filtering and association analysis steps were repeated on a replication dataset. SNPs for replication were not ideally selected, but rather were genotypes available from another genotyping laboratory for a DKD cohort and were a 'best-case-scenario' available to follow-up results of the discovery GWAS. There were no clinical covariates available for the replication dataset. The replication dataset contained 7 variants and 96 individuals of unspecified sex. The PLINK (10) command "–allow-no-sex" was used for each step in the replication as sex of the individuals was ambiguous in this case. Results The discovery dataset contained 39637 variants and 478 individuals, 233 of which were males and 245 of which were females. After genotype QC and filtering the clean GWA data contained 267868 variants and 465 individuals, (229 cases and 235 controls). Following the logistic regression step, the values in the P column of the data produced were filtered by a value of p < 10?5 to try and identify any statistically significant associations. This produced two SNPs, these can be seen in Table 1 below. CHR SNP P 13 rs1591173 8.69E-06 13 rs4522294 5.00E-06 Table 1: SNPs remaining in the discovery dataset following QC and filtering The replication dataset was much smaller and contained only 7 variants and 96 individuals (no specified sex), 61 of which were cases and 35 controls. When QC was carried out on the replication cohort the clean replication GWA data produced 7 variants and 92 individuals. The 7 SNPs can be seen in Table 2 below. CHR SNP P 1 rs12124937 1.73E-05 2 rs9287656 0.0005121 2 rs10173491 0.0005121 11 rs3740769 2.17E-05 13 rs1591173 0.0001181 13 rs4522294 0.0001181 14 rs8008661 1.73E-05 Table 2: SNPs remaining after QC on the replication dataset Haploview (11) was used to produce a Manhattan plot of all SNPs from the discovery dataset following the logistic regression phase. This can be seen in Fig 1 accompanying this report. Fig 2 shows the remaining 2 SNPs after the discovery data was filtered by p < 10?5. Discussion The results from the discovery dataset appear to show that 2 SNPs out of 39637 were statistically significant, following QC and association. The same 2 SNPs were also present in the replication cohort: rs1591173 and rs4522294 both located on chromosome 13. This would seem to suggest that the results were replicated. Replication is essential for establishing the credibility of a genotype–phenotype association. However, there is still ongoing debate on what constitutes an adequate replication study (22). Despite this result, the dataset provided for the replication phase of the study was far too small to carry out QC on. Running the QC on the replication dataset had basically no impact on the final SNPs. There were not enough SNPs present in the replication data to identify any relationship between samples. Small sample size is a frequent problem in such GWAS and usually results in insufficient power to detect minor contributors of one or more alleles (22). Similarly, 'data dredging' is another significant problem in such GWAS (23). Considering there are no defined criteria for the thresholds during QC, data can be altered in order to achieve results that appear to be of statistical significance and worthy of publication. As stated earlier, the replication dataset was a highly selected group of patients with DKD, therefore you would not expect genotype distributions to be within HWE. One solution to replicating results could be requiring that replication studies use the same phenotype and definition of phenotype as the discovery cohort so as to help and avoid false positives (23). This was obviously not the case in this study as the individuals in the discovery cohort were said to have had CKD with no mention of any underlying illness that may have caused it, and individuals in the replication cohort specifically had DKD. Statistically around 40% of diabetics will develop DKD (24) and therefore it would be expected that instances of kidney disease would be much more prevalent in these individuals. Thus, it cannot be said that the results show credible association and are not just a chance finding. Another issue with the replication data, in this study, was the fact that the sex of the individuals was not specified. The "–allow-no-sex" command had to be used for the association and replication. This was necessary as sex of the individuals was ambiguous in this case and when the sex is not present, PLINK (10) forces ambiguous-sex phenotypes to missing, and the process would not have generated any association file results. In the field of GWAS the importance of QC has been well appreciated for some time, and even small sources of systematic or random error can result in false associations or obscure real ones. Therefore, allowing the sex to remain ambiguous would likely have an impact on results. In fact, some have suggested that separate studies should be carried out for male and females all together. There is mounting evidence of the importance of sex-differentiated effects in complex traits (25). Consider rs17810398 within the DAPL1 gene. Previous GWAS did not uncover any association between rs17810398 and age-related macular degeneration. However, when analyses were stratified by sex, the association was found to be highly significant for females (p?=?2.6?×?10?8) but not males (p?=?0.382)(26). This shows that significant associations can be lost when combining male and female data into one dataset. It has also been reported that sex-differentiated or sex-specific effects may be a contributor to the "missing heritability" of complex traits in such GWAS (25). Another issue with the data sets used in this study was that no ethnic background of any of the individuals was given. As mentioned at the beginning of the report CKD is more prevalent among people of black and South Asian descent (3). It has been previously documented that in association analysis, population structure can cause spurious findings if not accounted for, and it is one of the most common reasons that results are not replicated (27)(28). GWAS does have its advantages and has allowed for millions of SNPs to be genotyped and studied over the years. However, genetic effects due to common alleles are small, and detection often requires much larger sample sizes. Single GWAS do not tend to have enough power to detect such associations (29) and therefore meta-analysis is often a popular alternative to GWAS. Meta-analysis works by combining data from several single GWAS in an effort to increase power and reduce false-positive associations (30). The main focus of many GWAS have been to identify common disease variants that are associated with complex diseases, and while many have successfully done so, normally only variants with a MAF greater than 1% - 5% are followed up on. Rare and low-frequency variants typically have a MAF of < 1% and 1% - 5%, respectively, and are therefore often overlooked during GWAS meta-analyses (31), which is one of its main disadvantages. Usually the reasons for these omissions are to do with poor genotyping quality and also to act as a safety net for avoiding false-positive or other spurious findings. The underrepresentation of rare and low-frequency variants began to change with the help of projects such as the 1000 Genomes Project, and the UK10K along with the advent of next-generation sequencing (NGS) technologies. NGS has allowed for rapid whole-genome sequencing, RNA sequencing as well as allowing for the analysis of epigenetic factors (32). There do however remain several limitations to NGS technologies including, cost limitations. A single sequencing run of Illumina's Genome Analyser is capable of producing roughly 1.8Tb of data (33) and while this data has proved undoubtedly useful there does remain implications with storing such vast amounts of data. Conclusion In conclusion, 2 SNPs (rs1591173 and rs4522294) were identified that appeared to be associated with CKD. However, as previously stated the replication dataset was not large enough to validate the initial findings. One solution to this would be to run the analysis again with a much larger replication dataset, ideally replication data sets should be the same size if not larger than the discovery dataset. Any significant associations would need to be strictly validated by follow-up studies before being reported. Finally, while there can be no doubt about the usefulness of GWAS in helping researchers gain insight into human variations and the role they play in common complex diseases, GWAS-defined variants typically only explain a small proportion of trait heritability. More than 90 - 95% of the heritability of diseases is often left unexplained after extensive GWAS analysis (34). GWAS proved a useful tool in the early days but a common view is that it should be replaced by NGS technologies which are extending the reach of GWAS to ever lower ranges of MAFs (35).