Knome: A Model for Personal Genomics
by Greg Emmerich, UW Madison, M.S. Biotechnology Program, Early Drug Discovery Class. December 7th, 2012.
Predictive, preventative and personalized medicine is increasingly becoming a reality with recent advances to whole genome sequencing. Genome wide association studies wield impressive amounts of data, but due to differences in populations tested and analysis methodology, there are very few such studies that are reproducible. Great strides are being made in academia and industry, and the more research and data is shared, the more robust future studies will be. Knome Inc. has a unique method for annotating and interpreting genomic data that aims to tackle some of the challenges discussed.
Table of Contents:
- Genome wide association studies and their challenges
- Knome genome interpretation
Know thyself. This aphorism was inscribed on the Temple of Apollo at Delphi in Greece, a place honoring the Greek god of light, truth, medicine and healing. This simple statement illustrates a complex and challenging aspect of the human condition that remains relevant today. What started as a religious and philosophical pursuit has increasingly become the grounds for scientific insight.
The discovery that DNA represents the blueprints for all life on Earth has big implications for medicine. Each cell in our bodies has all the information to create a human being, but it is the regulation of gene expression that orchestrates the complex task of differentiating cells into tissues and organs to actually create a human being. However, sometimes things don’t always go according to plan. Sometimes there are errors in the regulation of cell growth, and tumors are formed, or the molecular machinery for copying DNA makes a mistake and mutations called small nucleotide polymorphisms (SNPs) arise that can result in disease. There are countless more examples of where biology can go wrong, but the underlying principle is that changes to the genetic code can result in changes to the observed phenotype. This is clearly demonstrated with sickle-cell anemia, which is caused by a single point mutation to the gene encoding the hemoglobin of red blood cells.
Typically it is not until symptoms are seen that treatment begins, but this approach may become secondary to more predictive and preventative measures made possible by advances to genomics and bioinformatics. The Human Genome Project ran from 1990-2003 and upon completion, predictions of a radical transformation in health care were made that personalized treatments and therapies would overtake the “one pill cures all” mentality. The field of human genomic epidemiology sprang up, focused on how genomic variations influence human health. Some predictions have come true, as evidenced by the improved identification of women to receive early treatment for breast and ovarian cancer, and the genetic screening for warfarin tolerance (prevents blood clots) (Nelson et al., 2005; Higashi et al., 2002). The relationship between DNA sequence and phenotype is far from simple, so the “genomic revolution” may still be a ways off. However, there is still much promise for emerging technologies to improve the sequencing and analysis of patient’s DNA. The genome interpretation technology behind the company Knome Inc. will be investigated, but first, a scientific background will be given to cover some of the many nuances inherent with genomic bioinformatics.
Next generation sequencing is the first step towards personalized medicine. This technology has made it possible to sequence an entire human genome of approximately 3 billion nucleotides for $1000 in a single day (Life Tech’s Ion Proton, 2012) which originally took 13 years and $3.8 billion to complete the Human Genome Project. This vastly exceeded Moore’s Law. Genomic screening has become a more realistic option for medical care, but there are numerous ethical considerations for individual privacy that beg the question if whole genome sequencing could be opening Pandora’s box. Knowledge of genetic risk is only beneficial if there is subsequent treatment available. Alzheimer’s Disease can be predicted by APOE genotyping, but because there is no preventative treatment for Alzheimer’s, the only benefit of this knowledge could come from family planning or making other lifestyle decisions. Furthermore, individuals without APOE can also develop Alzheimer’s (Mayeux and Schupf, 1995). Current understandings of genetic risks are incomplete, which could give a false sense of security to patients, and potentially the opposite situation where a risk assessment is not valid could cause much stress to patients (the interested reader may refer to Clayton 2003 for a more thorough discussion of the ethical considerations). Put simply, we can take control of our own health by understanding the biological processes that create us, so long as this information is treated properly.
The creation of biobanks and completion of large collaborative research projects have laid the foundation for genome wide association studies (GWAS). The high cost and difficulty of obtaining enough patient DNA samples was a large barrier for GWAS because in order to achieve statistical significance of a true positive risk factor, data from thousands of patients along with controls is needed. For each patient, hundreds of thousands of DNA variations are tested. Biobanks have served as a large repository for biological specimens often collected from patients at hospitals worldwide and are a key resource for researchers. The International HapMap Project identified the majority of the common SNPs in the human genome in order to create a map describing patterns of genetic variation. The data from this project has been made freely available as well as genetic tools for researchers. Additionally, the Human Genome Epidemiology Network (HuGENet) was created as an online, open access database chronicling all epidemiological studies of human genes published since October 1st, 2000, as well as the data from those studies.
A commonly used approach to ease the burden of the large number of required tests for gene-risk association is to use a tiered design to the studies. This approach uses a subset of SNPs from an initial screening detected as significant to use as a launching pad for deeper investigation. This subset is genotyped, yielding a smaller subset of significant SNPs. Repetition of this helps to identify false positive associations (Manolio, 2010). When the initial subset of SNPs is large, the risk of false negatives is reduced. To truly confirm a positive association as a true positive, the most reliable method remains independent replication of the results, especially among a different population of subjects. The predictive value of cumulative genotypic scores will increase as sample sizes increase and more risk variants are identified (Khoury et al., 2007).
Traditional, hypothesis-driven studies of one or two candidate genes (genetic linkage studies) are being replaced with “hypothesis-free” (or non-candidate-driven) genome-wide association studies. The freedom allowed with this method gives it the power to overcome uncertainty of disease pathophysiology and detect genes previously thought unrelated to the phenotype (Kitsios and Zintzaras, 2009). Next gen sequencing has made it feasible to conduct genome-wide association studies looking at many individuals with a disease to see what commonalities (alleles—two or more forms of a gene) they have which healthy individuals do not. This is made easier by studying blocks of alleles at adjacent locations that tend to be inherited together, called haplotypes. Alleles that frequently differ between the populations are then given a risk association for the particular disease the afflicted individuals have. When combined with epidemiological data, it becomes possible to look at gene-gene and gene-environment interactions that predispose individuals to certain diseases. These studies have provided insights to age-related macular degeneration, myocardial infarction, abnormal cardiac repolarization intervals, and type 2 diabetes (Christensen and Murray, 2007; Wheeler and Barroso, 2011).
Genome-wide association studies require translational research in order to be clinically valid. The pathway for diagnostics is well established through the four phases of clinical trials, but genetic tests don’t have as clear of a pathway. A problem with GWAS translational research is that the results from these studies are notoriously difficult to replicate, with a historical success rate of around 3.6%. In a 2002 review of 600 genome association studies, only 166 of the studies were repeated three or more times, and only 6 of these were repeated consistently (Hirschorn et al., 2002).
There are numerous reasons for why replication of results remains difficult in genomic epidemiology research. Very few diseases are caused by a deficiency of a single gene. The identification of a common gene correlated with a particular disease does not imply causation—there could be many other molecular players in the physiological pathway that are controlled by different genes, and there are likely several regulatory elements controlled by other genes. Caution must also be taken to account for interference between multiple alleles–one allele could be inhibiting while another allele is stimulating a regulatory pathway, as is the case in the beta-2 adrenoceptor gene (B2AR) (Liggett, 1997). Epigenetic factors influencing gene activation, like DNA methylation, also come into play and add another layer of complication. The copy-number for a gene or genes can vary between individuals (up to a few hundred thousand base pairs in duplication or deletion) which could result in different levels of expression and susceptibility to environmental stimuli. Furthermore, it is difficult to pick out a specific common mutation (haplotype) when there is much variation among alleles, and thus it can be very difficult to investigate some diseases by genome-wide association studies (Khoury et al., 2007). These are significant challenges to overcome, and so far, most genetic risk factors identified are only likely to contribute moderate risk, as compared to single-gene diseases which have a very high risk factor correlation. Improving the estimation of disease risk to genotype requires many levels of research collaboration.
Large research consortiums of individual investigators, universities, hospitals, nonprofit and for-profit organizations, and cohorts of study participants have been created to more efficiently conduct genome wide association studies. The use of meta-analyses can help find more statistically significant results and standardize methodologies across numerous studies. The results from these collaborations have been an impressive amount of identification of genetic associations with complex traits. Disease conditions previously thought unrelated, like coronary disease, type 2 diabetes and invasive melanoma, have been shown to share common loci. Surprisingly, the majority of SNPs identified did not lie in protein coding gene regions (Manolio, 2010). The extent to which noncoding intron and intergenic DNA regions play a role in disease phenotypes–through regulatory functions in gene expression or other functions–is largely unknown (Hardy and Singleton, 2009). This identification step of translational research is very helpful, but much more work still needs to be done to determine the function the identified genes have in disease pathophysiology and applicable regulatory mechanisms so that the drug development process can begin.
Large research consortia present other challenges beyond the scientific. The extent of data sharing, practices for publication and authorship, and ownership of intellectual property remains unclear. These considerations are vitally important to establish with scientific collaboration in order for genome wide association studies to continue to have a vibrant, prosperous future. In a 2012 study, only 25% of the 55 identified GWAS collaborations had publicly accessible research guidelines, and these policies varied considerably. Increased availability of guidelines would improve implementation of appropriate research standards in new forms of large scale collaboration (Austin et al., 2012).
The incredible amount of data generated from next gen sequencing presents challenges for the processing and interpretation of that data. Being able to accurately and consistently interpret genomic sequences is paramount before whole genome sequencing will become commonplace for individual healthcare. Many of the difficulties to achieving this were discussed in the previous section. However, numerous companies have already sprung up around personalized genomics. This has drawn some criticism from scientists and skeptics who argue that there is too much uncertainty with the results that the general public may not fully appreciate; individuals may take the results as absolutes which could harm the future of the industry if those individuals don’t trust the predictions they get from whole genome sequencing. Special cases have proven quite fruitful, but it may be too early for whole genome sequencing to be broadly integrated into health care plans for individuals (NOVA, 2012). Industry has made some considerable advances in the interpretation of genomic sequences, and the company Knome Inc. will be investigated as an example. An overview of Knome’s core technology will be given, followed by how Knome disseminates derived genomic information to individuals.
Knome offers several options for genome processing to a wide array of consumers. These services include whole genome sequencing, informatics processing and interpretation of genomic data, and more specialized genomic cancer screening. The core technology behind Knome is their bioinformatics software which processes and interprets digital genomic data, called kGAP. Knome currently has two patent applications worldwide that specifically cover this technology (Pearson and D’Aco, 2012; Conde, 2012). The kGAP genomics engine standardizes, annotates, distills, and compares genomic data to reference data. Knome’s genome interpretation software can identify short substitutions, insertions, and deletions, as well as segmental copy number variants, inversions, translocations, and other long structural variants.
Genotypes at multiple SNPs are often combined into scores calculated according to the number of risk alleles carried in order to generate a quantitative risk factor for multiple diseases. This is usually done by encoding variants with an IUPAC code (A, T, C, G) or an arbitrary value for match (an “a”) or mismatch (b) to a reference sequence. These methods are not very efficient for being able to quickly find sites on individuals’ genomes that have similar levels of variation compared to other genomes. There is also no distance metric between areas of variation. Knome’s solution to this is an improved annotation method applying a numerical value for each genotype segment, and a vector summary of all the numerical values for the segments among a larger set of genomes in order to obtain a pattern of genotypic variation. This pattern can be compared between healthy and disease-affected populations in order to determine specific gene risk factors for that disease. This complicated process will be explained in more detail.
The first step in genomic indexing compares the nucleic acid base sequence from the subject genome(s) to that of the reference genome to determine the existence of a match or a mismatch, or a no call determination. A match of the subject to the reference is assigned a numerical value of “0”, a mismatch is assigned a value of “1”, and a no-call is assigned a null value of “_”. This is done for each chromosome copy, thus the highest possible value would be a “2” for a homozygous, recessive allele. This value could be assigned for a segment of DNA as short as an individual SNP or for multiple nucleotides in length (such as the case for a translocation, deletion, or insertion), the segment hereafter referred to as a genotype. The assigned numerical values for all the genotypes in an individual’s genome are added in a delimited fashion (data separated by commas, for example) to obtain a total numerical value. This process is repeated for each genome in the entire set of genomes, creating a vector of total numerical values for each genotype. For a set of three genomes that are homozygous dominant, heterozygous, and homozygous recessive for a particular genotype, this vector would look like [0,1,2].
It is then possible to create a pattern of genotype match/mismatch across the population based upon the sum of those vectors. For example, in the case of an unknown recessive disease, it is possible to sequence both of the parents’ and the child’s genomes and then search through the annotated genomes looking for a genotype with a match/mismatch following the pattern of [1,1,2]. While all matching patterns are potentially a risk source, the hits from this search are further filtered against the patterns of healthy individuals to eliminate the majority of the recessive genotypes that do not lead to this particular disease. This serves as a launching pad for further research into the role of the candidate gene or genes upon the disease phenotype.
Several factors can improve the results from genome pattern analyses. The higher the sample size of healthy and affected individuals, the higher the resolution and confidence that the identified genotype relates to the phenotype. This also means that healthy individuals who match this genotype are at a higher risk of developing the disease in the future. As noted earlier, results can be filtered based on previous patient data and information from publications about the gene variant and associated risk factors. Results can also be filtered to determine if the variant sequence directly encodes a protein/functional molecule, or if the variant sequence is involved in gene regulation. The match/mismatch pattern can even be generated by comparing genomic data from two different tissue samples from the same individual, such as between cancerous and healthy tissue.
Displaying the results from whole genome sequencing and analysis in a clear, simple fashion is essential to make these services useful. Individuals who get their genome sequenced and then interpreted by Knome receive their results in the context of current scientific understanding. That is, when Knome identifies a certain gene as a risk factor for a particular disease, they index those results with information from public databases containing genetic information about the risk allele and consumer health information about the disease or condition. Science is an ever-growing body of knowledge that can come to upturn previously held beliefs with the discovery of new evidence, thus the identification that a particular gene is associated with a disease is subject to future findings. A platform for interpreting and displaying genomic data must be flexible enough to adapt to emerging knowledge. Knome uses a reference database that is cloud-based and thus is more quickly able to update their genomic interpretations for individuals.
The era of personalized genomics has arrived and shows great promise at becoming a key tool for individual health care in the future. The task of interpreting genomic data is quite challenging and very sensitive to the parameters used and population selected. Knome Inc. has developed a whole genome interpretation service that improves the speed of analysis and ability to filter results to better map genotype to phenotype. Knome is standing proof that science is starting to revolutionize how we come to know ourselves.
Austin, Melissa, Marilyn Hair, and Stephanie Fullerton. (2012) Research Guidelines in the Era of Large-scale Collaborations: An Analysis of Genome-wide Association Study Consortia. Am J Epidemiology; Vol. 175, No. 9.
Bracken, Michael B. (2005). Genomic epidemiology of complex disease: The need for an electronic evidence-based approach to research synthesis. Am J Epidem 162(4):297-301.
Burke, Wylie et al. (2006). The path from genome-based research to population health: Development of an international public health genomics network. Genet Med: 8(7):451-458.
Christensen, Kaare and Jefferey Murray. (2007). What genome-wide association studies can do for medicine. N Engl J Med 356:11 p. 1094-1097.
Clayton, Ellen W. (2003) Ethical, Legal, and Social Implications of Genomic Medicine. N Engl J Med 349:562-569.
Conde, Jorge. (2012). Personal Genome Indexer. Knome Inc. Patent application WO 2012/030967.
Hardy J, Singleton, A. (2009) Genomewide association studies and human disease. N Engl J Med 360:1759-68
Higashi, M.K., et al. (2002). Association between CYP2C9 genetic variants and anticoagulation-related outcomes during warfarin therapy. JAMA; 287:1690–1698.
Hirschhorn, Joel N. et al. (2002). A comprehensive review of genetic association studies. Genet Med: 4(2):45-61.
Hoover RN. (2007) The evolution of epidemiologic research: from cottage industry to “big” science. Epidemiology 18:13-7.
Khoury, Muin J. et al. (2007). The continuum of translation research in genomic medicine: how can we accelerate the appropriate integration of human genome discoveries into health care and disease prevention? Genet Med: 9(10):665-674.
Kitsios, Georgios and Elias Zintzaras. (2009). Genome-wide association studies: hypothesis-free or engaged? Transl Res. 154(4):161-164.
Liggett, S.B. (1997) Polymorphisms of the B2-adrenergic receptor. Am J Respir Crit Care Med156:S156–62.
Manolio, Teri. (2010) Genomewide Association Studies and Assessment of the Risk of Disease. N Engl J Med 363:166-176
Mayeux R. and Schupf N. (1995) Apolipoprotein E and Alzheimer’s disease: the implications of progress in molecular medicine. Am J Public Health; 85:1280–1284.
NOVA. (2012). Cracking your genetic code. Retrieved from http://www.pbs.org/wgbh/nova/body/cracking-your-genetic-code.html
Pearson, Nathaniel and Katherine D’Aco. (2012). Methods and Apparatus for Assigning a Meaningful Numeric Value to Genomic Variants, and Searching and Assessing Same. Knome Inc. Patent Application WO 2012/100216
Nelson, H.D., et al. (2005). Genetic Risk Assessment and BRCA Mutation Testing for Breast and Ovarian Cancer Susceptibility: Systematic Evidence Review for the U.S. Preventive Services Task Force. Ann Intern Med;143:362–379.
Wang, William et al. (2005). Genome-wide association studies: theoretical and practical concerns. Nature Reviews, 6:109-118.
Wheeler, Eleanor and Inês Barroso. (2011) Genome-wide association studies and type 2 diabetes. Briefings in Functional Genomics 10 (2): 52-60