Statistical Genetic and Epidemiological Analyses for Complex Diseases

Over the last ten years, there has been an explosion in the ability to rapidly massive amounts of genomic data in order to study biological risk factors associated with human diseases. The amount and size of these datasets has led to several new challenges in how to manage and process these data in a time-efficient manner as well as answer important questions surrounding the development of complex disease. Over the last year, we worked on several projects over the last year that can be categorized into three main categories: 1) Integration of high resolution endophenotype data and next-generation sequencing whole-genome sequencing data. High resolution endophenotypes refer to those associated biological risk factors that may be intermediate and are known to be associated with a disease. by studying these intermediate files we feel we may have more opportunity to study the complexity of a disease. We are currently involved in two projects one that investigated over 600 lipidomic species and their relationship to cardiovascular disease using whole-genome sequence data in the Busselton Health Study and instrumental endophenotypes known to be associated with the development schizophrenia diagnosis in the Western Australia Family Study of Schizophrenia. 2) Genome-wide and epigeneome-wide DNA methylation studies in the developmental origins of health and disease. We are involved in several studies that utilize large scale genome and DNA methylation array data to investigate the biological and environmental origins of human disease. The developmental origins of health and disease hypothesis postulates that an individuals early life environment has a significant impact on an an individuals risk of developing disease in later-life. Through our association with the Raine Study and several international collaborations we are investigating the impacts of both early life environment and genetic and epigenetic risk in adolescence and into early adulthood. Through several meta-analyses in conjunction with international consortia we have generated several key findings round neonatal birth weight, childhood asthma, and gestational age. 3) Machine learning approaches to disease risk. the massive amount of human genome data being produced requires efficient statistical approaches to better define risk prediction. Machine learning allows for the parsing of large data sets into predictive patterns that can potentially allow for earlier disease identification within an individual and will lead to improved health outcomes.

Principal investigator

Phillip E. Melton
Magnifying glass

Area of science

Biological Science, Biological Sciences, Geosciences, Medical And Health Sciences, Statistics

Systems used

Magnus, Zeus and Nimbus, Topaz and Managed Storage

Applications used

R, plink, bwa, gatk, samtools, plink, gemma, snpTools, Annovar, python, shifter, perl, tophat
Partner Institution: Curtin University| Project Code: pawsey0134

The Challenge

The aim of our endophenotype and genome-wide association approaches are to identify specific variants within the human genome that are associated with either a specific endophenotype, trait, or disease. The human genome contains three billion letters and we require large copies of each persons genome in order to ensure accuracy and we need several thousands of individuals in order to identify variation and obtain the required sample size for adequate statistical power. This massive amount of raw data needs to be organized against the human reference genome, called and annotated to acquire an adequate data set for downstream analysis. This large amount of data requires a supercomputer with large storage in order to facilitate each of these steps through a pipeline.

Machine learning presents several different challenges including the need to process large data sets across different formats that require strict quality control but still may require independent computationally intense models for accurate processing. These methods often require GPUs to obtain the most accurate and time-efficient results.

The Solution

In order assemble our whole-genome sequence data we have developed an efficient bioinformatic pipeline, this takes advantage of the ability to parrallelise different parts of the analysis. First, we divide the human genome into chunks for an individual and align those separately, before bringing them back together and running quality control metrics across each genome, we then call and annotate the genome. Using magnus this allows us to rapidly obtain qualtiy data in a short amount of time. We are then able to take a phenome-wide approach and analyse several phenotypes and genomes simultaneously to obtain our results. This requires several computers and large data storage in order to accomplish this in a short time frame

For association studies we use several different analytical approaches to account for association between traits and genetic or epigenetic risk loci. Here we leverage different processors so that we are able to do analysis within a hour that would take days to weeks to run on a normal computer.

Accurate machine learning requires large memory both from a processing and a storage size as well as the ability to run across several different models to benchmark. This means that GPUs are the most efficient strategy for this.

The Outcome

The data storage provided by Pawsey ensures that the data is safely stored in an environment that is secure but also close to the processing site, so we don’t spend large amounts of time transferring data. As we are often looking at several TBs of data this can potentially take days to transfer, so this allows for more time-efficient analysis.

For assembly and calling of human whole-genome sequences we utilize Magnus as this allows us to maximize our processes. This requires large numbers of nodes but not large amounts of computational time. We are also able to do job-packing which limits our analyst time in having to repeat the same submission of scripts.

For our association analysis we utilise specific analytical routines that we or our collaborators have developed to rapidly analyse this data. This often requires several processors but can be done quickly on few processor so we use the workq on Zeus to accomplish this as we can do this on any open processor without using all the memory on a single node.

Machine learning requires large amounts of storage which can be accommodated on the scratch directory at Pawsey and GPUs for which we use Nimbus to run at the moment but may need to move to Topaz if need more processing space. This allows us to run these analyses in short amount of time across several models in days.

List of Publications

  1. Parmar P, Lowry E, Cugliari G, Suderman, M, Wilson R, Karhunen V, Andrew T, Wiklund P, Wielscher M, Guarrera S, Teumer A, Lehne, B, Milani L, de Klein N, Mishra PP, Melton PE, Mandaviya PR, Kasela S, Nano J, Zhang W, Zhang Y, Uitterlinden AG, Peters A, Schöttker B, Gieger C, Anderson D, Boomsma DI, Grabe HJ, Veldink JH, Meurs JB, van den Berg L, Beilin LJ, Franke L, Loh M, van Greevenbroek MMJ, Nauck M, Kähönen M, Hurme MA, Raitakari OT, Franco OH, Slagboom PE, van der Harst P, Kunje S, Felix SB, Zhang T, Chen W, Mori TA, Bonnefond A, Heijmans BT, Muka T, Kooner JS, Fischer K, Waldenberg M, Froguel P, Huang RC, Lehtimäki T, Rathmann W, Relton CL, Matullo G, Brenner H, Verweij N, Li S, Chambers JC, Järvelin MJ. 2018. Meta-analysis of maternal prenatal smoking GFI1-locus and cardio-metabolic phenotypes in adults. EBioMedicine. 38:206-216.2) Reese SE, Xu CJ, den Dekker HT, Lee, MK, Ruiz-Arenas C, Merid SK, Rezwan FI, Page CM, Ullemar V, Melton PE, Oh SS, Yang IV, Burrows K, Söderhäll C, Jima DD, Gao L, Arathimos R, Küpers LK, Wielscher M, Rzehak P, Lahti J, Laprise C, Madore AM, Sikdar S, Ward J, Bennett BD, Wang T, Bell DA, The BIOS Consortium, Håberg SE, Zhao S, Karlsson R, Hollams E, Hu D, Richards AJ, Bergström A, Sharp GC, Felix JF, Bustamante M, Gruzieva O, Maguire RL, Gilliland F, Baïz N, Nohr EA, Corpeleijn E, Sebert S, Karmaus W, Grote V, Kajantie E, Magnus MC, Örtqvist AK, Eng C, Liu AH, Kull I, Jaddoe VWV, Sunyer J, Kere J, Annesi-Maesano CH, Arshad SH, Koletzko B, Brunekreef B, Binder EB, Räikkönen K, Reischl E, Holloway JW, Jarvelin MR, Snieder H, Kazmi N, Breton CV, Murphy SK, Pershagen G, Anto JM, Relton CL, Schwartz DA, Burchard EG, Huang RC, Nystad W, Almqvist C, Henderson AJ, Melén E, Duijts L, Koppelman GH, London SJ, 2019. Epigenome-wide consortium meta-analysis of DNA methylation and childhood asthma. J Allergy Clin Immunol., 143(6):2062-2074.

    3) Wallace H, Cadby G, Melton PE, Fear M, Falder S, Crowe M, Martin L, Marlow K, Wood F. 2019. Genetic influence on scar outcome after burn injury: genome-wide association study and pathway analysis. Burns. May;45(3):567-578.

    4) Melton PE, Johnson MP, Gokhale-Agashe D, Rea A, Ariff A, Peralta JP, McNab T, Allcock RA, Abraham L, Blangero J, Brennecke SP, Moses EK. 2019. Whole exome sequencing identifies novel candidate genes in multiplex preeclampsia families. Journal of Hypertension May;37(5):997-1011.

    5) Huang RC, Beilin LJ, Lillycrop KA, Godfrey KM, Anderson DA, Mori TA, Burdge GC, Oddy WH, Pennell CE, Holbrook JD, Melton PE. 2019. Epigenetic Age Acceleration in Adolescence Associates with BMI, Inflammation, and Risk Score for Middle Age Cardiovascular Disease. The Journal of Clinical Endocrinology and Metabolism. 104(7):3012-3024.

    6) Ariff A, Melton PE, Brennecke SP, Moses EK. 2019. Analysis of the epigenome in multiplex preeclampsia identifies SORD, DGKI and ICA1 as novel candidate risk genes. Frontiers in Genetics 19;10:227

    7) Küpers LK, Monnereau C, Sharp GC, Yousefi P, Salas LA, Ghantous A, Page CM, Reese SE, Wilcox AJ, Czamara D, Starling AP, Novoloaca A, Lent S, Roy R, Hoyo C, Breton CV, Allard C, Just AC, Bakulski KM, Holloway JW, Everson TM, Xu CJ, Huang RC, van der Plaat DA, Wielscher M, Merid SK, Ullemar V, Rezwan FI, Lahti J, van Dongen J, Langie SAS, Richardson TG, Magnus MC, Nohr EA, Xu Z, Duijts L, Zhao S, Zhang W, Plusquin M, DeMeo DL, Solomon O, Heimovaara JH, Jima DD, Gao L, Bustamante M, Perron P, Wright RO, IHertz-Picciotto Zhang H, Karagas MR, Gehring U, Marsit CJ, Beilin LJ, Vonk JM, Jarvelin MR, Bergström A, Örtqvist AK, Ewart S, Villa PM, Moore SE, Willemsen G, Standaert ARL, Håberg SE, Sørensen TIA, Taylor JA, Räikkönen K, Yang IV, Kechris K, Nawrot TS, Silver MJ, Gong YY, Richiardi L, Kogevinas M, Litonjua AA, Eskenazi B, Huen K, Mbarek H, Maguire RL, Dwyer T, Vrijheid M, Bouchard L, Baccarelli AA, Croen LA, Karmaus W, Anderson D, de Vries M, Sebert S, Kere J, Karlsson R, Arshad SY, Hämäläinen E, Routledge MN, Boomsma DI, Feinberg AP, Newschaffer CJ, Govarts E, Moisse M, Fallin MD, Melén E, Prentice AM, Kajantie E, Almqvist C, Oken E, Dabelea D, Boezen HM, Melton PE, Wright RJ, Koppelman GH, Trevisi L, Hivert MF, Sunyer J, Munthe-Kaas MC, Murphy SK, Corpeleijn E, Wiemels J, Holland N, Herceg Z, Binder EB, Smith GD, Jaddoe VWV, Lie RT, Nystad W, London SJ, Lawlor DA, Relton CL*, Snieder H, Felix JF. 2019. A meta-analysis of epigenome-wide association studies in neonates reveals widespread differential methylation associated with birthweight. Nature Communications. 23;10(1):1893.

    8) Jones RM, Melton PE, Pinese M, Rea A, Ingley E, Ballinger ML, Wood DJ, Thomas DM, Moses EK. Identification of novel risk variants for sarcoma and other cancers by whole exome sequencing analysis in cancer cluster families. BMC Medical Genetics. 3;20(1):69.

    9) Rauschert S, Melton PE, Burdge G, Craig J, Godfrey K, Holbrook J, Lillycrop K, Mori TA, Beilin LJ, Oddy W, Pennell C, Huang RC. Maternal smoking during pregnancy induces persistent epigenetic changes into adolescence, associated with cardiovascular health. Frontiers in Genetics. 19: 10:770.

    10) Barton S, Melton PE, Titcombe P, Murray R, Huang RC, Holbrook J, Lillycrop K, Godfrey K. 2019. Including cell type adjustments in regression equations involving methylation data can lead to mullticollinarity and subsequent reversal of direction of association of predictors of interest. Frontiers in Genetics. 19: 10:816.

    11) Warrington NM, Beaumont RN, Horikoshi M, et al. Maternal and fetal genetic effects on birth weight and their relevance to cardio-metabolic risk factors. Nat Genet. 2019;51(5):804-814..

    12) Justice AE, Karaderi T, Highland HM, et al. Protein-coding variants implicate novel genes related to lipid homeostasis contributing to body-fat distribution. Nat Genet. 2019;51(3):452-469.

    13) Spracklen CN, Karaderi T, Yaghootkar H, et al. Exome-Derived Adiponectin-Associated Variants Implicate Obesity and Lipid Biology [published correction appears in Am J Hum Genet. 2019 Sep 5;105(3):670-671]. Am J Hum Genet. 2019;105(1):15-28.

    14) Liu X, Helenius D, Skotte L, et al. Variants in the fetal genome near pro-inflammatory cytokine genes on 2q13 associate with gestational duration. Nat Commun. 2019;10(1):3927.

    15) Merino J, Dashti HS, Li SX, et al. Genome-wide meta-analysis of macronutrient intake of 91,114 European ancestry participants from the cohorts for heart and aging research in genomic epidemiology consortium. Mol Psychiatry.

    16) Bradfield JP, Vogelezang S, Felix JF, et al. A trans-ancestral meta-analysis of genome-wide association studies reveals loci associated with childhood obesity. Hum Mol Genet.

    17) Couto Alves A, De Silva NMG, Karhunen V, et al. GWAS on longitudinal growth traits reveals different genetic factors influencing infant, child, and adult BMI. Sci Adv. 2019;5(9):eaaw3095

Figure 1. Circos plot showing the (Bonferroni-corrected p  50%. Tracks 2–6: highlighted in red are CpGs that were not found in the 914 main meta-analysis hits (though note differences in sample size and hence statistical power for different analyses presented in the different tracks). From 53. Küpers et al (2019) A meta-analysis of epigenome-wide association studies in neonates reveals widespread differential methylation associated with birthweight. Nature Communications. 23;10(1):1893