Analysis of complex genomes

The field of genomics has been revolutionised by advances in DNA sequencing technology. This explosion in DNA sequence volume has created a challenge both to manage and interpret this data, as well as to apply this technology to answer basic scientific questions about the organism under study. We worked on several projects last year and mainly in three areas: 1. Pangenome construction: There is an increasing awareness that a reference sequence representing a genome of a single individual is unable to capture all of the gene repertoire found in the species. A pangenome of a species is the whole gene repertoire of a study group of individuals and is an important source of genetic diversity for crop breeding. In order to include all genes of a species, the study group needs to be large enough for constructing a pangenome that only becomes feasible with the advancement and low cost Next generation sequencing (NGS) technology. The pangenome construction of a species will provide important insights into genomic composition and diversity of these economically important crops allowing breeders to breed better crops to improve food production. 2. Genome assembly and validation: A good reference genome is essential to answer important biological questions. Basic assemblies which produce the sequence of all genes, promoters, and low copy or unique regions are relatively inexpensive and provide valuable biological insights, while more robust pseudomolecule assemblies have greater utility in the identification of gene variation underlying traits, and for use in genomics-assisted breeding. We had projects to assembly different plant genomes focusing mostly on crop genomes including orphan crops such as yam bean, but also Australian native plants such as seagrasses or Hakea. 3. Machine learning for crop improvement: Climate change is here and farmers need solutions for higher yield. We have collaborations around drone-based automated phenotyping of crops growing on Western Australian fields and around genomic prediction of crosses in crop breeding programs. The drone data is being used to predict crop yield in commercial varieties, and the genomic data is used to predict disease resistance status of crop varieties.
Person

Principal investigator

David Edwards dave.edwards@uwa.edu.au
Magnifying glass

Area of science

Bioinformatics
CPU

Systems used

Magnus , Zeus, Nimbus, Topaz and Managed Storage
Computer

Applications used

BLAST+, MaSuRCA, Samtools, Bowtie, SoapAligner, HISATs, Braker, Augustus, bamtools, RepeatMasker, jellyfish, interproscan, maker, orthomcl, trimmomatic, tophat, picard-tools, fastqc, cufflinks, bwa, scikit-learn, keras, pytorch, tensorflow, orthofinder, conda, bowtie2, masurca, Python, Perl, R, as well as custom scripts and software
Partner Institution: The University of Western Australia| Project Code: Pawsey0149

The Challenge

The aim of the pangenome projects is to construct a good pangenome reference reflecting the significant diversity within a species. Genomes are large, for example the wheat genome consists of 17 thousand million letters, 6 times larger than the human genome, and we need many copies to determine their order and variations. This massive quantity of data is required to be processed and analysed by aligning a reference genome, assembling un-aligned reads, and annotating the assembled reads to the large gene databases. It is impossible to analyse such massive amount of data without a HPC cluster running multiple jobs in parallel.
Genome assembly faces different challenges. It involved different steps mainly including data quality control, data pre-processing, read assembly and annotation. Building a good reference relies on having large enough of data that all are required to be loaded into memory for de novo assembly. Therefore, the main challenge of de novo assembly tasks is the requirement of large memory computers to produce the required results.
Machine learning faces several different challenges: large quantities of data in hard-to-process formats such as binary image data, dirty data that needs strict quality control, and that many different, computationally intensive models need to be tested in order to arrive at the best-performing model, which requires powerful GPUs.

The Solution

For the construction of pangenomes, we have established an efficient pangenome construction pipeline and apply this to various species. Even with an efficient pipeline, a powerful highly parallelised computer cluster is essential for the analysis of this massive amount of data within an acceptable timeframe.
The success of assembling a genome relies on good design and good resources that includes project design, material preparation, data generation, data management and data analysis. It is important to analyse the data with sufficient computing resources.
Efficient machine learning relies on good experimental design and powerful graphics processing units (GPUs). This is a rapidly expanding field as the scope of diverse datasets and applications continues to grow.

The Outcome

The Data Stores provided by Pawsey allowed us to keep the data safely and the data can be quickly downloaded to the scratch space for analysis in other Pawsey servers.
Large memory machines are essential for de novo assembly of large plant genomes. In the past we used Zythos for our genome assembly projects but as this has been discontinued now we use the highmemq on Zeus for the same outcome. We are also using Zeus’ longq for long-running jobs. The pangenome projects requires smaller amount of memory but requires a large number of nodes to process the massive amount of data in parallel so that the results can be obtained in an acceptable time. Magnus has thousands of nodes allowing highly parallelised processing that allowed us to finish the high computing demanding analysis in several weeks instead of years. The machine learning projects require storage space, which is handled by the /scratch system on Pawsey and data@pawsey, and powerful GPUs, which are provided by the Topaz cluster and by Nimbus GPU nodes, which enables us to train and analyse many models in a matter of days.

Commercial Advantage of this Project

We are exploring two new commercial projects this year. One is from an international genotyping company who want support on a genotyping method we developed a few years ago based on whole genome skim sequencing. The second is with a European start up company who are interested in our machine learning approaches to develop plant proteins for human and animal food.

List of Publications

1. Schliebs O, Chan K, Bayer PE, Petereit J, Singh A, Hassani-Pak K, Batley J, Edwards D. (2021) Daisychain: search and interactive visualisation of homologs in genome assemblies. Agronomy (accepted December 2021)
2. Varshney RK, Roorkiwal M, Sun S, Bajaj P, Chitikineni A, Thudi M, Singh NP, Du X, Upadhyaya HD, Khan AW, Wang Y, Garg V, Fan G, Cowling WA, Crossa J, Gentzbittel L, Voss-Fels KP, Valluri VK, Sinha P, Singh VK, Ben C, Rathore A, Punna R, Singh MK, Tar’an B, Bharadwaj C, Yasin M, Pithia MS, Singh S, Soren KR, Kudapa H, Jarquín D, Cubry P, Hickey LT, Dixit GP, Thuillet A-C, Hamwieh A, Kumar S, Deokar AA, Chaturvedi SK, Francis A, Howard R, Chattopadhyay D, Edwards D, Lyons E, Vigouroux Y, Hayes BJ, von Wettberg E, Datta SK, Yang H, Nguyen HT, Wang J, Siddique KHM, Mohapatra T, Bennetzen JL, Xu X, Liu X. (2021) A chickpea genetic variation map based on the sequencing of 3,366 genomes. Nature. https://doi.org/10.1038/s41586-021-04066-1
3. de Ronne M, Santhanam P, Labbé C, Lebreton A, Ye H, Vuong T, Hu H, Valliyodan B, Edwards D, Nguyen H, Belzile F, Belanger R. (2021) Mapping of partial resistance to Phytophthora sojae in soybean plant introductions using whole-genome sequencing reveals a major QTL. Plant Genome. (accepted November 2021)
4. Greer SF, Hackenberg D, Gegas V, Mitrousia G, Edwards D, Batley J, Teakle G, Barker GC, Walsh JA. (2021) QTL mapping of resistance to turnip yellows virus (TuYV) in Brassica rapa and Brassica oleracea and introgression of these resistances by resynthesis into allotetraploid plants for deployment in Brassica napus. Frontiers in Plant Science. (accepted November 2021)
5. Zanini S, Bayer PE, Wells R, Snowdon R, Varshney R, Nguyen H, Edwards D, Golicz AA. (2021) Pangenomics in crop improvement – from coding SVs to finding regulatory variants with pangenome graphs. Plant Genome. (accepted October 2021)
6. Danilevicz MF, Bayer PE, Boussaid F, Bennamoun M, Edwards D. (2021) Maize yield prediction at an early developmental stage using multispectral images and genotype data for preliminary hybrid selection in the field. Remote Sensing. 13 (19):3976. https://doi.org/10.3390/rs13193976
7. Li Y, Ruperao P, Batley J, Edwards D, Martin W, Hobson K, Sutton T. (2021) Genomic Prediction of Preliminary Yield Trials in Chickpea: Effect of Functional Annotation of SNP and Environment. Plant Genome. (accepted September 2021)
8. Hanifei M, Mehravi S, Khodadadi M, Severn-Ellis AA, Edwards D, Batley J. (2021) Detection of epistasis for fruit and some phytochemical traits in coriander under different irrigation regimes. Agronomy (accepted September 2021)
9. Ma Y, Zhao J, Fu H, Yang T, Dong J, Yang W, Chen L, Zhou L, Wang J, Liu B, Zhang S and Edwards D. (2021) Genome‑wide identification, expression and functional analysis reveal the involvement of FCS‑like zinc finger gene family in submergence response in rice. Rice. (accepted August 2021)
10. Gacek K, Bayer PE, Anderson R, Severn-Ellis AA, Wolko J, Łopatyńska A, Matuszczak M, Bocianowski J, Edwards D, Batley J. (2021) QTL genetic mapping study for traits affecting meal quality in winter oilseed rape (Brassica napus L.). Genes. (accepted August 2021)
11. Varshney RK, Bohra A, Roorkiwal M, Barmukh R, Cowling W, Chitikineni A, Lam HM, Hickey LT, Croser J, Edwards D, Farooq M, Crossa J, Weckwerth W, Millar H, Kumar A, Bevan MW, Siddique KHM. (2021) Rapid delivery systems for a food-secure future. Nature Biotechnology. (accepted August 2021)
12. Varshney RH, Bohra A, Roorkiwal M, Barmukh R, Cowling W, Chitikineni A, Lam HM, Hickey LT, Croser JS, Bayer P, Edwards D, Crossa J, Weckwerth W, Millar H, Kumar A, Bevan MW, Siddique KHM. (2021) Fast-forward breeding for a food-secure world. Trends in Genetics. (accepted August 2021)
13. Wang K, Hu H , Tian Y, Li J, Scheben A, Zhang C, Li Y, Wu J, Yang J, Fan X, Sun G, Li D, Zhang Y, Han R, Jiang R, Huang H, Yan F, Wang Y, Li Z, Li G, Liu X, Li W, Edwards D, Kang X. (2021). The chicken pan-genome reveals gene content variation and a regulatory region deletion in IGF2BP1 affecting body size. Molecular Biology and Evolution (accepted July 2021)
14. Hu H, Scheben A, Verpaalen B, Tirnaz S, Bayer PE, Hodel R, Batley J, Soltis D, Soltis P, Edwards D. 2021) Amborella gene presence/absence variation is associated with abiotic stress responses that may contribute to environmental adaptation. New Phytologist. (accepted July 2021)
15. Bayer P, Scheben A, Golicz A, Yuan Y, Faure S, Lee HT, Chawla H, Anderson R, Bancroft I, Raman H, Lim YP, Robbens S, Jiang L, Liu S, Barker M, Schranz E, Wang X, King G, Pires JC, Chalhoub B, Snowdon R, Batley J, Edwards D. (2021) Modelling of gene loss propensity in the pangenomes of three Brassica species suggests different mechanisms between polyploids and diploids. Plant Biotechnology Journal. (accepted July 2021)
16. Ewerea EE, Rosica N, Bayer PE, Ngangbama A, Edwards D, Kelahera BP, Mamoe LT, Benkendorffa K. (2021) Marine heatwaves have minimal influence on the quality of adult Sydney rock oyster flesh. Science of the Total Environment. (accepted June 2021)
17. Mehravi S, Ranjbar GA, Mirzaghaderi G, Severn-Ellis AA, Scheben A, Edwards D, Batley J. (2021) De novo SNP discovery and genotyping of Iranian Pimpinella Species using double digest restriction site-associated DNA sequencing. Agronomy. (accepted June 2021)
18. Yuxuan Y, Bayer PE, Batley J, Edwards D. (2021) Current status of structural variation studies in plants. Plant Biotechnology Journal. (accepted June 2021)
19. Danilevicz M, Bayer PE, Nestor B, Bennamoun M, Edwards D. (2021) Resources for image-based high throughput phenotyping in crops and data sharing challenges. Plant Physiology. (accepted June 2021)
20. Park SG, Noh E, Choi SR, Choi B, Shin IG, Yoo SI, Lee DJ, Ji S, Kim HS, Hwang YJ, Kim JS, Batley J, Lim YP, Edwards D, Hong CP (2021) Draft genome assembly and transcriptome dataset for European turnip (Brassica rapa L. ssp. rapifera), ECD4 carrying clubroot resistance. Frontiers in Genetics. 2 (12): 651298 doi: 10.3389/fgene.2021.651298
21. Amas J, Robyn R, Edwards D, Cowling W, Batley J. (2021) Status and advances in mining for blackleg (Leptosphaeria maculans) quantitative resistance (QR) in oilseed rape (Brassica napus). Theoretical and Applied Genetics. 134(10):3123-3145. doi: 10.1007/s00122-021-03877-0
22. Vranken S, Wernberg T, Scheben A, Severn-Ellis A, Batley J, Bayer PE, Edwards D, Wheeler D, Coleman M. (2021) Genotype–Environment mismatch of kelp forests under climate change. Molecular Ecology. 30: 3730– 3746. https://doi.org/10.1111/mec.15993
23. Tay Fernandez C, Pati K, Severn-Ellis AA, Batley J, Edwards D. (2021) Studying the genetic diversity of yam bean using a new draft genome assembly. Agronomy. 11(5): 953 https://doi.org/10.3390/agronomy11050953
24. Bayer PE, Petereit J, Danilevicz MF, Anderson R, Batley J, Edwards D. (2021) The application of pangenomics and machine learning in genomic selection. Plant Genome. e20112. doi: 10.1002/tpg2.20112
25. Bayer PE, Valliyodan B, Hu H, Marsh J, YuanY, Vuong TD, Patil G, Song Q, Batley J, Varshney RK, Lam HM, Edwards D, Nguyen HY. (2021) Sequencing the USDA core soybean collection reveals gene loss during domestication and breeding. Plant Genome. e20109. https://doi.org/10.1002/tpg2.20109
26. Ruperao P, Thirunavukkarasu N, Gandham P, Selvanayagam S, Govindaraj M, Nebie B, Manyasa E, Gupta R, Das RR, Odeny DA, Gandhi H, Edwards D, Deshpande SP, Rathore A. (2021) Sorghum pan-genome explores the functional utility for genomic- assisted breeding to accelerate the genetic gain. Frontiers in Plant Science. 12:963
27. He Z, Ji R, Havlickova L, Wang L, Li Y, Lee HT, Song J, Koh C, Yang J, Zhang M, Parkin IAP, Wang X, Edwards D, King GJ, Zou J, Liu K, Snowdon RJ, Banga SS, Machackova I, Bancroft I. (2021). Genome structural evolution in Brassica crops. Nature Plants. 7(6):757-765. doi: 10.1038/s41477-021-00928-8
28. Rijzaani H, Bayer PE, Rouard M, Doležel J, Batley J, Edwards D. (2021) The pangenome of banana highlights differences between genera and genomes. Plant Genome. e20100 https://doi.org/10.1002/tpg2.20100
29. Marsh JI, Hu H, Gill M, Batley J, Edwards D. (2021) Crop breeding for a changing climate: integrating phenomics and genomics with bioinformatics. Theoretical and Applied Genetics. 134: 1677-1690
30. Yang H, Mohd Saad NS, Ibrahim MI, Bayer PE, Neik TX, Severn-Ellis AA, Pradhan A, Edwards D, Batley J. (2021) Candidate Rlm6 resistance genes against Leptosphaeria maculans identified through a genome-wide association study in Brassica juncea. Theoretical and Applied Genetics. 134 (7): 2035-2050
31. Mohd Saad NS, Severn-Ellis A, Pradhan A, Edwards D, Batley J. (2021) Genomics armed with diversity leads the way in Brassica improvement in a changing global environment. Frontiers in Genetics. 12: 600789
32. Valliyodan B, Brown A, Wang J, Patil G, Liu Y, Otyama P, Nelson R, Vuong T, Song Q, Musket T, Wagner R, Marri P, Reddy S, Sessions A, Wu X, Grant D, Bayer P, Roorkiwal M, Varshney R, Liu X, Edwards D, Xu D, Joshi T, Cannon S, Nguyen H. (2021) Genetic variation among 481 diverse soybean accessions, inferred from genomic re-sequencing. Scientific Data. 8, 50
33. Cantila AY, Mohd Saad NS, Amas JC, Edwards D, Batley J. (2021) Recent findings unravel genes and genetic factors underlying Leptosphaeria maculans resistance in Brassica napus and its relatives. International Journal of Molecular Sciences. 22(1): 313