Analysis of Complex Genomes

The field of genomics has been revolutionised by advances in DNA sequencing technology. This explosion in DNA sequence volume has created a challenge both to manage and interpret this data, as well as to apply this technology to answer basic scientific questions about the organism under study. We worked on several projects last year and mainly in three areas: 1. Pangenome construction: There is an increasing awareness that a reference sequence representing a genome of a single individual is unable to capture all of the gene repertoire found in the species. A pangenome of a species is the whole gene repertoire of a study group of individuals and is an important source of genetic diversity for crop breeding. In order to include all genes of a species, the study group needs to be large enough for constructing a pangenome that only becomes feasible with the advancement and low cost Next generation sequencing (NGS) technology. The pangenome construction of a species will provide important insights into genomic composition and diversity of these economically important crops allowing breeders to breed better crops to improve food production. 2. Genome assembly and validation: A good reference genome is essential to answer important biological questions. Basic assemblies which produce the sequence of all genes, promoters, and low copy or unique regions are relatively inexpensive and provide valuable biological insights, while more robust pseudomolecule assemblies have greater utility in the identification of gene variation underlying traits, and for use in genomics-assisted breeding. We had projects to assembly different plant genomes. 3. Machine learning for crop improvement: Climate change is here and farmers need solutions for higher yield. We have started collaborations around drone-based automated phenotyping of crops growing on Western Australian fields and around genomic prediction of crosses in crop breeding programs. The drone data is being used to predict frost in order to assign different levels of seed quality in a field, and the genomic prediction data is used to predict which crosses will perform best to find optimal combinations of alleles

Principal investigator

David Edwards
Magnifying glass

Area of science

Biological Sciences, Geosciences

Systems used

Magnus, Zeus and Nimbus, Topaz and Managed Storage

Applications used

BLAST+, MaSuRCA, Samtools, Bowtie, SoapAligner, HISATs, Braker, Augustus, bamtools, RepeatMasker, jellyfish, interproscan, maker, orthomcl, trimmomatic, tophat, picard-tools, fastqc, cufflinks, bwa, scikit-learn, keras, pytorch, tensorflow, orthofinder, conda, bowtie2, masurca, Python, Perl, as well as custom scripts and software.
Partner Institution: The University of Western Australia| Project Code: Pawsey0149

The Challenge

The aim of the pangenome projects is to construct a good pangenome reference. A large enough study group contributes to a massive amount of data for each individual. Genomes are large, for example the wheat genome consists of 17 thousand million letters and we need many copies to determine their order and variations. This massive quantity of data is required to be processed and analysed by aligning a reference genome, assembling un-aligned reads, and annotating the assembled reads to the large gene databases. It is impossible to analyse such massive amount of data without a HPC cluster running multiple jobs in parallel.

Genome assembly faces different challenges. It involved different steps mainly including data quality control, data pre-processing, read assembly and annotation. Building a good reference relies on having large enough of data that all are required to be loaded into memory for de novo assembly. Therefore, the main challenge of de novo assembly tasks is the requirement of large memory computers to produce the required results.
Machine learning faces several different challenges: large quantities of data in hard-to-process formats such as binary image data, dirty data that needs strict quality control, and that many different, computationally intensive models need to be tested in order to arrive at the best-performing model, which requires powerful GPUs.

The Solution

For the construction of pangenomes, we have established an efficient pangenome construction pipeline to apply to various species. Even with an efficient pipeline, a powerful highly parallelised computer cluster is essential for the analysis of this massive amount of data within an acceptable timeframe.

The success of assembling a genome relies on good design and good resources that includes project design, material preparation, DNA extraction, DNA sequencing, data management and data analysis. Bioinformatics takes care of data aspect and it is important to analyse the data with sufficient computing resources.

Efficient machine learning relies on good experimental design and powerful graphics processing units (GPUs).

The Outcome

The Data Stores provided by Pawsey allowed us to keep the data safely and the data can be quickly downloaded to the scratch space for analysis in other Pawsey servers.

Large memory machines are essential for de novo assembly of large plant genomes. In the past we used Zythos for our genome assembly projects but as this has been discontinued now we use the highmemq on Zeus for the same outcome.

The pangenome projects requires smaller amount of memory but requires a large number of nodes to process the massive amount of data in parallel so that the results can be obtained in an acceptable time. Magnus has thousands of nodes allowing highly parallelised processing that allowed us to finish the high computing demanding analysis in several weeks instead of years.

The machine learning projects require storage space, which is handled by the /scratch system on Pawsey, and powerful GPUs, which are provided by the Topaz cluster and by Nimbus GPU nodes, which enables us to train and analyse many models in a matter of days.

List of Publications

1. Golicz AA, Bhalla PL, Edwards D, Singh MB. (2020) Rice topologically associated domains display elevated sequence variation and meiotic crossover rate. Communications Biology. (accepted April 2020)
2. Zhao J, Bayer PE, Ruperao P, Saxena RK, Khan AW, Golicz AA, Nguyen HT, Batley J, Edwards D, Varshney RK. (2020) Trait associations in the pangenome of pigeon pea (Cajanus cajan) Plant Biotechnology Journal. (accepted February 2020)
3. Danilevicz MF, Tay Fernandez CG, Marsh JI, Bayer PE, Edwards D. (2020) Plant Pangenomics: Approaches, Applications and Advancements. Current Opinion in Plant Biology. 54: 15-25
4. Anderson R, Bayer PE, Edwards D. (2020) Climate Change and the need for agricultural adaptation. Current Opinion in Plant Biology. (accepted December 2019)
5. Hussain Q, Shi J, Scheben, Zhan J, Wang X, Liu G, Yan G, King G, Edwards D, Wang H. (2020) Genetic and signaling pathways of dry fruit size: targets for genome editing based crop. Plant Biotechnology Journal. (accepted December 2019)
6. Golicz A, Bayer PE, Bhalla PL, Batley J, Edwards D. (2020) Pangenomics comes of age: From bacteria to plant and animal applications. Trends in Genetics 63(2): 132-145
7. Hackenberg D, Asare-Bediako E, Baker A, Walley P, Jenner C, Greer S, Bramham L, Batley J, Edwards D, Delourme R, Barker G, Teakle G, Walsh J. (2020) Identification and QTL mapping of resistance to Turnip yellows virus (TuYV) in oilseed rape, Brassica napus. Theoretical and Applied Genetics 133: 383-393
8. Feng K, Cui L, Wang L, Shan D, Tong W, Deng P, Yan Z, Wang M, Zhan H, Wu X, He W, Zhou X, Ji J, Zhang G, Mao L, Karafiátová M, Šimková H, Doležel J, Du X, Zhao S, Luo M‐C, Han D, Zhang C, Kang Z, Appels R, Edwards D, Nie X and Weining S. (2020) The improved assembly of 7DL chromosome provides insight into the structure and evolution of bread wheat. Plant Biotechnology Journal 18 (3): 732-742
9. Dolatabadian A, Bayer P, Tirnaz S, Hurgobin B, Edwards D, Batley J. (2020) Characterisation of disease resistance genes in the Brassica napus pangenome reveals significant structural variation. Plant Biotechnology Journal. 18 (4): 969-982
10. Valliyodan B, Cannon SB, Bayer PE, Shu S, Ren L, Jenkins J, Chung CYL, Chan TF, Daum CG, Plott C, Hastie A, Baruch K, Barry KW, Huang WH, Patil G, Varshney RK, Hu H, Batley J, Yuan Y, Song Q, Goodstein DM, Stacey G, Lam HM, Jackson SA, Schmutz J, Grimwood J, Edwards D, Nguyen HT. (2019) Construction and comparison of three new reference-quality genome assemblies for soybean. The Plant Journal. 100 (5): 1066-1082
11. Kreplak J, Madoui MA, Cápal P, Novák P, Labadie K, Aubert G, Bayer PE, Gali KK, Syme RA, Main D, Klein A, Bérard A, Vrbová I, Fournier C, d’Agata L, Belser C, Berrabah W, Toegelová H, Milec Z, Vrána J, Lee HT, Kougbeadjo A, Térézol M, Huneau C, Turo CT, Mohellibi N, Neumann P, Falque M, Gallardo K, McGee R, Tar’an B, Bendahmane A, Aury JM, Batley J, Le Paslier MC, Ellis N, Warkentin TD, Coyne CJ, Salse J, Edwards D, Lichtenzveig J, Macas J, Doležel J, Wincker P, Burstin J. (2019) A reference genome for pea provides insight into legume genome evolution. Nature Genetics 51, 1411-1422.
12. Varshney RK, Thudi M, Roorkiwal M, He W, Upadhyaya HD, Yang W, Bajaj P, Cubry P, Rathore A, Jian J, Doddamani D, Khan AW, Garg V, Chitikineni A, Xu D, Gaur PM, Singh NP, Chaturvedi SK, Nadigatla GVPR, Krishnamurthy L, Dixit GP, Fikre A, Kimurto PK, Sreeman SM, Bharadwaj C, Tripathi S, Wang J, Lee S-H, Edwards D, Polavarapu KKB, Penmetsa RV, Crossa J, Nguyen HT, Siddique KHM, Colmer TD, Sutton T, von Wettberg E, Vigouroux Y, Xu X and Liu X. (2019) Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nature Genetics 51, 857-864.
13. Mousavi-Derazmahalleh M, Chang S, Thomas G, Derbyshire M, Bayer PE, Edwards D, Nelson M, Erskine W, Lopez-Ruiz FJ, Clements J, Hane. (2019) Prediction of pathogenicity genes involved in adaptation to a lupin host in the fungal pathogens Botrytis cinerea and Sclerotinia sclerotiorum via comparative genomics. BMC Genomics 20 (1): 385
14. Nock CJ, Hardner CM, Montenegro JD, Termizi AAA, Hayashi S, Playford J, Edwards D, Batley J. (2019) Wild origins of Macadamia domestication identified through intraspecific chloroplast genome sequencing. Frontiers in Plant Science. 10 (334)
15. Scheben A, Verpaalen B, Lawley C, Chan KCC, Bayer PE, Batley J, Edwards D. (2019) CropSNPdb: a database of SNP array data for Brassica crops and hexaploid bread wheat. The Plant Journal. 98 (1): 142-152
16. Anderson R, Edwards D, Batley J, Bayer PE. (2019) Genome-wide association studies in plants. eLS. (accepted November 2018)
17. Chan CKK, Rosic N, Lorenc MT, Visendi P, Lin M, Kaniewska P, Ferguson B, Gresshoff P, Batley J, Edwards D. (2019) A differential k-mer analysis pipeline for comparing RNA-Seq transcriptome and meta-transcriptome datasets without a reference. Functional and Integrative Genomics. 19 (2): 363-371
18. Yu J, Golicz A, Lu K, Dossa K, Zhang Y, Chen J, Wang L, You J, Fan D, Edwards D, Zhang X. (2019) Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars. Plant Biotechnology Journal. 17 (5): 881-892
19. Melonek J, Zhou R, Bayer PE, Edwards D, Stein N, Small I. (2019) High intraspecific diversity of Restorer-of-fertility-like genes in barley. The Plant Journal. 97 (2): 281-295
20. Bayer PE; Golicz A, Tirnaz S, Chan KCC, Edwards D, Batley J. (2019) Variation in abundance of predicted resistance genes in the Brassica oleracea pangenome. Plant Biotechnology Journal. 17 (4) :789-800
21. Scheben A and Edwards D. (2018) Bottlenecks for genome-edited crops on the road from lab to farm. Genome Biology. 19 (178)
22. Bayer PE, Edwards D, Batley J. (2018) Bias in resistance gene prediction due to repeat-masking. Nature Plants. 4: 762–765
23. Mousavi-Derazmahalleh M, Nevado B, Bayer PE, Filatov D, Hane JK, Edwards D, Erskine W, N. Nelson N. (2018) The western Mediterranean region provided the founder population of domesticated narrow-leafed lupin. Theoretical and Applied Genetics 131 (12): 2543-2554
24. Yuan Y, Milec Z, Bayer PE, Vrána J, Doležel J, Edwards D, Erskine W, Kaur P. (2018) Large-Scale Structural Variation Detection in Subterranean Clover Subtypes Using Optical Mapping. Frontiers in Plant Science. 9 (971)
25. The International Wheat Genome Sequencing Consortium (IWGSC). (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 361 (6403): eaar7191
26. Hu H, Scheben A, Edwards D. (2018) Advances in integrating genomics and bioinformatics in the plant breeding pipeline. Agriculture. 8 (6): 75
27. Scheben A and Edwards D. (2018) Towards a more predictable plant breeding pipeline with CRISPR/Cas-induced allelic series to optimize quantitative and qualitative traits. Current Opinion in Plant Biology. 45: 218-225
28. Lee HT, Golicz AA, Bayer PE, Severn-Ellis A, Chan CKK, Batley J, Kendrick GA and Edwards D. (2018) Genomic comparison of two independent seagrass lineages reveals habitat-driven convergent evolution. Journal of Experimental Botany. 69 (15): 3689-3702
29. Taylor CM, Kamphuis LG, Zhang W, Garg G, Berger JD, Mousavi-Derazmahalleh M, Bayer P, Edwards D, Singh KB, Cowling WA, Nelson MN. (2018) INDEL variation in the regulatory region of the major flowering time gene LanFTc1 is associated with vernalisation response and flowering time in narrow-leafed lupin (Lupinus angustifolius L.). Plant Cell and Environment. 42 (1): 174-187
30. Tulpová Z, Luo MC, Toegelová H, Visendi P, Hayashi S, Vojta P, Paux E, Kilian A, Abrouk M, Bartoš J, Hajdúch M, Batley J, Edwards D, Doležel J, Šimková H. (2018) Integrated physical map of bread wheat chromosome arm 7DS to facilitate gene cloning and comparative studies. New Biotechnology. 48: 12-19
31. Mousavi-Derazmahalleh M, Bayer P, Hane J, Valliyodan B, Nguyen HT, Nelson M, Erskine W, Varshney RK, Papa R, Edwards D. (2018) Adapting legume crops to climate change using genomic approaches. Plant Cell and Environment. 42 (1): 6–19
32. Scheben A, Chan CKK, Mansueto L, Mauleon R, Larmande P, Alexandrov N, Wing RA, McNally KL, Quesneville H, Edwards D. (2018) Progress in single-access information systems for wheat and rice crop improvement. Briefings in Bioinformatics. bby016,
33. Watson A, Ghosh S, Williams MJ, Cuddy WS, Simmonds, Rey MD, Hatta MAMD, Hinchliffe A, Steed A, Reynolds D, Adamski NM, Breakspear A, Korolev A, Rayner T, Dixon LE, Riaz A, Martin W, Ryan M, Edwards D, Batley J, Raman H, Carter J, Rogers C, Domoney C, Moore G, Harwood W, Nicholson P, Dieters MJ, DeLacy IH, Zhou J, Uauy C, Boden SA, Park RF, Wulff BBH, Hickey LT. (2018) Speed breeding is a powerful tool to accelerate crop research and breeding. Nature Plants. 4: 23–29
34. Li Y, Ruperao P, Batley J, Edwards D, Khan T, Colmer TD, Pang J, Siddique KHM, Sutton T. Investigating drought tolerance in chickpea using genome-wide association mapping and genomic selection based on whole-genome resequencing data. Frontiers in Plant Science. 9: 190
35. Mousavi-Derazmahalleh M, Bayer PE, Buno Nevado B, Hurgobin B, Filatov D, Kilian A, Kamphuis LG, Singh KB, Berger JD, Hane JK, Edwards D, Erskine W, N. Nelson MN. Exploring the genetic and adaptive diversity of a pan Mediterranean crop wild relative: narrow-leafed lupin (2018) Theoretical and Applied Genetics. 131 (4): 887–901
36. Kaashyap, M, Ford, R, Kudapa, H, Jain, M, Edwards, D, Varshney, R and Mantri, N (2018) Differential regulation of genes involved in root morphogenesis and cell wall modification is associated with salinity tolerance in chickpea. Scientific Reports, 8 (1). doi:10.1038/s41598-018-23116-9
37. Yuan Y, Lee HT, Hu H, Scheben A, Edwards D. (2018) Single-cell genomic analysis in plants. Genes. 9 (1): 50
38. Hurgobin B, Golicz A, Bayer P, Chan K, Tirnaz S, Dolatabadian A, Schiessl S, Samans B, Montenegro J, Parkin I, Pires C, Chalhoub B, King G, Snowdon R, Batley J and Edwards D. Homoeologous exchange is a major cause of gene presence/absence variation in the amphidiploid Brassica napus. (2018) Plant Biotechnology Journal. 16 (7), 1265-1274
39. Alamery S, Tirnaz S, Bayer P, Tollenaere R, Chaloub B, Edwards D and Batley J. (2018) Genome wide identification and comparative analysis of NBS-LRR resistance genes in Brassica napus. Crop and Pasture Science. 69, 72-93

Figure 1. Genome-wide association results in the pigeon pea (Cajanus cajan) pangenome for seed weight, trait associations for variable genes (a) and single nucleotide polymorphisms )SNPs) (b). One gene and two SNPs one three different chromosomes show statistically significant association with the phenotype. Figure from Zhao J et al (2020) Trait associations in the pangenome of pigeon pea (Cajanus cajan) Plant Biotechnology Journal.
Figure 2. Genomic structure of the pea (Pisum sativum) genome assembly. Each bar represents chromosomes 1 to 7, with estimated centromere positions represented as black bars. Each track represents the density of retrotransposons, transposons, genes, ncRNA, tRNA and miRNA coding sequences (b–g). Lines in the inner circle represent links between synteny-selected paralogs. Figure from Kreplak J et al (2019) A reference genome for pea provides insight into legume genome evolution. Nature Genetics 51, 1411-1422.