Analysis of complex genomes

The field of genomics has been revolutionised by advances in DNA sequencing technology. This explosion in DNA sequence volume has created a challenge both to manage and interpret this data, as well as to apply this technology to answer basic scientific questions about the organism under study. We worked on several projects last year and mainly in two areas: 1. Pangenome construction: There is an increasing awareness that a reference sequence representing a genome of a single individual is unable to capture all of the gene repertoire found in the species. A pangenome of a species is the whole gene repertoire of a study group of individuals and is an important source of genetic diversity for crop breeding. In order to include all genes of a species, the study group needs to be large enough for constructing a pangenome that only becomes feasible with the advancement and low cost Next generation sequencing (NGS) technology. The pangenome construction of a species will provide important insights into genomic composition and diversity of these economically important crops allowing breeders to breed better crops to improve food production. 2. Genome assembly and validation: A good reference genome is essential to answer important biological questions. Basic assemblies which produce the sequence of all genes, promoters, and low copy or unique regions are relatively inexpensive and provide valuable biological insights, while more robust pseudomolecule assemblies have greater utility in the identification of gene variation underlying traits, and for use in genomics-assisted breeding. We had projects to assembly different plant genomes
Person

Principal investigator

David Edwards dave.edwards@uwa.edu.au
Magnifying glass

Area of science

Biology
CPU

Systems used

Magnus, Zeus and Nimbus, Zythos and Data Storage
Computer

Applications used

BLAST+, MaSuRCA, Samtools, Bowtie, SoapAligner, HISATs, Braker, Augustus, bamtools, RepeatMasker, jellyfish, interproscan, maker, orthomcl, trimmomatic, tophat, picard-tools, fastqc, cufflinks, bwa as well as custom scripts and software.
Partner Institution: The University of Western Australia| Project Code: Pawsey0149

The Challenge

The aim of the pangenome projects is to construct a good pangenome reference. A large enough study group contributes to a massive amount of data for each individual. Genomes are large, for example the wheat genome consists of 17 thousand million letters and we need many copies to determine their order and variations. This massive quantity of data is required to be processed and analysed by aligning a reference genome, assembling un-aligned reads, and annotating the assembled reads to the large gene databases. It is impossible to analyse such massive amount of data without HPC cluster running multiple jobs in parallel.
Genome assembly faces different challenges. It involved different steps mainly including data quality control, data pre-processing, read assembly and annotation. Building a good reference relies on having large enough of data that all are required to be loaded into memory for de novo assembly. Therefore, the main challenge of de novo assembly tasks is the requirement of large memory computers to produce the required results

The Solution

For the construction of pangenomes, we have established an efficient pangenome construction pipeline to apply to various species. Even with an efficient pipeline, a powerful highly parallelised computer cluster is essential for the analysis of this massive amount of data within an acceptable timeframe. The success of assembling a genome relies on good design and good resources that includes project design, material preparation, DNA extraction, DNA sequencing, data management and data analysis. Bioinformatics takes care of data aspect and it is important to analyse the data with sufficient computing resources.

The Outcome

The Data Stores provided by Pawsey allowed us to keep the data safely and the data can be quickly downloaded to the scratch space for analysis in other Pawsey servers. The large memory machine is essential for de novo assembly large plant genomes. The powerful Zythos server provides 6TB of RAM makes assembly large genomes feasible. We used Zythos for all of our large genome assemblies, such as wheat chromosome assembly.
The pangenome projects requires smaller amount of memory but requires a large number of nodes to process the massive amount of data in parallel so that the results can be obtained in an acceptable time. Magnus has thousands of nodes allowing highly parallelised processing that allowed us to finish the high computing demanding analysis in several weeks instead of years.

List of Publications

. Scheben A, Verpaalen B, Lawley C, Chan KCC, Bayer PE, Batley J, Edwards D. (2018) CropSNPdb: a database of SNP array data for Brassica crops and hexaploid bread wheat. The Plant Journal. (accepted November 2018)
2. Anderson R, Edwards D, Batley J, Bayer PE. (2018) Genome-wide association studies in plants. eLS. (accepted November 2018)
3. Chan CKK, Rosic N, Lorenc MT, Visendi P, Lin M, Kaniewska P, Ferguson B, Gresshoff P, Batley J, Edwards D. (2018) A differential k-mer analysis pipeline for comparing RNA-Seq transcriptome and meta-transcriptome datasets without a reference. Functional and Integrative Genomics. (accepted November 2018)
4. Yu J, Golicz A, Lu K, Dossa K, Zhang Y, Chen J, Wang L, You J, Fan D, Edwards D, Zhang X. (2018) Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars. Plant Biotechnology Journal. (accepted October 2018)
5. Scheben A and Edwards D. (2018) Bottlenecks for genome-edited crops on the road from lab to farm. Genome Biology. (accepted October 2018)
6. Melonek J, Zhou R, Bayer PE, Edwards D, Stein N, Small I. (2018) High intraspecific diversity of Restorer-of-fertility-like genes in barley. The Plant Journal. (accepted September 2018)
7. Bayer PE; Golicz A, Tirnaz S, Chan KCC, Edwards D, Batley J. (2018) Variation in abundance of predicted resistance genes in the Brassica oleracea pangenome. Plant Biotechnology Journal. (accepted September 2018)
8. Bayer PE, Edwards D, Batley J. (2018) Bias in resistance gene prediction due to repeat-masking. Nature Plants (accepted August 2018)
9. Mousavi-Derazmahalleh M, Nevado B, Bayer PE, Filatov D, Hane JK, Edwards D, Erskine W, N. Nelson N. (2018) The western Mediterranean region provided the founder population of domesticated narrow-leafed lupin. Theoretical and Applied Genetics (accepted August 2018)
10. Yuan Y, Milec Z, Bayer PE, Vrána J, Doležel J, Edwards D, Erskine W, Kaur P. (2018) Large-Scale Structural Variation Detection in Subterranean Clover Subtypes Using Optical Mapping. Frontiers in Plant Science. 9 (971)
11. The International Wheat Genome Sequencing Consortium (IWGSC). (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 361 (6403):xx-xx
12. Hu H, Scheben A, Edwards D. (2018) Advances in integrating genomics and bioinformatics in the plant breeding pipeline. Agriculture. (accepted May 2018)
13. Scheben A and Edwards D. (2018) Towards a more predictable plant breeding pipeline with CRISPR/Cas-induced allelic series to optimize quantitative and qualitative traits. Current Opinion in Plant Biology. (accepted April 2018)
14. Lee HT, Golicz AA, Bayer PE, Severn-Ellis A, Chan CKK, Batley J, Kendrick GA and Edwards D. (2018) Genomic comparison of two independent seagrass lineages reveals habitat-driven convergent evolution. Journal of Experimental Botany. (accepted April 2018)
15. Taylor CM, Kamphuis LG, Zhang W, Garg G, Berger JD, Mousavi-Derazmahalleh M, Bayer P, Edwards D, Singh KB, Cowling WA, Nelson MN. (2018) INDEL variation in the regulatory region of the major flowering time gene LanFTc1 is associated with vernalisation response and flowering time in narrow-leafed lupin (Lupinus angustifolius L.). Plant Cell and Environment. (accepted April 2018)
16. Tulpová Z, Luo MC, Toegelová H, Visendi P, Hayashi S, Vojta P, Paux E, Kilian A, Abrouk M, Bartoš J, Hajdúch M, Batley J, Edwards D, Doležel J, Šimková H. (2018) Integrated physical map of bread wheat chromosome arm 7DS to facilitate gene cloning and comparative studies. New Biotechnology. (accepted March 2018)
17. Mousavi-Derazmahalleh M, Bayer P, Hane J, Valliyodan B, Nguyen HT, Nelson M, Erskine W, Varshney RK, Papa R, Edwards D. (2018) Adapting legume crops to climate change using genomic approaches. Plant Cell and Environment. (accepted March 2018)
18. Scheben A, Chan CKK, Mansueto L, Mauleon R, Larmande P, Alexandrov N, Wing RA, McNally KL, Quesneville H, Edwards D. (2018) Progress in single-access information systems for wheat and rice crop improvement. Briefings in Bioinformatics. bby016, https://doi.org/10.1093/bib/bby016
19. Watson A, Ghosh S, Williams MJ, Cuddy WS, Simmonds, Rey MD, Hatta MAMD, Hinchliffe A, Steed A, Reynolds D, Adamski NM, Breakspear A, Korolev A, Rayner T, Dixon LE, Riaz A, Martin W, Ryan M, Edwards D, Batley J, Raman H, Carter J, Rogers C, Domoney C, Moore G, Harwood W, Nicholson P, Dieters MJ, DeLacy IH, Zhou J, Uauy C, Boden SA, Park RF, Wulff BBH, Hickey LT. (2018) Speed breeding is a powerful tool to accelerate crop research and breeding. Nature Plants. 4: 23–29
20. Li Y, Ruperao P, Batley J, Edwards D, Khan T, Colmer TD, Pang J, Siddique KHM, Sutton T. Investigating drought tolerance in chickpea using genome-wide association mapping and genomic selection based on whole-genome resequencing data. Frontiers in Plant Science. (accepted January 2018)
21. Mousavi-Derazmahalleh M, Bayer PE, Buno Nevado B, Hurgobin B, Filatov D, Kilian A, Kamphuis LG, Singh KB, Berger JD, Hane JK, Edwards D, Erskine W, N. Nelson MN. Exploring the genetic and adaptive diversity of a pan‑Mediterranean crop wild relative: narrow-leafed lupin (2018) Theoretical and Applied Genetics (accepted January 2018)
22. Yuan Y, Lee HT, Hu H, Scheben A, Edwards D. Single-cell genomic analysis in plants. Genes (accepted January 2018)

Figures 1 and 2. A large class of resistance genes in classes are NBS-LRR (NLR) genes, which contain different combinations of NB-ARC, Leucine-Rich Repeats (LRR), Coils, and TIR domains. These different domains can be used to assign classes to resistance gene candidates mined from whole genome assemblies. These two trees show the extent of predicted resistance genes carrying different amounts of NLR genes. We mined these annotations by analysing all available Brassicaceae genome assemblies and annotations.