Maize Genome size

Maize Genome Size: Insights into Structure and Evaluation

Maize Genome Sequencing Projects

B73 Reference Genome

Maize Genome Size: Insights into Structure and Evaluation, the B73 inbred line has been a cornerstone of maize genome sequencing efforts. The initial B73 reference genome, known as B73 RefGen_v1, was released in 2010 . This assembly had a genome size of approximately 2.1 Gb and consisted of 10 chromosomes along with 2 organelle genomes . The sequencing technology used for this version included a combination of Sanger and 454 methods, with assembly carried out using phredPhrap and Abyss software .

As sequencing technologies advanced, subsequent versions of the B73 reference genome were released. The B73 RefGen_v3, published in 2013, marked a significant improvement in assembly quality . This version maintained the 2.1 Gb genome size but showed enhanced contiguity with a scaffold N50 of 217.9 Mb and a contig N50 of 41.3 kb 

The most recent iteration, B73 RefGen_v4, was released in 2017 and represented a major leap forward in maize genome assembly . This version utilized Pac Bio sequencing technology, which allowed for longer read lengths and improved resolution of repetitive regions . The assembly method for RefGen_v4 employed the Celera Assembler v. CA 8.3rc2, resulting in a more accurate and complete representation of the maize genome

Diversity Panels

To capture the genetic diversity of maize, several large-scale sequencing projects have focused on diverse panels of maize lines. These efforts have significantly expanded our understanding of maize genetic variation and have provided valuable resources for genome-wide association studies (GWAS).

One of the earliest diversity panels was the Maize Association Panel (MAP), which was initially genotyped with a modest number of Simple Sequence Repeat (SSR) markers . As sequencing technologies improved, this panel was sequenced as part of the Maize HapMap3 project, increasing the number of segregating genetic markers to an impressive 83 million .

Another important diversity panel is the Wisconsin Diversity (WiDiv) panel. Initially genotyped with 1,536 microarray-based markers, it later underwent RNA sequencing, which increased the number of segregating genetic markers to 900,000 . A subset of lines from the WiDiv panel was subsequently re sequenced, resulting in a set of 3.1 million SNPs scored across 511 genotypes .

The Shoot Apical Meristem (SAM) panel represents another significant contribution to maize diversity studies. This panel was genotyped using mRNA sequencing, enabling the identification and scoring of 1.2 million segregating single-nucleotide polymorphism (SNP) markers 

Pan-genome Initiatives

Recent efforts have focused on developing a maize pan-genome, which aims to capture the full genetic diversity present across multiple maize lines. These initiatives have led to the sequencing and assembly of numerous maize inbred lines beyond the B73 reference.

By 2019, Maize GDB, the primary database for maize genomic information, hosted the genomes of six maize inbred lines and one teosinte . Since then, there has been a dramatic increase in the number of available maize genomes. Maize GDB has incorporated 39 additional reference-quality genomes, including important individual inbred lines such as PH207, Mo17, and W22 .

A significant contribution to the pan-genome effort came from the sequencing of the Nested Association Mapping (NAM) population founder lines. This project resulted in 26 high-quality Pac Bio genome assemblies . The NAM founder lines represent a large swath of maize’s considerable diversity and have been extensively used by researchers to study various agronomic traits .

These pan-genome initiatives have also led to the development of new tools for analyzing maize genetic diversity. For instance, Pan Effect, an AI-driven platform, offers insights into the effects of genetic variants across a diverse set of 50 maize varieties . This tool uses the Zm-B73-REFERENCE-NAM-5.0 as the reference genome, which contains 39,755 gene models and 75,539 transcripts .

Genome Assembly Strategies for Maize

The assembly of the maize genome has been a challenging endeavor due to its large size and complex structure. Scientists have employed various strategies to tackle this challenge, each with its own advantages and limitations.

BAC-by-BAC approach

The BAC-by-BAC approach has been a traditional method for sequencing complex genomes. This strategy involves creating bacterial artificial chromosome (BAC) libraries, which are large-insert genomic libraries that have become popular for structural genome research in plants . BACs have several advantages over yeast artificial chromosomes (YACs), including a low frequency of chimeras, high stability of clones, and ease of manipulation .

In the case of maize, with its genome size ranging from 2,300 to 2,700 Mb, the BAC-by-BAC approach has been particularly useful . This method has helped in constructing integrated genetic and physical maps, which are crucial for understanding the genome sequence, gene content, and structure of the maize genome .

However, the BAC-by-BAC approach can be relatively expensive. For instance, sequencing the 2.5 GDP maize genome using this method was estimated to cost around USD 50 million . This high cost is due to the expenses associated with making BAC libraries, fingerprinting BAC clones, and sequencing large numbers of overlapping BACs .

Whole genome shotgun

The whole genome shotgun (WGS) approach has emerged as a faster and more cost-effective alternative to BAC-by-BAC sequencing. This method involves fragmenting the entire genome into small pieces, sequencing these fragments, and then assembling them computationally.

However, WGS assemblies often result in lower quality compared to BAC-by-BAC approaches, particularly in complex genomes like maize. The high repeat content of the maize genome, estimated at approximately 50% to 73%, poses significant challenges for WGS assembly  These repetitive sequences can lead to gaps and misassembles in the final genome sequence .

Despite these challenges, WGS assemblies can still be useful for many biological questions. They can be used for analyzing gene content, discovering and applying molecular genetic markers, and conducting evolutionary studies . However, for detailed studies of genome rearrangements or for resolving paralogues or homoeologous in complex and polyploid genomes, more complete assemblies are required.

Long-read technologies

Recent advancements in sequencing technologies, particularly long-read sequencing platforms, have revolutionized genome assembly strategies for complex genomes like maize. These technologies, such as PacBio Single Molecule Real-Time (SMRT) sequencing, have the ability to produce reads that are significantly longer than traditional short-read technologies.

Long-read technologies have shown promise in resolving repetitive regions that were previously challenging to assemble. For instance, a recent study using Pac Bio SMRT sequencing produced a maize genome assembly with fewer gaps than any previously sequenced maize genome . This assembly had a contig N50 of 6.99 Mb, which is comparable to other recent high-quality maize genome assemblies .

The combination of long-read sequencing with other technologies has further improved assembly quality. For example, a hybrid approach using Pac Bio SMRT sequencing, Illumina paired-end sequencing, and Bio Nano optical mapping resulted in a maize genome assembly with only 438 gaps and a final N50 contig size of 7.77 Mb . This represents a 5-fold improvement in contiguity compared to previous assemblies of maize inbred lines B73 and Mo17 .

These advancements in long-read technologies and hybrid assembly approaches are bringing us closer to achieving gapless, telomere-to-telomere assemblies of complex genomes like maize. Such high-quality assemblies are crucial for understanding the full extent of genetic diversity, structural variations, and regulatory elements in the maize genome, ultimately contributing to crop improvement efforts.

Gene Annotation Challenges in Maize

The annotation of the maize genome presents unique challenges due to its large size and complex structure. These challenges have led to the development of various approaches and methodologies to accurately identify and characterize genes within the maize genome.

Gene prediction methods

Gene prediction in maize relies on two primary types of evidence: mathematical and biological. Mathematical evidence is developed ab initio, directly from the assembled genome sequence. Computer algorithms such as Gene finder, Ganesh, Augustus, and Gene Mark search for patterns in DNA sequence that define a gene, including start codons, amino acid codons, intron/exon boundaries, and stop codons . These pattern-based programs are typically trained on a set of representative known genes to develop a hidden Markov model (HMM), which identifies organismal biases for gene features hidden in DNA sequence .

Biological evidence, on the other hand, is provided by experiments that yield mRNA and, to a lesser extent, protein sequences. Homology-based programs look for similarities between the genome sequence and independent RNA and protein evidence from the organism under study and from related organisms .

Evidence-based annotation

Evidence-based gene annotation has emerged as a rapid and cost-effective way to provide reliable gene annotations for newly sequenced genomes. However, one limitation of this approach is its requirement for transcriptional evidence, such as known proteins, full-length cDNAs, or expressed sequence tags (ESTs) in the species of interest .

To overcome this limitation, researchers have developed evidence-based gene build systems that can use transcriptional evidence across related species. For example, the Gramene pipeline has shown that cross-species ESTs from within monocot or dicot classes are a valuable source of evidence for gene predictions . Using only EST and cross-species evidence, this pipeline can generate a plant gene set comparable in quality to human genes based on known proteins and full-length cDNAs .

Recent advancements in genomic technologies have introduced new challenges and opportunities for gene annotation in maize. Genotyping-by-sequencing (GBS) has been used to construct genomic selection models for large and complex polyploidy wheat breeding materials . However, applying GBS in maize breeding programs faces biological complications due to the dynamic nature of the maize genome, which exhibits extensive presence-absence variation . Studies suggest that 80-90% of the maize genome shows some presence-absence variation, with only 75-82% of sites present in samples, except for the reference genome B73, which exhibits near-complete coverage .

Manual curation

Manual annotation, or curation, involves a person evaluating one gene at a time, adding information and making corrections . This process has been supported by annotation jamborees, which provide intensive but sporadic annotation efforts. For example, the Drosophila melanogaster genome underwent an early round of annotation by a jamboree of volunteers .

In maize, manual curation has played a crucial role in integrating various data sources and improving gene annotations. Over the years, about 6,000 functional genes described in the literature have been curated into Maize GDB . However, this count is significantly smaller than the 32,540 gene models predicted for the B73 genome by the Maize Genome Sequencing Consortium and the approximately 10,000 gene models thought to be present in other inbred lines but not in B73 .

To integrate these data, tools like the ‘Locus Lookup Tool’ have been developed, which help researchers with genetically mapped genes identify the chromosomal window containing their gene of interest . This tool aids positional cloning efforts and ultimately connects theoretical gene models with biologically defined genes .

Manual curation also involves assigning cDNA sequences aligned to the genome assembly to locus variations, which can be used to link classical genetic information with the genome sequence . This curated information is periodically shared with the NCBI and can be found along with gene and marker names and synonyms on NCBI gene records .

Comparative Analysis of Maize Inbreeds

The comparative analysis of maize inbred lines has revealed an unprecedented level of structural diversity among higher eukaryotes. This diversity plays a crucial role in shaping the extraordinary phenotypic plasticity and adaptability of maize. Recent advancements in genomic technologies have allowed researchers to delve deeper into the complexities of maize genome structure, uncovering various types of structural variations that contribute to the genetic diversity of this important crop.

Structural Variations

Structural variations in the maize genome include rearrangements, copy number variations (CNV), and presence/absence variations (PAV). These variations have been identified through various techniques, including comparative genomic hybridization (CGH) and whole-genome sequencing. A study comparing the inbred lines B73 and Mo17 using array-based CGH revealed an extensive level of structural diversity . This analysis conservatively estimated several hundred CNV sequences and several thousand PAV sequences present in B73 but absent in Mo17.

The distribution of structural variants across the maize genome is not uniform. More variants are observed near the ends of chromosomes compared to the central centromeric regions, generally mirroring genic density . Interestingly, CNVs exhibit a significantly different distribution than expected, with higher levels in low recombination regions, while PAVs do not show altered rates in high and low recombination regions.

Presence-Absence Variations

Presence/absence variations (PAVs) represent a major source of genetic diversity in maize. A study comparing the B73 and Mo17 genomes identified several large PAVs, including three regions present only in B73 and two regions present only in Mo17 . The largest PAV, designated as Regional, spans approximately 3.2 Mb on Chromosome 6 and contains 70 protein-coding genes .

Another study focused on identifying PAVs present in Mo17 but absent from B73 using next-generation sequencing. This analysis revealed 119 PAVs, of which 57 were validated by PCR . These PAVs were dispersed across all ten chromosomes, suggesting the possibility of large genetic fragments in Mo17 being absent from the B73 genome .

Interestingly, some PAVs have been associated with disease resistance, indicating that these variations may play a role in conferring resistance to certain pathogens in maize . The majority of genes within PAVs were found to be transcriptionally silenced or expressed at low levels in some tissues, especially for PAVs related to disease resistance .

Copy Number Variations

Copy number variations (CNVs) represent another significant source of genetic diversity in maize. A study analyzing 33 genotypes, including 19 diverse maize genotypes and 14 teosinte genotypes, identified 479 Up CNV genes (consistently higher signal than the reference B73 genome) and 3410 Down CNV/PAV genes . Of the Down CNV/PAV genes, 586 were classified as Down CNV candidates, while 2824 were likely examples of PAV .

The largest Up CNV event identified included nine genes located on chromosome 7, observed in 6 out of 25 domesticated maize lines and 6 out of 14 teosinte lines . Individual genotypes differed from B73 at between 21 and 217 (mean = 114) Up CNVs and between 405 and 1375 Down CNV/PAV (mean = 917) .

These structural variations, including both PAVs and CNVs, contribute significantly to the genetic diversity of maize. They represent a source of variation that is not always discoverable by SNPs alone, with 21.9% of common pSVs showing low linkage disequilibrium with nearby SNPs . This underscores the importance of considering structural variations in genetic studies and breeding programs aimed at improving maize traits and adaptability.

Maize Genome Databases and Resources

Maize GDB

Maize GDB serves as the primary community database for maize researchers, providing essential data curation and informatics resources to support maize genetics, genomics, and breeding research . This platform has evolved significantly since its inception, tracing its lineage to the early twentieth century when maize research data was first collected and shared . In 2008, Maize GDB shifted its focus to genomic data with the release of the first complete B73 genome assembly .

As of February 2024, Maize GDB hosts an impressive 104 maize genome assemblies, including the representative reference genome for Zea mays ssp mays, B73, and other important inbred lines such as PH207, Mo17, and W22 . The database also includes a set of European flint lines, sweet corn, and 26 high-quality Pac Bio genome assemblies of the Nested Associated Mapping (NAM) population founder lines . These NAM founder lines represent the broad diversity of domesticated maize and have been instrumental in elucidating various agronomic traits .

Maize GDB offers three perspectives for each hosted genome: independent use with genome-specific data, association with the B73 reference genome, and presentation in a pan-genomic framework  . The platform has expanded its resources to include an epigenetic atlas for the B73v5 reference genome, featuring ChIP-seq, ATAC-seq , and methyl-seq data . Additionally, MaizeGDB incorporates structural variant data for each NAM founder genome, enabling researchers to compare large-scale structural variants across all NAM founders against the reference genome.

Grammeme

Grammeme is a comparative plant genomics resource that focuses on crops and model organisms . The platform offers a range of tools and services to support maize research, including:

  1. Genome Browser: Provides genome annotations, variation data, and comparative tools .
  2. Plant Reactome: Allows users to browse and analyze metabolic and regulatory pathways .
  3. BLAST: Enables researchers to query Gramene’s genomes with DNA or protein sequences .
  4. Gramene Mart: An advanced genomic query interface powered by BioMart .

Gramene’s maize-specific resources include full genome sequences of the NAM founder lines and comparative genomic analyzes of protein-coding gene families . The platform’s homology tab displays an interactive gene tree visualization, allowing researchers to explore differences among sequences in a gene family . Gramene also provides pathway information, such as the jasmonic acid biosynthesis pathway associated with the lox9 gene.

Conclusion

The exploration of the maize genome has a profound impact on our understanding of plant genetics and crop improvement. The intricate structure of the maize genome, with its complex arrangement of genes, transposons, and structural variations, provides a rich playground for genetic research. From the early days of BAC-by-BAC sequencing to the latest long-read technologies, advancements in genomic tools have paved the way to uncover the secrets hidden within maize DNA. These breakthroughs are crucial to develop more resilient and productive maize varieties.

To wrap up, the maize genome continues to surprise researchers with its dynamic nature and genetic diversity. The ongoing efforts to sequence and annotate various maize inbred lines are creating a more complete picture of the maize pan-genome. This knowledge is essential to tackle future challenges in agriculture and food security. As we delve deeper into the maize genome, we’re not just learning about one crop – we’re gaining insights that could revolutionize our approach to plant breeding and genomics across species.

FAQs

  1. What is the size of the maize genome?
    The maize genome is approximately 2,365 megabases (Mb). For comparison, the rice genome is about 389 Mb according to the International Rice Genome Sequencing Project (2005). Research by Whitelaw et al. (2003) estimated that 63% of the maize genome consists of repetitive sequences, as detailed by Messing et al. (2004).
  2. What does the size of a genome tell us?
    The genome size represents the total amount of DNA contained within one copy of a single complete genome.
  3. Can the size of a genome predict the complexity of an organism?
    There is no direct correlation between the size of a genome and the complexity of an organism. For example, some unicellular eukaryotes have genomes larger than those of many multicellular animals, and some amphibians have genomes larger than those found in mammals.
  4. How is the size of a genome determined?
    To estimate the actual size of a genome, you can divide the total number of k-mers (n) by the coverage (C). This method provides a numerical approximation of the genome’s size.

References

[1] – https://www.ncbi.nlm.nih.gov/assembly/71581
[2] – https://www.ncbi.nlm.nih.gov/assembly/999771
[3] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9396454/
[4] – https://bmcplantbiol.biomedcentral.com/articles/10.1186/s12870-021-03173-5
[5] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10881103/
[6] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC166683/
[7] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4719536/
[8] – https://www.nature.com/articles/s41467-019-14023-2
[9] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6816542/
[10] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765265/
[11] – https://academic.oup.com/g3journal/article/3/11/1903/6025656
[12] – https://academic.oup.com/database/article/doi/10.1093/database/bar022/464959
[13] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2780416/
[14] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2989995/
[15] – https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02448-2
[16] – https://link.springer.com/article/10.1007/s13258-015-0272-7
[17] – https://www.nature.com/articles/s41588-019-0427-6
[18] – http://www.nature.com/scitable/topicpage/transposons-the-jumping-genes-518
[19] – https://academic.oup.com/g3journal/article/11/10/jkab238/6320786
[20] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9456015/
[21] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547701/
[22] – https://mobilednajournal.biomedcentral.com/articles/10.1186/1759-8753-1-15
[23] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2776974/
[24] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3608884/
[25] – https://academic.oup.com/genetics/article/149/4/2025/6034305
[26] – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8377966/
[27] – https://academic.oup.com/genetics/article/227/1/iyae036/7641224
[28] – https://www.gramene.org/
[29] – https://maize-pangenome.gramene.org/

Leave a Reply

Your email address will not be published. Required fields are marked *