Genotype Calling | Tools for Polyploids

2024

Training Presentations

Genomic-assisted breeding of strawberries by Mitchel Feldmann

User Presentations

Development of mid-density genotyping platforms and microhaplotype database for specialty crops and animals in North America presented by Dongyan Zhao
A public mid-density genotyping platform for blueberry presented Manoj Sapkota
Bayesian tests for segregation distortion in experimental tetraploid populations presented by David Gerard.
Development of a mid-density genotyping platform for alfalfa and its application in a drought tolerance breeding program by Alex Sandercock
Development and validation of a mid-density genotyping platform for cranberry by Shufen Chen

Poster Presentations

Genome-wide association analyses reveal candidate genes associated with health components in blueberry

Estefania Tavares Flores, Gonzalo Casorzo, Paul Adunola, Mary Ann Lila, Mary Grace, Camila F. Azevedo, Luis Felipe Ferrao, Patricio Munoz. University of Florida, North Carolina State University, Universidade Federal de Vicosa.

Blueberry fruits (Vaccinium spp.) are depicted worldwide for their high content of antioxidants and phenolic compounds compared to other fresh fruits and vegetables. Anthocyanins have been described as the primary source of antioxidants in these berries. While demands for high-nutrient blueberries have increased significantly during the last two decades, a better understanding of how to genetically improve this trait is still required. We carried out a genome-wide association study toward this goal by gathering genomic and metabolic data from a southern highbush blueberry (SHB) breeding population of 369 genotypes. These were genotyped using a sequence capture methodology, including 50k SNP markers spanning all blueberry chromosomes. Target traits were phenotyped using high-performance liquid chromatography (HPLC), including total anthocyanin content, 20 anthocyanin types, and 40 phenolic and flavonoid non-anthocyanin compounds. A univariate linear mixed model was used for estimating SNP effects per trait after correcting for population structure. We found significant SNPs associated with health-component characteristics that can be further tested for marker-assisted selection, thus avoiding the cost and time required for phenotyping these traits. Findings from our analyses revealed consistent hits across Chr. 1, 2, 4, and 8 for anthocyanin-related traits – supporting previous studies in blueberry. Additional new hits were detected in Chr. 3, 5, 7, 10, 11, and 12 for both anthocyanin types and flavonoid-related traits. Notably, we detected an association in Chr. 9 related to total anthocyanin content, which seems to be the first time reported for this trait. Interestingly, several transcription factors have been annotated around this spanning region of the genome, including a bHLH type. bHLH transcription factors have been described before for playing a pivotal role in modulating expression of both flavonoid and anthocyanin late biosynthesis genes in plants, which makes this finding a good candidate for further validation. Altogether, identifying genomic regions potentially involved in anthocyanin and flavonoids is a meaningful advancement for understanding health component traits in blueberries, which can be further functionally validated and used for molecular breeding purposes.

Variation Graph Pangenomes Improve Read Mapping and SNP Calling Accuracy in Divergent Diploid And Allopolyploid Populations

Justin Conover, David Gerard, Ryan Gutenkunst, Micahel Barker. University of Arizona. American University.

A fundamental first step in population genomic analyses is choosing a reference genome to map sequencing data against for variant discovery. When considering an analysis of divergent populations or species, the population that is more distantly related to the reference genome will typically have less accurate variant discovery than the more closely related population, a phenomenon known as “reference bias”. Allotetraploid genomes present a unique bioinformatic challenge. These polyploid genomes are composed of two complete chromosomal complements from two divergent progenitor species. Hence, using either diploid progenitor’s genome as the reference will necessarily lead to a reference bias for variants in the subgenome from the other parental species. Additionally, recombination between duplicated homologous chromosomes (homoeologous exchange) can alter the dosage expectation of a given chromosomal region, further complicating variant discovery in allotetraploids. Here, we explore reference biases in a population of allotetraploid Brassica napus and its diploid progenitors, B. rapa and B. oleracea. We find that although reference biases abound when mapping to the reference genome of either diploid reference, these biases are largely ameliorated by creating and mapping reads to a variation graph pangenome composed of both diploid reference genomes. We also find that the method used to construct the variation graph pangenome, and the variant discovery pipeline, have significant effects on the accuracy of variant discovery.

Advances in QTL identification for resistance to spittlebug in an allotetraploid interspecific Urochloa mapping population

Paula Espitita Buitrago1, Camilla Ryan2, Claudia Perea1, Jose de Vega2, Rosa Jauregui1. 1Alliance Bioversity-CIAT, 2Earlham Institute.

Improvement on resistance to spittlebugs (Hemiptera: Cercopidae) has been achieved in the Interspecific Urochloa hybrid program at high rates the Alliance Bioversity-CIAT using phenotyping selection to assess antibiosis and tolerance to nymphal attack at high accuracy. However, it takes 75 days and requires mass-rearing colonies and synchronising the insects’ life cycle with the trials. the trials. Consequently, it is challenging to evaluate ~800 plants, corresponding to 150 genotypes, in the second stage of the breeding scheme. We aim to integrate molecular techniques to accelerate the Interspecific Urochloa breeding programme .
The development of markers for selection is challenging due to the allotetraploid and apomictic nature of the materials in the U. ruziziensis/U. brizantha/U. decumbens agamic complex used in the programme. To overcome these challenges, we developed a bioinformatics pipeline using new resources, such as a tetraploid reference of Urochloa decumbens and R based software for polyploids, to identify QTL associated with resistance and tolerance to the nymphal stage of the spittlebug Aeneolamia varia in a multiparental mapping population. A total of 320 half-sib hybrids of four biparental families were sequenced using RADseq to a read depth of ~10X per haplotype. Then, these reads were preprocessed and aligned to two Urochloa references: a haploid reference of the diploid U. ruziziensis(n =9); and the haplotype-resolved tetraploid U. decumbens (n=9, 4n=36). Both sets of aligned reads were used for SNP calling using GATK’s “germline short variant discovery” pipeline setting the ploidy as 4 to obtain dosages. We also used allelic depth from GATK for dosage calling in the updog package comparing the models for preferential pairing (f1pp) and normal distribution (norm). We obtained more SNP loci and higher mean read depth when aligning to the haploid U. ruziziensis reference (9 chrs) compared to the haplotype resolved (36 chrs). Genotype plots for both datasets show that high read depth is crucial for accurate genotype calling, and often markers do not follow the expected segregation pattern of the models . This suggests unaccounted variation on the preferential pairing during meiosis in some chromosomes. The next steps involve using Tools for Polyploids R packages for genetic mapping to elucidate meiotic behavior (mappoly); for haplotype reconstruction of the multiparental population (polyHaplotyper or polyOrigin); and finally, for identification of QTL related to spittlebug resistance (polyqtlR).

2023

User presentations

Construction of a strawberry breeding core collection to capture and exploit genetic variation

Tim Koorevaar, Johan Willemsen, Paul Arens, Chris Maliepaard, Richard Visser. Wageningen University and Research - Plant Breeding, the Netherlands.

As genotyping by sequencing (GBS) methods are becoming cheaper their applications become broader. For genotyping, GBS has become an alternative to SNP arrays which have certain limitations that can be overcome by GBS, such as genome coverage and ascertainment bias. Ideally, all material in a plant breeding program would be screened by using high coverage and deep sequencing. However, this is not cost-effective and probably not needed because high-quality genotypes can also be obtained by more cost-effective GBS tools with lower depth and/or less coverage which are imputed by utilizing a high-quality haplotype reference panel. A reference panel (core collection) of genotypes that represents the full width of the breeding program is essential for accurate imputation. In this study, we show a stepwise approach to obtain a representative core collection in a commercial plant breeding program that can be used as a reference panel for the utilization of cost-effective GBS methods. First, the most important crossing parents of advanced selections and specific genotypes (with specific traits) are identified and selected because they represent future genetic variation. Then, the core collection is finalized by maximizing the representativeness of the core collection compared to the current whole breeding program. Constructing representative core collections is commonly done by using genetic distances but pedigree-genomic-based relationship coefficients allow for accurate relationship estimation without the need to genotype each genotype in the breeding program. These pedigree-genomic-based relationship coefficients can identify pedigree errors, correct for missing links, and estimate relationships among founder genotypes. Consequently, this pedigree-genomic-based relationship matrix was used to complement the core collection by maximizing the representativeness of the total core collection.

Training presentations

2022

Multiploidy support in polyRAD - presented by Lindsay V. Clark, Joyce Njuguna, Alexander E. Lipka, and Erik J. Sacks. Department of Crop Sciences, University of Illinois, Urbana-Champaign, Urbana, IL.

polyRAD is an R package for Bayesian genotype calling from sequence read depth in diploid and polyploid organisms. It can use population structure or mapping population design to inform genotype calls and can export discrete or continuous genotypes. Although the original version of polyRAD allowed inheritance model to vary across the genome, it still required all individuals to be the same ploidy, limiting its use in staple crops such as banana and yam in which breeding populations typically consist of a mixture of ploidies. polyRAD 2.0 will support multiploidy, allowing simultaneous genotyping of individuals of different ploidies. The “possiblePloidies” slot will still be used to indicate potential inheritance modes for loci. A new slot called “taxaPloidy” contains one integer for each individual to indicate its ploidy, and acts as a multiplier for the values stored in “possiblePloidies”. Examples of how to code this information in various crops will be presented in the digital poster. We will also present Miscanthus sacchariflorus as a use case, in which introgression has occurred among diploid, triploid, and tetraploid populations. The development version of polyRAD 2.0 can be installed from GitHub.

2021

2022

User presentations

Reads2Map: Practical and reproducible workflows to build linkage maps from sequencing data - presented by Cris Taniguti et al.

Cristiane H. Taniguti¹, Lucas M. Taniguti³, Gabriel S. Gesteira², Thiago P. Oliveira³, Jeekin Lau¹, Getulio C. Ferreira³, Rodrigo R. Amadeu³, David Byrne¹, Oscar Riera-Lizarazu¹, Guilherme S. Pereira², Marcelo Mollinari², and Augusto F. Garcia³. ¹Texas A&M University, College Station, TX. ²North Carolina State University, Raleigh, NC. ³University of Sao Paulo, Sao Paulo, Brazil.

High-throughput sequencing methods produce millions of sequence reads that need to be processed by bioinformatic tools before being applied in genetics research. For each step of the procedure, such as alignment of reads, SNPs identification, and genotype calling, several tools are available, all with different methods and parameters to be selected by users. Changes in a single parameter in the pipeline can cause downstream consequences in the analysis quality. Because the genetic properties of meiotic events are well-known, it is possible to identify low-quality markers using linkage analysis. Genotyping errors lead to an overestimation of recombination events amount, inflated linkage map distances, and issues while grouping and ordering markers. Thus, good-quality genetic maps validate all upstream procedures and help to identify the best combinations of software and parameters. Here, we present the Reads2Map workflows to build linkage maps from sequencing data of experimental F1 outcrossing populations testing combinations of upstream tools. The workflows are written with Workflow Description Language (WDL) which offers a comprehensive structure and metadata for each step, making it easier for users to adapt specific parameters. WDL also allows interfacing with containers to increase reproducibility, facilitate access to diverse software, and use in high-performance computing or cloud service environments. The final workflow output is the input for the Reads2MapApp, a Shiny app, which allows interactive visualization of the produced genetic maps and selection of the best pipeline. We demonstrate Reads2Map workflows and Reads2MapApp using both simulated and empirical RADseq data.

Smooth Descent: a ploidy-agnostic algorithm to improve linkage mapping in the presence of genotyping errors, Alejandro Thérèse Navarro et al.

Alejandro Thérèse Navarro, Peter Bourke, Eric van de Weg, Paul Arens, Richard Finkers, Chris Maliepaard. Wageningen University and Research, Wagengingen, the Netherlands.

Linkage mapping is an approach to order markers based on recombination events. Mapping algorithms cannot easily handle genotyping errors, which are common in high-throughput genotyping data. To solve this issue, strategies have been developed, aimed mostly at identifying and eliminating spurious genotypes. One such strategy is SMOOTH (van Os et al. 2005), an iterative algorithm to detect genotyping errors. Unlike other approaches, SMOOTH can also be used to impute the most probable alternative genotypes, but its application is limited to diploid species and to markers heterozygous only in one of the parents. We adapted SMOOTH to expand its use to any marker type and to autopolyploids with the use of identity-by-descent probabilities, naming the updated algorithm Smooth Descent (SD). We applied SD to real and simulated data, showing that in the presence of genotyping errors this method produces better genetic maps in terms of marker order and map length. SD is particularly useful for error rates between 5% and 20% and when error rates are not homogeneous among markers or individuals. Moreover, the simplicity of the algorithm allows hundreds of thousands of markers to be efficiently processed, thus being particularly useful for error detection in high-throughput data. We implemented SD within an R package, SmoothDescent that can perform error detection, genotype imputation, and iterative mapping for diploids and autopolyploids.

Poster presentations

Identifying a Rose Germplasm Panel to Attain Optimal SNP Array Genotype Calling of Small Samples of Genotyped Individuals

Jeekin Lau, Cristiane H. Taniguti, David Byrne, and Oscar Riera-Lizarazu. Texas A&M University, College Station, TX.

Since genotyping with the Axiom WagRhSNP68K SNP array can be cost-prohibitive, we explored an approach that would permit robust genotyping of samples in one or two 96-well plates. We have observed that genotyping accuracy via SNP arrays increases as the number of individuals used for genotype calling increases. We reasoned that this increased accuracy may be due to greater sample size and allelic diversity. To test this idea, we conducted an experiment where one bi-parental mapping population of 94 individuals plus two parents were clustered alone (one plate of genotyping) and in combinations with sets of related biparental populations and unrelated germplasm with increasing numbers and various levels of genetic diversity. We then compared both marker statistics and the linkage map quality generated from genotype calls of the target mapping population using the various datasets. As the number of individuals used in clustering increased, the number of useful markers increased nominally. However, the resulting linkage maps revealed that the addition of other genotypes in the marker clustering step resulted in shorter total map length and smaller gap sizes as the number of individuals and diversity increased. The decreased map lengths and gap sizes indicate that the inclusion of other genotypes helped genotyping accuracy. The output of this study will be a core set of genotyped rose germplasm that may be used to improve genotype calling of small samples of genotyped materials.

Development of a Genotyping by Sequencing Pipeline in Tetraploid Roses (Rosa sp.)

Tessa Hochhaus, Cristiane H. Taniguti, Jeekin Lau, Patricia E. Klein, David H. Byrne, and Osar Riera-Lizarazu. Department of Horticultural Sciences, Texas A&M University, College Station, TX.

Roses are highly heterozygous and most commonly diploids, triploids, and tetraploids. Genotyping by sequencing (GBS) has been performed in diploid rose populations, however, it has not been done in populations with higher ploidy because of their increased complexity (autopolyploidy). This complexity is due to the greater number of genotypic classes and the difficulty in accurately calling allele dosage. GBS uses restriction enzymes to reduce genome complexity and adapter barcodes to allow the pooling of multiple samples to increase efficiency and to lower the sample cost. In this study, we are optimizing a GBS protocol for tetraploid roses using three populations (Morden Blush x George Vancouver, Stormy Weather x Brite Eyes, and Brite Eyes x My Girl). The optimization will entail varying sequencing read depth and coverage while minimizing missing data, and using in-house workflows to test various combinations of open-source software for quality control, alignment of reads, identifying SNPs, and dosage calling. Through the development of this pipeline, we hope to facilitate cost-effective genotyping in polyploid roses and the use of genomic-assisted breeding.