PolyHaplotyper workshop

Introduction and preparation

PolyHaplotyper is an R package for haplotyping in polyploids as well as in diploids. Haplotyping is performed per locus, where a locus consists of a limited number of closely linked SNP markers. Such a locus is called a “haploblock”, and the different alleles of a haploblock (different combinations of the SNP alleles) are called haplotypes. PolyHaplotyper considers all samples in a population to perform the haplotyping. Populations may consist of any collection of individuals, but PolyHaplotyper is designed to take advantage of the presence of (multiple, possibly linked) Full-Sib populations if present.

For this workshop we need to install PolyHaplotyper first. As PolyHaplotyper is available from CRAN this is simple:

install.packages("PolyHaplotyper")

In this workshop we will use one of the data sets published in Voorrips & Tumino, 2022. This data set is included in the workshop materials, as file Dataset1_chrysanthemum.RData. Create a new folder, and copy this file there. Next we set the working directory to this folder and make the functions in PolyHaplotyper available:

setwd("c:/PolyHaplotyper workshop") # example; use the folder that you created
library(PolyHaplotyper)

Example: Hexaploid Chrysanthemum

Input data

This example contains actual data obtained with a SNP array in hexaploid chrysanthemum (courtesy of Deliflor Chrysanten, the Netherlands). The array is described in Van Geest et al, 2017. Load the Chrysanthemum dataset and list its contents:

print(load("Dataset1_chrysanthemum.RData"))
## [1] "dosmat"    "hapblocks" "parents"   "FS"        "ped"       "ahclist"

Matrix dosmat specifies the SNP dosages (the dosage of the Alt allele) for all SNPs and all individuals. SNPs are in rows, samples in columns:

dosmat[1:4,1:7]
##                36451 39287 51202 51203 51204 51205 51206
## WagCmDlf_30760     0     1     0     1     1     1     1
## WagCmDlf_76496     0     0     0     0     0     0     0
## WagCmDlf_78625     6     6     6     6     6     6     6
## WagCmDlf_71641     1     2     3     1     1     2     1

The list hapblocks specifies which SNP markers belong to each haploblock:

hapblocks[1:3]
## $contig_12393
## [1] "WagCmDlf_30760" "WagCmDlf_61099" "WagCmDlf_31762" "WagCmDlf_04714"
## 
## $contig_05603
## [1] "WagCmDlf_76496" "WagCmDlf_45201" "WagCmDlf_30694" "WagCmDlf_18150"
## 
## $contig_05651
## [1] "WagCmDlf_78625" "WagCmDlf_59195" "WagCmDlf_15684" "WagCmDlf_73229"
# show the distribution of the number of SNPs per haploblock:
table(sapply(hapblocks, length))
## 
##  4  5 
## 26 17

In this dataset the population includes four Full-Sib (FS) families. These are specified with matrix parents and list FS:

# matrix parents has 2 columns: for each FS family the two parental samples
parents
##       [,1]  [,2]
## [1,] 36451 39287
## [2,] 41234 40360
## [3,]  9656  9541
## [4,] 32141 39287
# list FS has one element per FS family: a vector of the FS members.
# show the sizes of the FS families:
sapply(FS, length)
## [1] 405  53  76  37

Matrix parents shows that FS family 1 and 4 share a parent.
Any samples in dosmat that are not a parent or member of a FS family are considered to be unrelated material.

Data.frame ped gives the entire set of samples as a pedigree. This pedigree is not used for haplotyping, but we will use it to check for conflicts between the assigned haplotype genotypes of all parent-offspring pairs.

ped[51:60,]
##    genotype mother father
## 89    41234  32466  28021
## 91    42680     NA  40360
## 92    46507     NA  40360
## 93    46537     NA  40360
## 94    50257  41234  40360
## 95    50259  41234  40360
## 96    50260  41234  40360
## 97    50261  41234  40360
## 98    50262  41234  40360
## 99    50263  41234  40360

We see that the mother and/or father can be unknown (NA). However all individuals listed as parents must also occur in the first column. The last 6 lines show part of the second FS family.

Finally, the list ahclist contains all haplotype combinations that correspond to the SNP dosage combinations. We will say more about ahclist and its extended version ahccompletelist at the end of this workshop vignette.

Haplotyping

Package PolyHaplotyper contains a function inferHaplotypes that does all the work. It performs haplotyping and returns a list with one element per haploblock. This takes about 2 minutes on the test pc.

PHresults <- 
  inferHaplotypes(mrkDosage=dosmat, ploidy=6, 
                  haploblock=hapblocks, 
                  parents=parents, FS=FS) 
saveRDS(PHresults, "Chrysanthemum-PHresults.rds")

Output data

The list PHresults has 43 elements: one for each haploblock. Each of these is itself a list with several components, as we can see by inspecting the results of the first haploblock:

length(PHresults)
## [1] 43
names(PHresults[[1]])
## [1] "hapdos"      "mrkdids"     "markers"     "FSfit"       "FSmessages"  "FSpval"      "imputedGeno"
## [8] "message"

The main result for each haploblock are the haplotype combinations assigned to each individual.

PHresults[[1]]$hapdos[,1:7]
##                 36451 39287 51202 51203 51204 51205 51206
## contig_12393_02     0     0     0    NA     0    NA     0
## contig_12393_03     0     1     0    NA     0    NA     0
## contig_12393_04     5     3     6    NA     4    NA     3
## contig_12393_07     0     0     0    NA     0    NA     0
## contig_12393_08     1     1     0    NA     1    NA     2
## contig_12393_16     0     1     0    NA     1    NA     1

The haplotypes present in the entire population are in rows. They are named for the haploblock (here: contig_12393), followed by a number. This first haploblock has 4 SNPs. With 4 SNPs, 16 (24) haplotypes are possible, of which only the 6 shown were inferred in the population.
The individuals are in columns. The individuals are hexaploid, so the haplotype dosages sum to 6 in each column. We can also see that some individuals could not be haplotyped. This could be because of missing or incorrect data in the SNP dosages or because multiple haplotype combinations were possible.

With function allHaplotypes we can see the composition of all 16 possible haplotypes:

allHaplotypes(PHresults[[1]]$markers)
##       WagCmDlf_30760 WagCmDlf_61099 WagCmDlf_31762 WagCmDlf_04714
##  [1,]              0              0              0              0
##  [2,]              0              0              0              1
##  [3,]              0              0              1              0
##  [4,]              0              0              1              1
##  [5,]              0              1              0              0
##  [6,]              0              1              0              1
##  [7,]              0              1              1              0
##  [8,]              0              1              1              1
##  [9,]              1              0              0              0
## [10,]              1              0              0              1
## [11,]              1              0              1              0
## [12,]              1              0              1              1
## [13,]              1              1              0              0
## [14,]              1              1              0              1
## [15,]              1              1              1              0
## [16,]              1              1              1              1

Here each row is a haplotype; the SNP markers are in columns. 0 and 1 represent the Ref and Alt alleles.

Another components of the results are the imputed SNP dosages for FS individuals where some SNP dosages within this haploblock were missing.

PHresults[[1]]$imputedGeno
##                51271 51285 50602 50620 51045 51060
## WagCmDlf_30760     0     0     2     1     1     0
## WagCmDlf_61099     0     0     2     1     1     0
## WagCmDlf_31762     6     6     6     6     6     6
## WagCmDlf_04714     5     5     6     6     6     5

The other components of the output are all discussed in the manual for this function (type ?inferHaplotypes). They are best viewed using function overviewByFS. This produces an overview of the results by haploblock and by FS family:

PHovwFS <- overviewByFS(haploblock=hapblocks, parents=parents, FS=FS,
                        hapresults=PHresults)
# show the first 3 haploblocks and first FS family:
PHovwFS$ovw[1:3, 1:8]
##              nmrk nhap parmrk1 fit1          P1 mrk1 imp1 hap1
## contig_12393    4    6       2    1 0.078828045  337    2  339
## contig_05603    4    5       2    1 0.161740622  290    0  290
## contig_05651    4    5       2    1 0.005111137  369    4  373

The first two columns show the number of SNP markers (nmrk) and the number of haplotypes inferred (nhap) for the haploblock.
Then for each FS family a set of 6 columns (we printed only the first family):

  • parmrk (0, 1 or 2: the number of parents with complete marker data)
  • fit (0 or 1): whether a solution for the FS was found, assuming polysomic inheritance
  • P: the chi-squared P-value of the best fitting solution
  • mrk: the number of FS progeny with complete SNP marker data
  • imp: the number of FS progeny where SNP marker data were imputed
  • hap: the number of FS progeny with assigned haplotype combinations

There are also columns mrk and hap for the non-FS individuals and for all individuals (not shown).

Checks and statistics

We have seen that function inferHaplotypes performs haplotyping, for individuals within FS families as well as for unrelated material. We would like to know if these results are correct. For a real, experimental data set like this one we don’t know the true genotypes so we cannot compare these with the inferred ones. But we do know the pedigree of the material, so something we can do is to check whether the haplotype combination assigned to each individual is possible, assuming polysomic inheritance and given the haplotype combinations assigned to its parent(s). If we find only few conflicts we can have confidence in the haplotyping results.

PolyHaplotyper has a functions pedigreeHapCheck to perform these inheritance checks, without or with allowing for double reduction (DR).

PHpedchk <- pedigreeHapCheck(ped=ped, mrkDosage=dosmat,
                             haploblock=hapblocks,
                             hapresults=PHresults)

The function returns a list with 2 items: ped_arr and parents_arr. Both are 3D arrays. ped_arr gives, for all haploblocks, information on each individual: whether it has SNP marker data and has been haplotyped, and if its haplotyping matches with its parents. parents_ arr gives information on all individuals that occur as parents in the pedigree (including, but not only, the parents of the specified FS families): how many progeny it has, how many of the progeny were haplotyped, and how many of those match with the parent’s haplotyping.

These are large 3D arrays and not easily viewed. For getting a summary of both the overviewByFS and the pedigreeHapCheck results we use function calcStatistics:

PHcst <- calcStatistics(pedchk=PHpedchk, ovwFS=PHovwFS)

The result of function calcStatistics is a list with 2 elements: data.frames pedstats and FSstats.

head(PHcst$pedstats)
##     haploblock mrk imp hap match.NA noDR.TRUE noDR.FALSE withDR.TRUE withDR.FALSE
## 1 contig_12393 513   6 507      189       440          2         442            0
## 2 contig_05603 440   0 440      214       411          6         417            0
## 3 contig_05651 509   5 434      215       414          2         416            0
## 4 contig_11164 386   3 325      313       312          6         318            0
## 5 contig_01638 377  85 459      188       437          6         443            0
## 6 contig_12300 378 125 499      159       472          0         472            0
colSums(PHcst$pedstats[, -1])
##    mrk  imp   hap match.NA noDR.TRUE noDR.FALSE withDR.TRUE withDR.FALSE
##  20258 1059 20322     7972     19026        135       19151           10

Data.frame pedstats gives for each haploblock the total numbers of individuals with complete SNP dosage data, with imputed SNP dosages, and which were haplotyped. Further it gives overall data on the match between individuals and their parent(s): the number of individuals where no matching could be done because they or their parents were not haplotyped, and the number of matches and conflicts assuming double reduction (DR) to be absent or present.

PHcst$FSstats
##   FSfam bothparmrk FSFit       mrk       imp       hap
## 1     1         43    43 332.62791 21.767442 354.00000
## 2     2         38    30  28.32558  0.744186  26.62791
## 3     3         34    30  48.48837  1.488372  41.58140
## 4     4         37    25  24.32558  0.627907  20.62791
## 5  rest         NA    NA  30.86047  0.000000  24.04651
## 6   all         NA    NA 471.11628 24.627907 472.60465

Data.frame FSstats gives for each FS family, and for all non-FS (“rest”) individuals and for all individuals together, the number of haploblocks where both parents had complete SNP dosage data, and the number of haploblocks where a valid segregation was found given the inferred parental haplotyping, and the average numbers of FS individuals with complete and with inferred SNP dosages and the average number of haplotyped FS individuals.

Finally we want to check that all assigned haplotype combinations match the observed SNP dosages. For this check the function compareHapMrkDosages is provided:

cmpdos <- compareHapMrkDosages(mrkDosage=dosmat, hapresults=PHresults)
table(cmpdos[,,"match"], useNA="always")
## 
##  TRUE  <NA> 
## 20322  6811

The total of the table is the number of individuals in the SNP dosage matrix x the number of haploblocks, here 27133. In general the table has 3 entries: FALSE, TRUE and NA representing the number of individuals over all haploblock where the haplotyping does not match or matches the SNP dosages or where the SNP dosages or haplotyping were not available. Here the FALSE (non-matching) category is missing. This is because PolyHaplotyper never assigns a non-matching haplotype combination (in contrast to some other haplotyping software).

ahclist and ahccompletelist

PolyHaplotyper needs to know which haplotype combinations match a given set of SNP dosages. It can compute these dosages when needed, but this is not efficient. When it does so, it stores these data in a list called ahclist, and this list is written to a file ahclist_4x.RData (for tetraploids; _6x for hexaploids etc). In the current workshop an ahclist was provided that already contained all SNP dosage combinations occurring in all haploblocks; since no new ones were computed, no new ahclist file was written.

It is more efficient to calculate all haplotype combinations matching all possible SNP dosage combinations in advance, for haploblocks up to some maximum number of SNPs. This is done by function build_ahccompletelist:

# DON'T TRY THIS DURING THE WORKHOP!
# It will take hours.
build_ahccompletelist(ploidy=6, maxmrk=5, overwrite=FALSE)

With these parameters a file ahccompletelist_6x.RData will be created in the working directory, containing all hexaploid haplotype combinations for haploblocks of up to 5 SNP markers.
Because the number of SNP dosage combinations, and even more so the number of haplotype combinations quickly become astronomical, the practical limits of the number of SNPs per haploblock are 8 in tetraploids, 6 in hexaploids and 5 in octaploids. This may seem a severe limitation given the total number of SNPs present in e.g. a gene. However, the purpose of PolyHaplotyper is not to accommodate full-gene haplotypes. Instead it aims to produce multi-allelic markers for use in downstream genetic analyses, e.g. GWAS, and for this a haploblock with 32 to 256 possible alleles is a significant improvement over bi-allelic SNPs.

Further information

Here we used one of the example data sets provided in the PolyHaplotyper article (Voorrips & Tumino 2022). This article also provides two other data sets, and scripts to analyze all three data sets with PolyHaplotyper and some other software.
Also, the package includes a vignette in which part of the data set shown here is analyzed. And finally, all functions have their own manual that you can access with the question-mark notation, e.g.

?inferHaplotypes

References

Van Geest G, Voorrips RE, Esselink E, Post A, Visser RGF (2017) Conclusive evidence for hexasomic inheritance in chrysanthemum based on analysis of a 183 k SNP array. BMC Genomics 18:585; https://doi.org/10.1186/s12864-017-4003-0

Voorrips RE, Tumino G (2022) PolyHaplotyper: haplotyping in polyploids based on bi‑allelic marker dosage data. BMC Bioinformatics 23:44; https://doi.org/10.1186/s12859-022-04989-0