PolyHaplotyper is an R package for haplotyping in polyploids as well as in diploids. Haplotyping is performed per locus, where a locus consists of a limited number of closely linked SNP markers. Such a locus is called a “haploblock”, and the different alleles of a haploblock (different combinations of the SNP alleles) are called haplotypes. PolyHaplotyper considers all samples in a population to perform the haplotyping. Populations may consist of any collection of individuals, but PolyHaplotyper is designed to take advantage of the presence of (multiple, possibly linked) Full-Sib populations if present.
For this workshop we need to install PolyHaplotyper first. As PolyHaplotyper is available from CRAN this is simple:
install.packages("PolyHaplotyper")
In this workshop we will use one of the data sets published in Voorrips & Tumino, 2022. This data set is included in the workshop materials, as file Dataset1_chrysanthemum.RData. Create a new folder, and copy this file there. Next we set the working directory to this folder and make the functions in PolyHaplotyper available:
setwd("c:/PolyHaplotyper workshop") # example; use the folder that you created
library(PolyHaplotyper)
This example contains actual data obtained with a SNP array in hexaploid chrysanthemum (courtesy of Deliflor Chrysanten, the Netherlands). The array is described in Van Geest et al, 2017. Load the Chrysanthemum dataset and list its contents:
print(load("Dataset1_chrysanthemum.RData"))
## [1] "dosmat" "hapblocks" "parents" "FS" "ped" "ahclist"
Matrix dosmat specifies the SNP dosages (the dosage of the Alt allele) for all SNPs and all individuals. SNPs are in rows, samples in columns:
dosmat[1:4,1:7]
## 36451 39287 51202 51203 51204 51205 51206
## WagCmDlf_30760 0 1 0 1 1 1 1
## WagCmDlf_76496 0 0 0 0 0 0 0
## WagCmDlf_78625 6 6 6 6 6 6 6
## WagCmDlf_71641 1 2 3 1 1 2 1
The list hapblocks specifies which SNP markers belong to each haploblock:
hapblocks[1:3]
## $contig_12393
## [1] "WagCmDlf_30760" "WagCmDlf_61099" "WagCmDlf_31762" "WagCmDlf_04714"
##
## $contig_05603
## [1] "WagCmDlf_76496" "WagCmDlf_45201" "WagCmDlf_30694" "WagCmDlf_18150"
##
## $contig_05651
## [1] "WagCmDlf_78625" "WagCmDlf_59195" "WagCmDlf_15684" "WagCmDlf_73229"
# show the distribution of the number of SNPs per haploblock:
table(sapply(hapblocks, length))
##
## 4 5
## 26 17
In this dataset the population includes four Full-Sib (FS) families. These are specified with matrix parents and list FS:
# matrix parents has 2 columns: for each FS family the two parental samples
parents
## [,1] [,2]
## [1,] 36451 39287
## [2,] 41234 40360
## [3,] 9656 9541
## [4,] 32141 39287
# list FS has one element per FS family: a vector of the FS members.
# show the sizes of the FS families:
sapply(FS, length)
## [1] 405 53 76 37
Matrix parents shows that FS family 1 and 4 share a parent.
Any samples in dosmat that are not a parent or member of a FS family are
considered to be unrelated material.
Data.frame ped gives the entire set of samples as a pedigree. This pedigree is not used for haplotyping, but we will use it to check for conflicts between the assigned haplotype genotypes of all parent-offspring pairs.
ped[51:60,]
## genotype mother father
## 89 41234 32466 28021
## 91 42680 NA 40360
## 92 46507 NA 40360
## 93 46537 NA 40360
## 94 50257 41234 40360
## 95 50259 41234 40360
## 96 50260 41234 40360
## 97 50261 41234 40360
## 98 50262 41234 40360
## 99 50263 41234 40360
We see that the mother and/or father can be unknown (NA). However all individuals listed as parents must also occur in the first column. The last 6 lines show part of the second FS family.
Finally, the list ahclist contains all haplotype combinations that correspond to the SNP dosage combinations. We will say more about ahclist and its extended version ahccompletelist at the end of this workshop vignette.
Package PolyHaplotyper contains a function inferHaplotypes that does all the work. It performs haplotyping and returns a list with one element per haploblock. This takes about 2 minutes on the test pc.
PHresults <-
inferHaplotypes(mrkDosage=dosmat, ploidy=6,
haploblock=hapblocks,
parents=parents, FS=FS)
saveRDS(PHresults, "Chrysanthemum-PHresults.rds")
The list PHresults has 43 elements: one for each haploblock. Each of these is itself a list with several components, as we can see by inspecting the results of the first haploblock:
length(PHresults)
## [1] 43
names(PHresults[[1]])
## [1] "hapdos" "mrkdids" "markers" "FSfit" "FSmessages" "FSpval" "imputedGeno"
## [8] "message"
The main result for each haploblock are the haplotype combinations assigned to each individual.
PHresults[[1]]$hapdos[,1:7]
## 36451 39287 51202 51203 51204 51205 51206
## contig_12393_02 0 0 0 NA 0 NA 0
## contig_12393_03 0 1 0 NA 0 NA 0
## contig_12393_04 5 3 6 NA 4 NA 3
## contig_12393_07 0 0 0 NA 0 NA 0
## contig_12393_08 1 1 0 NA 1 NA 2
## contig_12393_16 0 1 0 NA 1 NA 1
The haplotypes present in the entire population are in rows. They are
named for the haploblock (here: contig_12393), followed by a number.
This first haploblock has 4 SNPs. With 4 SNPs, 16 (24)
haplotypes are possible, of which only the 6 shown were inferred in the
population.
The individuals are in columns. The individuals are hexaploid, so the
haplotype dosages sum to 6 in each column. We can also see that some
individuals could not be haplotyped. This could be because of missing or
incorrect data in the SNP dosages or because multiple haplotype
combinations were possible.
With function allHaplotypes we can see the composition of all 16 possible haplotypes:
allHaplotypes(PHresults[[1]]$markers)
## WagCmDlf_30760 WagCmDlf_61099 WagCmDlf_31762 WagCmDlf_04714
## [1,] 0 0 0 0
## [2,] 0 0 0 1
## [3,] 0 0 1 0
## [4,] 0 0 1 1
## [5,] 0 1 0 0
## [6,] 0 1 0 1
## [7,] 0 1 1 0
## [8,] 0 1 1 1
## [9,] 1 0 0 0
## [10,] 1 0 0 1
## [11,] 1 0 1 0
## [12,] 1 0 1 1
## [13,] 1 1 0 0
## [14,] 1 1 0 1
## [15,] 1 1 1 0
## [16,] 1 1 1 1
Here each row is a haplotype; the SNP markers are in columns. 0 and 1 represent the Ref and Alt alleles.
Another components of the results are the imputed SNP dosages for FS individuals where some SNP dosages within this haploblock were missing.
PHresults[[1]]$imputedGeno
## 51271 51285 50602 50620 51045 51060
## WagCmDlf_30760 0 0 2 1 1 0
## WagCmDlf_61099 0 0 2 1 1 0
## WagCmDlf_31762 6 6 6 6 6 6
## WagCmDlf_04714 5 5 6 6 6 5
The other components of the output are all discussed in the manual for this function (type ?inferHaplotypes). They are best viewed using function overviewByFS. This produces an overview of the results by haploblock and by FS family:
PHovwFS <- overviewByFS(haploblock=hapblocks, parents=parents, FS=FS,
hapresults=PHresults)
# show the first 3 haploblocks and first FS family:
PHovwFS$ovw[1:3, 1:8]
## nmrk nhap parmrk1 fit1 P1 mrk1 imp1 hap1
## contig_12393 4 6 2 1 0.078828045 337 2 339
## contig_05603 4 5 2 1 0.161740622 290 0 290
## contig_05651 4 5 2 1 0.005111137 369 4 373
The first two columns show the number of SNP markers (nmrk) and the
number of haplotypes inferred (nhap) for the haploblock.
Then for each FS family a set of 6 columns (we printed only the first
family):
There are also columns mrk and hap for the non-FS individuals and for all individuals (not shown).
We have seen that function inferHaplotypes performs haplotyping, for individuals within FS families as well as for unrelated material. We would like to know if these results are correct. For a real, experimental data set like this one we don’t know the true genotypes so we cannot compare these with the inferred ones. But we do know the pedigree of the material, so something we can do is to check whether the haplotype combination assigned to each individual is possible, assuming polysomic inheritance and given the haplotype combinations assigned to its parent(s). If we find only few conflicts we can have confidence in the haplotyping results.
PolyHaplotyper has a functions pedigreeHapCheck to perform these inheritance checks, without or with allowing for double reduction (DR).
PHpedchk <- pedigreeHapCheck(ped=ped, mrkDosage=dosmat,
haploblock=hapblocks,
hapresults=PHresults)
The function returns a list with 2 items: ped_arr and parents_arr. Both are 3D arrays. ped_arr gives, for all haploblocks, information on each individual: whether it has SNP marker data and has been haplotyped, and if its haplotyping matches with its parents. parents_ arr gives information on all individuals that occur as parents in the pedigree (including, but not only, the parents of the specified FS families): how many progeny it has, how many of the progeny were haplotyped, and how many of those match with the parent’s haplotyping.
These are large 3D arrays and not easily viewed. For getting a summary of both the overviewByFS and the pedigreeHapCheck results we use function calcStatistics:
PHcst <- calcStatistics(pedchk=PHpedchk, ovwFS=PHovwFS)
The result of function calcStatistics is a list with 2 elements: data.frames pedstats and FSstats.
head(PHcst$pedstats)
## haploblock mrk imp hap match.NA noDR.TRUE noDR.FALSE withDR.TRUE withDR.FALSE
## 1 contig_12393 513 6 507 189 440 2 442 0
## 2 contig_05603 440 0 440 214 411 6 417 0
## 3 contig_05651 509 5 434 215 414 2 416 0
## 4 contig_11164 386 3 325 313 312 6 318 0
## 5 contig_01638 377 85 459 188 437 6 443 0
## 6 contig_12300 378 125 499 159 472 0 472 0
colSums(PHcst$pedstats[, -1])
## mrk imp hap match.NA noDR.TRUE noDR.FALSE withDR.TRUE withDR.FALSE
## 20258 1059 20322 7972 19026 135 19151 10
Data.frame pedstats gives for each haploblock the total numbers of individuals with complete SNP dosage data, with imputed SNP dosages, and which were haplotyped. Further it gives overall data on the match between individuals and their parent(s): the number of individuals where no matching could be done because they or their parents were not haplotyped, and the number of matches and conflicts assuming double reduction (DR) to be absent or present.
PHcst$FSstats
## FSfam bothparmrk FSFit mrk imp hap
## 1 1 43 43 332.62791 21.767442 354.00000
## 2 2 38 30 28.32558 0.744186 26.62791
## 3 3 34 30 48.48837 1.488372 41.58140
## 4 4 37 25 24.32558 0.627907 20.62791
## 5 rest NA NA 30.86047 0.000000 24.04651
## 6 all NA NA 471.11628 24.627907 472.60465
Data.frame FSstats gives for each FS family, and for all non-FS (“rest”) individuals and for all individuals together, the number of haploblocks where both parents had complete SNP dosage data, and the number of haploblocks where a valid segregation was found given the inferred parental haplotyping, and the average numbers of FS individuals with complete and with inferred SNP dosages and the average number of haplotyped FS individuals.
Finally we want to check that all assigned haplotype combinations match the observed SNP dosages. For this check the function compareHapMrkDosages is provided:
cmpdos <- compareHapMrkDosages(mrkDosage=dosmat, hapresults=PHresults)
table(cmpdos[,,"match"], useNA="always")
##
## TRUE <NA>
## 20322 6811
The total of the table is the number of individuals in the SNP dosage matrix x the number of haploblocks, here 27133. In general the table has 3 entries: FALSE, TRUE and NA representing the number of individuals over all haploblock where the haplotyping does not match or matches the SNP dosages or where the SNP dosages or haplotyping were not available. Here the FALSE (non-matching) category is missing. This is because PolyHaplotyper never assigns a non-matching haplotype combination (in contrast to some other haplotyping software).
PolyHaplotyper needs to know which haplotype combinations match a given set of SNP dosages. It can compute these dosages when needed, but this is not efficient. When it does so, it stores these data in a list called ahclist, and this list is written to a file ahclist_4x.RData (for tetraploids; _6x for hexaploids etc). In the current workshop an ahclist was provided that already contained all SNP dosage combinations occurring in all haploblocks; since no new ones were computed, no new ahclist file was written.
It is more efficient to calculate all haplotype combinations matching all possible SNP dosage combinations in advance, for haploblocks up to some maximum number of SNPs. This is done by function build_ahccompletelist:
# DON'T TRY THIS DURING THE WORKHOP!
# It will take hours.
build_ahccompletelist(ploidy=6, maxmrk=5, overwrite=FALSE)
With these parameters a file ahccompletelist_6x.RData will be created
in the working directory, containing all hexaploid haplotype
combinations for haploblocks of up to 5 SNP markers.
Because the number of SNP dosage combinations, and even more so the
number of haplotype combinations quickly become astronomical, the
practical limits of the number of SNPs per haploblock are 8 in
tetraploids, 6 in hexaploids and 5 in octaploids. This may seem a severe
limitation given the total number of SNPs present in e.g. a gene.
However, the purpose of PolyHaplotyper is not to accommodate full-gene
haplotypes. Instead it aims to produce multi-allelic markers for use in
downstream genetic analyses, e.g. GWAS, and for this a haploblock with
32 to 256 possible alleles is a significant improvement over bi-allelic
SNPs.
Here we used one of the example data sets provided in the
PolyHaplotyper article (Voorrips & Tumino 2022). This article also
provides two other data sets, and scripts to analyze all three data sets
with PolyHaplotyper and some other software.
Also, the package includes a vignette in which part of the data set
shown here is analyzed. And finally, all functions have their own manual
that you can access with the question-mark notation, e.g.
?inferHaplotypes
Van Geest G, Voorrips RE, Esselink E, Post A, Visser RGF (2017) Conclusive evidence for hexasomic inheritance in chrysanthemum based on analysis of a 183 k SNP array. BMC Genomics 18:585; https://doi.org/10.1186/s12864-017-4003-0
Voorrips RE, Tumino G (2022) PolyHaplotyper: haplotyping in polyploids based on bi‑allelic marker dosage data. BMC Bioinformatics 23:44; https://doi.org/10.1186/s12859-022-04989-0