library(mappoly)
MAPpoly is an R package to construct genetic maps in autopolyploids with even ploidy levels. In this quick start guide, we will present some basic functions to construct a tetraploid potato map. For a comprehensive description of all function, please refer to MAPpoly’s Reference Manual and the Extended Tutorial.
There are several functions to read discrete and probabilistic dosage-based genotype data sets in MAPpoly. You can read a data set from TXT, CSV, VCF, fitPoly-generated or import it from the R packages polyRAD, polymapR, and updog. The data set distributed along with MAPpoly is a subset of markers from (Pereira et al., 2021) in CSV format. Let us read it into MAPpoly
<- system.file("extdata/potato_example.csv", package = "mappoly")
file.name <- read_geno_csv(file.in = file.name, ploidy = 4)
dat #> Reading the following data:
#> Ploidy level: 4
#> No. individuals: 153
#> No. markers: 2225
#> No. informative markers: 2225 (100%)
#> ...
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Done with filtering.
print(dat, detailed = T)
#> This is an object of class 'mappoly.data'
#> Ploidy level: 4
#> No. individuals: 153
#> No. markers: 2225
#> Missing data under prob. threshold: 1.07%
#>
#> ----------
#> No. markers per chromosome:
#> seq No.mrk
#> 1 258
#> 10 114
#> 11 150
#> 12 122
#> 2 214
#> 3 197
#> 4 207
#> 5 168
#> 6 202
#> 7 221
#> 8 179
#> 9 179
#> ----------
#> Markers with no chromosome information: 14
#> ----------
#> No. of markers per dosage combination in both parents:
#> P1 P2 freq
#> 0 1 60
#> 0 2 24
#> 1 0 213
#> 1 1 195
#> 1 2 140
#> 1 3 52
#> 2 0 51
#> 2 1 153
#> 2 2 212
#> 2 3 203
#> 2 4 90
#> 3 1 62
#> 3 2 152
#> 3 3 252
#> 3 4 249
#> 4 2 27
#> 4 3 90
plot(dat)
The output figure shows a bar plot on the left-hand side with the number of markers in each allele dosage combination in \(P_1\) and \(P_2\), respectively. The upper-right plot contains the \(\log_{10}(p-value)\) from \(\chi^2\) tests for all markers, considering the expected segregation patterns under Mendelian inheritance.
Quality control (QC) procedures are extremely important to identify:
These procedures can be conducted in any order, depending on the data set. Let us first removing individuals from crosses other than \(P_1 \times P_2\). When using the interactive function, the user needs select a polygon around the individuals to be removed by clicking in its vertices and pressing Esc.
<- filter_individuals(dat) dat
#> Initial data:
#> Number of Individuals: 155
#> Number of Markers: 2225
#>
#> Missing data check:
#> Total SNPs: 2225
#> 0 SNPs dropped due to missing data threshold of 1
#> Total of: 2225 SNPs
#> MAF check:
#> No SNPs with MAF below 0
#> Monomorphic check:
#> No monomorphic SNPs
#> Summary check:
#> Initial: 2225 SNPs
#> Final: 2225 SNPs ( 0 SNPs removed)
#>
#> Completed! Time = 0.461 seconds
Now, let us filter out markers and individuals with more than 5% of missing data. You can update the threshold interactively
<- filter_missing(dat, type = "marker", filter.thres = .05)
dat <- filter_missing(dat, type = "individual", filter.thres = .05) dat
Finally, we can filter out markers with distorted segregation and redundant information. At this point, we do not consider preferential pairing and double reduction.
<- filter_segregation(dat, chisq.pval.thres = 0.05/dat$n.mrk)
seq.filt <- make_seq_mappoly(seq.filt)
seq.filt <- elim_redundant(seq.filt) seq.red
After filtering, 1990 markers left to be mapped. Now, let us create a sequence of ordered markers to proceed with the analysis. You can also plot the data set for a specific sequence of markers
<- make_seq_mappoly(seq.red)
seq.init plot(seq.init)
and check the distribution of the markers in the reference genome, if the information is available
<- get_genomic_order(input.seq = seq.init) ## get genomic order of the sequence
go plot(go)
The two-point analysis calculates the pairwise recombination fraction in a sequence of markers. At this point of the analysis, where we have a large number of markers, we use the function est_pairwise_rf2
which has a less detailed output then the original est_pairwise_rf
(which will be used later), but can handle tens of thousands of markers, even when using a personal computer. Nevertheless, depending on the number of markers, the analysis can take a while if few cores are available.
<- parallel::detectCores() - 1
ncores <- est_pairwise_rf2(seq.init, ncpus = ncores)
tpt #> Going RcppParallel mode.
#> ...
#> Done with pairwise computation.
#> ~~~~~~~
#> Reorganizing results
#> ...
<- rf_list_to_matrix(tpt) ## converts rec. frac. list into a matrix
m <- make_seq_mappoly(go) ## creates a sequence of markers in the genome order
sgo plot(m, ord = sgo, fact = 5) ## plots a rec. frac. matrix using the genome order, averaging neighbor cells in a 5 x 5 grid
We can cluster the markers in linkage groups by using function group_mappoly
. The function uses the recombination fraction matrix and UPGMA method to group markers. Use the option comp.mat = TRUE
to compare the results of the linkage based clustering with the chromosome information. If your data set does not contain chromosome information, use the option comp.mat = FALSE
. You also can use the interactive version to change the number of expected groups
<- group_mappoly(m, expected.groups = 12, comp.mat = TRUE)
g plot(g)
g#> This is an object of class 'mappoly.group'
#> ------------------------------------------
#> Criteria used to assign markers to groups:
#>
#> - Number of markers: 1988
#> - Number of linkage groups: 12
#> - Number of markers per linkage groups:
#> group n.mrk
#> 1 118
#> 2 159
#> 3 191
#> 4 166
#> 5 197
#> 6 180
#> 7 133
#> 8 197
#> 9 222
#> 10 105
#> 11 168
#> 12 152
#> ------------------------------------------
#> 10 9 4 5 2 3 11 7 1 12 6 8 NoChr
#> 1 88 0 3 5 7 5 3 1 2 2 0 1 1
#> 2 0 147 1 0 2 0 1 2 2 0 1 2 1
#> 3 0 1 166 3 0 3 3 2 2 0 4 2 5
#> 4 1 1 4 140 3 2 4 3 5 0 2 1 0
#> 5 1 1 3 1 174 5 0 1 2 0 0 8 1
#> 6 0 4 0 2 6 155 1 1 6 0 3 2 0
#> 7 6 2 1 1 0 0 118 0 2 1 1 1 0
#> 8 1 3 2 1 0 2 1 179 2 3 1 1 1
#> 9 1 1 2 5 2 0 0 6 204 0 0 1 0
#> 10 0 1 2 0 0 1 0 1 1 98 0 0 1
#> 11 2 0 0 0 1 1 0 0 0 1 160 2 1
#> 12 2 0 0 1 1 1 0 0 1 1 2 142 1
#> ------------------------------------------
In the table above, the rows indicate linkage groups obtained using linkage information and the columns are the chromosomes in the reference genome. Notice the diagonal indicating the concordance between the two sources of information.
Markers are ordered within linkage groups. In this tutorial, we are going to show the step-by-ptep procedure using Linkage Group 1 (LG1). You can do the same for the remaining linakge groups.
Since we had a good concordance between genome and linkage information, we will use only markers that were assigned to a particular linkage group using both sources of information. We will do that using genomic.info = 1
so the function uses the intersection of the markers assigned using linkage and the chromosome with highest number of allocated markers. To use only the linkage information, do not use the argument genomic.info
. You also need the recombination fraction matrix for that group.
<- make_seq_mappoly(g, 1, ## Select LG1
s1 genomic.info = 1)
<- make_mat_mappoly(m, s1) m1
Let us order the markers in the sequence using the MDS algorithm.
<- mds_mappoly(input.mat = m1)
mds.o1 #> Stress: 0.30656
#> Mean Nearest Neighbour Fit: 1.87646
<- make_seq_mappoly(mds.o1)
s1.mds plot(m1, ord = s1.mds)
Usually, at this point, the user can use diagnostic plots to remove markers that disturb the ordering procedure. We didn’t use that procedure in this tutorial, but we encourage the user to check the example in
?mds_mappoly
. Now, let us use the reference genome to order the markers.
You also can order the markers the reference genome
<- get_genomic_order(s1)
gen.o1 <- make_seq_mappoly(gen.o1)
s1.gen plot(m1, ord = s1.gen)
For the sake of this short tutorial, let us use the MDS order (s1.mds
) to phase the markers, but you can also try the genome order (s1.gen
) and compare the resulting maps using function compare_maps
and you will notice that the genomic order yield a better map since its likelihood is higher.
The estimation of the genetic map for a given order involves the computation of recombination fraction between adjacent markers and inferring the linkage phase configuration of those markers in both parents. The core function to perform these tasks in MAPpoly
is est_rf_hmm_sequential
. This function uses the pairwise recombination fraction as the first source of information to sequenlially position allelic variants in specific parental homologs. For situations where pairwise analysis has limited power, the algorithm relies on the likelihood obtained through a hidden Markov model (HMM). Once all markers are positioned, the final map is reconstructed using the HMM multipoint algorithm. For a detailed description of the est_rf_hmm_sequential
arguments, please refer to MAPpoly’s Reference Manual and the Extended Tutorial.
First we need to calculate the pairwise recombination fraction for markers in sequence s1.mds
using est_pairwise_rf
, which contains the information necessary to the proper working of the phasing algorithm.
<- est_pairwise_rf(s1.mds, ncpus = ncores)
tpt1 <- est_rf_hmm_sequential(input.seq = s1.mds,
lg1.map start.set = 3,
thres.twopt = 10,
thres.hmm = 20,
extend.tail = 50,
info.tail = TRUE,
twopt = tpt1,
sub.map.size.diff.limit = 5,
phase.number.limit = 20,
reestimate.single.ph.configuration = TRUE,
tol = 10e-3,
tol.final = 10e-4)
#> ════════════════════════════════════════════════════════════ Initial sequence ══
#> ══════════════════════════════════════════════════ Done with initial sequence ══
#> ══════════════════════════════════ Reestimating final recombination fractions ══
#> ════════════════════════════════════════════════════════════════════════════════
Now, use the functions print
and plot
to view the map results:
print(lg1.map)
#> This is an object of class 'mappoly.map'
#> Ploidy level: 4
#> No. individuals: 144
#> No. markers: 75
#> No. linkage phases: 1
#>
#> ---------------------------------------------
#> Number of linkage phase configurations: 1
#> ---------------------------------------------
#> Linkage phase configuration: 1
#> map length: 82.66
#> log-likelihood: -1014.32
#> LOD: 0
#> ~~~~~~~~~~~~~~~~~~
plot(lg1.map, mrk.names = TRUE, cex = 0.7)
Now let us update the recombination fractions by allowing a global error in the HMM recombination fraction re-estimation. Using this approach, the genetic map’s length will be updated by removing spurious recombination events. This procedure can be applied using either the probability distribution provided by the genotype calling software using function est_full_hmm_with_prior_prob
or assuming a global genotype error like the following example
<- est_full_hmm_with_global_error(input.map = lg1.map, error = 0.05,
lg1.map.up verbose = TRUE)
plot(lg1.map.up, mrk.names = TRUE, cex = 0.7)
We can also use the ordinary least squares (OLS) method and the weighted MDS followed by fitting a one dimensional principal curve (wMDS_to_1D_pc)
<- reest_rf(lg1.map, m1, method = "ols")
lg1.map.ols <- reest_rf(lg1.map, m1, method = "wMDS_to_1D_pc", input.mds = mds.o1) lg1.map.mds
Now let us create a list with the maps and plot the results
<- list(orig = lg1.map,
map.list.lg1 updt = lg1.map.up,
ols = lg1.map.ols,
mds = lg1.map.mds)
plot_map_list(map.list.lg1, col = "ggstyle", title = "Estimation method")
## Homolog probability and preferential pairing profile
In order to use the genetic map in conjunction with QTL analysis software, we need to obtain the homolog probability for all linkage groups for all individuals in the full-sib population. In this short guide we will proceed only with one linkage group, but in real situations this procedure should be applied to all chromosomes. Let us use the updated map
<- calc_genoprob_error(lg1.map.up, step = 1, error = 0.05)
g1 #> Ploidy level:4
#> Number of individuals:144
#> ..................................................
#> ..................................................
#> ............................................
<- export_qtlpoly(g1) #export to QTLpoly
to.qtlpoly #>
#> Linkage group 1 ...
<- calc_homologprob(g1)
h1 #>
#> Linkage group 1 ...
plot(h1, lg = 1, ind = 10)
Now let us compute the preferential pairing profile for linkage group 1
= calc_prefpair_profiles(g1)
p1 #>
#> Linkage group 1 ...
plot(p1, min.y.prof = 0.25, max.y.prof = 0.4, P1 = "Atlantic", P2 = "B1829.5")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It is possible to export a phased map to an external CSV file using
export_map_list(lg1.map.up, file = "output_file.csv")
You can also print the same output on the screen using
export_map_list(lg1.map.up, file = "")
For a script with a complete analysis of the data set presented here, please refer to YYYYYYYYY