Preparation

Create a new folder and copy the following files there: fitPolyTools_1.1.1.tar.gz, AxiomGT1.summary.txt, WagRhSNPAX_Sample_Table.csv, CsvAnnotationFile.v1.txt and B_scores.RData, and set the working directory to this folder:

# set the working directory to the folder where you copied the materials.
# In R, the separator in file paths is "/", not "\" as in Windows.
# For example:
setwd("c:/SCRI/workshops/fitPoly")

If the packages were not yet installed, do so now:

# install package fitPoly from CRAN:
install.packages("fitPoly")
# you may be asked for a CRAN mirror server at this point; choose any one from the list

# install package fitPolyTools from the local file:
install.packages("fitPolyTools_1.1.1.tar.gz")
# if this fails, the following will most likely work:
# 1. install the devtools package from CRAN if that was not done earlier:
# install.packages("devtools")
# 2. then install the fitPolyTools package using the install_local function
#    of devtools:
# devtools::install_local("fitPolyTools_1.1.1.tar.gz")

Now the packages are installed, but the functions in them are not yet accessible. For that we need the library command:

library(fitPoly)

## Warning: package 'fitPoly' was built under R version 3.6.3

library(fitPolyTools)

A final word about the code examples: several times the function knitr::kable() is used for pretty-printing of tables. If you don’t have the knitr package installed (which is not required for the data processing) you can just type the function argument:

# example: function knitr::kable is not required
# instead of
knitr::kable(head(datAX))
# you can also type
head(datAX)

Extract data from SNP array

In this workshop we work with tetraploid rose data from an Axiom array. The fitPolyTools package also provides functions to work with Illumina arrays, which are demonstrated in the package vignettes.
The raw data from Axiom arrays are processed with the Affymetrix Power Tools (APT) or the Genotyping Console software. After initial quality control steps all passing samples and all SNPs are processed by the program “apt probeset genotype”. If the option –summaries is included (as is normally the case) a file AxiomGT1.summary.txt file is produced. Usually this file is created by the service provider.

Our work in this workshops starts with this AxiomGT1.summary.txt file. It is a tab-separated text file with the markers in rows and the samples in columns. We have one column per sample and two rows, A and B, for each marker. The file starts with several hundred lines of text, and the last few thousand rows contain quality control (DQC) “pseudomarkers”. Also the sample names all have “.CEL” appended to them.

# show part of the Affymetrix AxiomGT1.summary.txt file:
dat <- readDatfile("AxiomGT1.summary.txt", comment.char="#", check.names=FALSE,
                   nrows=6)
dat[1:6, 1:4]

##     probeset_id a550450-4187610-051814-378_A06.CEL a550450-4187610-051814-379_A01.CEL
## 1 AX-86752744-A                           375.0776                           310.0749
## 2 AX-86752744-B                           239.8847                           393.0592
## 3 AX-86752753-A                           530.1507                           413.8058
## 4 AX-86752753-B                           693.3399                           296.1384
## 5 AX-86752767-A                          2103.9311                          1952.4893
## 6 AX-86752767-B                          1206.9029                           419.4540
##   a550450-4187610-051814-379_A02.CEL
## 1                           365.9014
## 2                           204.9692
## 3                           563.3449
## 4                           198.8771
## 5                          2117.5882
## 6                           435.1182

(Function readDatfile in fitPolyTools is a wrapper for the standard R function read.table that sets some required parameters for tab-separated files automatically.)

Read and convert the array data

We provide a demo file with 17244 SNP markers (much less than on the average Axiom array) and 176 samples. We will import this file, clean the sample names and convert them from “wide” format (all samples side by side) to “long” format (all samples under each other), with columns MarkerName, SampleName, X, Y, R, ratio. X and Y are the signal intensities of the two alleles, R = X+Y ( the total signal intensity), and ratio = Y/R: the fraction of the Y signal in the total signal.

# import the AxiomGT1.summary.txt file (takes about 15 sec):
datAX <- readAxiomSummary(AXdata="AxiomGT1.summary.txt", out=NA)
knitr::kable(head(datAX))

MarkerName	SampleName	X	Y	R	ratio
AX-86752744	a550450-4187610-051814-378_A06	375.0776	239.8847	614.9623	0.3900803
AX-86752744	a550450-4187610-051814-379_A01	310.0749	393.0592	703.1341	0.5590103
AX-86752744	a550450-4187610-051814-379_A02	365.9014	204.9692	570.8707	0.3590467
AX-86752744	a550450-4187610-051814-379_A03	422.0593	393.9801	816.0394	0.4827954
AX-86752744	a550450-4187610-051814-379_A04	414.6733	273.4405	688.1138	0.3973769
AX-86752744	a550450-4187610-051814-379_A05	473.2871	342.9835	816.2706	0.4201835

The reformatted data are now in data.frame datAX. As you can see, the -A and -B that were appended to the probeset-id, and the .CEL that was appended to the sample names were removed.

Rename the markers and the samples

Usually the marker names and/or the sample names in the data files produced by the service provider are different from the original names. Therefore we need files specifying the translation between the two sets of names. In this exercise we will convert the marker and sample names of the datAX data set to ones that are meaningful to us.

Markers

First we rename the markers. The translation is specified by file “CsvAnnotationFile.v1.txt”. This file is often supplied by the service provider and has a fixed format, starting with some comment lines, then some data on each probe (including the Probe.Set.ID and the customer_id which we will use here). It usually ends with several thousand lines that are used for quality control, which can be ignored here. Since the markers on the array are a fixed set these annotation files don’t change and can be assumed to be correct (i.e. all marker names in the array also occur in the annotation file and v.v., and all customer_id codes are unique). Therefore we don’t need to check this and can perform the translation directly:

# read the marker annotation file:
mrktable <- read.csv("CsvAnnotationFile.v1.txt", comment.char="#")
head(mrktable)

##   Probe.Set.ID   Affy.SNP.ID Allele.A Allele.B customer_id
## 1  AX-86752845 Affx-86841042        T        C G75957_739P
## 2  AX-86752898 Affx-86843748        A        C  M5639_684P
## 3  AX-86753237 Affx-86843020        T        C   M81_2578P
## 4  AX-86753271 Affx-86842379        T        G G48944_897P
## 5  AX-86753282 Affx-86838393        T        C  M4506_206P
## 6  AX-86753334 Affx-86842938        A        G  M8839_565P

# replace the original marker names in datAX by our own "customer_id":
datAX$MarkerName <-
  mrktable$customer_id[match(datAX$MarkerName, mrktable$Probe.Set.ID)]

Samples

Next we rename the samples. The samples are different between projects using the same array, and sample annotations are usually made by hand, in different formats, and cannot be assumed to be error-free. We perform some checks and rename the samples with function splitNrenameSamples. This function can also be used to split the dataset into parts with samples of different ploidy, but that is not the case here.

smptable <- read.csv("WagRhSNPAX_Sample_Table.csv", stringsAsFactors=FALSE)
head(smptable)

##   SampleName material                      BestArray
## 1     P540a1    P1_K5 a550450-4187610-051814-379_A01
## 2     P867a1    P2_K5 a550450-4187610-051814-379_A02
## 3     P867a2    P2_K5 a550450-4187610-051814-378_A06
## 4      P867b    P2_K5 a550450-4187610-051814-380_A04
## 5       K001    F1_K5 a550450-4187610-051814-379_A03
## 6       K002    F1_K5 a550450-4187610-051814-379_A04

# no split, return the original data.frame with substituted SampleNames:
datAX <- splitNrenameSamples(dat=datAX, sampletable=smptable,
                              SampleID="BestArray", CustomerID="SampleName",
                              out="datAX")
head(datAX)

##    MarkerName SampleName        X        Y        R     ratio
## 1 G44849_319P     P867a2 375.0776 239.8847 614.9623 0.3900803
## 2 G44849_319P     P540a1 310.0749 393.0592 703.1341 0.5590103
## 3 G44849_319P     P867a1 365.9014 204.9692 570.8707 0.3590467
## 4 G44849_319P       K001 422.0593 393.9801 816.0394 0.4827954
## 5 G44849_319P       K002 414.6733 273.4405 688.1138 0.3973769
## 6 G44849_319P       K003 473.2871 342.9835 816.2706 0.4201835

The result is also saved in a file datAX.RData in a compressed, R-specific format.

For this workshop we will keep all data. The package contains tools to filter markers, samples, and datapoints within markers based on signal intensities and these are demonstrated in the package vignettes, but we will skip these steps here.

Dosage scoring with fitPoly

The actual dosage calling is done using functions in package fitPoly, which is an extension of package fitTetra (Voorrips et al, 2011). The original package was specific for tetraploids and treated all samples as belonging to one population, where it tried to fit mixture models reflecting either Hardy-Weinberg segregation ratios or unrestricted ratios. The new fitPoly package works for any ploidy level, it allows to define multiple populations with different segregations, and it adds F1 segregation ratios to the HW and the unrestricted ratios.

Here we use a function called (for historical reasons) “saveMarkerModels”. This function is a wrapper function: with one call an entire set of markers can be genotyped and the results are saved to files.

We have one F1 population here which is defined as follows:

# the parents data.frame defines the population structure:
parents <- data.frame(
  pop=c("F1_K5", "P1_K5", "P2_K5"),
  parent1=c("P1_K5", NA, NA),
  parent2=c("P2_K5", NA, NA),
  stringsAsFactors=FALSE)
knitr::kable(parents)

pop	parent1	parent2
F1_K5	P1_K5	P2_K5
P1_K5	NA	NA
P2_K5	NA	NA

# the pop data.frame defines which sample belongs to which population;
# we extract it from smptable:
pop <- smptable[, 1:2]
names(pop) <- c("SampleName", "population")
knitr::kable(pop[1:6,])

SampleName	population
P540a1	P1_K5
P867a1	P2_K5
P867a2	P2_K5
P867b	P2_K5
K001	F1_K5
K002	F1_K5

The dosage scoring takes too much time to run in this workshop. Several hundred markers per hour are scored per computing core, but still with more than 17000 markers in the data set this would take too long. However we can run the first 3 to give the idea:

saveMarkerModels(ploidy=4,
                 markers=1:3, #for this demo
                 data=datAX,
                 pop.parents=parents,
                 population=pop,
                 p.threshold=0.9,
                 filePrefix="A",
                 rdaFiles=TRUE,
                 plot="fitted",
                 ncores=1)

## saveMarkerModels: batchsize = 3

This function produces several files. The most interesting one here is file A_scores.RData which contains a data.frame with for each marker (in this case only the first 3 markers) and each sample the assigned dosage, as well as some further data:

load("A_scores.RData")
knitr::kable(head(scores))

marker	MarkerName	SampleName	ratio	P1	P2	P3	maxgeno	maxP	geno
1	D10190_215P	K001	0.3748836	1.0000000	0e+00	0	1	1.0000000	1
1	D10190_215P	K002	0.5066971	0.0000000	1e+00	0	2	1.0000000	2
1	D10190_215P	K003	0.3289250	1.0000000	0e+00	0	1	1.0000000	1
1	D10190_215P	K004	0.5424403	0.0000000	1e+00	0	2	1.0000000	2
1	D10190_215P	K007	0.3971393	0.9999997	3e-07	0	1	0.9999997	1
1	D10190_215P	K008	0.7047116	0.0000000	0e+00	1	3	1.0000000	3

The important columns here are MarkerName, SampleName and geno. Geno is the assigned dosage; it is the most probable dosage if the probability of that dosage is above the p.threshold, and otherwise it is NA (missing).

In this example also some plot files are generated in a folder A_plots in the current working directory. The plots for marker 1 and 3 show a fitted model for a duplex x triplex marker (segregation 1:5:5:1) and a nulliplex x simplex marker (segregation 1:1). Marker 2 is rejected because samples are all in the same peak, while the default peak.threshold is 0.85 (max. 85% in the same peak allowed; this threshold exists because for a reliable model fit data of at least two peaks should be available). plot for marker 1

For further information on the input and output of fitPoly functions see the help in that package.

Post-processing of dosage scores based on FS family data

In the previous section we ran fitPoly for only 3 markers. Now we will assume that we have run the following (similar) command for all 17244 markers in the data set, producing a.o. the file B_scores.RData:

saveMarkerModels(ploidy=4,
                 data=datAX,
                 pop.parents=parents,
                 population=pop,
                 p.threshold=0.9,
                 filePrefix="B",
                 rdaFiles=TRUE,
                 ncores=1)

We load this file:

load("B_scores.RData")

Missing data

Some markers are of low quality and cannot be fitted or are rejected by fitPoly (based on several user-settable thresholds). For markers that are not rejected, some samples may be assigned a missing geno (dosage) value if the probability of the most probable dosage is below p.threshold.

We take a look at the number of missing values in the geno column per marker:

NAcounts <- tapply(scores$geno, INDEX=scores$MarkerName, 
                   FUN=function(x) sum(is.na(x)))
NAtab <- table(NAcounts)
head(NAtab)

## NAcounts
##    0    1    2    3    4    5 
## 2604 2046 1461 1023  698  594

tail(NAtab)

## NAcounts
##   66   67   68   69   70  176 
##    5    5    4    4    4 5305

The number of missing data per marker varies from 0 to 70, or is 176. In the last case all samples have a missing value: these are the non-fitted and rejected markers (5305 of the original 17244). The 11939 remaining markers have up to 70 (of 176) missing data.

Segregation types of all markers and quality checks

The dosage scores produced by fitPoly (or by other software) may contain errors, because the marker assay does not work well on the genotyped material or because the software arrives at a sub-optimal fit. Here we make use of a full-sib (F1) population to identify unreliable markers and/or to correct the scores where an incorrect model was used. If additional samples were genotyped in addition to the FS family, the filtering of unreliable markers and the corrections can be applied to them as well.

The initial check for segregation type is carried out by function checkF1:

par1 <- pop$SampleName[pop$population=="P1_K5"]
par2 <- pop$SampleName[pop$population=="P2_K5"]
F1 <- pop$SampleName[pop$population=="F1_K5"]
# this takes about one minute:
chk1 <- checkF1(scores, 
                parent1=par1, parent2=par2, F1=F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile="chk1.dat")
# parts of the result:
knitr::kable(chk1[1:4,1:14])

m	MarkerName	parent1	parent2	F1_0	F1_1	F1_2	F1_3	F1_4	F1_NA	P540a1	P867a1	P867a2	P867b
1	D10190_215P	2	3	0	20	65	77	10	0	2	3	3	3
2	D10190_215Q	NA	NA	0	0	0	0	0	172	NA	NA	NA	NA
3	D10234_403P	0	1	80	89	3	0	0	0	0	1	1	1
4	D10234_403Q	0	1	79	89	0	0	0	4	0	1	1	1

knitr::kable(chk1[1:4,c(1:2, 19:22, 26)])

m	MarkerName	bestParentfit	frqInvalid_bestParentfit	Pvalue_bestParentfit	matchParent_bestParentfit	qall_mult
1	D10190_215P	1551_1	0.0000	0.2064	Yes	1.0000
2	D10190_215Q		NA	NA		NA
3	D10234_403P	11_0	0.0174	0.4887	Yes	0.9826
4	D10234_403Q	11_0	0.0000	0.4404	Yes	0.9419

This is a Rose dataset and we expect that both polysomic and disomic inheritance may occur. By setting both corresponding parameters to TRUE we consider both sets of segregation types. The result is a data.frame with one row for each marker. The first columns list the marker name, the inferred parental dosages, the observed F1 segregation and the observed parental sample dosages. The bestParentfit column lists the selected segregation type and the next 3 columns list 3 parameters describing how well this segregation fits with the observations. The qall_mult column is an overall quality assessment, from poor (0) to perfect (1). There are more columns in chk1, and with other parameter settings even more columns will be produced; for a full explanation refer to the help. The data.frame chk1 is saved to file chk1.dat, which is a tab-separated file that can be viewed in Excel.

The segregation codes need some explanation. They consist of 2 parts: the first part has the segregation ratios (e.g. 11 means 1:1, 141 means 1:4:1 and so on; a single 1 means no segregation, all F1 have the same dosage). An underscore separates the two parts and the second part is a single number specifying the lowest dosage occurring in the segregation. For example: 121_0 means 1 nulliplex : 2 simplex : 2 duplex; 11_3 means 1 triplex : 1 quadruplex. There is one tetrasomic segregation type 1:8:18:8:1 where one of the ratios is larger than 9 and cannot be represented by a digit; in this and similar cases we use letters A-Z to represent ratios 10-35, so this segregation is represented as 18I81_0, with the I meaning 18. For higher ploidy levels this system is extended, see the help for a full explanation.

“Shifted” markers

When not all dosages are present among the samples it is possible that fitPoly assigns the peaks in the signal ratio histogram to the wrong dosages. For example, if parents in reality have dosages 0 and 1, also the F1 progeny will (almost) only have dosages 0 or 1. If no other samples are analyzed, only two of the five dosages are present, and these might be mis-scored as being dosages 1 and 2 instead of 0 and 1. These mis-assignments are called “shifts”. fitPoly checks against possible shifts whether or not parents and F1 populations are defined but it doesn’t catch them all.

In package fitPolyTools a function correctDosages is provided that, based on the result of checkF1, checks if shifting the dosages up or down by 1 would be worth to try. The result is a set of suggested shifts, that should then again be checked (using checkF1) for their actual fit.

cordos <- correctDosages(chk1, scores, par1, par2, ploidy=4,
                        polysomic=TRUE, disomic=TRUE, mixed=FALSE)
# show the first few nonzero shifts:
knitr::kable(head(cordos[cordos$shift!=0,]))

	MarkerName	segtype	parent1	parent2	shift	ParNA
329	G10701_1435P	11_2	3	NA	1	1
492	G12049_166Q	11_1	1	NA	-1	1
504	G1217_1009Q	11_2	2	NA	1	1
568	G12636_1704Q	11_2	3	NA	1	1
632	G13490_151Q	11_2	3	2	1	0
673	G13900_549P	11_2	2	3	1	0

Column shift contains the suggested shifts to try: a 0 means no shift suggested, a 1 means: shift all dosages assigned by fitPoly up (0 becomes 1, 1 becomes 2, 2 becomes 3, 3 and 4 become 4 (3 and 4 are merged)). A shift of -1 means that all dosages should be decreased by 1 (merging previous dosages 0 and 1 into dosage 0).

This is a rather simplistic way to check for possibly shifted markers. If there are more samples available than only the one F1 population and its parents, those might also be informative about possible shifts. Also, sometimes both the unshifted and the shifted version of the marker might be acceptable. In these cases it is best to keep both versions and decide later on (after mapping or haplotyping steps) which version fits best.

The next step is to run checkF1 again with as additional parameter these shifts; it will then perform the indicated dosage shift on all samples (parental and F1) before selecting the most likely segregation type and quality assessment:

#select the markers where a shift should be tried:
cordos <- cordos[cordos$shift != 0,] 
subscores <- scores[scores$MarkerName %in% cordos$MarkerName,]
chk2 <- checkF1(subscores, par1, par2, F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile=NA, shiftmarkers=cordos)

Data.frame chk2 has the same format as chk1 but has an additional column “shift” at the end. For the next steps it is useful to merge chk1 and chk2. In order to do that, chk1 needs also to get this extra column:

chk1$shift <- 0
chk <- rbind(chk1, chk2)
chk <- chk[order(chk$MarkerName),]
writeDatfile(chk, file="chk.dat")

Select acceptable markers

Data.frame chk now has all unshifted and shifted versions of the markers. Each of the (versions of the) markers has been checked for its quality, which is
summarized in its value of qall_mult. We would like to select only those (versions of) markers that are of sufficient quality. As a rule of thumb, all scores of markers with qall_mult==0 are bad and all others might be acceptable, but perhaps we can raise the threshold for qall_mult a bit for a more stringent selection, without losing valuable markers. The approach is to study XY-scatterplots for several markers at different levels of qall_mult and see from which level good (well-scored) markers appear.

In order to produce scatterplots with dots colored according to assigned dosages we first combine the X and Y signals with the geno (dosage) values:

XYgeno <- combineFiles(XYdata=datAX, scores=scores)
# define qall levels of 0, 0.05, 0.10 , .. , 1 where we want to inspect some SNPs:
qall.levels <- seq(0, 1, by=0.05)
# select six SNPs with qall values near each of these values and draw their plots:
chkx <- selMarkers_qall(chk,  qall.levels, mrkperlevel=6)
drawXYplots(dat=XYgeno, markers=chkx,
            out="XYgeno",
            genocol=get.genocol(ploidy=4), sample.groups=list(par1, par2),
            groups.col=c("red", "blue"), ploidy=4)

This generates a series of 21 pages, each with 6 plots representing markers with a qall_mult score at a certain level.

The drawXYplots command here is quite complex. A full explanation of all parameters can be found in the help. Some things to note here: genocol sets the colors assigned to each of the dosages, from red (nulliplex) through blue (duplex) to green (quadruplex), and grey for samples with missing dosage score. The parents are highlighted in red and blue via parameters sample.groups and groups.col. A conclusion from this series of plots is that already at qall_mult==0.05 most of the markers appear to be well-scored. A next step could be to try to find a qall_mult threshold between 0 and 0.05 to separate the well-scored and badly scored markers. For now we will accept all markers with qall > 0. XY-plots page 21

#select only the markers with qall_mult > 0:
chk <- chk[!is.na(chk$qall_mult) & chk$qall_mult > 0,]

However, of many SNPs we now have more than one “version”. For instance:

knitr::kable(chk[substr(chk$MarkerName,1,9)=="K14264_98", 
                 c("MarkerName", "parent1", "parent2", "bestParentfit", "qall_mult", "shift")])

	MarkerName	parent1	parent2	bestParentfit	qall_mult	shift
6358	K14264_98P	2	3	11_2	0.5462	0
17241	K14264_98P	3	4	11_3	0.5462	1
6359	K14264_98Q	2	3	11_2	0.5008	0
17242	K14264_98Q	3	4	11_3	0.5008	1

Here we have 4 “versions” of the same SNP. The P or Q after the SNP name indicate which probe was used to target the SNP: Axiom arrays need a probe on only one side of the SNP, and on this particular Rose array all SNPs were targeted with two probes (on the two opposite strands). Since the two probes detect the same SNP the scores should be mostly similar. Further, for this SNP the nonshifted and shifted versions were both acceptable.

The following functions check for each SNP which versions have passed the quality check (in our case, qall_mult > 0), and see if it is possible to merge the two probes. In order to merge, there should be only few scores over the entire FS family that conflict between the scores. If this is the case, the merging is performed, with missing values from one probe being filled in from the other probe.

The merged version get an R appended, and the second of the two functions removes the P and Q versions that were used to generate the R version. Also, n’s or s’s are appended to indicate if the nonshifted or the shifted version of the probes was used.

cpp <- compareProbes(chk, scores, parent1=par1, parent2=par2, F1=F1,
                     polysomic=TRUE, disomic=TRUE, mixed=FALSE,
                     ploidy=4, compfile=NA, combscorefile=NA)

## batch 1 of 33 
## batch 2 of 33 
## batch 3 of 33 
## batch 4 of 33 
## batch 5 of 33 
## batch 6 of 33 
## batch 7 of 33 
## batch 8 of 33 
## batch 9 of 33 
## batch 10 of 33 
## batch 11 of 33 
## batch 12 of 33 
## batch 13 of 33 
## batch 14 of 33 
## batch 15 of 33 
## batch 16 of 33 
## batch 17 of 33 
## batch 18 of 33 
## batch 19 of 33 
## batch 20 of 33 
## batch 21 of 33 
## batch 22 of 33 
## batch 23 of 33 
## batch 24 of 33 
## batch 25 of 33 
## batch 26 of 33 
## batch 27 of 33 
## batch 28 of 33 
## batch 29 of 33 
## batch 30 of 33 
## batch 31 of 33 
## batch 32 of 33 
## batch 33 of 33

knitr::kable(cpp$compstat[1:6,])

SNPname	segtypePn	qallPn	segtypePs	qallPs	segtypeQn	qallQn	segtypeQs	qallQs	segtypeRnn	segtypeRns	segtypeRsn	segtypeRss	countP	countQ	countR
D10190_215	1551_1	1.0000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	0	0
D10234_403	11_0	0.9826	NA	NA	11_0	0.9419	NA	NA	11_0	NA	NA	NA	1	1	1
D10262_172	NA	NA	NA	NA	11_3	0.4971	NA	NA	NA	NA	NA	NA	0	1	0
D10433_789	NA	NA	NA	NA	11_3	0.9273	NA	NA	NA	NA	NA	NA	0	1	0
D11322_170	11_3	0.9018	NA	NA	11_3	0.9363	NA	NA	11_3	NA	NA	NA	1	1	1
D11323_451	11_0	1.0000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	0	0

rr <- removeRedundant(compstat=cpp$compstat, combscores=cpp$combscores,
                      compfile=NA, combscorefile="rose_axiom_7195_mrk.dat")
knitr::kable(rr$combscores[1:6, 1:12])

	MarkerName	segtype	P540a1	P867a1	P867a2	P867b	parent1	parent2	K001	K002	K003	K004
1	D10190_215Pn	1551_1	2	3	3	3	2	3	1	2	1	2
4	D10234_403Rnn	11_0	0	1	1	1	0	1	0	0	0	1
5	D10262_172Qn	11_3	4	3	2	3	4	3	2	4	4	4
6	D10433_789Qn	11_3	3	4	4	4	3	4	4	3	3	4
9	D11322_170Rnn	11_3	3	4	4	4	3	4	4	4	3	3
10	D11323_451Pn	11_0	1	0	0	0	1	0	0	1	0	1

Both these functions return a list with two components: compstat and combscores, each a data.frame. These are also saved as tab-separated text files if filenames are specified for compfile and combscorefile. Data.frame compstat has an overview of the different versions of each SNP (two probes, each unshifted and/or shifted). For each of these (suffixed with P or Q for the probe, and with n or s for (non-)shifted) the segregation type and qall_mult quality score is listed. If (a version of) the P and the Q probe are sufficiently similar, a combined version of the SNP is produced which gets the suffix R, and a further suffix nn, ns, sn or ss depending on whether the nonshifted or the shifted version of the P and Q probe was used for merging. Also for such a merged (R) marker the segregation type is given, and finally the number of versions of the P, Q and R markers.
Two probes are considered sufficiently similar to merge if they have the same segregation type and if there are (by default) not more than 4% conflicting sample scores. For more details and additional parameters see the help.

The other component of the result of these functions is data.frame combscores. This lists for each remaining marker the segregation type and dosages. We save the combscores result of removeRedundant in file rose_axiom_7195_mrk.dat, which will be starting point for the workshop on linkage mapping with polymapR.

If there would be additional samples, not belonging to the FS family or its parents, they could benefit from the same shift corrections and merging of probes as applied to the family, by specifying them using parameter ‘other’ in the call to compareProbes. Their dosages would then appear in the combscores in columns after the FS family.

Dosage scoring / fitPoly workshop

Roeland E. Voorrips

2021-01-08

Introduction