Dosage scoring / fitPoly workshop

Roeland E. Voorrips

2021-01-08

Introduction

This workshop covers the process of obtaining dosage scores from SNP array data. This process consists of three steps: (1) pre-processing, including importing and reformatting the SNP array data and possibly filtering; (2) the actual dosage scoring, and (3) post-processing: identification and correction of incorrectly scored markers, comparison and combination of “versions” of markers (explained below), and exporting the acceptable markers in a format readable by downstream software.

For these three stages we will use two packages: fitPoly (available from CRAN) and fitPolyTools (unpublished, part of the Materials for this workshop). The actual dosage scoring is done with fitPoly, while the pre- and post-processing is done with functions in fitPolyTools.

Preparation

Create a new folder and copy the following files there: fitPolyTools_1.1.1.tar.gz, AxiomGT1.summary.txt, WagRhSNPAX_Sample_Table.csv, CsvAnnotationFile.v1.txt and B_scores.RData, and set the working directory to this folder:

# set the working directory to the folder where you copied the materials.
# In R, the separator in file paths is "/", not "\" as in Windows.
# For example:
setwd("c:/SCRI/workshops/fitPoly")

If the packages were not yet installed, do so now:

# install package fitPoly from CRAN:
install.packages("fitPoly")
# you may be asked for a CRAN mirror server at this point; choose any one from the list

# install package fitPolyTools from the local file:
install.packages("fitPolyTools_1.1.1.tar.gz")
# if this fails, the following will most likely work:
# 1. install the devtools package from CRAN if that was not done earlier:
# install.packages("devtools")
# 2. then install the fitPolyTools package using the install_local function
#    of devtools:
# devtools::install_local("fitPolyTools_1.1.1.tar.gz")

Now the packages are installed, but the functions in them are not yet accessible. For that we need the library command:

library(fitPoly)
## Warning: package 'fitPoly' was built under R version 3.6.3
library(fitPolyTools)

A final word about the code examples: several times the function knitr::kable() is used for pretty-printing of tables. If you don’t have the knitr package installed (which is not required for the data processing) you can just type the function argument:

# example: function knitr::kable is not required
# instead of
knitr::kable(head(datAX))
# you can also type
head(datAX)

Extract data from SNP array

In this workshop we work with tetraploid rose data from an Axiom array. The fitPolyTools package also provides functions to work with Illumina arrays, which are demonstrated in the package vignettes.
The raw data from Axiom arrays are processed with the Affymetrix Power Tools (APT) or the Genotyping Console software. After initial quality control steps all passing samples and all SNPs are processed by the program “apt probeset genotype”. If the option –summaries is included (as is normally the case) a file AxiomGT1.summary.txt file is produced. Usually this file is created by the service provider.

Our work in this workshops starts with this AxiomGT1.summary.txt file. It is a tab-separated text file with the markers in rows and the samples in columns. We have one column per sample and two rows, A and B, for each marker. The file starts with several hundred lines of text, and the last few thousand rows contain quality control (DQC) “pseudomarkers”. Also the sample names all have “.CEL” appended to them.

# show part of the Affymetrix AxiomGT1.summary.txt file:
dat <- readDatfile("AxiomGT1.summary.txt", comment.char="#", check.names=FALSE,
                   nrows=6)
dat[1:6, 1:4]
##     probeset_id a550450-4187610-051814-378_A06.CEL a550450-4187610-051814-379_A01.CEL
## 1 AX-86752744-A                           375.0776                           310.0749
## 2 AX-86752744-B                           239.8847                           393.0592
## 3 AX-86752753-A                           530.1507                           413.8058
## 4 AX-86752753-B                           693.3399                           296.1384
## 5 AX-86752767-A                          2103.9311                          1952.4893
## 6 AX-86752767-B                          1206.9029                           419.4540
##   a550450-4187610-051814-379_A02.CEL
## 1                           365.9014
## 2                           204.9692
## 3                           563.3449
## 4                           198.8771
## 5                          2117.5882
## 6                           435.1182

(Function readDatfile in fitPolyTools is a wrapper for the standard R function read.table that sets some required parameters for tab-separated files automatically.)

Read and convert the array data

We provide a demo file with 17244 SNP markers (much less than on the average Axiom array) and 176 samples. We will import this file, clean the sample names and convert them from “wide” format (all samples side by side) to “long” format (all samples under each other), with columns MarkerName, SampleName, X, Y, R, ratio. X and Y are the signal intensities of the two alleles, R = X+Y ( the total signal intensity), and ratio = Y/R: the fraction of the Y signal in the total signal.

# import the AxiomGT1.summary.txt file (takes about 15 sec):
datAX <- readAxiomSummary(AXdata="AxiomGT1.summary.txt", out=NA)
knitr::kable(head(datAX))
MarkerName SampleName X Y R ratio
AX-86752744 a550450-4187610-051814-378_A06 375.0776 239.8847 614.9623 0.3900803
AX-86752744 a550450-4187610-051814-379_A01 310.0749 393.0592 703.1341 0.5590103
AX-86752744 a550450-4187610-051814-379_A02 365.9014 204.9692 570.8707 0.3590467
AX-86752744 a550450-4187610-051814-379_A03 422.0593 393.9801 816.0394 0.4827954
AX-86752744 a550450-4187610-051814-379_A04 414.6733 273.4405 688.1138 0.3973769
AX-86752744 a550450-4187610-051814-379_A05 473.2871 342.9835 816.2706 0.4201835

The reformatted data are now in data.frame datAX. As you can see, the -A and -B that were appended to the probeset-id, and the .CEL that was appended to the sample names were removed.

Rename the markers and the samples

Usually the marker names and/or the sample names in the data files produced by the service provider are different from the original names. Therefore we need files specifying the translation between the two sets of names. In this exercise we will convert the marker and sample names of the datAX data set to ones that are meaningful to us.

Markers

First we rename the markers. The translation is specified by file “CsvAnnotationFile.v1.txt”. This file is often supplied by the service provider and has a fixed format, starting with some comment lines, then some data on each probe (including the Probe.Set.ID and the customer_id which we will use here). It usually ends with several thousand lines that are used for quality control, which can be ignored here. Since the markers on the array are a fixed set these annotation files don’t change and can be assumed to be correct (i.e. all marker names in the array also occur in the annotation file and v.v., and all customer_id codes are unique). Therefore we don’t need to check this and can perform the translation directly:

# read the marker annotation file:
mrktable <- read.csv("CsvAnnotationFile.v1.txt", comment.char="#")
head(mrktable)
##   Probe.Set.ID   Affy.SNP.ID Allele.A Allele.B customer_id
## 1  AX-86752845 Affx-86841042        T        C G75957_739P
## 2  AX-86752898 Affx-86843748        A        C  M5639_684P
## 3  AX-86753237 Affx-86843020        T        C   M81_2578P
## 4  AX-86753271 Affx-86842379        T        G G48944_897P
## 5  AX-86753282 Affx-86838393        T        C  M4506_206P
## 6  AX-86753334 Affx-86842938        A        G  M8839_565P
# replace the original marker names in datAX by our own "customer_id":
datAX$MarkerName <-
  mrktable$customer_id[match(datAX$MarkerName, mrktable$Probe.Set.ID)]

Samples

Next we rename the samples. The samples are different between projects using the same array, and sample annotations are usually made by hand, in different formats, and cannot be assumed to be error-free. We perform some checks and rename the samples with function splitNrenameSamples. This function can also be used to split the dataset into parts with samples of different ploidy, but that is not the case here.

smptable <- read.csv("WagRhSNPAX_Sample_Table.csv", stringsAsFactors=FALSE)
head(smptable)
##   SampleName material                      BestArray
## 1     P540a1    P1_K5 a550450-4187610-051814-379_A01
## 2     P867a1    P2_K5 a550450-4187610-051814-379_A02
## 3     P867a2    P2_K5 a550450-4187610-051814-378_A06
## 4      P867b    P2_K5 a550450-4187610-051814-380_A04
## 5       K001    F1_K5 a550450-4187610-051814-379_A03
## 6       K002    F1_K5 a550450-4187610-051814-379_A04
# no split, return the original data.frame with substituted SampleNames:
datAX <- splitNrenameSamples(dat=datAX, sampletable=smptable,
                              SampleID="BestArray", CustomerID="SampleName",
                              out="datAX")
head(datAX) 
##    MarkerName SampleName        X        Y        R     ratio
## 1 G44849_319P     P867a2 375.0776 239.8847 614.9623 0.3900803
## 2 G44849_319P     P540a1 310.0749 393.0592 703.1341 0.5590103
## 3 G44849_319P     P867a1 365.9014 204.9692 570.8707 0.3590467
## 4 G44849_319P       K001 422.0593 393.9801 816.0394 0.4827954
## 5 G44849_319P       K002 414.6733 273.4405 688.1138 0.3973769
## 6 G44849_319P       K003 473.2871 342.9835 816.2706 0.4201835

The result is also saved in a file datAX.RData in a compressed, R-specific format.

For this workshop we will keep all data. The package contains tools to filter markers, samples, and datapoints within markers based on signal intensities and these are demonstrated in the package vignettes, but we will skip these steps here.

Dosage scoring with fitPoly

The actual dosage calling is done using functions in package fitPoly, which is an extension of package fitTetra (Voorrips et al, 2011). The original package was specific for tetraploids and treated all samples as belonging to one population, where it tried to fit mixture models reflecting either Hardy-Weinberg segregation ratios or unrestricted ratios. The new fitPoly package works for any ploidy level, it allows to define multiple populations with different segregations, and it adds F1 segregation ratios to the HW and the unrestricted ratios.

Here we use a function called (for historical reasons) “saveMarkerModels”. This function is a wrapper function: with one call an entire set of markers can be genotyped and the results are saved to files.

We have one F1 population here which is defined as follows:

# the parents data.frame defines the population structure:
parents <- data.frame(
  pop=c("F1_K5", "P1_K5", "P2_K5"),
  parent1=c("P1_K5", NA, NA),
  parent2=c("P2_K5", NA, NA),
  stringsAsFactors=FALSE)
knitr::kable(parents)
pop parent1 parent2
F1_K5 P1_K5 P2_K5
P1_K5 NA NA
P2_K5 NA NA
# the pop data.frame defines which sample belongs to which population;
# we extract it from smptable:
pop <- smptable[, 1:2]
names(pop) <- c("SampleName", "population")
knitr::kable(pop[1:6,])
SampleName population
P540a1 P1_K5
P867a1 P2_K5
P867a2 P2_K5
P867b P2_K5
K001 F1_K5
K002 F1_K5

The dosage scoring takes too much time to run in this workshop. Several hundred markers per hour are scored per computing core, but still with more than 17000 markers in the data set this would take too long. However we can run the first 3 to give the idea:

saveMarkerModels(ploidy=4,
                 markers=1:3, #for this demo
                 data=datAX,
                 pop.parents=parents,
                 population=pop,
                 p.threshold=0.9,
                 filePrefix="A",
                 rdaFiles=TRUE,
                 plot="fitted",
                 ncores=1)
## saveMarkerModels: batchsize = 3

This function produces several files. The most interesting one here is file A_scores.RData which contains a data.frame with for each marker (in this case only the first 3 markers) and each sample the assigned dosage, as well as some further data:

load("A_scores.RData")
knitr::kable(head(scores))
marker MarkerName SampleName ratio P0 P1 P2 P3 P4 maxgeno maxP geno
1 D10190_215P K001 0.3748836 0 1.0000000 0e+00 0 0 1 1.0000000 1
1 D10190_215P K002 0.5066971 0 0.0000000 1e+00 0 0 2 1.0000000 2
1 D10190_215P K003 0.3289250 0 1.0000000 0e+00 0 0 1 1.0000000 1
1 D10190_215P K004 0.5424403 0 0.0000000 1e+00 0 0 2 1.0000000 2
1 D10190_215P K007 0.3971393 0 0.9999997 3e-07 0 0 1 0.9999997 1
1 D10190_215P K008 0.7047116 0 0.0000000 0e+00 1 0 3 1.0000000 3

The important columns here are MarkerName, SampleName and geno. Geno is the assigned dosage; it is the most probable dosage if the probability of that dosage is above the p.threshold, and otherwise it is NA (missing).

In this example also some plot files are generated in a folder A_plots in the current working directory. The plots for marker 1 and 3 show a fitted model for a duplex x triplex marker (segregation 1:5:5:1) and a nulliplex x simplex marker (segregation 1:1). Marker 2 is rejected because samples are all in the same peak, while the default peak.threshold is 0.85 (max. 85% in the same peak allowed; this threshold exists because for a reliable model fit data of at least two peaks should be available). plot for marker 1

For further information on the input and output of fitPoly functions see the help in that package.

Post-processing of dosage scores based on FS family data

In the previous section we ran fitPoly for only 3 markers. Now we will assume that we have run the following (similar) command for all 17244 markers in the data set, producing a.o. the file B_scores.RData:

saveMarkerModels(ploidy=4,
                 data=datAX,
                 pop.parents=parents,
                 population=pop,
                 p.threshold=0.9,
                 filePrefix="B",
                 rdaFiles=TRUE,
                 ncores=1)

We load this file:

load("B_scores.RData")

Missing data

Some markers are of low quality and cannot be fitted or are rejected by fitPoly (based on several user-settable thresholds). For markers that are not rejected, some samples may be assigned a missing geno (dosage) value if the probability of the most probable dosage is below p.threshold.

We take a look at the number of missing values in the geno column per marker:

NAcounts <- tapply(scores$geno, INDEX=scores$MarkerName, 
                   FUN=function(x) sum(is.na(x)))
NAtab <- table(NAcounts)
head(NAtab)
## NAcounts
##    0    1    2    3    4    5 
## 2604 2046 1461 1023  698  594
tail(NAtab)
## NAcounts
##   66   67   68   69   70  176 
##    5    5    4    4    4 5305

The number of missing data per marker varies from 0 to 70, or is 176. In the last case all samples have a missing value: these are the non-fitted and rejected markers (5305 of the original 17244). The 11939 remaining markers have up to 70 (of 176) missing data.

Segregation types of all markers and quality checks

The dosage scores produced by fitPoly (or by other software) may contain errors, because the marker assay does not work well on the genotyped material or because the software arrives at a sub-optimal fit. Here we make use of a full-sib (F1) population to identify unreliable markers and/or to correct the scores where an incorrect model was used. If additional samples were genotyped in addition to the FS family, the filtering of unreliable markers and the corrections can be applied to them as well.

The initial check for segregation type is carried out by function checkF1:

par1 <- pop$SampleName[pop$population=="P1_K5"]
par2 <- pop$SampleName[pop$population=="P2_K5"]
F1 <- pop$SampleName[pop$population=="F1_K5"]
# this takes about one minute:
chk1 <- checkF1(scores, 
                parent1=par1, parent2=par2, F1=F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile="chk1.dat")
# parts of the result:
knitr::kable(chk1[1:4,1:14])
m MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_NA P540a1 P867a1 P867a2 P867b
1 D10190_215P 2 3 0 20 65 77 10 0 2 3 3 3
2 D10190_215Q NA NA 0 0 0 0 0 172 NA NA NA NA
3 D10234_403P 0 1 80 89 3 0 0 0 0 1 1 1
4 D10234_403Q 0 1 79 89 0 0 0 4 0 1 1 1
knitr::kable(chk1[1:4,c(1:2, 19:22, 26)])
m MarkerName bestParentfit frqInvalid_bestParentfit Pvalue_bestParentfit matchParent_bestParentfit qall_mult
1 D10190_215P 1551_1 0.0000 0.2064 Yes 1.0000
2 D10190_215Q NA NA NA
3 D10234_403P 11_0 0.0174 0.4887 Yes 0.9826
4 D10234_403Q 11_0 0.0000 0.4404 Yes 0.9419

This is a Rose dataset and we expect that both polysomic and disomic inheritance may occur. By setting both corresponding parameters to TRUE we consider both sets of segregation types. The result is a data.frame with one row for each marker. The first columns list the marker name, the inferred parental dosages, the observed F1 segregation and the observed parental sample dosages. The bestParentfit column lists the selected segregation type and the next 3 columns list 3 parameters describing how well this segregation fits with the observations. The qall_mult column is an overall quality assessment, from poor (0) to perfect (1). There are more columns in chk1, and with other parameter settings even more columns will be produced; for a full explanation refer to the help. The data.frame chk1 is saved to file chk1.dat, which is a tab-separated file that can be viewed in Excel.

The segregation codes need some explanation. They consist of 2 parts: the first part has the segregation ratios (e.g. 11 means 1:1, 141 means 1:4:1 and so on; a single 1 means no segregation, all F1 have the same dosage). An underscore separates the two parts and the second part is a single number specifying the lowest dosage occurring in the segregation. For example: 121_0 means 1 nulliplex : 2 simplex : 2 duplex; 11_3 means 1 triplex : 1 quadruplex. There is one tetrasomic segregation type 1:8:18:8:1 where one of the ratios is larger than 9 and cannot be represented by a digit; in this and similar cases we use letters A-Z to represent ratios 10-35, so this segregation is represented as 18I81_0, with the I meaning 18. For higher ploidy levels this system is extended, see the help for a full explanation.

“Shifted” markers

When not all dosages are present among the samples it is possible that fitPoly assigns the peaks in the signal ratio histogram to the wrong dosages. For example, if parents in reality have dosages 0 and 1, also the F1 progeny will (almost) only have dosages 0 or 1. If no other samples are analyzed, only two of the five dosages are present, and these might be mis-scored as being dosages 1 and 2 instead of 0 and 1. These mis-assignments are called “shifts”. fitPoly checks against possible shifts whether or not parents and F1 populations are defined but it doesn’t catch them all.

In package fitPolyTools a function correctDosages is provided that, based on the result of checkF1, checks if shifting the dosages up or down by 1 would be worth to try. The result is a set of suggested shifts, that should then again be checked (using checkF1) for their actual fit.

cordos <- correctDosages(chk1, scores, par1, par2, ploidy=4,
                        polysomic=TRUE, disomic=TRUE, mixed=FALSE)
# show the first few nonzero shifts:
knitr::kable(head(cordos[cordos$shift!=0,]))
MarkerName segtype parent1 parent2 shift fracNotOk ParNA
329 G10701_1435P 11_2 3 NA 1 0 1
492 G12049_166Q 11_1 1 NA -1 0 1
504 G1217_1009Q 11_2 2 NA 1 0 1
568 G12636_1704Q 11_2 3 NA 1 0 1
632 G13490_151Q 11_2 3 2 1 0 0
673 G13900_549P 11_2 2 3 1 0 0

Column shift contains the suggested shifts to try: a 0 means no shift suggested, a 1 means: shift all dosages assigned by fitPoly up (0 becomes 1, 1 becomes 2, 2 becomes 3, 3 and 4 become 4 (3 and 4 are merged)). A shift of -1 means that all dosages should be decreased by 1 (merging previous dosages 0 and 1 into dosage 0).

This is a rather simplistic way to check for possibly shifted markers. If there are more samples available than only the one F1 population and its parents, those might also be informative about possible shifts. Also, sometimes both the unshifted and the shifted version of the marker might be acceptable. In these cases it is best to keep both versions and decide later on (after mapping or haplotyping steps) which version fits best.

The next step is to run checkF1 again with as additional parameter these shifts; it will then perform the indicated dosage shift on all samples (parental and F1) before selecting the most likely segregation type and quality assessment:

#select the markers where a shift should be tried:
cordos <- cordos[cordos$shift != 0,] 
subscores <- scores[scores$MarkerName %in% cordos$MarkerName,]
chk2 <- checkF1(subscores, par1, par2, F1,
                polysomic=TRUE, disomic=TRUE, mixed=FALSE, 
                ploidy=4, outfile=NA, shiftmarkers=cordos)

Data.frame chk2 has the same format as chk1 but has an additional column “shift” at the end. For the next steps it is useful to merge chk1 and chk2. In order to do that, chk1 needs also to get this extra column:

chk1$shift <- 0
chk <- rbind(chk1, chk2)
chk <- chk[order(chk$MarkerName),]
writeDatfile(chk, file="chk.dat")

Select acceptable markers

Data.frame chk now has all unshifted and shifted versions of the markers. Each of the (versions of the) markers has been checked for its quality, which is
summarized in its value of qall_mult. We would like to select only those (versions of) markers that are of sufficient quality. As a rule of thumb, all scores of markers with qall_mult==0 are bad and all others might be acceptable, but perhaps we can raise the threshold for qall_mult a bit for a more stringent selection, without losing valuable markers. The approach is to study XY-scatterplots for several markers at different levels of qall_mult and see from which level good (well-scored) markers appear.

In order to produce scatterplots with dots colored according to assigned dosages we first combine the X and Y signals with the geno (dosage) values:

XYgeno <- combineFiles(XYdata=datAX, scores=scores)
# define qall levels of 0, 0.05, 0.10 , .. , 1 where we want to inspect some SNPs:
qall.levels <- seq(0, 1, by=0.05)
# select six SNPs with qall values near each of these values and draw their plots:
chkx <- selMarkers_qall(chk,  qall.levels, mrkperlevel=6)
drawXYplots(dat=XYgeno, markers=chkx,
            out="XYgeno",
            genocol=get.genocol(ploidy=4), sample.groups=list(par1, par2),
            groups.col=c("red", "blue"), ploidy=4)

This generates a series of 21 pages, each with 6 plots representing markers with a qall_mult score at a certain level.

The drawXYplots command here is quite complex. A full explanation of all parameters can be found in the help. Some things to note here: genocol sets the colors assigned to each of the dosages, from red (nulliplex) through blue (duplex) to green (quadruplex), and grey for samples with missing dosage score. The parents are highlighted in red and blue via parameters sample.groups and groups.col. A conclusion from this series of plots is that already at qall_mult==0.05 most of the markers appear to be well-scored. A next step could be to try to find a qall_mult threshold between 0 and 0.05 to separate the well-scored and badly scored markers. For now we will accept all markers with qall > 0. XY-plots page 21

#select only the markers with qall_mult > 0:
chk <- chk[!is.na(chk$qall_mult) & chk$qall_mult > 0,]

However, of many SNPs we now have more than one “version”. For instance:

knitr::kable(chk[substr(chk$MarkerName,1,9)=="K14264_98", 
                 c("MarkerName", "parent1", "parent2", "bestParentfit", "qall_mult", "shift")])
MarkerName parent1 parent2 bestParentfit qall_mult shift
6358 K14264_98P 2 3 11_2 0.5462 0
17241 K14264_98P 3 4 11_3 0.5462 1
6359 K14264_98Q 2 3 11_2 0.5008 0
17242 K14264_98Q 3 4 11_3 0.5008 1

Here we have 4 “versions” of the same SNP. The P or Q after the SNP name indicate which probe was used to target the SNP: Axiom arrays need a probe on only one side of the SNP, and on this particular Rose array all SNPs were targeted with two probes (on the two opposite strands). Since the two probes detect the same SNP the scores should be mostly similar. Further, for this SNP the nonshifted and shifted versions were both acceptable.

The following functions check for each SNP which versions have passed the quality check (in our case, qall_mult > 0), and see if it is possible to merge the two probes. In order to merge, there should be only few scores over the entire FS family that conflict between the scores. If this is the case, the merging is performed, with missing values from one probe being filled in from the other probe.

The merged version get an R appended, and the second of the two functions removes the P and Q versions that were used to generate the R version. Also, n’s or s’s are appended to indicate if the nonshifted or the shifted version of the probes was used.

cpp <- compareProbes(chk, scores, parent1=par1, parent2=par2, F1=F1,
                     polysomic=TRUE, disomic=TRUE, mixed=FALSE,
                     ploidy=4, compfile=NA, combscorefile=NA)
## batch 1 of 33 
## batch 2 of 33 
## batch 3 of 33 
## batch 4 of 33 
## batch 5 of 33 
## batch 6 of 33 
## batch 7 of 33 
## batch 8 of 33 
## batch 9 of 33 
## batch 10 of 33 
## batch 11 of 33 
## batch 12 of 33 
## batch 13 of 33 
## batch 14 of 33 
## batch 15 of 33 
## batch 16 of 33 
## batch 17 of 33 
## batch 18 of 33 
## batch 19 of 33 
## batch 20 of 33 
## batch 21 of 33 
## batch 22 of 33 
## batch 23 of 33 
## batch 24 of 33 
## batch 25 of 33 
## batch 26 of 33 
## batch 27 of 33 
## batch 28 of 33 
## batch 29 of 33 
## batch 30 of 33 
## batch 31 of 33 
## batch 32 of 33 
## batch 33 of 33
knitr::kable(cpp$compstat[1:6,])
SNPname segtypePn qallPn segtypePs qallPs segtypeQn qallQn segtypeQs qallQs segtypeRnn segtypeRns segtypeRsn segtypeRss countP countQ countR
D10190_215 1551_1 1.0000 NA NA NA NA NA NA NA NA NA NA 1 0 0
D10234_403 11_0 0.9826 NA NA 11_0 0.9419 NA NA 11_0 NA NA NA 1 1 1
D10262_172 NA NA NA NA 11_3 0.4971 NA NA NA NA NA NA 0 1 0
D10433_789 NA NA NA NA 11_3 0.9273 NA NA NA NA NA NA 0 1 0
D11322_170 11_3 0.9018 NA NA 11_3 0.9363 NA NA 11_3 NA NA NA 1 1 1
D11323_451 11_0 1.0000 NA NA NA NA NA NA NA NA NA NA 1 0 0
rr <- removeRedundant(compstat=cpp$compstat, combscores=cpp$combscores,
                      compfile=NA, combscorefile="rose_axiom_7195_mrk.dat")
knitr::kable(rr$combscores[1:6, 1:12])
MarkerName segtype P540a1 P867a1 P867a2 P867b parent1 parent2 K001 K002 K003 K004
1 D10190_215Pn 1551_1 2 3 3 3 2 3 1 2 1 2
4 D10234_403Rnn 11_0 0 1 1 1 0 1 0 0 0 1
5 D10262_172Qn 11_3 4 3 2 3 4 3 2 4 4 4
6 D10433_789Qn 11_3 3 4 4 4 3 4 4 3 3 4
9 D11322_170Rnn 11_3 3 4 4 4 3 4 4 4 3 3
10 D11323_451Pn 11_0 1 0 0 0 1 0 0 1 0 1

Both these functions return a list with two components: compstat and combscores, each a data.frame. These are also saved as tab-separated text files if filenames are specified for compfile and combscorefile. Data.frame compstat has an overview of the different versions of each SNP (two probes, each unshifted and/or shifted). For each of these (suffixed with P or Q for the probe, and with n or s for (non-)shifted) the segregation type and qall_mult quality score is listed. If (a version of) the P and the Q probe are sufficiently similar, a combined version of the SNP is produced which gets the suffix R, and a further suffix nn, ns, sn or ss depending on whether the nonshifted or the shifted version of the P and Q probe was used for merging. Also for such a merged (R) marker the segregation type is given, and finally the number of versions of the P, Q and R markers.
Two probes are considered sufficiently similar to merge if they have the same segregation type and if there are (by default) not more than 4% conflicting sample scores. For more details and additional parameters see the help.

The other component of the result of these functions is data.frame combscores. This lists for each remaining marker the segregation type and dosages. We save the combscores result of removeRedundant in file rose_axiom_7195_mrk.dat, which will be starting point for the workshop on linkage mapping with polymapR.

If there would be additional samples, not belonging to the FS family or its parents, they could benefit from the same shift corrections and merging of probes as applied to the family, by specifying them using parameter ‘other’ in the call to compareProbes. Their dosages would then appear in the combscores in columns after the FS family.

Summary and Conclusion

Dosage scoring using the fitPoly package requires that the data are reformatted from the output of the array-specific software. The supporting package fitPolyTools has functions that perform this reformatting for Axiom and for Illumina arrays. It also makes it easy to rename the markers samples and samples from the codes used in the array output to your own names.

The dosage calling is never error-free. One common type of error is that all dosages are scored one too high or too low, a so-called shift. Functions in package fitPolyTools can recognize badly scored markers and correct this shift, provided there is a FS family among the samples. It also has functions to compare and merge the results from two probes targeting the same SNP, which is possible on Axiom arrays.

These and other functionalities are further explained in the two vignettes of package fitPolyTools. More details on the functions in fitPoly and fitPolyTools can be found in the help manuals.