This workshop covers the process of obtaining dosage scores from SNP array data. This process consists of three steps: (1) pre-processing, including importing and reformatting the SNP array data and possibly filtering; (2) the actual dosage scoring, and (3) post-processing: identification and correction of incorrectly scored markers, comparison and combination of “versions” of markers (explained below), and exporting the acceptable markers in a format readable by downstream software.
For these three stages we will use two packages: fitPoly (available from CRAN) and fitPolyTools (unpublished, part of the Materials for this workshop). The actual dosage scoring is done with fitPoly, while the pre- and post-processing is done with functions in fitPolyTools.
Create a new folder and copy the following files there: fitPolyTools_1.1.1.tar.gz, AxiomGT1.summary.txt, WagRhSNPAX_Sample_Table.csv, CsvAnnotationFile.v1.txt and B_scores.RData, and set the working directory to this folder:
# set the working directory to the folder where you copied the materials. # In R, the separator in file paths is "/", not "\" as in Windows. # For example: setwd("c:/SCRI/workshops/fitPoly")
If the packages were not yet installed, do so now:
# install package fitPoly from CRAN: install.packages("fitPoly") # you may be asked for a CRAN mirror server at this point; choose any one from the list # install package fitPolyTools from the local file: install.packages("fitPolyTools_1.1.1.tar.gz") # if this fails, the following will most likely work: # 1. install the devtools package from CRAN if that was not done earlier: # install.packages("devtools") # 2. then install the fitPolyTools package using the install_local function # of devtools: # devtools::install_local("fitPolyTools_1.1.1.tar.gz")
Now the packages are installed, but the functions in them are not yet accessible. For that we need the library command:
## Warning: package 'fitPoly' was built under R version 3.6.3
A final word about the code examples: several times the function knitr::kable() is used for pretty-printing of tables. If you don’t have the knitr package installed (which is not required for the data processing) you can just type the function argument:
# example: function knitr::kable is not required # instead of knitr::kable(head(datAX)) # you can also type head(datAX)
In this workshop we work with tetraploid rose data from an Axiom array. The fitPolyTools package also provides functions to work with Illumina arrays, which are demonstrated in the package vignettes.
The raw data from Axiom arrays are processed with the Affymetrix Power Tools (APT) or the Genotyping Console software. After initial quality control steps all passing samples and all SNPs are processed by the program “apt probeset genotype”. If the option –summaries is included (as is normally the case) a file AxiomGT1.summary.txt file is produced. Usually this file is created by the service provider.
Our work in this workshops starts with this AxiomGT1.summary.txt file. It is a tab-separated text file with the markers in rows and the samples in columns. We have one column per sample and two rows, A and B, for each marker. The file starts with several hundred lines of text, and the last few thousand rows contain quality control (DQC) “pseudomarkers”. Also the sample names all have “.CEL” appended to them.
# show part of the Affymetrix AxiomGT1.summary.txt file: dat <- readDatfile("AxiomGT1.summary.txt", comment.char="#", check.names=FALSE, nrows=6) dat[1:6, 1:4]
## probeset_id a550450-4187610-051814-378_A06.CEL a550450-4187610-051814-379_A01.CEL ## 1 AX-86752744-A 375.0776 310.0749 ## 2 AX-86752744-B 239.8847 393.0592 ## 3 AX-86752753-A 530.1507 413.8058 ## 4 AX-86752753-B 693.3399 296.1384 ## 5 AX-86752767-A 2103.9311 1952.4893 ## 6 AX-86752767-B 1206.9029 419.4540 ## a550450-4187610-051814-379_A02.CEL ## 1 365.9014 ## 2 204.9692 ## 3 563.3449 ## 4 198.8771 ## 5 2117.5882 ## 6 435.1182
(Function readDatfile in fitPolyTools is a wrapper for the standard R function read.table that sets some required parameters for tab-separated files automatically.)
We provide a demo file with 17244 SNP markers (much less than on the average Axiom array) and 176 samples. We will import this file, clean the sample names and convert them from “wide” format (all samples side by side) to “long” format (all samples under each other), with columns MarkerName, SampleName, X, Y, R, ratio. X and Y are the signal intensities of the two alleles, R = X+Y ( the total signal intensity), and ratio = Y/R: the fraction of the Y signal in the total signal.
# import the AxiomGT1.summary.txt file (takes about 15 sec): datAX <- readAxiomSummary(AXdata="AxiomGT1.summary.txt", out=NA) knitr::kable(head(datAX))
The reformatted data are now in data.frame datAX. As you can see, the -A and -B that were appended to the probeset-id, and the .CEL that was appended to the sample names were removed.
Usually the marker names and/or the sample names in the data files produced by the service provider are different from the original names. Therefore we need files specifying the translation between the two sets of names. In this exercise we will convert the marker and sample names of the datAX data set to ones that are meaningful to us.
First we rename the markers. The translation is specified by file “CsvAnnotationFile.v1.txt”. This file is often supplied by the service provider and has a fixed format, starting with some comment lines, then some data on each probe (including the Probe.Set.ID and the customer_id which we will use here). It usually ends with several thousand lines that are used for quality control, which can be ignored here. Since the markers on the array are a fixed set these annotation files don’t change and can be assumed to be correct (i.e. all marker names in the array also occur in the annotation file and v.v., and all customer_id codes are unique). Therefore we don’t need to check this and can perform the translation directly:
# read the marker annotation file: mrktable <- read.csv("CsvAnnotationFile.v1.txt", comment.char="#") head(mrktable)
## Probe.Set.ID Affy.SNP.ID Allele.A Allele.B customer_id ## 1 AX-86752845 Affx-86841042 T C G75957_739P ## 2 AX-86752898 Affx-86843748 A C M5639_684P ## 3 AX-86753237 Affx-86843020 T C M81_2578P ## 4 AX-86753271 Affx-86842379 T G G48944_897P ## 5 AX-86753282 Affx-86838393 T C M4506_206P ## 6 AX-86753334 Affx-86842938 A G M8839_565P
# replace the original marker names in datAX by our own "customer_id": datAX$MarkerName <- mrktable$customer_id[match(datAX$MarkerName, mrktable$Probe.Set.ID)]
Next we rename the samples. The samples are different between projects using the same array, and sample annotations are usually made by hand, in different formats, and cannot be assumed to be error-free. We perform some checks and rename the samples with function splitNrenameSamples. This function can also be used to split the dataset into parts with samples of different ploidy, but that is not the case here.
smptable <- read.csv("WagRhSNPAX_Sample_Table.csv", stringsAsFactors=FALSE) head(smptable)
## SampleName material BestArray ## 1 P540a1 P1_K5 a550450-4187610-051814-379_A01 ## 2 P867a1 P2_K5 a550450-4187610-051814-379_A02 ## 3 P867a2 P2_K5 a550450-4187610-051814-378_A06 ## 4 P867b P2_K5 a550450-4187610-051814-380_A04 ## 5 K001 F1_K5 a550450-4187610-051814-379_A03 ## 6 K002 F1_K5 a550450-4187610-051814-379_A04
# no split, return the original data.frame with substituted SampleNames: datAX <- splitNrenameSamples(dat=datAX, sampletable=smptable, SampleID="BestArray", CustomerID="SampleName", out="datAX") head(datAX)
## MarkerName SampleName X Y R ratio ## 1 G44849_319P P867a2 375.0776 239.8847 614.9623 0.3900803 ## 2 G44849_319P P540a1 310.0749 393.0592 703.1341 0.5590103 ## 3 G44849_319P P867a1 365.9014 204.9692 570.8707 0.3590467 ## 4 G44849_319P K001 422.0593 393.9801 816.0394 0.4827954 ## 5 G44849_319P K002 414.6733 273.4405 688.1138 0.3973769 ## 6 G44849_319P K003 473.2871 342.9835 816.2706 0.4201835
The result is also saved in a file datAX.RData in a compressed, R-specific format.
For this workshop we will keep all data. The package contains tools to filter markers, samples, and datapoints within markers based on signal intensities and these are demonstrated in the package vignettes, but we will skip these steps here.
The actual dosage calling is done using functions in package fitPoly, which is an extension of package fitTetra (Voorrips et al, 2011). The original package was specific for tetraploids and treated all samples as belonging to one population, where it tried to fit mixture models reflecting either Hardy-Weinberg segregation ratios or unrestricted ratios. The new fitPoly package works for any ploidy level, it allows to define multiple populations with different segregations, and it adds F1 segregation ratios to the HW and the unrestricted ratios.
Here we use a function called (for historical reasons) “saveMarkerModels”. This function is a wrapper function: with one call an entire set of markers can be genotyped and the results are saved to files.
We have one F1 population here which is defined as follows:
# the parents data.frame defines the population structure: parents <- data.frame( pop=c("F1_K5", "P1_K5", "P2_K5"), parent1=c("P1_K5", NA, NA), parent2=c("P2_K5", NA, NA), stringsAsFactors=FALSE) knitr::kable(parents)
# the pop data.frame defines which sample belongs to which population; # we extract it from smptable: pop <- smptable[, 1:2] names(pop) <- c("SampleName", "population") knitr::kable(pop[1:6,])
The dosage scoring takes too much time to run in this workshop. Several hundred markers per hour are scored per computing core, but still with more than 17000 markers in the data set this would take too long. However we can run the first 3 to give the idea:
saveMarkerModels(ploidy=4, markers=1:3, #for this demo data=datAX, pop.parents=parents, population=pop, p.threshold=0.9, filePrefix="A", rdaFiles=TRUE, plot="fitted", ncores=1)
## saveMarkerModels: batchsize = 3
This function produces several files. The most interesting one here is file A_scores.RData which contains a data.frame with for each marker (in this case only the first 3 markers) and each sample the assigned dosage, as well as some further data:
The important columns here are MarkerName, SampleName and geno. Geno is the assigned dosage; it is the most probable dosage if the probability of that dosage is above the p.threshold, and otherwise it is NA (missing).
In this example also some plot files are generated in a folder A_plots in the current working directory. The plots for marker 1 and 3 show a fitted model for a duplex x triplex marker (segregation 1:5:5:1) and a nulliplex x simplex marker (segregation 1:1). Marker 2 is rejected because samples are all in the same peak, while the default peak.threshold is 0.85 (max. 85% in the same peak allowed; this threshold exists because for a reliable model fit data of at least two peaks should be available).
For further information on the input and output of fitPoly functions see the help in that package.
In the previous section we ran fitPoly for only 3 markers. Now we will assume that we have run the following (similar) command for all 17244 markers in the data set, producing a.o. the file B_scores.RData:
saveMarkerModels(ploidy=4, data=datAX, pop.parents=parents, population=pop, p.threshold=0.9, filePrefix="B", rdaFiles=TRUE, ncores=1)
We load this file:
Some markers are of low quality and cannot be fitted or are rejected by fitPoly (based on several user-settable thresholds). For markers that are not rejected, some samples may be assigned a missing geno (dosage) value if the probability of the most probable dosage is below p.threshold.
We take a look at the number of missing values in the geno column per marker:
NAcounts <- tapply(scores$geno, INDEX=scores$MarkerName, FUN=function(x) sum(is.na(x))) NAtab <- table(NAcounts) head(NAtab)
## NAcounts ## 0 1 2 3 4 5 ## 2604 2046 1461 1023 698 594
## NAcounts ## 66 67 68 69 70 176 ## 5 5 4 4 4 5305
The number of missing data per marker varies from 0 to 70, or is 176. In the last case all samples have a missing value: these are the non-fitted and rejected markers (5305 of the original 17244). The 11939 remaining markers have up to 70 (of 176) missing data.
The dosage scores produced by fitPoly (or by other software) may contain errors, because the marker assay does not work well on the genotyped material or because the software arrives at a sub-optimal fit. Here we make use of a full-sib (F1) population to identify unreliable markers and/or to correct the scores where an incorrect model was used. If additional samples were genotyped in addition to the FS family, the filtering of unreliable markers and the corrections can be applied to them as well.
The initial check for segregation type is carried out by function checkF1:
par1 <- pop$SampleName[pop$population=="P1_K5"] par2 <- pop$SampleName[pop$population=="P2_K5"] F1 <- pop$SampleName[pop$population=="F1_K5"] # this takes about one minute: chk1 <- checkF1(scores, parent1=par1, parent2=par2, F1=F1, polysomic=TRUE, disomic=TRUE, mixed=FALSE, ploidy=4, outfile="chk1.dat") # parts of the result: knitr::kable(chk1[1:4,1:14])
knitr::kable(chk1[1:4,c(1:2, 19:22, 26)])
This is a Rose dataset and we expect that both polysomic and disomic inheritance may occur. By setting both corresponding parameters to TRUE we consider both sets of segregation types. The result is a data.frame with one row for each marker. The first columns list the marker name, the inferred parental dosages, the observed F1 segregation and the observed parental sample dosages. The bestParentfit column lists the selected segregation type and the next 3 columns list 3 parameters describing how well this segregation fits with the observations. The qall_mult column is an overall quality assessment, from poor (0) to perfect (1). There are more columns in chk1, and with other parameter settings even more columns will be produced; for a full explanation refer to the help. The data.frame chk1 is saved to file chk1.dat, which is a tab-separated file that can be viewed in Excel.
The segregation codes need some explanation. They consist of 2 parts: the first part has the segregation ratios (e.g. 11 means 1:1, 141 means 1:4:1 and so on; a single 1 means no segregation, all F1 have the same dosage). An underscore separates the two parts and the second part is a single number specifying the lowest dosage occurring in the segregation. For example: 121_0 means 1 nulliplex : 2 simplex : 2 duplex; 11_3 means 1 triplex : 1 quadruplex. There is one tetrasomic segregation type 1:8:18:8:1 where one of the ratios is larger than 9 and cannot be represented by a digit; in this and similar cases we use letters A-Z to represent ratios 10-35, so this segregation is represented as 18I81_0, with the I meaning 18. For higher ploidy levels this system is extended, see the help for a full explanation.
When not all dosages are present among the samples it is possible that fitPoly assigns the peaks in the signal ratio histogram to the wrong dosages. For example, if parents in reality have dosages 0 and 1, also the F1 progeny will (almost) only have dosages 0 or 1. If no other samples are analyzed, only two of the five dosages are present, and these might be mis-scored as being dosages 1 and 2 instead of 0 and 1. These mis-assignments are called “shifts”. fitPoly checks against possible shifts whether or not parents and F1 populations are defined but it doesn’t catch them all.
In package fitPolyTools a function correctDosages is provided that, based on the result of checkF1, checks if shifting the dosages up or down by 1 would be worth to try. The result is a set of suggested shifts, that should then again be checked (using checkF1) for their actual fit.
cordos <- correctDosages(chk1, scores, par1, par2, ploidy=4, polysomic=TRUE, disomic=TRUE, mixed=FALSE) # show the first few nonzero shifts: knitr::kable(head(cordos[cordos$shift!=0,]))
Column shift contains the suggested shifts to try: a 0 means no shift suggested, a 1 means: shift all dosages assigned by fitPoly up (0 becomes 1, 1 becomes 2, 2 becomes 3, 3 and 4 become 4 (3 and 4 are merged)). A shift of -1 means that all dosages should be decreased by 1 (merging previous dosages 0 and 1 into dosage 0).
This is a rather simplistic way to check for possibly shifted markers. If there are more samples available than only the one F1 population and its parents, those might also be informative about possible shifts. Also, sometimes both the unshifted and the shifted version of the marker might be acceptable. In these cases it is best to keep both versions and decide later on (after mapping or haplotyping steps) which version fits best.
The next step is to run checkF1 again with as additional parameter these shifts; it will then perform the indicated dosage shift on all samples (parental and F1) before selecting the most likely segregation type and quality assessment:
#select the markers where a shift should be tried: cordos <- cordos[cordos$shift != 0,] subscores <- scores[scores$MarkerName %in% cordos$MarkerName,] chk2 <- checkF1(subscores, par1, par2, F1, polysomic=TRUE, disomic=TRUE, mixed=FALSE, ploidy=4, outfile=NA, shiftmarkers=cordos)
Data.frame chk2 has the same format as chk1 but has an additional column “shift” at the end. For the next steps it is useful to merge chk1 and chk2. In order to do that, chk1 needs also to get this extra column:
chk1$shift <- 0 chk <- rbind(chk1, chk2) chk <- chk[order(chk$MarkerName),] writeDatfile(chk, file="chk.dat")
Data.frame chk now has all unshifted and shifted versions of the markers. Each of the (versions of the) markers has been checked for its quality, which is
summarized in its value of qall_mult. We would like to select only those (versions of) markers that are of sufficient quality. As a rule of thumb, all scores of markers with qall_mult==0 are bad and all others might be acceptable, but perhaps we can raise the threshold for qall_mult a bit for a more stringent selection, without losing valuable markers. The approach is to study XY-scatterplots for several markers at different levels of qall_mult and see from which level good (well-scored) markers appear.
In order to produce scatterplots with dots colored according to assigned dosages we first combine the X and Y signals with the geno (dosage) values:
XYgeno <- combineFiles(XYdata=datAX, scores=scores) # define qall levels of 0, 0.05, 0.10 , .. , 1 where we want to inspect some SNPs: qall.levels <- seq(0, 1, by=0.05) # select six SNPs with qall values near each of these values and draw their plots: chkx <- selMarkers_qall(chk, qall.levels, mrkperlevel=6) drawXYplots(dat=XYgeno, markers=chkx, out="XYgeno", genocol=get.genocol(ploidy=4), sample.groups=list(par1, par2), groups.col=c("red", "blue"), ploidy=4)
This generates a series of 21 pages, each with 6 plots representing markers with a qall_mult score at a certain level.
The drawXYplots command here is quite complex. A full explanation of all parameters can be found in the help. Some things to note here: genocol sets the colors assigned to each of the dosages, from red (nulliplex) through blue (duplex) to green (quadruplex), and grey for samples with missing dosage score. The parents are highlighted in red and blue via parameters sample.groups and groups.col. A conclusion from this series of plots is that already at qall_mult==0.05 most of the markers appear to be well-scored. A next step could be to try to find a qall_mult threshold between 0 and 0.05 to separate the well-scored and badly scored markers. For now we will accept all markers with qall > 0.
#select only the markers with qall_mult > 0: chk <- chk[!is.na(chk$qall_mult) & chk$qall_mult > 0,]
However, of many SNPs we now have more than one “version”. For instance:
knitr::kable(chk[substr(chk$MarkerName,1,9)=="K14264_98", c("MarkerName", "parent1", "parent2", "bestParentfit", "qall_mult", "shift")])
Here we have 4 “versions” of the same SNP. The P or Q after the SNP name indicate which probe was used to target the SNP: Axiom arrays need a probe on only one side of the SNP, and on this particular Rose array all SNPs were targeted with two probes (on the two opposite strands). Since the two probes detect the same SNP the scores should be mostly similar. Further, for this SNP the nonshifted and shifted versions were both acceptable.
The following functions check for each SNP which versions have passed the quality check (in our case, qall_mult > 0), and see if it is possible to merge the two probes. In order to merge, there should be only few scores over the entire FS family that conflict between the scores. If this is the case, the merging is performed, with missing values from one probe being filled in from the other probe.
The merged version get an R appended, and the second of the two functions removes the P and Q versions that were used to generate the R version. Also, n’s or s’s are appended to indicate if the nonshifted or the shifted version of the probes was used.
cpp <- compareProbes(chk, scores, parent1=par1, parent2=par2, F1=F1, polysomic=TRUE, disomic=TRUE, mixed=FALSE, ploidy=4, compfile=NA, combscorefile=NA)
## batch 1 of 33 ## batch 2 of 33 ## batch 3 of 33 ## batch 4 of 33 ## batch 5 of 33 ## batch 6 of 33 ## batch 7 of 33 ## batch 8 of 33 ## batch 9 of 33 ## batch 10 of 33 ## batch 11 of 33 ## batch 12 of 33 ## batch 13 of 33 ## batch 14 of 33 ## batch 15 of 33 ## batch 16 of 33 ## batch 17 of 33 ## batch 18 of 33 ## batch 19 of 33 ## batch 20 of 33 ## batch 21 of 33 ## batch 22 of 33 ## batch 23 of 33 ## batch 24 of 33 ## batch 25 of 33 ## batch 26 of 33 ## batch 27 of 33 ## batch 28 of 33 ## batch 29 of 33 ## batch 30 of 33 ## batch 31 of 33 ## batch 32 of 33 ## batch 33 of 33
rr <- removeRedundant(compstat=cpp$compstat, combscores=cpp$combscores, compfile=NA, combscorefile="rose_axiom_7195_mrk.dat") knitr::kable(rr$combscores[1:6, 1:12])
Both these functions return a list with two components: compstat and combscores, each a data.frame. These are also saved as tab-separated text files if filenames are specified for compfile and combscorefile. Data.frame compstat has an overview of the different versions of each SNP (two probes, each unshifted and/or shifted). For each of these (suffixed with P or Q for the probe, and with n or s for (non-)shifted) the segregation type and qall_mult quality score is listed. If (a version of) the P and the Q probe are sufficiently similar, a combined version of the SNP is produced which gets the suffix R, and a further suffix nn, ns, sn or ss depending on whether the nonshifted or the shifted version of the P and Q probe was used for merging. Also for such a merged (R) marker the segregation type is given, and finally the number of versions of the P, Q and R markers.
Two probes are considered sufficiently similar to merge if they have the same segregation type and if there are (by default) not more than 4% conflicting sample scores. For more details and additional parameters see the help.
The other component of the result of these functions is data.frame combscores. This lists for each remaining marker the segregation type and dosages. We save the combscores result of removeRedundant in file rose_axiom_7195_mrk.dat, which will be starting point for the workshop on linkage mapping with polymapR.
If there would be additional samples, not belonging to the FS family or its parents, they could benefit from the same shift corrections and merging of probes as applied to the family, by specifying them using parameter ‘other’ in the call to compareProbes. Their dosages would then appear in the combscores in columns after the FS family.
Dosage scoring using the fitPoly package requires that the data are reformatted from the output of the array-specific software. The supporting package fitPolyTools has functions that perform this reformatting for Axiom and for Illumina arrays. It also makes it easy to rename the markers samples and samples from the codes used in the array output to your own names.
The dosage calling is never error-free. One common type of error is that all dosages are scored one too high or too low, a so-called shift. Functions in package fitPolyTools can recognize badly scored markers and correct this shift, provided there is a FS family among the samples. It also has functions to compare and merge the results from two probes targeting the same SNP, which is possible on Axiom arrays.
These and other functionalities are further explained in the two vignettes of package fitPolyTools. More details on the functions in fitPoly and fitPolyTools can be found in the help manuals.