FitPoly contains several functions to import from SNP array software, perform selections based on signal intensity and prepare input data for genotyping. The package offers more functions than shown here, and the functions we use have additional parameters; check the help for extra information.
Load the package and set the working directory to the data directory:
We cover here the cases for Illumina Infinium and GoldenGate arrays and for Affymetrix Axiom arrays. The aim of this step is to get the data into a uniform format, such that the further steps are not dependent on the source of the data (array platform).
The data from Illumina arrays are processed with Illumina’s
GenomeStudio software. This software can export a
Full Data Table.txt
file, which is a tab-separated text
file (i.e. columns in this file are separated by Tab stops). In this
file format every row corresponds to a marker. The first columns contain
general data for the marker including its name; next for each sample
there are a few columns. For our purposes we need the columns X and Y
for each sample. These are not exported by default; in order to export
them, within GenomeStudio we use the Column Chooser button to include
them in the output. All other sample columns besides X and Y are not
needed but are tolerated; however since they increase the file size and
memory load it is better to omit them when exporting from
GenomeStudio.
# show part of a GenomeStudio FullDataTable file (only some relevant columns):
fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly")
dat <- readDatfile(fname, check.names=FALSE)
dat[1:3, c(2, 15,16, 21,22, 27,28)]
## Name 101765-0001.X 101765-0001.Y 101765-0002.X 101765-0002.Y 101765-0003.X 101765-0003.Y
## 1 SNP0001 0.10170589 0.7760399 0.10211141 0.7885740 0.35890719 0.7804398
## 2 SNP0002 0.04271171 1.3020643 0.04893748 1.2564085 0.02508135 1.1183847
## 3 SNP0003 0.01710911 0.9452760 0.02105922 0.8730388 0.02211778 0.8874161
In case you want to see the FullDataTable.txt
file
itself, you can find it in the extdata subdirectory of the main package
directory.
The data from Affymetrix Axiom arrays are processed with the
Affymetrix Power Tools (APT) or the Genotyping Console software. After
initial quality control steps all passing samples and all SNPs are
processed by the program “apt probeset genotype”. If the option
–summaries is included (as is normally the case) a file
AxiomGT1.summary.txt
file is produced. Like the
GenomeStudio file, this is a tab-separated text file with the markers in
rows and the samples in columns. There are also differences; most
importantly, instead of having columns X and Y for each sample, we now
have one column per sample but rows A and B for each marker. Also the
files start with several hundred lines of text, and the last few
thousand rows contain quality control (DQC) markers. Also the sample
names all have “.CEL” appended to them.
# show part of an Affymetrix AxiomGT1.summary.txt file:
fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly")
dat <- readDatfile(fname, comment.char="#", check.names=FALSE)
dat[1:6, 1:4]
## probeset_id 814-379_A03.CEL 814-379_A04.CEL 814-379_A05.CEL
## 1 AX-86752740-A 532.8074 1089.7257 856.1627
## 2 AX-86752740-B 606.8601 726.3137 647.0679
## 3 AX-86752741-A 1596.1018 2228.1368 1820.3040
## 4 AX-86752741-B 871.0347 486.4657 877.3148
## 5 AX-86752742-A 2365.1000 2560.8275 2511.4948
## 6 AX-86752742-B 803.7539 981.7766 856.3433
Again, the AxiomGT1.summary.txt
file is located in the
extdata subdirectory of the main package directory.
We provide demo files, each with only 10 or 15 samples and 10 SNP
markers. We will import these files, clean the sample names and convert
them from wide format (all samples side by side) to
long format (all samples under each other, with columns
MarkerName
, SampleName
, X
,
Y
, R
, ratio
).
# import a FullDataTable.txt file from GenomeStudio:
fname <- system.file("extdata", "FullDataTable.txt", package="fitPoly")
datGS <- readFullDataTable(filename=fname, out="")
head(datGS)
## MarkerName SampleName X Y R ratio
## 1 SNP0001 101765-0001 0.101705887 0.77603991 0.8777458 0.88412831
## 2 SNP0002 101765-0001 0.042711711 1.30206429 1.3447760 0.96823879
## 3 SNP0003 101765-0001 0.017109110 0.94527599 0.9623851 0.98222218
## 4 SNP0004 101765-0001 0.008343576 0.86564662 0.8739902 0.99045347
## 5 SNP0005 101765-0001 1.158060329 0.09794167 1.2560020 0.07797891
## 6 SNP0006 101765-0001 0.033692960 1.10556404 1.1392570 0.97042550
## [1] "SNP0001" "SNP0002" "SNP0003" "SNP0004" "SNP0005" "SNP0006" "SNP0007" "SNP0008" "SNP0009"
## [10] "SNP0010"
## [1] "101765-0001" "101765-0002" "101765-0003" "101765-0004" "101765-0005" "101765-0006"
## [7] "101765-0007" "101765-0008" "101765-0009" "101765-0010"
# import an AxiomGT1.summary.txt file from Affymetrix Power Tools:
fname <- system.file("extdata", "AxiomGT1.summary.txt", package="fitPoly")
datAX <- readAxiomSummary(AXdata=fname, out="")
head(datAX)
## MarkerName SampleName X Y R ratio
## 1 AX-86752740 814-379_A03 532.8074 606.8601 1139.668 0.5324887
## 2 AX-86752740 814-379_A04 1089.7257 726.3137 1816.039 0.3999438
## 3 AX-86752740 814-379_A05 856.1627 647.0679 1503.231 0.4304515
## 4 AX-86752740 814-379_A06 721.9925 960.5359 1682.528 0.5708884
## 5 AX-86752740 814-379_A07 453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740 814-379_A08 525.5640 1253.0268 1778.591 0.7045054
## [1] "AX-86752740" "AX-86752741" "AX-86752742" "AX-86752743" "AX-86752744" "AX-86752745"
## [7] "AX-86752746" "AX-86752747" "AX-86752748" "AX-86752749"
## [1] "814-379_A03" "814-379_A04" "814-379_A05" "814-379_A06" "814-379_A07" "814-379_A08"
## [7] "814-379_A09" "814-379_A10" "814-379_A11" "814-379_A12" "814-380_G12" "814-380_H01"
## [13] "814-380_H02" "814-380_H03" "814-380_H04"
These functions return the reformatted data as a data.frame. Normally
you would use the parameter out
to specify the name of an
RData file (or optionally a tab-separated text file, see the help) to
save the converted data.
Normally the marker names and/or the sample names as used in the
array software are different from the original names. In these cases
there should be a file specifying the translation between the two sets
of names. Such a file can also contain additional information on the
samples, such as the categories of the samples (e.g. F1 parent 1, F1
parent 2, F1 individual, association panel, wild, control) and ploidy
(e.g. 2x, 4x). In this exercise we will convert the marker and sample
names of the datAX
data.frame and split it into two parts:
one for the tetraploid and one for the diploid samples.
First we rename the markers. The translation is specified by file
CsvAnnotationFile.v1.txt
. Since the markers on the array
are a fixed set these annotation files don’t change and can usually be
assumed to be correct (i.e. all marker names in the array can be assumed
also to occur in the annotation file and v.v., and all
customer_id
codes can be assumed to be unique). Therefore
we don’t need to check this and can perform the translation
directly:
fname <- system.file("extdata", "CsvAnnotationFile.v1.txt", package="fitPoly")
mrktable <- read.csv(fname, comment.char="#")
levels(datAX$MarkerName) <-
mrktable$customer_id[match(levels(datAX$MarkerName), mrktable$Probe.Set.ID)]
Next we rename the samples. The samples are different between
projects using the same array, and sample annotations are usually made
by hand, in different formats, and cannot be assumed to be error-free.
Also it may be that there are samples of various ploidy levels, which we
may want to split into separate sets. For this we use function
splitNrenameSamples
:
fname <- system.file("extdata", "AX_sampletable.csv", package="fitPoly")
smptable <- read.csv(fname)
head(smptable)
## X SubmittedPlateName SubmittedWell PlateName Well SampleName Ploidy SampleSource
## 1 99 SMP4_0008814 A03 SMP4_0008814 A03 K001 4x Customer
## 2 100 SMP4_0008814 A04 SMP4_0008814 A04 K002 4x Customer
## 3 101 SMP4_0008814 A05 SMP4_0008814 A05 K003 4x Customer
## 4 102 SMP4_0008814 A06 SMP4_0008814 A06 K004 4x Customer
## 5 103 SMP4_0008814 A07 SMP4_0008814 A07 K007 4x Customer
## 6 104 SMP4_0008814 A08 SMP4_0008814 A08 K008 4x Customer
## SampleType InternalPico CVPercent BestArray FailureMode
## 1 Rose genomic DNA 16.46 6.90 814-379_A03 NA
## 2 Rose genomic DNA 14.87 3.63 814-379_A04 NA
## 3 Rose genomic DNA 8.32 6.97 814-379_A05 NA
## 4 Rose genomic DNA 23.68 1.52 814-379_A06 NA
## 5 Rose genomic DNA 20.28 3.61 814-379_A07 NA
## 6 Rose genomic DNA 6.78 2.29 814-379_A08 NA
# no split, return the original data.frame with substituted SampleNames:
datAX_unsplit <- splitNrenameSamples(dat=datAX, sampletable=smptable,
SampleID="BestArray", CustomerID="SampleName",
out=NA)
head(datAX_unsplit)
## MarkerName SampleName X Y R ratio
## 1 AX-86752740 K001 532.8074 606.8601 1139.668 0.5324887
## 2 AX-86752740 K002 1089.7257 726.3137 1816.039 0.3999438
## 3 AX-86752740 K003 856.1627 647.0679 1503.231 0.4304515
## 4 AX-86752740 K004 721.9925 960.5359 1682.528 0.5708884
## 5 AX-86752740 K007 453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740 K008 525.5640 1253.0268 1778.591 0.7045054
# split samples based on ploidy, return a list of data.frames:
datAX_split <- splitNrenameSamples(dat=datAX, sampletable=smptable,
SampleID="BestArray", CustomerID="SampleName",
Ploidy="Ploidy", out=NA)
## sampletable contains ploidy levels 2x 4x
## the samples will be split per ploidy value
## MarkerName SampleName X Y R ratio
## 11 AX-86752740 V01 720.7784 2086.717 2807.496 0.7432664
## 12 AX-86752740 V02 427.6401 1864.825 2292.465 0.8134584
## 13 AX-86752740 V03 756.3445 1333.276 2089.621 0.6380470
## 14 AX-86752740 V04 415.7363 1850.588 2266.325 0.8165593
## 15 AX-86752740 V05 722.2508 1729.903 2452.154 0.7054627
## 26 AX-86752741 V01 864.9218 1615.226 2480.148 0.6512620
## MarkerName SampleName X Y R ratio
## 1 AX-86752740 K001 532.8074 606.8601 1139.668 0.5324887
## 2 AX-86752740 K002 1089.7257 726.3137 1816.039 0.3999438
## 3 AX-86752740 K003 856.1627 647.0679 1503.231 0.4304515
## 4 AX-86752740 K004 721.9925 960.5359 1682.528 0.5708884
## 5 AX-86752740 K007 453.8059 1134.3632 1588.169 0.7142585
## 6 AX-86752740 K008 525.5640 1253.0268 1778.591 0.7045054
Although we use out=NA
in this demo, normally you would
provide a filename to store the results.
The total signal intensity R = X + Y can be used as a measure whether
sample DNA is ok, the marker is designed well, and individual datapoints
are acceptable. Selection of data based on signal intensity has to be
done before dosage calling by fitPoly as fitPoly only works with the
signal ratio, not with absolute signal intensities. For this exercise we
will use an example data.frame XYdat
.
## MarkerName SampleName X Y R ratio
## 2013 RhK5_1000_451P K001 291.7016 853.7687 1145.4703 0.7453433
## 2014 RhK5_1000_451P K002 295.5672 392.0710 687.6382 0.5701705
## 2015 RhK5_1000_451P K003 260.1488 364.9967 625.1455 0.5838588
## 2016 RhK5_1000_451P K004 248.3574 314.0045 562.3619 0.5583673
## 2017 RhK5_1000_451P K007 265.1335 286.6872 551.8208 0.5195296
## 2018 RhK5_1000_451P K008 257.5974 337.2776 594.8750 0.5669723
Samples with bad DNA quality may be recognized by a low R (total
signal intensity), averaged over all markers. This information is saved
in a file produced by readFullDataTable
or
readAxiomSummary
if parameter out is specified. However it
is also quite easy to calculate directly if all data are in the same
data.frame:
It is clear that the mean R values for all samples are quite similar so there is no reason to exclude any samples at this point.
Next we will check if any markers should be excluded based on their overall signal intensity R. We calculate a number of statistics of the R distribution per marker and study the 95% quantiles (q95) of the markers:
## 0% 25% 50% 75% 100%
## 252.7965 2677.6312 3295.4407 3698.0998 5871.7522
In contrast to the samples, the markers vary widely in signal intensity. Probably there is a level below which no usable markers occur. We define a number of levels for q95 and at each level we select 3 markers. We then draw XY-scatterplots for these markers and decide from which level clear clustering pattern occur:
Rlevels <- seq(200, 6000, by=400)
sel.Rstats <- selMarkers_byR(Rstats, Rlevels, mrkperlevel=3)
drawXYplots(dat=XYdat, markers=sel.Rstats,
out="your-path-and filename", drawRthresholds=TRUE)
This generates a series of pages, each with 6 plots; here is part of
one such page:
These two markers show more or less clear clusters and have a q95
around 1600, so clearly the threshold below which markers should be
excluded must be lower than 1600. In this case we will exclude all
markers with q95 < 1400. This selection will be combined with a
within-marker selection of samples in function
makeFitPolyFiles
(see next exercise), but if no
within-marker selection is needed it can also be done as shown here:
In the plots as shown above the diagonal line represents R = 0.5 *
q95. Different, and even multiple, R threshold lines can be drawn by
using the Rthreshold.param
paprameter of
drawXYplots
(see the help). By looking at a series of
XY-plots we can decide if we want to exclude samples with R < 0.5 *
q95 (or any other threshold). Note that the within-marker threshold
should be based on the q95 (or another R statistic) of each marker, as
the signal intensity differs a lot between markers. The following
selects all markers where q95 (the 95% quantile of R) is >= 1400, and
within these markers it assigns a missing value (NA) to all markers
where R < 0.5 * q95:
dat.sel <- makeFitPolyFiles(XYdat, out=NA, Rquantile=0.95, marker.threshold=1400,
Rthreshold.param=c(0.95, 0.5, 0))
#dat.sel is in this case a list with only one useful element (the other element
#is NA); simplify:
dat.sel <- dat.sel[[1]]
head(dat.sel)
## MarkerName SampleName X Y R ratio
## 21173 RhK5_100_3510P K001 2704.714 369.3005 3074.015 0.1201362
## 21174 RhK5_100_3510P K002 2472.556 756.2809 3228.837 0.2342270
## 21175 RhK5_100_3510P K003 2281.848 759.0215 3040.869 0.2496068
## 21176 RhK5_100_3510P K004 2563.892 857.1673 3421.059 0.2505561
## 21177 RhK5_100_3510P K007 2796.043 459.6424 3255.686 0.1411814
## 21178 RhK5_100_3510P K008 2459.193 789.6996 3248.893 0.2430673
For more information on makeFitPoly
files see the
help.
This vignette is about preparing and selecting data for dosage
scoring by using the function fitMarkers
. We assume that
data.frame dat.sel
from the previous exercise is still in
memory.
The main result of this function call is a file
A_scores.RData
which contains a data.frame with for each
marker (in this case only the first 3 markers) and sample the assigned
dosage, as well as some further data. For further information on the
input and output of fitPoly functions see the vignette Segregation analysis
of dosage scores in this package.