Reference
Charon has three exported functions, MCMCsampler, exactposterior and unpackposterior, that can be used after typing
using Charon
Charon.Charon
— Modulemodule Charon
This package provides functions to sample from the posterior.
MCMCsampler
This is the MCMC sampler. It has seven methods.
Data formats:
You can input the data in four formats.
- With two (opened) (gzipped) files: a base count file, which is a CSV file where the first column are the number of derived reads, and the second column is the coverage, and a frequency file, which is a CSV file with one column, containing the frequencies. The data at line i in the base count file corresponds to the same locus as the data on line i of the frequency file.
- As an (opened) (gzipped) DICE file, or in the form of a DataFrame, also in DICE format. So the first column is the number of ancestral reads, the second column is the number of derived reads, the third column is the frequency in the anchor population, and the fourth column is the count of the number of loci where this combination of three numbers occur.
- Or, by providing three vectors:
coverages
,derivedreads
,frequencies
, of length equal to the number of SNPs, where at locusi
, there arederivedreads[i]
derived reads,coverages[i]
coverage andfrequencies[i]
frequency in the anchor population. - The third format is given with four vectors:
coverages
,derivedreads
,frequencies
,counts
. This means that there arecounts[i]
loci withcoverages[i]
coverage,derivedreads[i]
derived reads and frequencyfrequencies[i]
in the anchor population.
If you provide data in formats 1, 2, or 3, then the program will automatically convert it to format 4.
Parameters:
nchains
, number of chains, which is a positive integer. If you want to run all your chains in parallel, start julia with number of threads equal tonchains
.nsteps
number of samples per chain, which is a positive integer.prioronn
the prior on n, specified as a subtype ofDiscreteUnivariateDistribution
of the Distributions Julia package. Our implementation requires thatprioronn
has support on {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, because otherwise rounding errors will accumulate too much.prioronτCτA
is the prior on (τC, τA), which allows for correlation between τC and τA. Its type is a subtype of ContinuousMultivariateDistribution. It should have support contained in [0,∞)x[0,∞).prioronϵ
is the prior on ϵ. It should have support in [0, 0.5). It is a subtype ofContinuousUnivariateDistribution
in the Distributions package.coverages
a vector with coverages = ancestral reads + derived reads. Is a subtype ofAbstractVector{<:Integer}
. All coverages should be non-negative integers, and at least one should be positive, where the corresponding frequency is strictly between 0 and 1.derivedreads
a vector of derived reads. Is a subtype ofAbstractVector{<:Integer}
. All elements of the vector should be non-negative integers.frequencies
a vector of frequencies. Is a subtype of AbstractVector{<:Real}. Each frequency is between 0.0 and 1.0. At least one frequency should be strictly between 0.0 and 1.0 with corresponding positive coverage.counts
, a vector of positive integers. For each index,counts[index]
indicates how many loci there are withderivedreads[index]
derived reads, coveragecoverages[index]
and frequencyfrequencies[index]
.df
a DataFrame from the DataFrames package in the DICE-2 format. So the first column should be the number of ancestral reads, the second column the number of derived reads, the third column the frequencies in the anchor population, and the fourth column the counts of how many times this particular combination of ancestral reads, derived reads and frequency occurs.dicefile
is either an opened (gzipped) DICE file, or a path to a (gzipped) DICE file.
Keyword parameters. Keyword parameters should be given as keyword=value to the function, in case you want to set another value then the default.
messages
is an integer. Ifmessages
is non-positive, no message will be printed. Ifmessages
is positive, everymessages
steps a message will be printed with the progress of the sampler. The default value isnsteps÷100
, so every 1% progress a message is printed.scalingmessage
is eithertrue
(default) orfalse
. If true, a message will be printed when the scaling constant changes.header
isnothing
,true
(default) orfalse
. Has the dicefile a header? If nothing, the software tries to determine whether the dicefile has a header. This works only when you provide a path to a file.
The output is a vector with nchains items. Each item represents a chain. Each item is a tuple consisting of six columns, as described here.
Charon.MCMCsampler
— FunctionMCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, counts::AbstractVector{<:Integer}, messages::Integer=nsteps÷100, scalingmessages::Bool=true)
MCMC with simulated proposals. There are counts[i] loci with coverage coverages[i], derivedreads[i] derived alleles and frequency frequencies[i] in the anchor population.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, messages::Integer=nsteps÷100, scalingmessages::Bool=true)
coverages[i] is the coverage, derivedreads[i] the number of derived reads and frequencies[i] the frequency of the derived allele in the anchor population at locus i.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, df::AbstractDataFrame, messages::Integer=nsteps÷100, scalingmessages::Bool=true)
df is a DataFrame (from the DataFrames package) in DICE 2-pop format: It should contain exactly four columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, dicefile::IO, messages::Integer=nsteps÷100, scalingmessages::Bool=true, header::Bool=true)
dicefile is an opened CSV-file in DICE 2-pop format: It should contain exactly 4 columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.
If you have Julia 1.3 or higher, and the file is a zipped file, open it using a zip decompressor:
using CodecZlib
dicefile = GzipDecompressorStream(open("path/to/your/file.csv.gz"))
With the named argument "header" one should indicate whether the CSV file has a header. The default is true.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, dicefile::AbstractString, messages::Integer=nsteps÷100, scalingmessages::Bool=true)
dicefile is a (zipped) CSV-file in DICE 2-pop format: It should contain exactly four columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, sedimentdata::AbstractString, frequencies::AbstractString, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}; messages::Integer=nsteps÷100, scalingmessages::Bool=true, headersedimentdata::Union{Nothing, Bool}, headerfrequencies::Union{Nothing, Bool})
sedimentdata is a path to a CSV file, where the first column are the derived reads, and the second column are the coverages. frequencies is a path to a CSV file with one column, containing frequencies of the anchor population. The number of rows of both files should be the same.
MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, sedimentdata::DataFrame, frequencies::DataFrame; messages::Integer=nsteps÷100, scalingmessages::Bool=true)
sedimentdata is a path to a DataFrame, where the first column are the derived reads, and the second column are the coverages. frequencies is a DataFrame with one column, containing frequencies of the anchor population. The number of rows of both files should be the same.
unpackposterior
This function is used to build the unconditional posterior from the MCMC samples conditioned on n, as described in the paper. It has two methods.
Arguments:
nsteps
is the number of MCMC samples, a positive integer.chains
this the tuple that is the output ofMCMCsampler
.
Charon.unpackposterior
— Functionunpackposterior(nsteps::Integer, chains::AbstractVector{<:Tuple{AbstractVector{<:Integer}, AbstractMatrix{<:Real}, AbstractMatrix{<:Real}, AbstractMatrix{<:Real}, AbstractMatrix{Bool}, AbstractMatrix{<:Real}}})
Calculate the unconditional posterior from the output of MCMCsampler. chains
is the output of the sampler, nsteps
is the number of MCMC samples.
exactposterior
exactposterior
is a function to calculate the posterior up to a fixed constant, only depending on the data, but not on the parameters. You can use this function for maximum posterior estimation. If you use uniform priors, you can use this function for maximum likelihood estimation over the support of the prior. MCMCsampler
uses this function to find a good starting point for the sampler. It has two methods.
The posterior is calculated for each combination of parameters (n, τC, τA, ϵ) with n in nrange
, τC in τCrange
, τA in τArange
and ϵ in ϵrange
. So make sure that length(nrange)*length(τCrange)*length(τArange)*length(ϵrange)
is not too large, as otherwise it will take a very long time and you might run out of memory.
Parameters:
nrange
vector of n values. Subtype of AbstractVector{<:Integer}. Should be a subset of {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.τCrange
vector of τC values. Subtype of AbstractVector{<:Real}. All values should be non-negative.τArange
vector of τA values. Subtype of AbstractVector{<:Real}. All values should be non-negative.ϵrange
vector of ϵ values. Is a subtype of AbstractVector{<:Real}. All values should be non-negative and smaller than 0.5.coverages
a vector with coverages = ancestral reads + derived reads. Is a subtype ofAbstractVector{<:Integer}
. All coverages should be non-negative integers, and at least one should be positive, with corresponding frequency strictly between 0 and 1.uniquecoverages
should be equal tounique(coverages)
.derivedreads
a vector of derived reads. Is a subtype ofAbstractVector{<:Integer}
. All elements of the vector should be non-negative integers.frequencies
a vector of frequencies. Is a subtype of AbstractVector{<:Real}. Each frequency is between 0.0 and 1.0. At least one frequency should be strictly between 0.0 and 1.0 with corresponding positive coverage.counts
, all elements should be non-negative. For each index,counts[index]
indicates how many loci there are withderivedreads[index]
derived reads, coveragecoverages[index]
and frequencyfrequencies[index]
.
Keyword argument.
messages
is an integer. Ifmessages
is non-positive, no message will be printed. Ifmessages
is positive, everymessages
steps a message will be printed with the progress of the calculations.
The output are 5 vectors: ns, τCs, τAs, ϵs, logliks
, of each of length length(n)*length(τCrange)*length(τArange)*length(ϵrange)
, where logliks[index]
is the log likelihood, up to an additive constant, with parameters ns[index]
, τCs[index]
, τAs[index]
and ϵs[index]
. The additive constant only depends on the data, but not on the parameters.
Maximum posterior estimation (MAP estimator) and maximum likelihood estimation
Suppose you have results
ns, τCs, τAs, ϵs, logliks = exactposterior(args...)
where args
are your arguments (data and priors, etc.). Then you can calculate the index where the log likelihood is maximised:
i_max = argmax(logliks)
So
ns[i_max], τCs[i_max], τAs[i_max], ϵs[i_max]
is the MAP estimator. If you provide uniform priors, then the MAP estimator is a maximum likelihood estimator as well, where you maximise over the support of the prior.
Charon.exactposterior
— Functionexactposterior(nrange::AbstractVector{<:Integer}, τCrange::AbstractVector{<:Real}, τArange::AbstractVector{<:Real}, ϵrange::AbstractVector{<:Real}, coverages::AbstractVector{<:Integer}, uniquecoverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, counts::AbstractVector{<:Integer}, messages::Integer)
Calculate the exact posterior up to a fixed unknown constant that depends on the data but does not depend on the parameters.
exactposterior(nrange::AbstractVector{<:Integer}, τCrange::AbstractVector{<:Real}, τArange::AbstractVector{<:Real}, ϵrange::AbstractVector{<:Real}, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, messages::Integer=length(n)*length(τCrange)*length(τArange)*length(ϵrange)÷100)
Charon EigenExpansion type
This type is not exported, but can be used after using Charon: EigenExpansion
, or via Charon.EigenExpansion
. It represents a matrix decomposition $M=P*D*P^{-1}$, where $P$ is an invertible matrix with inverse $P^{-1}$, and $D$ is a diagonal matrix. It is used to efficiently calculate the exponent of a matrix.
Charon.EigenExpansion
— TypeEigenExpansion{Q<:AbstractMatrix{<:Real}, R<:Real, S<:AbstractVector{R}, T<:AbstractMatrix{<:Real}}
Representation of a matrix decomposition M=PDP^{-1}, where D is a diagonal matrix and P an invertible matrix with inverse Pinv. The fields are P, D and Pinv of types Q, Diagonal{R, S}, and T, respectively.
This decomposition can be used for fast matrix exponentiation, using that exp(M) = Pexp(D)Pinv, and the fact that the exponent of a diagonal matrix is formed by exponentiating its diagonal elements.
Internal, non-exported functions
The following functions are not exported, and only available via Charon.functionname
, or using Charon: functionname
.
Charon.calcloglik
— Functioncalcloglik(coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, counts::AbstractVector{<:Integer}, M::AbstractMatrix{<:Real}, hvals::AbstractMatrix{<:Real}, q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, index::Integer)
Calculate the log-likelihood at locus index.
Charon.calcmatrix
— Functioncalcmatrix(τC::Real, τA::Real, Qꜜ::EigenExpansion, Q::EigenExpansion)
Calculate the matrix e^{Q*τ_A}e^{Qꜜτ_C}
as in Schraiber. Note that in general, for matrices, exp(A+B) ≠ exp(A)exp(B) (equality holds if and only if A and B commute, which is not the case for Q and Qꜜ).
Charon.filtervectorsandapplycountmap
— Functionfiltervectorsandapplycountmap(coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, allowedindices::AbstractVector{Bool})
This function does the following: coveragesfiltered = coverages[allowedindices] derivedreadsfiltered = derivedreads[allowedindices] frequencies_filtered = frequencies[allowedindices]
Next, it counts how many times (coveragesfiltered[i], derivedreadsfiltered[i], frequenciesfiltered[i]) occurs, for all triples, and next returns four vectors (coveragesfilteredunique, derivedreadsfilteredunique, frequenciesfilteredunique, counts), where each triple (coveragesfilteredunique[i], derivedreadsfilteredunique[i], frequenciesfilteredunique[i]) is unique and occurs counts[i] times in zip(coveragesfiltered, derivedreadsfiltered, frequenciesfiltered).
Charon.logprobderivedreads!
— Functionlogprobderivedreads!(q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, n::Integer, τC::Real, τA::Real, ϵ::Real, coverages::AbstractVector{<:Integer}, uniquecoverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, counts::AbstractVector{<:Integer}, Qꜜ::EigenExpansion, Q::EigenExpansion, binomialcoefficients::AbstractVector{<:Integer}, hvals::AbstractMatrix{<:Real})
Calculate the log probability of the data, a.k.a. the likelihood. This function mutates q, which is a dictionary, and is used for intermediate calculations.
Charon.makeq
— Functionmakeq(nmax::Integer, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer})
Construct a vector of length nmax, where the n-th item is a dictionary with keys (coverages[i], derivedreads[i]) and values real vectors of length 2n+1.
Charon.makeqfixedn
— Functionmakeqfixedn(n::Integer, uniquecoveragesandderivedreads::AbstractVector{<:Tuple{Integer, Integer}})
Construct a dictionary with keys (coverages[i], derivedreads[i]) and values real vectors of length 2n+1.
makeqfixedn(n::Integer, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer})
Charon.makeQ
— FunctionmakeQ(n::Integer)
Calculate the (2n+1)x(2n+1) tridiagonal matrix Q as in Schraiber 2018.
Charon.makeQꜜ
— FunctionmakeQꜜ(n::Integer)
Calculate the (2n+1)x(2n+1) tridiagonal matrix Qꜜ as in Schraiber 2018.
Charon.preparedata
— Functionpreparedata(n::Integer, frequencies::AbstractVector{<:Real})
Returns a named tuple (Qꜜ=Qꜜ, Q=Q, binomialcoefficients=binomialcoefficients, hvals=hvals), where Qꜜ and Q are EigenExpansions of Qꜜ and Q (as in Schraiber 2018), respectively. hvals is a (2n+1) x length(y) matrix of type SizedMatrix. SizedMatrix is used rather than SMatrix because this matrix is so large and SMatrix is slow for large matrices. The matrix hvals is defined as hvals[k+1,i] = frequencies[i]^k*(1-frequencies[i])^(2n-k) (note that Julia has 1-based indexing). binomialcoefficients is a vector (of type SVector) containing the binomial coefficients 2n over k, k=0,...,2n.
Charon.readcsvfile
— Functionreadcsvfile(filename::AbstractString)
Read CSV file and turn it into a DataFrame (from DataFrames package). It automatically detects whether the CSV file has a header.
Charon.updateq!
— Functionupdateq!(q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, n::Integer, ϵ::Real, uniquecoverages::AbstractVector{<:Integer}, binomialcoefficients::AbstractVector{<:Integer})
Update the dictionary q, so that q[(R, d)][k+1] is the probability of d derived reads out of R when you have k derived alleles with n inidividuals and error rate ϵ.
Extended Julia base functions.
I extended several Julia base functions to my custom EigenExpansion type.
Base.exp
— Functionexp(s::Real, F::EigenExpansion)
Calculate e^(s*M) in a numerically efficient way.
If F=EigenExpansion(M), then exp(sM)=exp(s, F)=F.Pexp(sF.D)F.Pinv, which makes use of the fast implementation of exp for diagonal matrices (it is just the exponentiation of the diagonal elements).
Base.show
— FunctionBase.show(io::IO, mime::MIME{Symbol("text/plain")}, F::EigenExpansion{Q, R, S, T}) where {Q<:AbstractMatrix{<:Real}, R<:Real, S<:AbstractVector{R}, T<:AbstractMatrix{<:Real}}
Pretty display of EigenExpansion types.
Base.:==
— Function==(E1::EigenExpansion, E2::EigenExpansion)
Test whether E1 and E2 have the same decomposition PDP^(-1), so whether E1.P == E2.P, E1.D == E2.D and E1.Pinv == E2.Pinv.