Reference

Charon has three exported functions, MCMCsampler, exactposterior and unpackposterior, that can be used after typing

using Charon

Charon.Charon — Module

module Charon

This package provides functions to sample from the posterior.

source

MCMCsampler

This is the MCMC sampler. It has seven methods.

Data formats:

You can input the data in four formats.

With two (opened) (gzipped) files: a base count file, which is a CSV file where the first column are the number of derived reads, and the second column is the coverage, and a frequency file, which is a CSV file with one column, containing the frequencies. The data at line i in the base count file corresponds to the same locus as the data on line i of the frequency file.
As an (opened) (gzipped) DICE file, or in the form of a DataFrame, also in DICE format. So the first column is the number of ancestral reads, the second column is the number of derived reads, the third column is the frequency in the anchor population, and the fourth column is the count of the number of loci where this combination of three numbers occur.
Or, by providing three vectors: coverages, derivedreads, frequencies, of length equal to the number of SNPs, where at locus i, there are derivedreads[i] derived reads, coverages[i] coverage and frequencies[i] frequency in the anchor population.
The third format is given with four vectors: coverages, derivedreads, frequencies, counts. This means that there are counts[i] loci with coverages[i] coverage, derivedreads[i] derived reads and frequency frequencies[i] in the anchor population.

If you provide data in formats 1, 2, or 3, then the program will automatically convert it to format 4.

Parameters:

nchains, number of chains, which is a positive integer. If you want to run all your chains in parallel, start julia with number of threads equal to nchains.
nsteps number of samples per chain, which is a positive integer.
prioronn the prior on n, specified as a subtype of DiscreteUnivariateDistribution of the Distributions Julia package. Our implementation requires that prioronn has support on {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, because otherwise rounding errors will accumulate too much.
prioronτCτA is the prior on (τC, τA), which allows for correlation between τC and τA. Its type is a subtype of ContinuousMultivariateDistribution. It should have support contained in [0,∞)x[0,∞).
prioronϵ is the prior on ϵ. It should have support in [0, 0.5). It is a subtype of ContinuousUnivariateDistribution in the Distributions package.
coverages a vector with coverages = ancestral reads + derived reads. Is a subtype of AbstractVector{<:Integer}. All coverages should be non-negative integers, and at least one should be positive, where the corresponding frequency is strictly between 0 and 1.
derivedreads a vector of derived reads. Is a subtype of AbstractVector{<:Integer}. All elements of the vector should be non-negative integers.
frequencies a vector of frequencies. Is a subtype of AbstractVector{<:Real}. Each frequency is between 0.0 and 1.0. At least one frequency should be strictly between 0.0 and 1.0 with corresponding positive coverage.
counts, a vector of positive integers. For each index, counts[index] indicates how many loci there are with derivedreads[index] derived reads, coverage coverages[index] and frequency frequencies[index].
df a DataFrame from the DataFrames package in the DICE-2 format. So the first column should be the number of ancestral reads, the second column the number of derived reads, the third column the frequencies in the anchor population, and the fourth column the counts of how many times this particular combination of ancestral reads, derived reads and frequency occurs.
dicefile is either an opened (gzipped) DICE file, or a path to a (gzipped) DICE file.

Keyword parameters. Keyword parameters should be given as keyword=value to the function, in case you want to set another value then the default.

messages is an integer. If messages is non-positive, no message will be printed. If messages is positive, every messages steps a message will be printed with the progress of the sampler. The default value is nsteps÷100, so every 1% progress a message is printed.
scalingmessage is either true (default) or false. If true, a message will be printed when the scaling constant changes.
header is nothing, true (default) or false. Has the dicefile a header? If nothing, the software tries to determine whether the dicefile has a header. This works only when you provide a path to a file.

The output is a vector with nchains items. Each item represents a chain. Each item is a tuple consisting of six columns, as described here.

Charon.MCMCsampler — Function

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, counts::AbstractVector{<:Integer}, messages::Integer=nsteps÷100, scalingmessages::Bool=true)

MCMC with simulated proposals. There are counts[i] loci with coverage coverages[i], derivedreads[i] derived alleles and frequency frequencies[i] in the anchor population.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, messages::Integer=nsteps÷100, scalingmessages::Bool=true)

coverages[i] is the coverage, derivedreads[i] the number of derived reads and frequencies[i] the frequency of the derived allele in the anchor population at locus i.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, df::AbstractDataFrame, messages::Integer=nsteps÷100, scalingmessages::Bool=true)

df is a DataFrame (from the DataFrames package) in DICE 2-pop format: It should contain exactly four columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, dicefile::IO, messages::Integer=nsteps÷100, scalingmessages::Bool=true, header::Bool=true)

dicefile is an opened CSV-file in DICE 2-pop format: It should contain exactly 4 columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.

If you have Julia 1.3 or higher, and the file is a zipped file, open it using a zip decompressor:

using CodecZlib
dicefile = GzipDecompressorStream(open("path/to/your/file.csv.gz"))

With the named argument "header" one should indicate whether the CSV file has a header. The default is true.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, dicefile::AbstractString, messages::Integer=nsteps÷100, scalingmessages::Bool=true)

dicefile is a (zipped) CSV-file in DICE 2-pop format: It should contain exactly four columns, the first one contains the number of ancestral reads, the second the number of derived reads, the third the frequency of the derived allele in the anchor population and the fourth the number of loci with exactly this combination of data.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, sedimentdata::AbstractString, frequencies::AbstractString, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}; messages::Integer=nsteps÷100, scalingmessages::Bool=true, headersedimentdata::Union{Nothing, Bool}, headerfrequencies::Union{Nothing, Bool})

sedimentdata is a path to a CSV file, where the first column are the derived reads, and the second column are the coverages. frequencies is a path to a CSV file with one column, containing frequencies of the anchor population. The number of rows of both files should be the same.

source

MCMCsampler(nchains::Integer, nsteps::Integer, prioronn::DiscreteUnivariateDistribution, prioronτCτA::ContinuousMultivariateDistribution, prioronϵ::ContinuousUnivariateDistribution, sedimentdata::DataFrame, frequencies::DataFrame; messages::Integer=nsteps÷100, scalingmessages::Bool=true)

sedimentdata is a path to a DataFrame, where the first column are the derived reads, and the second column are the coverages. frequencies is a DataFrame with one column, containing frequencies of the anchor population. The number of rows of both files should be the same.

source

unpackposterior

This function is used to build the unconditional posterior from the MCMC samples conditioned on n, as described in the paper. It has two methods.

Arguments:

nsteps is the number of MCMC samples, a positive integer.
chains this the tuple that is the output of MCMCsampler.

Charon.unpackposterior — Function

unpackposterior(nsteps::Integer, chains::AbstractVector{<:Tuple{AbstractVector{<:Integer}, AbstractMatrix{<:Real}, AbstractMatrix{<:Real}, AbstractMatrix{<:Real}, AbstractMatrix{Bool}, AbstractMatrix{<:Real}}})

Calculate the unconditional posterior from the output of MCMCsampler. chains is the output of the sampler, nsteps is the number of MCMC samples.

source

exactposterior

exactposterior is a function to calculate the posterior up to a fixed constant, only depending on the data, but not on the parameters. You can use this function for maximum posterior estimation. If you use uniform priors, you can use this function for maximum likelihood estimation over the support of the prior. MCMCsampler uses this function to find a good starting point for the sampler. It has two methods.

The posterior is calculated for each combination of parameters (n, τC, τA, ϵ) with n in nrange, τC in τCrange, τA in τArange and ϵ in ϵrange. So make sure that length(nrange)*length(τCrange)*length(τArange)*length(ϵrange) is not too large, as otherwise it will take a very long time and you might run out of memory.

Parameters:

nrange vector of n values. Subtype of AbstractVector{<:Integer}. Should be a subset of {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
τCrange vector of τC values. Subtype of AbstractVector{<:Real}. All values should be non-negative.
τArange vector of τA values. Subtype of AbstractVector{<:Real}. All values should be non-negative.
ϵrange vector of ϵ values. Is a subtype of AbstractVector{<:Real}. All values should be non-negative and smaller than 0.5.
coverages a vector with coverages = ancestral reads + derived reads. Is a subtype of AbstractVector{<:Integer}. All coverages should be non-negative integers, and at least one should be positive, with corresponding frequency strictly between 0 and 1.
uniquecoverages should be equal to unique(coverages).
derivedreads a vector of derived reads. Is a subtype of AbstractVector{<:Integer}. All elements of the vector should be non-negative integers.
frequencies a vector of frequencies. Is a subtype of AbstractVector{<:Real}. Each frequency is between 0.0 and 1.0. At least one frequency should be strictly between 0.0 and 1.0 with corresponding positive coverage.
counts, all elements should be non-negative. For each index, counts[index] indicates how many loci there are with derivedreads[index] derived reads, coverage coverages[index] and frequency frequencies[index].

Keyword argument.

messages is an integer. If messages is non-positive, no message will be printed. If messages is positive, every messages steps a message will be printed with the progress of the calculations.

The output are 5 vectors: ns, τCs, τAs, ϵs, logliks, of each of length length(n)*length(τCrange)*length(τArange)*length(ϵrange), where logliks[index] is the log likelihood, up to an additive constant, with parameters ns[index], τCs[index], τAs[index] and ϵs[index]. The additive constant only depends on the data, but not on the parameters.

Maximum posterior estimation (MAP estimator) and maximum likelihood estimation

Suppose you have results

ns, τCs, τAs, ϵs, logliks = exactposterior(args...)

where args are your arguments (data and priors, etc.). Then you can calculate the index where the log likelihood is maximised:

i_max = argmax(logliks)

ns[i_max], τCs[i_max], τAs[i_max], ϵs[i_max]

is the MAP estimator. If you provide uniform priors, then the MAP estimator is a maximum likelihood estimator as well, where you maximise over the support of the prior.

Charon.exactposterior — Function

exactposterior(nrange::AbstractVector{<:Integer}, τCrange::AbstractVector{<:Real}, τArange::AbstractVector{<:Real}, ϵrange::AbstractVector{<:Real}, coverages::AbstractVector{<:Integer}, uniquecoverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, counts::AbstractVector{<:Integer}, messages::Integer)

Calculate the exact posterior up to a fixed unknown constant that depends on the data but does not depend on the parameters.

source

exactposterior(nrange::AbstractVector{<:Integer}, τCrange::AbstractVector{<:Real}, τArange::AbstractVector{<:Real}, ϵrange::AbstractVector{<:Real}, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, messages::Integer=length(n)*length(τCrange)*length(τArange)*length(ϵrange)÷100)

source

Charon EigenExpansion type

This type is not exported, but can be used after using Charon: EigenExpansion, or via Charon.EigenExpansion. It represents a matrix decomposition $M=P*D*P^{-1}$, where $P$ is an invertible matrix with inverse $P^{-1}$, and $D$ is a diagonal matrix. It is used to efficiently calculate the exponent of a matrix.

Charon.EigenExpansion — Type

EigenExpansion{Q<:AbstractMatrix{<:Real}, R<:Real, S<:AbstractVector{R}, T<:AbstractMatrix{<:Real}}

Representation of a matrix decomposition M=PDP^{-1}, where D is a diagonal matrix and P an invertible matrix with inverse Pinv. The fields are P, D and Pinv of types Q, Diagonal{R, S}, and T, respectively.

This decomposition can be used for fast matrix exponentiation, using that exp(M) = Pexp(D)Pinv, and the fact that the exponent of a diagonal matrix is formed by exponentiating its diagonal elements.

source

Internal, non-exported functions

The following functions are not exported, and only available via Charon.functionname, or using Charon: functionname.

Charon.calcloglik — Function

calcloglik(coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, counts::AbstractVector{<:Integer}, M::AbstractMatrix{<:Real}, hvals::AbstractMatrix{<:Real}, q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, index::Integer)

Calculate the log-likelihood at locus index.

source

Charon.calcmatrix — Function

calcmatrix(τC::Real, τA::Real, Qꜜ::EigenExpansion, Q::EigenExpansion)

Calculate the matrix e^{Q*τ_A}e^{Qꜜτ_C} as in Schraiber. Note that in general, for matrices, exp(A+B) ≠ exp(A)exp(B) (equality holds if and only if A and B commute, which is not the case for Q and Qꜜ).

source

Charon.filtervectorsandapplycountmap — Function

filtervectorsandapplycountmap(coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, frequencies::AbstractVector{<:Real}, allowedindices::AbstractVector{Bool})

This function does the following: coveragesfiltered = coverages[allowedindices] derivedreadsfiltered = derivedreads[allowedindices] frequencies_filtered = frequencies[allowedindices]

Next, it counts how many times (coveragesfiltered[i], derivedreadsfiltered[i], frequenciesfiltered[i]) occurs, for all triples, and next returns four vectors (coveragesfilteredunique, derivedreadsfilteredunique, frequenciesfilteredunique, counts), where each triple (coveragesfilteredunique[i], derivedreadsfilteredunique[i], frequenciesfilteredunique[i]) is unique and occurs counts[i] times in zip(coveragesfiltered, derivedreadsfiltered, frequenciesfiltered).

source

Charon.logprobderivedreads! — Function

logprobderivedreads!(q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, n::Integer, τC::Real, τA::Real, ϵ::Real, coverages::AbstractVector{<:Integer}, uniquecoverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer}, counts::AbstractVector{<:Integer}, Qꜜ::EigenExpansion, Q::EigenExpansion, binomialcoefficients::AbstractVector{<:Integer}, hvals::AbstractMatrix{<:Real})

Calculate the log probability of the data, a.k.a. the likelihood. This function mutates q, which is a dictionary, and is used for intermediate calculations.

source

Charon.makeq — Function

makeq(nmax::Integer, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer})

Construct a vector of length nmax, where the n-th item is a dictionary with keys (coverages[i], derivedreads[i]) and values real vectors of length 2n+1.

source

Charon.makeqfixedn — Function

makeqfixedn(n::Integer, uniquecoveragesandderivedreads::AbstractVector{<:Tuple{Integer, Integer}})

Construct a dictionary with keys (coverages[i], derivedreads[i]) and values real vectors of length 2n+1.

source

makeqfixedn(n::Integer, coverages::AbstractVector{<:Integer}, derivedreads::AbstractVector{<:Integer})

source

Charon.makeQ — Function

makeQ(n::Integer)

Calculate the (2n+1)x(2n+1) tridiagonal matrix Q as in Schraiber 2018.

source

Charon.makeQꜜ — Function

makeQꜜ(n::Integer)

Calculate the (2n+1)x(2n+1) tridiagonal matrix Qꜜ as in Schraiber 2018.

source

Charon.preparedata — Function

preparedata(n::Integer, frequencies::AbstractVector{<:Real})

Returns a named tuple (Qꜜ=Qꜜ, Q=Q, binomialcoefficients=binomialcoefficients, hvals=hvals), where Qꜜ and Q are EigenExpansions of Qꜜ and Q (as in Schraiber 2018), respectively. hvals is a (2n+1) x length(y) matrix of type SizedMatrix. SizedMatrix is used rather than SMatrix because this matrix is so large and SMatrix is slow for large matrices. The matrix hvals is defined as hvals[k+1,i] = frequencies[i]^k*(1-frequencies[i])^(2n-k) (note that Julia has 1-based indexing). binomialcoefficients is a vector (of type SVector) containing the binomial coefficients 2n over k, k=0,...,2n.

source

Charon.readcsvfile — Function

readcsvfile(filename::AbstractString)

Read CSV file and turn it into a DataFrame (from DataFrames package). It automatically detects whether the CSV file has a header.

source

Charon.updateq! — Function

updateq!(q::AbstractDict{<:Tuple{Integer, Integer}, <:AbstractVector{<:Real}}, n::Integer, ϵ::Real, uniquecoverages::AbstractVector{<:Integer}, binomialcoefficients::AbstractVector{<:Integer})

Update the dictionary q, so that q[(R, d)][k+1] is the probability of d derived reads out of R when you have k derived alleles with n inidividuals and error rate ϵ.

source

Extended Julia base functions.

I extended several Julia base functions to my custom EigenExpansion type.

Base.exp — Function

exp(s::Real, F::EigenExpansion)

Calculate e^(s*M) in a numerically efficient way.

If F=EigenExpansion(M), then exp(sM)=exp(s, F)=F.Pexp(sF.D)F.Pinv, which makes use of the fast implementation of exp for diagonal matrices (it is just the exponentiation of the diagonal elements).

source

Base.show — Function

Base.show(io::IO, mime::MIME{Symbol("text/plain")}, F::EigenExpansion{Q, R, S, T}) where {Q<:AbstractMatrix{<:Real}, R<:Real, S<:AbstractVector{R}, T<:AbstractMatrix{<:Real}}

Pretty display of EigenExpansion types.

source

Base.:== — Function

==(E1::EigenExpansion, E2::EigenExpansion)

Test whether E1 and E2 have the same decomposition PDP^(-1), so whether E1.P == E2.P, E1.D == E2.D and E1.Pinv == E2.Pinv.

source