Efficient forward simulation of whole genomes

Simulations are used extensively in population genetics, both for verifying analytical results and for exploring models that are mathematically intractable.  Simulation approaches can be broadly classified into forward-in-time and backward-in-time.

While backward-in-time (e.g. coalescent) simulations are extremely memory efficient for simulating neutral loci, it is difficult to simulate selection.  On the other hand, forward simulations allow very flexible selection models, but tend to be memory intensive.  This is because the standard approach to forward simulation stores all the variants carried by individuals in an array, whether or not those variants contribute to individual fitness.  With this approach, the memory used to store a population of individuals is 2NL bits, where N is the population size and L is the number of loci stored, assuming a single bit is used to store a variant (i.e. biallelic loci).  For a computer with 16GB of memory, this translates to 10000 individuals with 6.4Mbp of sequence per individual, or 1 million individuals with 64 kbp per individual.

An alternative approach is to represent individual chromosomes as mosaics of haplotypes from a founder population.  This approach is extremely memory efficient, allowing simulation of whole genomes of large populations.  I have written a forward simulator that implements this approach, called forqs (Forward simulation of Recombination, Quantitative traits, and Selection):



forqs allows explicit modeling of quantitative traits and selection based on individual trait values.  Part of the motivation for writing the simulator was to simulate artificial selection experiments.

I’ve recently finished a simulation study using forqs to evaluate the power of artificial selection experiments to identify quantitative trait loci (QTLs) underlying a quantitative trait.  Here’s a link to the manuscript pre-print on bioRxiv:






Leave a Reply

Post Navigation