Tech Reports | EECS at UC Berkeley

Sara Sheehan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2015-94

May 14, 2015

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-94.pdf

Since the 1920s, researchers in population genetics have developed mathematical models to explain how a species evolves. With the rise of DNA sequencing over the past decade, we now have the data to use these models to answer real questions in evolutionary biology. However, the sheer amount of data and the time complexity of the models makes inference extremely challenging. Computer science has therefore become an essential tool for bridging theoretical models and modern sequencing data.

In this thesis we present two novel algorithms that make use of DNA sequencing data in a principled yet practical way. The first method estimates the history of effective population sizes of a species using a coalescent hidden Markov model (HMM). Previous coalescent HMMs could only handle a few sequences, since the set of coalescent trees makes the state- space prohibitively large. Our algorithm uses a modified state-space to make inference computationally feasible while still retaining the essential genealogical features of a sample. We apply this algorithm, called diCal, to human data to learn more about major events in human history, such as the out-of-Africa migration. We also provide several extensions to diCal that make the computation faster, more automated, and applicable in a wider variety of scenarios.

The second method is an algorithm for jointly estimating effective population size changes and natural selection. These two factors can leave similar traces in genomic data, and the models that would describe both are computationally intractable. Our method uses a machine learning technique called deep learning to make the inference procedure robust and efficient. Deep learning automatically teases out important features of the data, but previously had not been used in population genetics. We apply this method to African Drosophila melanogaster data to jointly infer their population size changes and classify each region of their genome as neutral or under natural selection. We considered three types of selection: hard sweeps, soft sweeps, and balancing selection. To create a sophisticated framework for population genomic inference, in the future it would be promising to combine machine learning algorithms with biologically-inspired coalescent modeling.

Advisors: Yun S. Song

BibTeX citation:

@phdthesis{Sheehan:EECS-2015-94,
Author= {Sheehan, Sara},
Title= {Scalable Algorithms for Population Genomic Inference},
School= {EECS Department, University of California, Berkeley},
Year= {2015},
Month= {May},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-94.html},
Number= {UCB/EECS-2015-94},
Abstract= {Since the 1920s, researchers in population genetics have developed mathematical models to explain how a species evolves. With the rise of DNA sequencing over the past decade, we now have the data to use these models to answer real questions in evolutionary biology. However, the sheer amount of data and the time complexity of the models makes inference extremely challenging. Computer science has therefore become an essential tool for bridging theoretical models and modern sequencing data.

EndNote citation:

%0 Thesis
%A Sheehan, Sara 
%T Scalable Algorithms for Population Genomic Inference
%I EECS Department, University of California, Berkeley
%D 2015
%8 May 14
%@ UCB/EECS-2015-94
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-94.html
%F Sheehan:EECS-2015-94