Scalable Machine Learning Algorithms for Biological Sequence Data

Jeffrey Chan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-108

May 14, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-108.pdf

Recent advances in sequencing and synthesis technologies have sparked extraordinary growth in large-scale biological experimentation and data collection. This explosive growth necessitates the development of scalable yet accurate methods to investigate increasingly complex biological questions. Machine learning has become a vital tool for addressing the needs of computational biology blending complex statistical models with efficient computation to uncover the underpinnings of biology.

In this dissertation, I develop three novel machine learning algorithms tailored towards biological sequence data to aid in answering such biological questions. The first method is a general-purpose statistical framework for inference of population genetic parameters. Previous methods focused on developing model approximation methods for a restricted class of models or reducing datasets to a set of hand-crafted summary statistics and comparing them against simulated data. Our framework uses a exchangeable neural network which respects the permutation-invariant symmetries of the data to learn the mapping from simulated datasets to the population genetic parameters of interest.

The second method extends the ideas from the first method to a more challenging setting where segmentation of the genotypes is necessary to determine tracts of archaic admixture. In this setting, the data are permutation-equivariant requiring a neural network architecture that results in accurate segmentation of archaic admixture tracts.

Finally, the third method focuses on the problem of search in protein engineering to discover high fitness protein sequences of interest. Standard bandit optimization methods often focus on experimental feedback that is purely sequential. In protein engineering, advances in high-throughput synthesis and experimentation can often lead to large batches of size as large as 10^5 where the size of the batch can often be much larger than the number of rounds of experimentation. We propose a family of parallel contextual linear bandit algorithms and analyze their regret bounds.

Advisors: Yun S. Song

BibTeX citation:

@phdthesis{Chan:EECS-2021-108,
    Author= {Chan, Jeffrey},
    Title= {Scalable Machine Learning Algorithms for Biological Sequence Data},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-108.html},
    Number= {UCB/EECS-2021-108},
    Abstract= {Recent advances in sequencing and synthesis technologies have sparked extraordinary growth in large-scale biological experimentation and data collection. This explosive growth necessitates the development of scalable yet accurate methods to investigate increasingly complex biological questions. Machine learning has become a vital tool for addressing the needs of computational biology blending complex statistical models with efficient computation to uncover the underpinnings of biology.

In this dissertation, I develop three novel machine learning algorithms tailored towards biological sequence data to aid in answering such biological questions. The first method is a general-purpose statistical framework for inference of population genetic parameters. Previous methods focused on developing model approximation methods for a restricted class of models or reducing datasets to a set of hand-crafted summary statistics and comparing them against simulated data. Our framework uses a exchangeable neural network which respects the permutation-invariant symmetries of the data to learn the mapping from simulated datasets to the population genetic parameters of interest.

The second method extends the ideas from the first method to a more challenging setting where segmentation of the genotypes is necessary to determine tracts of archaic admixture. In this setting, the data are permutation-equivariant requiring a neural network architecture that results in accurate segmentation of archaic admixture tracts. 

Finally, the third method focuses on the problem of search in protein engineering to discover high fitness protein sequences of interest. Standard bandit optimization methods often focus on experimental feedback that is purely sequential. In protein engineering, advances in high-throughput synthesis and experimentation can often lead to large batches of size as large as 10^5 where the size of the batch can often be much larger than the number of rounds of experimentation. We propose a family of parallel contextual linear bandit algorithms and analyze their regret bounds.},
}

EndNote citation:

%0 Thesis
%A Chan, Jeffrey 
%T Scalable Machine Learning Algorithms for Biological Sequence Data
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 14
%@ UCB/EECS-2021-108
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-108.html
%F Chan:EECS-2021-108