CherryML 2.0: Almost Linear Time Estimation of Phylogenetic Models

Wilson Wu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-107

May 15, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-107.pdf

Phylogenetic models of protein evolution are parameterized by a rate matrix Q describing the rate at which amino acids mutate. Maximum likelihood based methods typically alter- nately optimize phylogenetic trees over the multiple sequence alignments given a rate matrix and the rate matrix given the trees. Classically, this process was bottlenecked by the matrix estimation step. Recently, CherryML showed that performing matrix estimation using com- posite likelihood estimate yielded accurate rate matrices whilst being orders of magnitudes faster. In this paper, we present CherryML 2.0 which applies composite likelihood to the tree estimation step, speeding up the tree estimation step by orders of magnitude with minimal impact in accuracy. Furthermore, CherryML 2.0’s time complexity is almost linear in the length of sequences and number of sequences per multiple sequence alignments. We show that CherryML 2.0 achieves similar accuracy to CherryML 1.0 on real datasets, and it’s able to reconstruct a rate matrix with less than 2% median error on simulated datasets.

Advisors: Yun S. Song

BibTeX citation:

@mastersthesis{Wu:EECS-2024-107,
    Author= {Wu, Wilson},
    Title= {CherryML 2.0: Almost Linear Time Estimation of Phylogenetic Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-107.html},
    Number= {UCB/EECS-2024-107},
    Abstract= {Phylogenetic models of protein evolution are parameterized by a rate matrix Q describing the rate at which amino acids mutate. Maximum likelihood based methods typically alter- nately optimize phylogenetic trees over the multiple sequence alignments given a rate matrix and the rate matrix given the trees. Classically, this process was bottlenecked by the matrix estimation step. Recently, CherryML showed that performing matrix estimation using com- posite likelihood estimate yielded accurate rate matrices whilst being orders of magnitudes faster. In this paper, we present CherryML 2.0 which applies composite likelihood to the tree estimation step, speeding up the tree estimation step by orders of magnitude with minimal impact in accuracy. Furthermore, CherryML 2.0’s time complexity is almost linear in the length of sequences and number of sequences per multiple sequence alignments. We show that CherryML 2.0 achieves similar accuracy to CherryML 1.0 on real datasets, and it’s able to reconstruct a rate matrix with less than 2% median error on simulated datasets.},
}

EndNote citation:

%0 Thesis
%A Wu, Wilson 
%T CherryML 2.0: Almost Linear Time Estimation of Phylogenetic Models
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 15
%@ UCB/EECS-2024-107
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-107.html
%F Wu:EECS-2024-107