Nanopore Methylation Calling from Limited Training Data

Brian Yao and Jennifer Listgarten

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-15

May 1, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-15.pdf

Nanopore sequencing platforms combined with machine learning models have been shown to be effective for detecting base modifications in DNA such as 5mC and 6mA. However, a challenge in building machine learning-based callers is access to labelled training data that span all modifications on all possible DNA k-mer backgrounds—a complete training dataset. Nanopore calling has historically been done with Hidden Markov Models (HMMs); these HMMs cannot make successful calls in k-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs) are increasingly being used to make base and modification calls, often outperforming their HMM cousins in the complete data setting. Moreover, it stands to reason that the DNN approach should be able to better generalize to unseen examples because its parameters are more fully shared across all training examples. Herein, we demonstrate that indeed a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid approach, AmortizedHMM, demonstrating that it outperforms both the pure HMM and DNN approaches on methylation calling when the training data are incomplete.

Advisors: Jennifer Listgarten

BibTeX citation:

@mastersthesis{Yao:EECS-2022-15,
    Author= {Yao, Brian and Listgarten, Jennifer},
    Title= {Nanopore Methylation Calling from Limited Training Data},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-15.html},
    Number= {UCB/EECS-2022-15},
    Abstract= {Nanopore sequencing platforms combined with machine learning models have been shown to be effective for detecting base modifications in DNA such as 5mC and 6mA. However, a challenge in building machine learning-based callers is access to labelled training data that span all modifications on all possible DNA k-mer backgrounds—a <i>complete</i> training dataset. Nanopore calling has historically been done with Hidden Markov Models (HMMs); these HMMs cannot make successful calls in k-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs) are increasingly being used to make base and modification calls, often outperforming their HMM cousins in the complete data setting. Moreover, it stands to reason that the DNN approach should be able to better generalize to unseen examples because its parameters are more fully shared across all training examples. Herein, we demonstrate that indeed a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid approach, <i>AmortizedHMM</i>, demonstrating that it outperforms both the pure HMM and DNN approaches on methylation calling when the training data are incomplete.},
}

EndNote citation:

%0 Thesis
%A Yao, Brian 
%A Listgarten, Jennifer 
%T Nanopore Methylation Calling from Limited Training Data
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 1
%@ UCB/EECS-2022-15
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-15.html
%F Yao:EECS-2022-15