Interpreting batch correction of single-cell variational inference at scale

Katherine Wu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-91

May 13, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-91.pdf

Single-cell RNA sequence datasets often contain unwanted technical variation from differences in sample collection, protocol, sequencing depth, experimental labs, and biological factors. These nuisance factors, known as batch effects, are especially common in newer datasets that span multiple conditions and hundreds of donors. To correct for such batch effects, integration methods like single-cell variational inference (scVI) combine samples of data and produce a self-consistent version for downstream analysis. In this thesis, we benchmark scVI's current performance on complex integration tasks of 100+ donor datasets, evaluating its ability to both remove batch effects and retain important biological information. We further propose the addition of a donor embedding to the model architecture, and demonstrate that the embedding is effective at interpreting batch correction for confounding covariates. Finally, we assess scVI integration in relation to gene expression through a scoring protocol that measures the batch sensitivity of each gene.

Advisors: Nir Yosef

BibTeX citation:

@mastersthesis{Wu:EECS-2022-91,
    Author= {Wu, Katherine},
    Title= {Interpreting batch correction of single-cell variational inference at scale},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-91.html},
    Number= {UCB/EECS-2022-91},
    Abstract= {Single-cell RNA sequence datasets often contain unwanted technical variation from differences in sample collection, protocol, sequencing depth, experimental labs, and biological factors. These nuisance factors, known as batch effects, are especially common in newer datasets that span multiple conditions and hundreds of donors. To correct for such batch effects, integration methods like single-cell variational inference (scVI) combine samples of data and produce a self-consistent version for downstream analysis. In this thesis, we benchmark scVI's current performance on complex integration tasks of 100+ donor datasets, evaluating its ability to both remove batch effects and retain important biological information. We further propose the addition of a donor embedding to the model architecture, and demonstrate that the embedding is effective at interpreting batch correction for confounding covariates. Finally, we assess scVI integration in relation to gene expression through a scoring protocol that measures the batch sensitivity of each gene.},
}

EndNote citation:

%0 Thesis
%A Wu, Katherine 
%T Interpreting batch correction of single-cell variational inference at scale
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 13
%@ UCB/EECS-2022-91
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-91.html
%F Wu:EECS-2022-91