Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software

Christopher Thompson

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2017-217

December 14, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-217.pdf

Modern code review is a lightweight and informal process for integrating changes into a software project, popularized by GitHub and pull requests. However, having a rich empirical understanding of modern code review and its effects on software quality and security can help development teams make intelligent, informed decisions, analyzing the costs and the benefits of implementing code review for their projects, and provide insight on how to support and improve its use.

This dissertation presents the results of our analyses on the relationships between modern code review practice and software quality and security, across a large population of open source software projects. First, we describe our neural network-based quantification model which allows us to efficiently estimate the number of security bugs reported to a software project. Our model builds on prior quantification-optimized models with a novel regularization technique we call random proportion batching. We use our quantification model to perform association analysis of very large samples of code review data, confirming and generalizing prior work on the relationship between code review and software security and quality. We then leverage timeseries changepoint detection techniques to mine for repositories that have implemented code review in the middle of their development. We use this dataset to explore the causal treatment effect of implementing code review on software quality and security. We find that implementing code review may significantly reduce security issues for projects that are already prone to them, but may significantly increase overall issues filed against projects. Finally, we expand our changepoint detection to find and analyze the effect of using automated code review services, finding that their use may significantly decrease issues reported to a project. These findings give evidence for modern code review being an effective tool for improving software quality and security. They also suggest that the development of better tools supporting code review, particularly for software security, could magnify this benefit while decreasing the cost of integrating code review into a team's development process.

Advisors: David Wagner

BibTeX citation:

@phdthesis{Thompson:EECS-2017-217,
    Author= {Thompson, Christopher},
    Title= {Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software},
    School= {EECS Department, University of California, Berkeley},
    Year= {2017},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-217.html},
    Number= {UCB/EECS-2017-217},
    Abstract= {Modern code review is a lightweight and informal process for integrating changes into a software project, popularized by GitHub and pull requests. However, having a rich empirical understanding of modern code review and its effects on software quality and security can help development teams make intelligent, informed decisions, analyzing the costs and the benefits of implementing code review for their projects, and provide insight on how to support and improve its use.

This dissertation presents the results of our analyses on the relationships between modern code review practice and software quality and security, across a large population of open source software projects. First, we describe our neural network-based quantification model which allows us to efficiently estimate the number of security bugs reported to a software project. Our model builds on prior quantification-optimized models with a novel regularization technique we call random proportion batching. We use our quantification model to perform association analysis of very large samples of code review data, confirming and generalizing prior work on the relationship between code review and software security and quality. We then leverage timeseries changepoint detection techniques to mine for repositories that have implemented code review in the middle of their development. We use this dataset to explore the causal treatment effect of implementing code review on software quality and security. We find that implementing code review may significantly reduce security issues for projects that are already prone to them, but may significantly increase overall issues filed against projects. Finally, we expand our changepoint detection to find and analyze the effect of using automated code review services, finding that their use may significantly decrease issues reported to a project. These findings give evidence for modern code review being an effective tool for improving software quality and security. They also suggest that the development of better tools supporting code review, particularly for software security, could magnify this benefit while decreasing the cost of integrating code review into a team's development process.},
}

EndNote citation:

%0 Thesis
%A Thompson, Christopher 
%T Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software
%I EECS Department, University of California, Berkeley
%D 2017
%8 December 14
%@ UCB/EECS-2017-217
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-217.html
%F Thompson:EECS-2017-217