End-to-End Large Scale Machine Learning with KeystoneML

Evan Sparks

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-200
December 15, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-200.pdf

The rise of data center computing and Internet-connected devices has led to an unparalleled explosion in the volumes of data collected across a multitude of industries and academic disciplines. This data serves as fuel for statistical machine learning techniques that in turn enable some of today's most advanced applications including those powered by image classification, speech recognition, and natural language understanding, which we broadly term machine learning applications.

Unfortunately, until recently the tools and techniques used to leverage recent advances in machine learning at the scales demanded by modern datasets, and thus develop these applications, have been available only to experts in fields such as distributed computing, statistics, and optimization.

I describe my efforts to render these tools accessible to a broader audience of application developers, and further demonstrate that by taking a holistic approach and capturing end-to-end high level specifications of machine learning applications the systems I present here can make novel, high impact optimizations to decrease resource consumption while simultaneously increasing throughput. These improvements are designed to decrease machine learning application development time, increase quality, and increase machine learning application developer productivity. I demonstrate the viability of these optimizations via experiments on a number of real-world applications in domains such as collaborative filtering, computer vision, and natural language processing.

Many of the ideas presented in this thesis have already had practical impact as embodied in the open source software packages KeystoneML and Apache Spark MLlib.

Advisor: Michael Franklin


BibTeX citation:

@phdthesis{Sparks:EECS-2016-200,
    Author = {Sparks, Evan},
    Title = {End-to-End Large Scale Machine Learning with KeystoneML},
    School = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-200.html},
    Number = {UCB/EECS-2016-200},
    Abstract = {The rise of data center computing and Internet-connected devices has led to an unparalleled explosion in the volumes of data collected across a multitude of industries and academic disciplines.
This data serves as fuel for statistical machine learning techniques that in turn enable some of today's most advanced applications including those powered by image classification, speech recognition, and natural language understanding, which we broadly term machine learning applications.

Unfortunately, until recently the tools and techniques used to leverage recent advances in machine learning at the scales demanded by modern datasets, and thus develop these applications, have been available only to experts in fields such as distributed computing, statistics, and optimization. 

I describe my efforts to render these tools accessible to a broader audience of application developers, and further demonstrate that by taking a holistic approach and capturing end-to-end high level specifications of machine learning applications the systems I present here can make novel, high impact optimizations to decrease resource consumption while simultaneously increasing throughput. These improvements are designed to decrease machine learning application development time, increase quality, and increase machine learning application developer productivity. 
I demonstrate the viability of these optimizations via experiments on a number of real-world applications in domains such as collaborative filtering, computer vision, and natural language processing.

Many of the ideas presented in this thesis have already had practical impact as embodied in the open source software packages KeystoneML and Apache Spark MLlib.}
}

EndNote citation:

%0 Thesis
%A Sparks, Evan
%T End-to-End Large Scale Machine Learning with KeystoneML
%I EECS Department, University of California, Berkeley
%D 2016
%8 December 15
%@ UCB/EECS-2016-200
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-200.html
%F Sparks:EECS-2016-200