Fully Distributed EM for Very Large Datasets

Jason Wolfe, Aria Delier Haghighi and Daniel Klein

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2007-178
December 22, 2007

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.pdf

In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be prohibitive. We present a framework which exploits parameter sparsity to fully distribute the entire EM procedure. Each node interacts with only the subset of parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate the effectiveness of our framework over a MapReduce approach, on two tasks: word alignment for machine translation, and LDA for topic modeling.


BibTeX citation:

@techreport{Wolfe:EECS-2007-178,
    Author = {Wolfe, Jason and Haghighi, Aria Delier and Klein, Daniel},
    Title = {Fully Distributed EM for Very Large Datasets},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2007},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.html},
    Number = {UCB/EECS-2007-178},
    Abstract = {In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters.  For very large data sets, however, even storing all of the parameters in a single node for the M-step can be prohibitive.  We present a framework which exploits parameter sparsity to fully distribute the entire EM procedure.  Each node interacts with only the subset of parameters relevant to its data, sending messages to other nodes along a junction-tree topology.  We demonstrate the effectiveness of our framework over a MapReduce approach, on two tasks: word alignment for machine translation, and LDA for topic modeling.}
}

EndNote citation:

%0 Report
%A Wolfe, Jason
%A Haghighi, Aria Delier
%A Klein, Daniel
%T Fully Distributed EM for Very Large Datasets
%I EECS Department, University of California, Berkeley
%D 2007
%8 December 22
%@ UCB/EECS-2007-178
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.html
%F Wolfe:EECS-2007-178