Tech Reports | EECS at UC Berkeley

Alper Vural and Nikhil Narayen and Martin Gouy

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2016-50

May 11, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-50.pdf

The problem our capstone project seeks to solve is finding groupings of similar patents using two data sets with information on 5 million previously filed patents. The first data set has records of what previous patents each patent cited, and the second one contains text from each patent. Our solution converts these two data sets into a “samples” and “features” format, which is then usable by a machine learning technique known as clustering. Clustering takes data items in the aforementioned format and groups similar items together. Clustering is a very common practice, and a literature review revealed a plethora of approaches, including spectral clustering, K Means, and hierarchical clustering. We measured the accuracy of each approach by checking how consistent the groupings were with a third data set, a text file containing pairs of patents that blocked each other from being filed. The best approach we found ended up being a modified version of the standard K Means. Due to issues with the citation data set, as well as time and server memory constraints we achieved some success but did not reach the desired accuracy.

Advisors: Lee Fleming

BibTeX citation:

@mastersthesis{Vural:EECS-2016-50,
    Author= {Vural, Alper and Narayen, Nikhil and Gouy, Martin},
    Title= {Entrepreneurial Patent Data Analysis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2016},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-50.html},
    Number= {UCB/EECS-2016-50},
    Abstract= {The problem our capstone project seeks to solve is finding groupings of similar patents using two data sets with information on 5 million previously filed patents. The first data set has records of what previous patents each patent cited, and the second one contains text from each patent. Our solution converts these two data sets into a “samples” and “features” format, which is then usable by a machine learning technique known as clustering. Clustering takes data items in the aforementioned format and groups similar items together. Clustering is a very common practice, and a literature review revealed a plethora of approaches, including spectral clustering, K Means, and hierarchical clustering. We measured the accuracy of each approach by checking how consistent the groupings were with a third data set, a text file containing pairs of patents that blocked each other from being filed. The best approach we found ended up being a modified version of the standard K Means. Due to issues with the citation data set, as well as time and server memory constraints we achieved some success but did not reach the desired accuracy.},
}

EndNote citation:

%0 Thesis
%A Vural, Alper 
%A Narayen, Nikhil 
%A Gouy, Martin 
%T Entrepreneurial Patent Data Analysis
%I EECS Department, University of California, Berkeley
%D 2016
%8 May 11
%@ UCB/EECS-2016-50
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-50.html
%F Vural:EECS-2016-50