The Berkeley Segmentation Dataset and Benchmark

New: The BSDS500, an extended version of the BSDS300 that includes 200 fresh test images, is now available here.

The goal of this work is to provide an empirical basis for research on image segmentation and boundary detection.  To this end, we have collected 12,000 hand-labeled segmentations of 1,000 Corel dataset images from 30 human subjects.  Half of the segmentations were obtained from presenting the subject with a color image; the other half from presenting a grayscale image. The public benchmark based on this data consists of all of the grayscale and color segmentations for 300 images. The images are divided into a training set of 200 images, and a test set of 100 images.

We have also generated figure-ground labelings for a subset of these images which may be found here

We have used this data for both developing new boundary detection algorithms, and for developing a benchmark for that task.  You may download a MATLAB implementation of our boundary detector below, along with code for running the benchmark.  We are committed to maintaining a public repository of benchmark results in the spirit of cooperative scientific progress.

On-Line Browsing


  • By Image -- This page contains the list of all the images. Clicking on an image leads you to a page showing all the segmentations of that image.
  • By Human Subject -- Clicking on a subject's ID leads you to a page showing all of the segmentations performed by that subject.

Benchmark Results

  • By Algorithm -- This page shows the list of tested algorithms, ordered as they perform on the benchmark.
  • By Image -- This page shows the test images.  The images are ordered by how well any algorithm can find boundaries, so that it is easy to see which images are "easy" and which are "hard" for the machine.

On all of these pages, there are many cross-links between images, subjects, and algorithms.  Note that many of the smaller images are linked to full-size versions.


Segmentation Dataset

You are free to download a portion of the dataset for non-commercial research and educational purposes.  In exchange, we request only that you make available to us the results of running your segmentation or boundary detection algorithm on the test set as described below.  Work based on the dataset should cite our ICCV 2001 paper:

  author = {D. Martin and C. Fowlkes and D. Tal and J. Malik},
  title = {A Database of Human Segmented Natural Images and its
           Application to Evaluating Segmentation Algorithms and
           Measuring Ecological Statistics},
  booktitle = {Proc. 8th Int'l Conf. Computer Vision},
  year = {2001},
  month = {July},
  volume = {2},
  pages = {416--423}

You can download the [images] (22MB) and the [human segmentations] (27MB) separately.  If you download both of these, you can safely untar them on top of each other.

/home/eecs/project/index.html You can find a description of the segmentation file format here

You can also download a tarball containing the Java application we used to construct the dataset. You may find it useful for creating ground truth segmentations of your own images.

Human Benchmark Results

If you want to generate web pages containing the benchmark results for your algorithm, then you'll need to download the benchmark results for the humans. Untar this file into a fresh directory, which will be your repository for benchmark results. You should then deposit your algorithm's results into this same directory, as per the directions in the Dataset/README file (which is in the code tarball that you also need; see below).

Benchmark and Boundary Detection Code

Here is the tarball of code, which you can also browse.  You should untar it in a fresh directory.  Running gmake install in that directory after untarring should build everything.  The makefile will create a lib/matlab directory that you should put in your MATLAB path.  Briefly, the subdirectory contents are listed below.  

  • Benchmark -- Code to run the benchmark and create web pages.
  • CSA++ -- C++ and MATLAB wrapping of Andrew Goldberg's CSA package for graph assignment problems.  This is the computational core of the benchmark, as it allows us to compare two boundary maps while both permitting localization error and avoiding over-counting.
  • Dataset -- Convenience routines for accessing images and segmentation data.  You should make sure to download the BSDS dataset (see above), and edit the file bsdsRoot.m to point to the data.
  • Detectors -- End-user routines for various boundary detectors.  Our brightness/color/texture gradient detectors are here (pbBGTG.m and pbCGTG.m), along with baseline detectors based on the image gradient magnitude and the second moment matrix.
  • Filters -- Routines for creating high quality filters and for filtering images quickly.
  • Gradients -- Routines to compute our brightness, color, and texture gradients efficiently.
  • Textons -- Code to compute and manipulate textons, which are the basis of the texture gradient.  The files unitex*.mat contain universal textons computed from the BSDS300 training set.
  • Util -- Miscellaneous support code for everything else.

The following files may be of particular interest:

Submitting Benchmark Results

If you have a boundary detector or segmentation algorithm,  your results on the test images should be put in the form of 8-bit grayscale BMP images.  These images should be the same size as the benchmark images (481x321 pixels), and should be named <iid>.bmp, where <iid> is the image ID number.  You should also create a name.txt file containing a 1-line text descriptor for your algorithm, and an optional about.html file with a short description.  The description can contain html links.  

In the downloads section above, you will find the code for running the benchmark, as well as scripts for generating web pages. This code is known to build and work on Intel/Linux platforms. We do not support Windows, although we know of at least one case where the code was build successfully on Windows using Cygwin. The code has also been built succesfully on Mac Intel (see notes here). You will need Matlab to run the benchmark. If you have the appropriate hardware and software, then please download the code and run the benchmark yourself. To submit results, tar up your algorithm directory and send us a URL from which we can download it.

If you are unable to run the benchmark yourself, then you may submit a tarball containing your algorithm's results with the name.txt and about.html files. We will run the benchmark for you, but we cannot guarantee quick turnaround.

Segmentation results should be in the form of binary images where a "1" marks the segment boundary pixels.  Boundary detection results can also be in this form, but we strongly encourage a "soft" boundary representation.  Submitting a soft output will remove the burden on you of choosing an optimal threshold, since the benchmark will find this threshold for you.  Note also that for best results, the boundaries should be thinned, e.g. by performing non-maxima supression. The benchmark will handle thick boundaries, but the morphological thinning operation that we do to thin boundaries may not be optimal for your algorithm.

Please Note: Although this should go without saying, we will say it anyway.  To ensure the integrity of results on the test data set, you may use the images and human segmentations in the training set for tuning your algorithms but your algorithms should not have access to any of the data (images or segmentations) in the test set until your are finished designing and tuning your algorithm.  

About the Benchmark

"When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of the meager and unsatisfactory kind." --Lord Kelvin

The goal of the benchmark is to produce a score for an algorithm's boundaries for two reasons:  (1) So that different algorithms can be compared to each other, and (2) So that progress toward human-level performance can be tracked over time.  We have spent a great deal of time working on a meaningful boundary detection benchmark, which we will explain briefly here.  See our NIPS 2002 and PAMI papers for additional details.  Note that the methodology we have settled on may be applied to any boundary dataset -- not just our dataset of human segmented natural images.  

The setup is as follows.  The human segmented images provide our ground truth boundaries.  We consider any boundary marked by a human subject to be valid.  Since we have multiple segmentations of each image by different subjects, it is the collection of these human-marked boundaries that constitutes the ground truth.  We are then presented the output of some algorithm for an image.  Let us assume that this output is a soft boundary map with one pixel wide boundaries, valued from zero to one where high values signify greater confidence in the existence of a boundary.  Our task is to determine how well this soft boundary map approximates the ground truth boundaries.

Traditionally, one would "binarize" the boundary map by choosing some threshold.  There are two problems with thresholding a boundary map: (1) The optimal threshold depends on the application, and we would like the benchmark to be useful across different applications, and (2) Thresholding a low-level feature like boundaries is likely to be a bad idea for most applications, since it destroys much information.  For these reasons, our benchmark operates on a non-thresholded boundary map.  

Nevertheless, we do need to threshold the boundary map in order to compare it to the ground truth boundaries, but we do so at many levels, e.g. 30.  At each level, we compute two quantities -- precision and recall -- and in this manner produce a precision-recall curve for the algorithm.  Precision and recall are similar to but different from the axes of ROC curves.  Precision is the probability that a machine-generated boundary pixel is a true boundary pixel.  Recall is the probability that a true boundary pixel is detected.  We consider these axes to be sensible and intuitive.  Precision is a measure of how much noise is in the output of the detector.  Recall is a measure of how much of the ground truth is detected.   The curve shows the inherent trade-off between these two quantities -- the trade-off between misses and false positives -- as the detector threshold changes.

Although the precision-recall curve for an algorithm is a rich descriptor of its performance, it is still desirable to distill the performance of an algorithm into a single number.  This is possible to do in a meaningful way for algorithms whose curves do not intersect and are roughly parallel.  When two precision-recall curves do not intersect, then the curve furthest from the origin dominates the other.  The summary statistic that we use is a measure of this distance.  It is the F-measure, which is the harmonic mean of precision and recall.  The F-measure is defined at all points on the precision-recall curve.  We report the maximum F-measure value across an algorithm's precision-recall curve as its summary statistic.

Why do we use precision-recall curves instead of ROC curves?  

Receiver operating characteristic (ROC) curves show, qualitatively, the same trade-off between misses and false positives that precision-recall curves show.  However, ROC curves are not appropriate for quantifying boundary detection.  The axes for an ROC curve are fallout and recall.  Recall is the same as above, and is also called hit rate.  Fallout, or false alarm rate, is the probability that a true negative was labeled a false positive.  This is not a meaningful quantity for a boundary detector since it is not independent of the image resolution.  If we reduce the radius of the pixels by a factor of n so that the number of pixels grows as n^2, then the number of true negative will grow quadratically in n while the number of true positives will grow only linearly in n.  Since boundaries are 1D objects, the number of false positives is most likely to also grow linearly in n, and so the fallout will decline by a factor of 1/n.  Precision does not have this problem, since instead of being normalized by the number of true negatives, it is normalized by the number of positives.  

This page is maintained by Pablo Arbelaez, Charless Fowlkes and David Martin Last modified June, 2007.