The Human Annotation Tool

Lubomir Bourdev and Jitendra Malik

Updated June 17, 2011

The Human Annotation Tool is a tool that allows one to annotate people - where their arms and legs are, what their 3D pose is, which body parts are occluded, etc. A database of annotated people would be invaluable for creating computer vision algorithms to detect and localize people.

Try the Human Annotation Tool

You may run a copy of the tool by clicking on the above image and agreeing to all the disclaimers. You need to have Java and a reasonably good graphics card. The tool is tested on Mac OS X and on Windows.

Tutorial

The tool supports two kinds of annotations - labeling joints and extracting the 3D pose, and labeling the regions of the body (hair, face, upper clothes, etc).

PART 1: CREATING THE 3D POSE

Here is a video tutorial that shows you how to create the 3D pose:

The picture above shows all the controls. To annotate, please follow these steps.

STEP 1. Navigate to the person to annotate (blue controls)

To jump to a given annotation, put its index in the "Current Entry" box. Use the arrows to go to an image containing a person that is not annotated. Pan and zoom to the person. Press the New Entry button and choose Male, Female or Child.

STEP 2. Specify keypoints (picture controls)

Move the mouse over the location of each keypoint and press the corresponding key, indicated with a picture of the body part. The right shoulder, elbow and wrist correspond to keys Q, A, and Z, and the left ones - to W, S and X. The right hip, knee and ankle correspond to E, D, and C, and the left ones - to R, F and V. Select the ears with T and U, the eyes with G and H, and the nose with Y. You may pick and drag keypoints to adjust their locations.

You can also use J for the back of the head, and P and O for the right/left toes respectively. The keys can be changed from the configuration file.

If a keypoint is occluded or falls outside the image but you have a rough guess where it should be, mark it as best as you can. Leave keypoints unmarked if you have no idea where they should lie. Both shoulders,

Shoulders, Elbows, Wrists, Hips, Knees, Ankles: Approximate limbs as cylinders. The joint location in 3D is the intersection of the axes of adjoining cylinders.

Left vs Right: The keypoint is labelled as left or right from the point of view of the labelled person, not based its location in the image. For example, if the person is facing the camera his or her left keypoints lie on the right in image space

Detail

Nose Tip: The location is the tip of the nose, regardless of frontal or profile view. nose_profile

Eyes: In frontal view it is the midpoint of the two eye corners. The eye location does not depend on the pupils. frontal_eyes In profile view it is the tip of the eye surface. profile_eye Even if the eye is closed, we estimate the tip of the eye surface, ignoring the eyelids.

Ears: The tip of the tragus (the small pointed eminence of the external ear). frontal_ear . profile_ear

STEP 3. Specify keypoint z-order and visibility (red controls)

When you hover with the mouse over a keypoint, use the red keys to specify keypoint properties or to delete it.

Press N to mark a keypoint as occluded. Occluded keypoints are shown in green. The general rule is that a keypoint is visible if and only if the ray from the keypoint location reaches the camera, with the following exceptions:

Detail

Eyes: Glasses (including dark sunglasses) do not hide the eye. Closed eyes are still considered visible.

Ears: If the triagus is overlapped by the person's hair or clothes, it is considered hidden. hidden_ear

Shoulders, Elbows, Wrists, Hips, Knees, Feet: The clothes corresponding to the area of the joint do not hide the joint. The joint may, however, be hidden by the torso or limbs of the same person. For example, in right profile view, the left joints of the body are often occluded by the torso and/or by the right limbs.

Most keypoints have a reference keypoint and, for each keypoint, we need to specify if it is closer to the camera, roughly equidistant, or further away than its reference keypoint. By default keypoints are considered equidistant. When you hover over a keypoint, you will see a segment connecting it to its reference keypoint. Use the B key to toggle between the three depth states of the keypoint. When a keypoint is marked Far (or Near), it is shown smaller (or larger) than usual. The segment to its reference keypoint also changes.

STEP 4. Adjusting the 3D pose (green controls)

It is a good idea to press the Save button to record all of your changes. It is now time to adjust the keypoints to approximate the correct 3D pose of the person. Orbit the right window using the left mouse button to see the person from another viewpoint. The 3D pose is usually far from correct initially. You may need to adjust a few keypoints manually and see how that affects the 3D pose. A few tips:

Use the three optimization buttons, 8, 9 and 0, to adjust the controls. For example, you may press and hold 9 for a second, then hold 0 for a few seconds, then adjust some keypoints and repeat until the 3D pose is reasonably correct. Button 8 is designed to improve the torso (hips and shoulders). Button 9 lengthens the shortest segment. It is the most effective. Button 0 changes all segments sizes to be more consistent. However, it only applies and is fast enough if there are red segments. Don't hold the optimization buttons for too long, however, because they tend to drift all the keypoints and you will have to adjust them manually.
There is no undo; you can only go back to the saved state. To revert to a saved state, use the arrows to go to another annotation and then return back.
The optimization buttons may correct the pose, but may cause the keypoints to drift away from their locations. You will have to adjust the keypoints and iterate the process.
It is often convenient to reset the views by pressing the Space bar.
If the optimization buttons tend to make things worse, make sure you have the correct depth ordering of all the keypoints as speficied in Step 3 (very important!). Also make sure you have specified the correct setting in the Male/Female/Child radio button. Ensure that you have not swapped the left and right keypoints, as described in Step 2.
The color of each segment indicates how close it is to being correct. Bluish segments need no adjustments, while reddish segments are supposed to be longer. When you are done with adjusting the 3D pose, all segments should be blue or bluish.
One segment is shown with a dashed line. It is the segment that determines the overall scale of the model (see the Tayor paper for more). Its two control points are more important than the rest. (Optimization key 9 makes the dashed segment longer)

STEP 5. Save!

Be sure to press the Save button before going to the next annotation or you will lose all of your changes!

A pose labelling is acceptable if:

The keypoints are near their ideal positions, and definitely over the body
There are no bright red segments in the 3D view
The 3D pose looks reasonable from multiple viewpoints
Both shoulders are labelled

PART 2: LABELING THE REGIONS

Some images have associated precomputed segmentations. When a segmentation is available, you can switch to region labelling mode using the "Pose3D / Segmentation" radio button.

Segmentation Tool

In this view the left window remains the image itself, while the right window shows the image broken down ito segments and the associated labels for each segment. You can pan and zoom the image from either window. The segment under the cursor is hilighted in red, as shown in the picture. Furthermore, its label is shown in the info bar at the bottom - in the above example the cursor is over a region labelled "LowerClothes". The labeled region selected by this segment is shown in the left image (not displayed in this example). Some files contain thousands of small segments and it would be too time consuming to label each of them. The HAT tool allows us to hierarchically merge segments into fewer large ones and label them at once. You can use the up/down arrow keys to subdivide or merge segments. Holding down the Shift button while pressing up/down arrow makes larger segmentation steps. Here is the above example at three different segmentation levels:

seg levels

A preferred way to label images containing thousands of small segments is to start labelling at low subdivision levels and then increase the subdivision and refine the labelled regions. region keys

Segments are labelled by pressing a key while the mouse is over the segment. Most of the keys are shown on the picture to the right and are as follows:

"c" lower clothes (pants, skirt)
"u" upper clothes
"h" hair
"f" face
"n" neck
"A","a" the left/right arm respectively. Only include the skin regions, if any
"L","l" the left/right leg, respectively. Only include the skin regions, if any.
"S","s" the left/right shoe, respectively.
"o" for occluding object (anything that is not part of the body and clothes and hides part of the body/clothes. Marking the entire occluder is not necessary; mark only the portion that occludes part of the body/clothes)
"b" bag, including backpack and suitcase
"t" hat

The following keys are not shown in the illustration on the right:

"d" dress or any clothes that cover both the upper and lower portion of the body.
"G","g" the left/right glove, respectively.
"K","k" the left/right sock, respectively.
"e" sunglasses
"x" Sometimes the segmentations are imperfect and the correct boundary should split a region in half. Use the "x" label to label regions that should be split. But first make sure that you are at the finest granularity.

Other keys:

Up Arrow - decrease granularity of segments, if possible (use Shift+Up Arrow to do it faster)
Down Arrow - increase granularity of segments, if possible (use Shift+Down Arrow to do it faster)
Left/Right Arrow - move to the previous/next person.
Delete Key - remove the label of the region under the cursor.
CTRL-Z - one-level of undo/redo

Note the following:

As with the 3D pose, left and right are defined relative to the person, not to the image. "A" is the person's left arm, not the arm on the left side of the screen.
The annotations are person-centric, not image centric. For example, the same region could be marked as an arm in one annotation and occluder in another. Be very careful to label the right person when people are next to each other! You may want to toggle to 3D pose view to see which person is selected.
If the top clothes are missing, pretend they are wearing an invisible T-shirt. If the person is barefoot, pretend they are wearing invisible shoes.
If a woman is wearing a dress, split it into upper-clothes and lower-clothes portion
Treat pantihose as skin if it is almost transparent. Otherwise treat it as lower clothes.
Don't mark socks and gloves.
If a hat is not on the head (as the helmet in the picture with the two boys), mark it as an occluder
To mark an entire area, hold down a key while moving the mouse.
Be sure to mark the region completely. If the picture contains thousands of small regions some of them may be left unmarked. These can be seen easily because their boundaries are drawn. Please zoom to them and mark them appropriately.

The region annotation tool now supports restricted labeling, which is very helpful in making sure new edits don't damage previous annotations. Specificially, the scope of the region labeling sequence is constrained to the label under the mouse at the beginning of the sequence. For example, if you press and hold the "d" (dress) key while the mouse is over "l" (lower clothes) region, and then move the mouse, you will be marking regions as dress, but only the ones that used to be marked lower clothes. Similarly, if you press and hold the delete key while on a labelled region, your erasing operation will only affect regions labelled with that label.

The June 2011 version fixes several bugs that were introduced with the latest Java updates. Region labeling is now possible on Macs. However, on some computers region labeling is VERY slow. If you find it slow please try it on a different computer/operating system.

Using the tool for your data, and beyond person

By default the tool uses demo images and cannot save them. To use it on your data, you will need to create a directory, for example my_annotations, and place the images in a subdirectory my_annotations/images. In addition, you need to place a configuration file in my_annotations/person_config.xml. Here is the default configuration file. You may change it to add new keypoints or regions, change the shortcuts, or even label a new category. Upon startup the tool asks you to open that configuration file. The tool accepts JPG images only.

Region labeling requires segmenting the images and placing the segmentations in a directory my_annotations/segmentations. To create the segmentation image from a given JPEG image you need to download the Berkeley segmentation engine. Use this file to create the segmentation image and save it as PNG in the segmentations directory. Restart the HAT tool and go to that image. The radio button for segmentation should now be enabled.

The annotations are saved in a new directory my_annotations/info that the tool creates. Each annotation is a separate XML file. The XML files generated by the tool can also be read by our Matlab tools, which can use the data to create poselets or to compute various statistics.

The H3D dataset version 1.01+ is compatible with this tool. You may download it and explore its directory structure.

Code

The source code of the annotation tool is now available here. Feel free to extend it or fix any bugs and let me know if you have created an improved version that we can use instead of this one. The code requires bravery. You are on your own.

Copyright

The images used in H3D are taken from Flickr under the Creative Commons Attribution license. It allows for redistribution and derivative work for non-commercial or commercial purposes as long as the authors are attributed accordingly. Please see the license for more detail.

Feedback

If you find bugs or have suggestions on how to improve the tool, please email me at lbourdev-at-eecs. Your feedback is much appreciated!

Reference

Camillo J. Taylor. "Reconstruction of Articulated Objects from Point Correspondences in a Single Image" : Computer Vision and Image Understanding, Volume 80, No. 3 pp 349-363 Dec. 2000