Next: A Virtual Mirror Display
Up: A Virtual Mirror Interface
Previous: Introduction
![]() |
Video from a pair of cameras is used to estimate the distance of people or other objects in the world using stereo correspondence techniques. The census correspondence algorithm [11] determines similarity between image regions, not based on inter-image intensity comparisons, but rather based on inter-image comparison of intra-image intensity ordering information. The census algorithm involves two steps: first the input images are transformed so that each pixel represents its local image structure; second, the elements of these transformed images are put into correspondence, producing a disparity image.
We have implemented the census algorithm on a single PCI card, multi-FPGA reconfigurable computing engine [12]. This stereo system is capable of computing 24 stereo disparities on 320 by 240 images at 42 frames per second, or approximately 77 million pixel-disparities per second. These processing speeds compare favorably with other real-time stereo implementations such as [4].
Given dense depth information, a silhouette of a user is found by selecting the nearest human-sized surface and then tracking that region until it disappears. Our segmentation and grouping technique proceeds in several stages of processing. We first smooth the raw range signal using a morphological closing operator to reduce the effect of low confidence stereo disparities. We then compute the response of a gradient operator on the smoothed range data. We threshold the gradient response above a critical value, and multiply the inverse of this thresholded gradient image (a binary quantity) with the smoothed range data. This creates regions of zero value in the image where abrupt transitions occur, such as between people. We finally apply a connected-components grouping analysis to this separated range image, marking contiguous regions with distinct integer labels.
This processing is repeated with each new set of video frames obtained from the video cameras. After a new set of regions is obtained, it is compared to the set obtained for the previous frame. Temporal correspondences are established between regions through time on the basis of true surface area and proximity. We mark a particular region as the target person and follow it until it leaves a defined workspace area; we then select a new target by choosing the nearest depth region.
This depth information is used to isolate figure from ground, so that
the color and face detection modules described below are not confused
by clutter from background content or other users who are not
currently being tracked. (We are also currently extending our system
to simultaneously track and process several users.) Specifically, we
use the recovered connected component target region as a boolean mask
which we apply to images from the primary camera before passing them
to the color and pattern matching modules.
It is
possible to estimate head location using just the peaks of a person's
silhouette computed from a range region
[10]; when color and face information are not available, we
use this estimate to determine head position. In all modes of the
system, we also use range to constrain estimated face size; if the
estimated real size of a face is not within one standard deviation of
average head size, we use the projected average size to set the
the head size (but not position).
Within the foreground depth region of a particular user, it is useful to mark regions that correspond to skin color. We use a classification strategy which matches skin hue but is largely invariant to intensity or saturation, as this is robust to different lighting conditions and absolute amount of skin pigment in a particular person.
We apply color segmentation processing to images obtained from the primary camera. Each image is initially represented in terms of the red, green, and blue channels of the image. It is converted directly into a ``log color-opponent'' space similar to that used by the human visual system. This space can directly represent the approximate hue of skin color, as well as its log intensity value. Specifically, (R,G,B) tuples are converted into tuples of the form (log(G),log(R)-log(G),log(B)-(log(R)+log(G))/2). We use a classifier with a Gaussian probability model; mean and full covariance are estimated from training examples for a ``skin'' class and a ``non-skin'' class. When a new pixel p is presented for classification, the likelihood ratio P(p=skin)/P(p=non-skin) is computed as a classification score. Our color representation is similar to that used in [3], but we estimate our classification criteria from examples rather than apply hand-tuned parameters.
For computational efficiency at run-time, we precompute a lookup table
over all input values, quantizing the classification score (skin
similarity value) into 8 bits and the input color channel values to 6,
7, or 8 bits. This corresponds to a lookup table which ranges between
256K and 16MB of size. This is stored as a texture map, if
texture mapping hardware supports the ability to
apply ``Pixel Textures'', in which each pixel of an input image is
rendered with uniform color but with texture coordinates set according
to the pixel's RGB value.
Otherwise a traditional lookup
table operation is performed on input images with the main CPU.
After the lookup table has been applied, segmentation and grouping analysis are performed on the classification score image. The same algorithm as described above for range image processing is used, except that the definition of the target region is handled differently. The target region (the face) is defined based on the results of the face detector module, or via range shape analysis, described below. (Should color be the only module available, we fall back to simply choosing the highest color region whose size agrees with the proportion of an upright face.) Connected-component regions are tracked from frame to frame as in the range case, with the additional constraint that a size constancy requirement is enforced: temporal correspondence is not permitted between regions if their real size changes more than a specified threshold amount. Overall, this algorithm allows us to infrequently identify the face region using the face detection or range shape criteria, but still track it through all frames using color information.
![]() |
Stereo and color processing provide signals as to the location and shape of the foreground user's body and hands, faces, and other skin tone regions (clothing or bags are also possible sources of false positives). To distinguish head from hands and other body parts, and to localize the face within a region containing the head, we use pattern recognition methods which directly model the statistical appearance of faces.
![]() |
In the simplest cases, the face detector identifies which flesh color regions correspond to the head, and which to other body parts. When a face is detected to overlap with a skin color region, we mark that region as the ``target'', and record the relative offset of the face detection result within the bounding box of the color region. The target label and relative offset persist as long as the region is tracked as in Section 2.2. Thus if the face detector cannot find the face in a subsequent frame, the system will still identify the target color region, unless it has left the scene, become occluded, or violated the size change constraint imposed on color region tracking.
When a a color region does change size dramatically, we perform an additional test to see if two regions in fact performed a split or merge relative to the previous frame. If this has occurred (we simply compare the bounding box of the new region to the bounding boxes of the previous region), we attempt to maintain the face detection target label and subregion position information despite the merge or split. In this case we make the assumption that the face did not actually move, compute the absolute screen coordinates of the face subregion in the previous frame, and re-evaluate which region it is in in the current frame. We also update the subregion coordinates relative to the newly identified target region. The assumption of stationary face is not ideal, but it works in many cases where users are intermittently touching their face with their hands.
If there was no face pattern detected corresponding to a likely skin region, we optionally check to see if a face region can be inferred from the overall configuration of the skin color regions and near range regions. We test each color region to see if corresponds to a peak in the selected range silhouette. If so, we label this region to be the target. If not, we use the estimate of the head computed from the range silhouette alone.
T. Darrell, G. Gordon. J. Woodfill, M. Harville, "A Virtual Mirror Interface using Real-time Robust Face Tracking", Proceedings of the the Third International Conference on Face and Gesture Recognition, IEEE Computer Society Press, April 1998, Nara, Japan.