Jordan Kee

Computing and Information Systems (Sunway) | Year 3 | Degree: BSc (Hons) in Computer Science

Human Silhouette Extraction from Videos by Semantic Segmentation

Video conferencing is commonplace nowadays. However, certain environments such as home offices may have distracting backgrounds. Silhouette extraction can remove the background from live footage without the need for a green screen which is troublesome to acquire and set-up. A test dataset was created comprising of 3 videos of differing conditions. DeepLabv3 and FCN were used to perform semantic segmentation, where the networks label each pixel of a given image with a corresponding class of what is being represented. The human category can be used to implement silhouette extraction from videos by labelling the ‘human’ tagged pixels as silhouette pixels, and everything else as background pixels. GMG and MOG2 were used to statistically model the background. Then, accuracy was derived from the confusion matrix data comparing the manually created ground truth binary maps to each method’s output.
The proposed ResNet-101 based solutions using DeepLabv3 and FCN demonstrate accurate results with the overall F-score of 0.96 albeit with a significant performance penalty, running at 1.00 and 1.20 FPS compared to 15.76 and 8.23 FPS of MOG2 and GMG, respectively. A newer CNN based approach such as Faster R-CNN or Context Encoding Network (EncNet) could be explored to address the current limitations

Email
Website
Instagram

Jordan Kee

More information

Jordan Kee

Computing and Information Systems (Sunway) | Year 3 | Degree: BSc (Hons) in Computer Science

Human Silhouette Extraction from Videos by Semantic Segmentation

Abstract This page presents a comparison of DeepLabv3, FCN, MOG2 and GMG in extracting human silhouettes from videos. Three videos were recorded: (a) static camera, constant illumination (b) moving camera, constant illumination and (c) static camera, dynamic illumination. The ResNet-101 based models were used to semantically segment human pixels in each frame to create a silhouette binary mask. Binary maps from manual rotoscoping was used as a ground truth to generate a confusion matrix producing the mean F-score of 0.96, 0.95, 0.26 and 0.22 at the mean FPS of 1.00, 1.20, 15.76, and 8.23, respectively.

Visual Comparison of Each Tested Method Visually, for test videos 1 and 2, the CNN methods produce nearly perfect segmentation results. They are less accurate for the third video where the limbs of the subject appear to be missing. For all 3 test videos, the computer vision methods deliver poor results where there are holes in the foreground, and the background is wrongly detected as a human silhouette.

Typical Process There are many solutions currently available and in-use. It is difficult for a computer vision algorithm to compensate for changes in lighting and camera movement, limiting the use case to static and uniform backgrounds [3]. Due to that, most computer vision techniques are imprecise and do not properly extract all foreground elements [6]. Generally, most algorithms have three main processes to achieve silhouette extraction [4]:

Process	Description
Initialize background	Generate a model of the background based on a predetermined number of frames.
Detect foreground	For a given frame, the frame is compared with the background model generated. The comparison can be done via subtraction which would result in extracting the foreground pixels.
Maintain background	Based on the learning rate specified, the background model generated in the first process is updated based on the new frames observed. Usually, pixels that have not moved for a long time would be considered as part of the background and hence added to the model.

Dataset A custom test dataset was deemed most suited for this report to compare the tested methods fairly and extensively. The following criterion was created:

Only human subject(s). Preferably the same human subject for each video to achieve consistency and fair comparison. Other categories of living or non-living objects are unnecessary for the purpose of this report.
Continuous video with no breaks. The video must have a frame rate of 30 which is commonly used in video conferencing applications. The video must contain at least 1 second of footage so that the methods tested may model the background.
Binary map ground truth. Each video must have a corresponding binary map to act as the ground truth to perform accuracy calculations. The binary map may be selected frames at a fixed interval to reduce computational complexity as well as manpower needed for manual rotoscoping.
Variable difficulty in the form of camera movement, background complexity, illumination changes, etc. The first video should ideally be as simple as possible to act as a best-case scenario. The various difficulties aim to test the limits of each method and find the failure cases.

Test Video	Camera Movement	Illumination	Background Complexity	Autofocus & Autoexposure	Summary
1	None	Fixed	Simple and uniform	Both locked	First frame is pure background. Subject walks into frame.
2	Extreme	Fixed	Complex	Both on auto	Stimulates an extreme case of camera shake. Camera follows moving subject.
3	None	Extreme	Complex	Both on auto	Stimulates an extreme case of lighting changes. Room alternates from being illuminated with natural sunlight from windows and no lighting.

Flowchart The .VideoCapture function from OpenCV is used to capture video files from a device or file source. Then, the .read function is used to extract each frame individually. Using the DeepLab pretrained model, a mask is created of the human pixels present in the frame. Next, a bitwise operation is applied to the frame by multiplying the source with the generated mask. The resulting frame is then displayed to the user via the .imshow function.

Functions

Function	Description
Load video	Program can accept a URL for a video stream or a file reference for a video file.
Capture video	Program can capture a live stream typically from a webcam.
Remove human silhouette	Extract only the human present in the video.
Live preview	Display the silhouette mask generated as well as the silhouette extracted for each frame.
FPS counter	Print the FPS of each frame as they are processed.
Save processed video	Output processed file.

Results To determine the accuracy of each method, a confusion matrix is calculated as seen in Table 4.1. For each frame, the binary map of the output is compared to the ground truth. The silhouette pixels are pure white (255 intensity value), while the background pixels are pure black (0 intensity value). When a pixel is part of the silhouette in the ground truth, and the output pixel corresponds in location and value, this is marked as one true positive pixel. When the output pixel is determined to be part of the silhouette when it is actually the background, this is marked as one false positive pixel. When the ground truth pixel is a silhouette, but the output pixel fails to identify as such, the pixel is marked as a false negative. When the pixel is a background pixel in both the ground truth and output, this is marked as a true negative.

Accuracy DeepLab3 leads with the highest overall F1-score of 0.96 across all three test videos. It produces a near-perfect silhouette extraction of 0.99 for the first test video, a slightly reduced score of 0.96 for the second test video and struggles slightly with the third test video at 0.94. As FCN is DeepLabv3’s predecessor, it was expected to perform marginally worse. However, its performance is just slightly DeepLabv3’s. FCN’s results are 0.1 less accurate across all categories. MOG2 and GMG perform the same for the first test video, at an F-score of 0.46. Both perform the worst in the second test video as the background was constantly moving which cannot be handled by the background modelling technique they both share. At 0.14 and 0.09 F-scores respectively, the result is extremely poor. They perform slightly better with the third test video at 0.17, and 0.11 respectively, as the background is now constant with the camera mounted to a tripod. The computer vision techniques are far inferior compared to the ResNet-101 convolutional networks, performing 4x worse overall.

Speed FPS is chosen over inference time or time per frame as the speed metric because for FPS is more common for video applications and can act as a metric to determine real-time processing speed. The mean FPS of each method is recorded on a per-frame basis. In Table 4.3, the FPS of every frame processed is recorded individually to analyse for a per-frame variance in performance. All three methods have the approximate first frame FPS of 6.23E-10. The FPS of the first frame is excluded as it is significantly longer due to the overhead of initialization. Much of this time is taken by OpenCV to capture a video file and load it as a stream. The time taken to load the CNN models are excluded from the FPS calculation. The variance of both CNN methods is very low, averaging at 0.0009 FPS. The variance of both computer vision methods are significantly higher compared to the CNN methods, at 0.4158 FPS. This is because ResNet-101 performs identical processing on each input frame regardless of the content, where convolution and deconvolution is applied uniformly which results in near identical processing times. For the computer vison methods however, processing is done with a background modelling approach. Depending on the nature of the background, processing time varies significantly. The speed results are the opposite of the accuracy results. DeepLabv3 performs the worst among the bunch, at the average of 1.00 FPS. FCN performs 20% better than DeepLabv3, at 1.20 FPS. This is because DeepLab introduced atrous convolution which is not present in the FCN model which carries a performance overhead. MOG2 and GMG are significantly faster at 15.76 and 8.23 FPS, respectively.

Performance Tradeoff When comparing speed and accuracy, computer vision methods have a major advantage in terms of speed, while CNN methods triumph in accuracy. Among the 4 methods discussed, there does not seem to be a middle ground compromise where both speed and accuracy is balanced. The methods only excel in one metric or the other. Between DeepLabv3 and FCN however, FCN is 20% faster with only a 1% accuracy penalty. FCN is the preferred option in this scenario when using a CNN based approach. For the statistical background modelling approaches, MOG2 is superior to GMG as it is 91% faster and 18% more accurate.

Conclusion Convolutional neural networks are an abundant class of models that tackle numerous pixelwise tasks. CNNs excel in semantic segmentation in images which can be adapted to extract human silhouettes from videos. The proposed ResNet-101 based solutions using DeepLabv3 and FCN demonstrate accurate results with the overall F-score of 0.96 albeit with a significant performance penalty, running at 1.00 and 1.20 FPS compared to 15.76 and 8.23 FPS of MOG2 and GMG, respectively. A trade-off method that performs well in terms of both accuracy and speed could be found to act as a middle ground to be used in real time applications. A newer CNN based approach such as Faster R-CNN or Context Encoding Network (EncNet) could be explored to address the current limitations.

THANK YOU FOR READING!

Page saved!

Add default layout Add text Add image/symbol Add audio/video

PAUSE AUDIO PAUSE VIDEO