How to Evaluate Background Removal Models with a Novel Method?

Case Studies

15 mins

Written by

Erik Harutyunyan

In an increasingly digital world, video conferencing has become an indispensable tool for remote work, virtual meetings, and social interaction. However, not everyone has the luxury of a distraction-free environment to conduct these virtual interactions. The cluttered bookshelf, the bustling café, or the chaotic living room can often serve as distractions that diminish the focus and professionalism of online meetings. That’s where background cancellation technology comes into play. Leveraging cutting-edge machine learning algorithms, background cancellation tasks aim to isolate the subject from their immediate environment, replacing the chaotic backdrops with a more neutral or professional setting.

In this article, we will explore how to evaluate the accuracy of background removal models and how Manot’s algorithm outperforms the existing algorithms.

‍

Common Challenges and Issues Faced in the Background Segmentation Process

‍

Background segmentation in computer vision faces several challenges, ranging from variable lighting conditions and dynamic backgrounds to real-time processing requirements. Algorithms may struggle when the foreground and background share similar colors or textures, or when the scene contains noise or occlusions. Additionally, issues like edge artifacts, where the boundary between foreground and background isn’t clear-cut, further complicate the task. These challenges necessitate a multi-faceted approach, composed of improved algorithms, and advanced insight providing tools to reach and go beyond the state of the art in background segmentation.

‍

The Importance of Precision and Accuracy in Background Removal

‍

Precision and accuracy in background removal are pivotal for a range of applications, from enhancing user experience to maintaining professionalism in virtual meetings. High precision ensures that the subject isn’t inadvertently removed, while high accuracy ensures that unwanted background elements are effectively eliminated. These metrics are especially critical in real-time scenarios, like video conferencing, where computational resources are limited and lighting conditions can vary. Poor precision and accuracy can result in distracting visual artifacts and may even affect the functionality and accessibility of technologies like augmented reality or assistive devices. Overall, the importance of these metrics extends beyond mere aesthetics, impacting the reliability, adaptability, and utility of background removal across different use-cases.

‍

Background Removal: An Essential Subset of Image Segmentation

‍

Image segmentation and semantic segmentation are cornerstone tasks in the field of computer vision, mostly employing deep learning models to accomplish their goals. These tasks operate on images or sequences of images, which serve as the input to the deep learning model. The input is typically a multi-dimensional tensor, where the dimensions represent the height, width, and color channels of the image.

In the case of image segmentation, the output is generally another image of the same dimensions, where the pixel values represent the segment to which each original pixel belongs. This segmented image essentially divides the original image into various regions based on certain attributes like color, intensity, or texture.

‍

Semantic segmentation takes this a step further. The output here is also an image, but instead of generic labels, each pixel is labeled with a category that provides semantic meaning — such as “car,” “human,” or “building.” The output is essentially a “semantic map” that allows for a more nuanced interpretation of the scene captured in the original image.

A specialized subtask within these general segmentation tasks is background segmentation. In this task, the primary goal is to distinguish the background from the foreground objects of interest. The input remains the same — an image or a sequence of images — but the output specifically identifies which pixels belong to the background and which belong to the foreground objects. This is immensely valuable in applications like video conferencing, where isolating a person from their background can allow for real-time background blurring, replacement, or removal. Essentially, background segmentation serves as the technology underpinning various enhancements that make virtual interactions more focused and less distracted by the environment.

‍

Manot: One Platform to Study Them All

‍

Manot is an AI powered insight providing platform that studies any ML/DL models performance on a continuously expanding pool of computer vision tasks. Besides providing commodity visualizations and providing informative metrics on the model’s performance, Manot can accurately separate a subset from the user’s raw data where the model will perform the worst. Moreover, Manot successfully does this without utilizing the model itself, just a sample data with predictions and ground truths provided by the user.

‍

The Practical Impact of Manot on Background Removal: A Case Study

‍

The task is fairly easy to perform compared in computer vision as it is essentially a binary semantic segmentation task. But even in this case the trained models possess biases or contain weak spots that a company has to know of before rolling a model out to production.

To study the power of Manot’s insight suggestion system we conducted an experiment on the DIS5K (Dichotomous Image Segmentation) dataset. We used the pre-trained IS-Net model presented in their paper, which was trained on the 3000 images of the DIS5K training set. For the Manot evaluation purposes we utilized the DIS5K test set of 2000 images that the model never saw during training or validation.

We split the DIS5K test set to 1000 image Manot setup and 1000 image Manot evaluation sets. The purpose of the Manot setup set is to give the platform model predictions along with the input images and ground truths for analysis and insight generation. Then the raw images of the Manot evaluation set are fed to the platform out of which it separates a subset where the model will show significant failures.

‍

To evaluate the quality of this subset selection mechanism we first calculate the mIoU on the selected subset, then we also select 10 times random subsets out of the Manot evaluation set with the same size as the insight subset that the platform selected, after which we compute the mean mIoU and std mIoU for these subsets for them to serve as a baseline for our algorithm.

‍

The std of the 10 mIoUs on the 10 random subsets was 0.028 which shows that the performance difference of Manot’s insight selection is significant. Besides looking at the bare numbers, let’s also look at the visualizations of predictions and ground truths on some of the images in the insights subset.

The platform gave the highest insight score to the image above, hinting a terrible failure when given to the model which indeed happened. The model considered an entire object — the plate — as a foreground when in fact it should be a part of the background.

In this example, in addition to the extra background objects being part of the foreground we have also imprecise masks of the chairs in the narrow arrears between the sticks, and a missing foreground object.

‍

Key Takeaways

‍

To conclude, background cancellation is a pretty widespread task given the amount of video calls and online conferences happening worldwide especially during and after the pandemic. Even having a very precise dataset like DIS5K and potent architectures, the trained DL models can have tremendous failures which in these applications will result in a lot of disappointed users with exposed unwanted parts of their backgrounds. To this end, the model invariant Manot insight providing algorithm is a powerful preventive mechanism for companies dependent on background cancellation models.

‍

To chat about the platform and its capabilities, or for the access to the experiment details, codes and platform demo you can reach out to me at erik@manot.ai