SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification

Xiaoying Wang1*, Yumeng He1,2*, Jingkai Shi1,2*, Jiayin Lu1, Yin Yang3, Ying Jiang1, Chenfanfu Jiang1
1University of California, Los Angeles   2University of Southern California   3University of Utah
* equal contributions

SeeClear is a novel framework that converts transparent objects into generative opaque images, predicting stable and accurate depth for transparent objects.

Transparent Input
Opaque Output
Depth
Transparent
Opaque
Depth
Transparent Input
Opaque Output
Depth
Transparent
Opaque
Depth
Transparent Input
Opaque Output
Depth
Transparent
Opaque
Depth

✨ Abstract ✨

Monocular depth estimation remains challenging for transparent objects, where refraction and transmission are difficult to model and break the appearance assumptions used by depth networks. As a result, state-of-the-art estimators often produce unstable or incorrect depth predictions for transparent materials. We propose SeeClear, a novel framework that converts transparent objects into generative opaque images, enabling stable monocular depth estimation for transparent objects. Given an input image, we first localize transparent regions and transform their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. The processed image is then fed into an off-the-shelf monocular depth estimator without retraining or architectural changes. To train the opacification model, we construct SeeClear-396k, a synthetic dataset containing 396k paired transparent-opaque renderings. Experiments on both synthetic and real-world datasets show that SeeClear significantly improves depth estimation for transparent objects.

🎯 Pipeline 🎯

Starting from an image, we first apply a segmentation model to obtain the transparent object mask. Guided by the mask and the image, a latent diffusion model generates an opacified image of the transparent object. A mask refinement module then predicts a soft blending mask to alpha-composite the generated opaque region with the original background, producing the final composited image. The composited image is finally fed into a depth model to estimate accurate depth.

♣️ Qualitative Comparison on In-the-Wild Images ♣️

MoGe-2 exhibits depth leakage through transparent objects (e.g., a water bottle), while our method produces more accurate and consistent depth predictions.


♠️ Comparison ♠️

We evaluate transparent-object depth estimation on ClearGrasp and TransPhy3D datasets. Compared with the baseline. SeeClear produces accurate transparent-object depth.

Input Input
Input Ground Truth
Input Depth4ToM
Input MODEST
Input D4RD
Input DKT
Input Marigold
Input GenPercept
Input GeoWizard
Input DA3
Input MoGe-2
Input Ours
Input Input
Input Ground Truth
Input Depth4ToM
Input MODEST
Input D4RD
Input DKT
Input Marigold
Input GenPercept
Input GeoWizard
Input DA3
Input MoGe-2
Input Ours
Input Input
Input Ground Truth
Input Depth4ToM
Input MODEST
Input D4RD
Input DKT
Input Marigold
Input GenPercept
Input GeoWizard
Input DA3
Input MoGe-2
Input Ours

🔷 TDoF20 Comparison 🔷

Comparison on TDoF20 dataset. Rows show Input, Ground Truth, DKT, and Ours. Each column is a different scene.

Input
Input Input Input Input Input
Ground Truth
Ground Truth Ground Truth Ground Truth Ground Truth Ground Truth
DKT
DKT DKT DKT DKT DKT
Ours
Ours Ours Ours Ours Ours