Zero-shot open-vocabulary segmentation with FreeSeg-Diff. The image is passed to a diffusion
model and to an image captioner model to get visual features and text
description respectively. The features are used to get class-agnostic masks, which are then associated
with the extracted text. The final segmentation map is obtained after a mask refinement step. All models
are kept frozen.
Our detailed pipeline for zero-shot semantic segmentation.
We extract features from several blocks of the DM corresponding to a noisy image.
These features are clustered to generate class-agnostic masks that are used to produce image crops.
These cropped parts are then passed to a CLIP image encoder 𝓥 to associate each mask/cropped image
to a given text/class.
The textual classes are obtained by generating a caption from BLIP, followed by NLP tools to extract
textual entities.
These entities are mapped to the most similar mask.
Finally, the masks are refined with CRF to obtain the final segmentation map. All models are kept
frozen.
Qualitative results on Pascal VOC dataset.
From top to down: original image, raw clustering results of DM features, segmentation results with
FreeSeg-Diff.
Our pipeline filters out redundant clusters while retaining key objects and refine coarse masks to yield
sharp segmentation maps.