FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Anonymous Author
Immagine

Zero-shot open-vocabulary segmentation with FreeSeg-Diff. The image is passed to a diffusion model and to an image captioner model to get visual features and text description respectively. The features are used to get class-agnostic masks, which are then associated with the extracted text. The final segmentation map is obtained after a mask refinement step. All models are kept frozen.

Immagine

Our detailed pipeline for zero-shot semantic segmentation. We extract features from several blocks of the DM corresponding to a noisy image. These features are clustered to generate class-agnostic masks that are used to produce image crops. These cropped parts are then passed to a CLIP image encoder 𝓥 to associate each mask/cropped image to a given text/class. The textual classes are obtained by generating a caption from BLIP, followed by NLP tools to extract textual entities. These entities are mapped to the most similar mask. Finally, the masks are refined with CRF to obtain the final segmentation map. All models are kept frozen.

Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 7
Image 8
Image 9
Image 10
Image 11
Image 12
Image 13
Image 14
Image 15
Image 16
Image 17
Image 18
Image 19
Image 20
Image 21
Image 22
Image 23
Image 24

Qualitative results on Pascal VOC dataset. From top to down: original image, raw clustering results of DM features, segmentation results with FreeSeg-Diff. Our pipeline filters out redundant clusters while retaining key objects and refine coarse masks to yield sharp segmentation maps.