FG-CLIP 2

Abstract

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. The key ingredients of FG-CLIP 2 are summarized below.

Rich Fine-Grained Supervision. Including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions.
Bilingual Multimodal Data. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance.
Performance. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages.
Chinese Multimodal Benchmark. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification.

Rich Fine-Grained Supervision

Our approach employs a two-stage hierarchical learning framework that progressively enhances vision-language alignment from global semantics to fine-grained details.

Stage 1: Global Semantic Alignment. We begin with large-scale image-text pairs, each annotated with both a short caption (for concise scene-level description) and a long caption (for rich contextual detail). Training on this bilingual corpus enables strong global alignment, establishing a robust foundation for cross-modal understanding in both English and Chinese.
Stage 2: Fine-Grained Visual-Language Learning. Building upon the globally aligned representation, we add region-level supervision and multiple fine-grained objectives to sharpen local correspondences. Specifically, this stage incorporates:
  – Fine-Grained Visual Learning: region–caption alignment via RoIAlign-extracted region features and phrase-level descriptions.
  – Fine-Grained Textual Learning: discrimination of subtle textual differences using hard negatives with perturbed attributes.
  – Cross-Modal Rank Loss with Global Threshold Synchronization: dynamic margin-based ranking with globally synchronized thresholds for stable hard negative mining.
  – Textual Intra-modal Contrastive Loss: intra-language contrastive learning to separate semantically similar but distinct region captions.

Bilingual Multimodal Data

In the first stage, we train on image-text pairs from diverse sources. For English, we adopt an enhanced version of the LAION-2B dataset, where we augment the original short captions with detailed long captions generated by LMMs. For Chinese, we combine three datasets: Wukong (100 million pairs), Zero (250 million pairs), and a large-scale in-house dataset (500 million pairs). In the second stage, we extend training with fine-grained region-text pairs to further improve spatial grounding. For English, we use the FineHARD dataset, which includes 12 million images, 40 million bounding boxes with fine-grained region descriptions, and 10 million hard negative samples. For Chinese, we use an in-house dataset containing 12 million images.

Zero dataset: https://github.com/yuxie11/R2D2
FineHARD dataset: https://huggingface.co/datasets/qihoo360/FineHARD

Performance

FG-CLIP 2 achieves superior performance across 29 datasets and 8 tasks, including fine-grained understanding, bounding box classification, open-vocabulary object detection, long and short caption image-text retrieval, zero-shot image classification, open-vocabulary segmentation, and large multimodal model (LMM) tasks, demonstrating strong bilingual generalization in both English and Chinese. Detailed results are reported in the following tables.

We present the alignment capability of FG-CLIP 2 between dense visual features and text in both Chinese and English contexts. The results are shown in the following figure, where warmer colors indicate higher similarity between image regions and the matched text. Compared to the previous version, FG-CLIP 2 supports denser visual feature output and achieves strong bilingual semantic alignment and fine-grained perception capabilities.

Visualization of FG-CLIP 2's dense feature maps and semantic alignment capability in bilingual scenarios.

Chinese Data Benchmark

LIT-CN: A large-scale Chinese long-caption dataset for image-text retrieval

LIT-CN comprises 15,000 images from AI-Challenger Caption, 3,230 from MUGE, and 20,000 curated web images. All images are uniformly re-captioned using Qwen2.5-VL-32B-Instruct-AWQ, prompted to generate rich, context-aware descriptions with an average length of 131 tokens. Images below 256×256 resolution are filtered, yielding 33,010 high-quality image-text pairs.

LIT-CN dataset: https://huggingface.co/datasets/qihoo360/LIT-CN

DCI-CN: Chinese translation of the Densely Captioned Images (DCI) dataset

DCI-CN is derived from the English Densely Captioned Images (DCI) dataset. Captions are translated into Chinese using the LMM and validated by native speakers to ensure linguistic fluency and semantic fidelity to the original content.

DCI-CN dataset: https://huggingface.co/datasets/qihoo360/DCI-CN

DOCCI-CN: Chinese version of the DOCCI dataset

DOCCI-CN is constructed from the original DOCCI dataset using the same translation and human validation pipeline as DCI-CN, ensuring consistent quality and cross-dataset comparability.

DOCCI-CN dataset: https://huggingface.co/datasets/qihoo360/DOCCI-CN

BoxClass-CN: Chinese region-text alignment benchmark

BoxClass-CN is a region classification dataset that evaluates the alignment between image regions and their corresponding Chinese textual descriptions. It provides fine-grained, region-level supervision and serves as a dedicated benchmark for assessing models’ ability to understand localized visual semantics in Chinese.

BoxClass-CN dataset: https://huggingface.co/datasets/qihoo360/BoxClass-CN

Examples of the BoxClass-CN Dataset

BibTeX


  @article{xie2025fg2,
    author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
    title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model}, 
    publisher={arXiv:2510.10921},
    year={2025},
  }

  @inproceedings{xie2025fg,
    title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
    author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
    booktitle={ICML},
    year={2025}
  }