说明文档

模型卡片：Fashion CLIP

免责声明：本模型卡片改编自此处的模型卡片。

模型详情

更新 (2023/03/10)：我们已更新了模型！我们发现 laion/CLIP-ViT-B-32-laion2B-s34B-b79K 检查点（感谢 Bin！）比原始 OpenAI CLIP 在时尚领域表现更好。因此，我们在一个更新（且更好！）的 FashionCLIP 版本（简称 FashionCLIP 2.0）上进行了微调，同时保持架构不变。我们推测 laion/CLIP-ViT-B-32-laion2B-s34B-b79K 带来的性能提升源于增加的训练数据（是 OpenAI CLIP 数据的 5 倍）。然而，我们的论文中的核心观点保持不变——在我们的时尚数据集上微调 laion/CLIP 提高了跨基准测试的零样本性能。请参阅下表，了解各模型加权宏 F1 分数的比较。

模型	FMNIST	KAGL	DEEP
OpenAI CLIP	0.66	0.63	0.45
FashionCLIP	0.74	0.67	0.48
Laion CLIP	0.78	0.71	0.58
FashionCLIP 2.0	0.83	0.73	0.62

FashionCLIP 是一个基于 CLIP 的模型，旨在为时尚概念生成通用产品表示。我们利用 OpenAI 发布的预训练 ViT-B/32 检查点，在一个大规高质量的新型时尚数据集上训练 FashionCLIP，以研究 CLIP 类模型的领域特定微调是否足以产生可零样本迁移到全新数据集和任务的通用产品表示。FashionCLIP 并非为模型部署而开发——若要进行部署，研究者首先需要仔细研究其在部署环境中的相关能力。

模型日期

2023 年 3 月

模型类型

该模型采用 ViT-B/32 Transformer 架构作为图像编码器，使用带掩码自注意力机制的 Transformer 作为文本编码器。这些编码器从预训练检查点开始训练，通过对比损失在包含 80 万产品的时尚数据集上最大化（图像，文本）对的相似度。

文档

数据

模型在从 Farfetch 数据集[^1 等待官方发布。]获取的（图像，文本）对上进行了训练。该数据集是一个英文数据集，包含超过 80 万个时尚产品，涵盖 3000 多个品牌和数十种产品类型。用于编码的图像是标准产品图像，即白色背景上的商品图片，不含人物。使用的文本是 Farfetch 数据集中 highlight（如"条纹"、"长袖"、"阿玛尼"）和 short description（"80 年代风格 T 恤"）的拼接。

局限性、偏见与公平性

我们承认 FashionCLIP 存在一定的局限性，并预期它继承了原始 CLIP 模型的某些局限和偏见。我们并不期望我们的微调能显著改善这些局限性：我们承认我们使用的时尚数据对性别概念做出了明确假设，如"女性蓝色鞋"，这不可避免地将服装的某些方面与特定人群联系起来。

我们的研究还表明，所使用的数据为 FashionCLIP 引入了一定的局限性。从文本模态来看，由于 Farfetch 数据集的大多数描述都较长，我们观察到 FashionCLIP 在处理较长查询时可能表现更好。从图像模态来看，FashionCLIP 也偏向于标准产品图像（居中、白色背景）。

模型选择，即在微调过程中选择适当的停止标准，仍然是一个开放性挑战。我们观察到，当需要域外泛化（即跨不同数据集）时，使用同域（即与测试集相同分布）验证数据集上的损失作为选择标准并不是好的选择，即使所使用的数据集相对多样化和大规模。

引用

@Article{Chia2022,
    title="Contrastive language and vision learning of general fashion concepts",
    author="Chia, Patrick John
            and Attanasio, Giuseppe
            and Bianchi, Federico
            and Terragni, Silvia
            and Magalh{\~a}es, Ana Rita
            and Goncalves, Diogo
            and Greco, Ciro
            and Tagliabue, Jacopo",
    journal="Scientific Reports",
    year="2022",
    month="Nov",
    day="08",
    volume="12",
    number="1",
    abstract="The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from general and transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model adapted for the fashion industry. We demonstrate the effectiveness of the representations learned by FashionCLIP with extensive tests across a variety of tasks, datasets and generalization probes. We argue that adaptations of large pre-trained models such as CLIP offer new perspectives in terms of scalability and sustainability for certain types of players in the industry. Finally, we detail the costs and environmental impact of training, and release the model weights and code as open source contribution to the community.",
    issn="2045-2322",
    doi="10.1038/s41598-022-23052-9",
    url="https://doi.org/10.1038/s41598-022-23052-9"
}

patrickjohncyh/fashion-clip

作者 patrickjohncyh

zero-shot-image-classification transformers

↓ 2.7M ♥ 268

创建时间: 2023-02-21 19:51:47+00:00

更新时间: 2024-09-17 15:19:43+00:00

在 Hugging Face 上查看

文件 (19)

.gitattributes

README.md

config.json

merges.txt

model.safetensors

onnx/config.json

onnx/merges.txt

onnx/model.onnx ONNX

onnx/preprocessor_config.json

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

onnx/vocab.json

preprocessor_config.json

pytorch_model.bin

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.json