说明文档

模型卡片：Fashion CLIP

免责声明：本模型卡片改编自此处的模型卡片。

模型详情

更新（2023年3月10日）：我们更新了模型！我们发现laion/CLIP-ViT-B-32-laion2B-s34B-b79K检查点（感谢Bin！）在时尚领域的表现优于原始的OpenAI CLIP。因此，我们在保持架构不变的情况下，微调了一个更新（且更好！）的FashionCLIP版本（以下简称FashionCLIP 2.0）。我们推测，laion/CLIP-ViT-B-32-laion2B-s34B-b79K所带来的性能提升归因于训练数据的增加（是OpenAI CLIP数据的5倍）。然而，我们的论文结论保持不变——在我们的时尚数据集上微调laion/CLIP提高了各项基准测试中的零样本性能。请参阅下表，比较各模型的加权宏F1分数。

模型	FMNIST	KAGL	DEEP
OpenAI CLIP	0.66	0.63	0.45
FashionCLIP	0.74	0.67	0.48
Laion CLIP	0.78	0.71	0.58
FashionCLIP 2.0	0.83	0.73	0.62

FashionCLIP是一个基于CLIP的模型，旨在为时尚概念生成通用的产品表示。利用OpenAI发布的预训练检查点（ViT-B/32），我们在一个大型、高质量的新型时尚数据集上训练FashionCLIP，以研究对类似CLIP的模型进行领域特定微调是否足以生成可零样本迁移到全新数据集和任务的产品表示。FashionCLIP并非为模型部署而开发——要在实际场景中部署，研究人员首先需要仔细研究其在特定部署环境中的能力。

模型日期

2023年3月

模型类型

该模型使用ViT-B/32 Transformer架构作为图像编码器，并使用掩码自注意力Transformer作为文本编码器。这些编码器从预训练检查点开始进行训练，通过在包含80万产品的时尚数据集上的对比损失来最大化（图像，文本）对的相似性。

文档

数据

该模型在从Farfetch数据集[^1 等待正式发布]获取的（图像，文本）对上进行训练，这是一个包含超过80万时尚产品的英文数据集，涵盖3000多个品牌和数十种对象类型。用于编码的图像是标准产品图像，即在白色背景上的商品图片，不包含人物。使用的文本是Farfetch数据集中可用的_亮点_（例如"条纹"、"长袖"、"阿玛尼"）和_简短描述_（"80年代风格T恤"）的拼接。

局限性、偏见和公平性

我们承认FashionCLIP存在某些局限性，并预期它继承了原始CLIP模型中存在的某些局限性和偏见。我们不期望我们的微调会显著加剧这些局限性：我们承认我们使用的时尚数据对性别的概念做出了明确的假设，例如"女士蓝色鞋子"，这不可避免地将服装的某些方面与特定人群联系起来。

我们的研究还表明，所使用的数据给FashionCLIP带来了某些局限性。从文本模态来看，鉴于从Farfetch数据集派生的大多数描述较长，我们观察到FashionCLIP在较长查询中的表现可能优于较短查询。从图像模态来看，FashionCLIP也偏向于标准产品图像（居中、白色背景）。

模型选择，即在微调过程中选择适当的停止标准，仍然是一个开放性挑战。我们观察到，当期望域外泛化（即跨不同数据集）时，在域内（即与测试集分布相同）验证数据集上使用损失是一个糟糕的选择标准，即使所使用的数据集相对多样且规模较大。

引用

@Article{Chia2022,
    title=\"Contrastive language and vision learning of general fashion concepts\",
    author=\"Chia, Patrick John
            and Attanasio, Giuseppe
            and Bianchi, Federico
            and Terragni, Silvia
            and Magalh{\~a}es, Ana Rita
            and Goncalves, Diogo
            and Greco, Ciro
            and Tagliabue, Jacopo\",
    journal=\"Scientific Reports\",
    year=\"2022\",
    month=\"Nov\",
    day=\"08\",
    volume=\"12\",
    number=\"1\",
    abstract=\"The steady rise of online shopping goes hand in hand with the development of increasingly complex ML and NLP models. While most use cases are cast as specialized supervised learning problems, we argue that practitioners would greatly benefit from general and transferable representations of products. In this work, we build on recent developments in contrastive learning to train FashionCLIP, a CLIP-like model adapted for the fashion industry. We demonstrate the effectiveness of the representations learned by FashionCLIP with extensive tests across a variety of tasks, datasets and generalization probes. We argue that adaptations of large pre-trained models such as CLIP offer new perspectives in terms of scalability and sustainability for certain types of players in the industry. Finally, we detail the costs and environmental impact of training, and release the model weights and code as open source contribution to the community.\",
    issn=\"2045-2322\",
    doi=\"10.1038/s41598-022-23052-9\",
    url=\"https://doi.org/10.1038/s41598-022-23052-9\"
}

Fluf22/fashion-clip-inference

作者 Fluf22

zero-shot-image-classification transformers

↓ 0 ♥ 1

创建时间: 2024-10-21 07:18:50+00:00

更新时间: 2024-10-21 09:04:34+00:00

在 Hugging Face 上查看

文件 (20)

.gitattributes

README.md

config.json

handler.py

merges.txt

model.safetensors

onnx/config.json

onnx/merges.txt

onnx/model.onnx ONNX

onnx/preprocessor_config.json

onnx/special_tokens_map.json

onnx/tokenizer.json

onnx/tokenizer_config.json

onnx/vocab.json

preprocessor_config.json

pytorch_model.bin

special_tokens_map.json

tokenizer.json

tokenizer_config.json

vocab.json