赞
踩
2020年10月,Dosovitskiy首次将纯Transformer的网络结构应用于图像分类任务中(ViT),并取得了当时最优的分类效果,其研究成果是Transformer完全替代标准卷积的首次尝试。随着谷歌提出ViT之后,一大批的vision transformer的工作席卷计算机视觉任务。
Open AI在2021年1月份发布DALL-E和CLIP,其中DALL-E是基于文本来生成图像的模型,而CLIP是用文本作为监督信号来训练可迁移的视觉模型,这两个工作也像ViT一样带动了一波新的研究高潮。
虽然原始论文中只对用CLIP进行zero-shot分类做了实验,但其实CLIP的应用价值远不止此。依据李沐团队的CLIP改进工作串讲总结如下(CLIP 改进工作串讲(上)【论文精读·42】):
今天,我们来了解下能够实现zero-shot分类的CLIP模型。
现在多模态大模型发展很迅猛,下图汇总了相关发展脉络。
论文链接:Learning Transferable Visual Models From Natural Language Supervision
有监督训练的,需要大量的数据标注,因此成本较高
。模型要预测的目标来源于数据本身,并非人工构造
),无需复杂的人工标注。就是做生成式模型
)。OpenAI的GPT系列模型,相较于BERT模型,可以直接zero-shot迁移到下游任务。因此CLIP模型,不仅仅可以进行自监督预训练,更重要的是还能实现zero-shot分类
。那么,CLIP模型是如何实现zero-shot分类的呢?
CLIP是一种基于对比学习的多模态模型,CLIP的训练数据是文本-图像对:一张图像和它对应的文本描述
,通过对比学习,模型能够学习到文本-图像对的匹配关系。
如下图所示,CLIP包括两个模型:Text Encoder和Image Encoder,其中Text Encoder用来提取文本的特征,可以采用NLP中常用的text transformer模型;而Image Encoder用来提取图像的特征,可以采用常用CNN模型或者vision transformer。
预训练过程(下图左半部分)
矩阵中的对角线元素
),而剩余的
N
2
−
N
N^2−N
N2−N个文本-图像对为负样本;推理过程(下图右半部分)
A photo of {object}
,然后将这些文本送入Text Encoder得到对应的文本特征,如果类别数目为N,那么将得到N个文本特征;CLIP模型已经开源,HuggingFace中transformers库中也集成了这个模型,我们可以先利用transformers库,进行简单的zero-shot图片分类测试:
我们先用一个常见的物体-键盘进行分类:
from PIL import Image from transformers import CLIPProcessor, CLIPModel # 0、准备一张测试图像 image = Image.open('./keyboard.png') print('image', image) # 1、加载预训练模型 model_path = '/root/autodl-fs/models/clip-vit-base-patch32' model = CLIPModel.from_pretrained(model_path) processor = CLIPProcessor.from_pretrained(model_path) # 2、相关词选项 text = ["a photo of a computer", "a photo of a mouse", "a photo of a keyboard", "a photo of a cellphone"] # 3、模型预测 inputs = processor(text=text, images=image, return_tensors='pt', padding=True) outputs = model(**inputs) # 4、预测结果归一化 logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) # 5、打印结果 probs = probs.detach().numpy().tolist() for i in range(len(text)): print(text[i], ':', probs[0][i])
a photo of a computer : 0.009659518487751484
a photo of a mouse : 0.000540732522495091
a photo of a keyboard : 0.9897673726081848 # 可以看到键盘的概率最高
a photo of a cellphone : 3.2318232115358114e-05
我们换一张动漫天使图片,并改变prompt进行分类
:
# 2、更改相关词选项
# text = ["a photo of a computer", "a photo of a mouse", "a photo of a keyboard", "a photo of a cellphone"]
text = ["a photo of a angle", "a photo of a ghost", "a photo of a cat", "a photo of a airplane"]
# 可以看到angle概率最大,而传统的ResNet在ImageNet预训练后只能分1000类
a photo of a angle : 0.9529269933700562
a photo of a ghost : 0.029617968946695328
a photo of a cat : 0.00526232598349452
a photo of a airplane : 0.012192689813673496
CLIP模型的训练
其实之前已经有研究用文本来作为监督信号来训练视觉模型,但这些方法难以实现较高的性能,作者认为主要原因就是数据集规模太小。因此为了训练CLIP,OpenAI从互联网收集了4个亿的文本-图像对
,论文称之为WebImageText,实现了大力出奇迹。
CLIP虽然是多模态模型,但它主要是用来训练可迁移的视觉模型。论文中Text Encoder固定选择一个包含63M参数的text transformer模型
而Image Encoder采用了两种的不同的架构
CLIP模型实现的伪代码如下:
prompt
prompt engineering
是最近NLP领域比较火的一个研究,核心是通过构建合适prompt(提示)来使预训练模型能够直接应用到下游任务,这和之前的预训练+微调
不同。A photo of {label}
,但其实也有其它选择,比如我们也可以直接用类别标签。但是如果直接采用类别标签作为文本描述,那么很多文本就是一个单词,缺少具体的上下文,而且也和CLIP的训练数据不太一致,效果上会不如采用A photo of {label}
。CLIP论文仅正文就长达27页,在30多个数据集上进行了大量实验,这里选择几个进行展示。
import clip import torch from PIL import Image if __name__ == '__main__': device = 'cuda' if torch.cuda.is_available() else 'cpu' print('loading model ...') model, preprocess = clip.load("ViT-B/32", device=device, download_root='/root/autodl-fs/models/clip_vit') # 图像预处理,input是clip的架构图,预处理后shape=(1, 3, 224, 224) image = preprocess(Image.open("./CLIP.png")).unsqueeze(0).to(device) # 文本预处理,预处理后shape=(3, 77) text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device) with torch.no_grad(): # logits_per_image shape = (1, 3) logits_per_image, logits_per_text = model(image, text) # 输入模型,执行前向推理 probs = logits_per_image.softmax(dim=-1).cpu().numpy() # softmax归一化 print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
# clip/clip.py
def _transform(n_px):
return Compose([
Resize(n_px, interpolation=BICUBIC),
CenterCrop(n_px),
_convert_image_to_rgb,
ToTensor(),
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])
# clip/clip.py def tokenize(texts: Union[str, List[str]], context_length: int = 77, truncate: bool = False) -> Union[torch.IntTensor, torch.LongTensor]: sot_token = _tokenizer.encoder["<|startoftext|>"] eot_token = _tokenizer.encoder["<|endoftext|>"] all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts] result = torch.zeros(len(all_tokens), context_length, dtype=torch.int) for i, tokens in enumerate(all_tokens): if len(tokens) > context_length: if truncate: tokens = tokens[:context_length] tokens[-1] = eot_token else: raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}") result[i, :len(tokens)] = torch.tensor(tokens) return result
# clip/model.py class CLIP(nn.Module): def __init__(self, embed_dim: int, # 图像被编码的维度与文本相同 # vision image_resolution: int, vision_layers: Union[Tuple[int, int, int, int], int], vision_width: int, vision_patch_size: int, # text context_length: int, vocab_size: int, transformer_width: int, transformer_heads: int, transformer_layers: int ): super().__init__() self.context_length = context_length if isinstance(vision_layers, (tuple, list)): # 图像编码方式一:使用ResNet结构 vision_heads = vision_width * 32 // 64 self.visual = ModifiedResNet( layers=vision_layers, output_dim=embed_dim, heads=vision_heads, input_resolution=image_resolution, width=vision_width ) else: # 图像编码方式一:使用ViT结构 vision_heads = vision_width // 64 self.visual = VisionTransformer( input_resolution=image_resolution, patch_size=vision_patch_size, width=vision_width, layers=vision_layers, heads=vision_heads, output_dim=embed_dim ) # 文本编码使用Transformer self.transformer = Transformer( width=transformer_width, layers=transformer_layers, heads=transformer_heads, attn_mask=self.build_attention_mask() ) self.vocab_size = vocab_size # 对单词进行embedding self.token_embedding = nn.Embedding(vocab_size, transformer_width) # 可学习的位置编码 self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width)) # LayerNorm self.ln_final = LayerNorm(transformer_width) # text映射 self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim)) # 可学习的logit缩放因子 self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) # 权重初始化 self.initialize_parameters()
# clip/model.py def forward(self, image, text): # 1、对输入图像和文本,分别编码 image_features = self.encode_image(image) # 最终的image_features:(1,512) text_features = self.encode_text(text) # 最终的text_features :(3,512) # 2、特征归一化 # normalized features image_features = image_features / image_features.norm(dim=1, keepdim=True) text_features = text_features / text_features.norm(dim=1, keepdim=True) # 3、求解图像与文本的相似度,获得图像与文本匹配结果 # cosine similarity as logits logit_scale = self.logit_scale.exp() # 对logit进行缩放 # 特征相乘获得相似度 [batch_img, batch_text] # logits_per_image shape = [1, 3] logits_per_image = logit_scale * image_features @ text_features.t() logits_per_text = logits_per_image.t() # shape = [global_batch_size, global_batch_size] return logits_per_image, logits_per_text
利用卷积将一张图像转变为一个序列
,可以参考:当CV遇上transformer(一)ViT模型 # clip/model.py
def encode_image(self, image):
return self.visual(image.type(self.dtype)) # 转成fp16
# clip/model.py def encode_text(self, text): # 1、对tokenized text进行编码 x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model] # 2、加上位置编码 x = x + self.positional_embedding.type(self.dtype) x = x.permute(1, 0, 2) # NLD -> LND # 3、多层ResidualAttentionBlock x = self.transformer(x) x = x.permute(1, 0, 2) # LND -> NLD # 4、layer_norm x = self.ln_final(x).type(self.dtype) # x.shape = [batch_size, n_ctx, transformer.width] # x.shape = [3, 77, 512] # take features from the eot embedding (eot_token is the highest number in each sequence) # [batch_size=3, transformer.width=512] @ [512,512] = [3,512] x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection return x
import glob import os import numpy as np from PIL import Image import torch import argparse import timm import torchvision from tqdm import tqdm from transformers import CLIPProcessor, CLIPModel """图像搜索引擎""" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def get_mean_std_of_dataset(dataset_dir): """统计数据集的均值和标准差""" train_files = glob.glob(os.path.join(dataset_dir, "*.jpg")) print(f'total {len(train_files)} files for training') result = [] # 遍历所有的图片 for file in train_files: img = Image.open(file).convert('RGB') img = np.array(img).astype(np.uint8) # 像素缩放到0-1 img = img / 255. result.append(img) # result shape = [BS, H, W, C] # 对每个通道求均值和标准差 mean = np.mean(result, axis=(0, 1, 2)) std = np.std(result, axis=(0, 1, 2)) print(f'mean = {mean}, std = {std}') return mean, std def get_args(): parser = argparse.ArgumentParser(description='Image Search Task') parser.add_argument('--input_size', type=int, default=128, help='images input size') parser.add_argument('--dataset_dir', default="/root/autodl-fs/data/fruit20/dataset/train", help='images path') parser.add_argument('--test_image_dir', default="/root/autodl-fs/data/fruit20/dataset/val", help='test images path') parser.add_argument('--save_dir', default="output_dir", help='path to save') parser.add_argument('--model_name', default="clip", help='model name: renet50 or renet152 or clip') parser.add_argument('--feature_dict_file', default="corpus_feature_dict.npy", help='filename where to save image representations') parser.add_argument('--topk', type=int, default=7, help='k most similar images') parser.add_argument('--mode', default="extract", help='extract or predict') args = parser.parse_args() return args def extract_feature_by_clip(model, preprocess, image_file_path): # 1、读取图像并进行预处理 image = Image.open(image_file_path) inputs = preprocess(images=image, return_tensors='pt') # 2、将输入传递给模型获取图像的特征 with torch.no_grad(): features = model.get_image_features(**inputs) vec_features = features.squeeze().cpu().numpy() return vec_features def extract_feature_single(args, model, image_file_path): image_rgb = Image.open(image_file_path).convert('RGB') image = image_rgb.resize(args.input_size, args.input_size) image = torchvision.transforms.ToTensor()(image) image = torchvision.transforms.Normalize(mean=[0.47, 0.43, 0.32], std=[0.37, 0.36, 0.34])(image).unsqueeze(0) with torch.no_grad(): features = model.forward_features(image) vec_features = model.global_pool(features) vec_features = vec_features.squeeze().cpu().numpy() return vec_features def extract_features(args, model, img_path, preprocess): all_vectors = {} train_files_path = glob.glob(os.path.join(img_path, "*.jpg")) train_files_path += glob.glob(os.path.join(img_path, "*.png")) for image_file_path in tqdm(train_files_path): if args.model_name == "clip": # 1、通过clip提取特征 all_vectors[image_file_path] = extract_feature_by_clip(model, preprocess, image_file_path) else: # 2、通过ResNet提取特征 all_vectors[image_file_path] = extract_feature_single(args, model, image_file_path) # 将提取出的图像特征保存起来 os.makedirs(f"./{args.save_dir}/{args.model_name}", exist_ok=True) np.save(f"{args.save_dir}/{args.model_name}/{args.feature_dict_file}", all_vectors) return all_vectors def get_similar_matrix(vectors_dict): """计算给定向量字典中各个向量之间的相似度,其中相似度的计算采用了向量之间的余弦相似度""" # 1、每行代表一个向量 v = np.array(list(vectors_dict.values())) # [NUM, H] # 2、计算相似度矩阵的分子部分 numerator = np.matmul(v, v.T) # [NUM, NUM] # 3、计算相似度矩阵的分母部分(计算每对向量之间的范数乘积) denominator = np.matmul( np.linalg.norm(v, axis=1, keepdims=True), np.linalg.norm(v, axis=1, keepdims=True).T ) # [NUM, NUM] # 4、到相似度矩阵 sim,其中 sim[i, j] 表示向量 i 和向量 j 之间的相似度 sim = numerator / denominator keys = list(vectors_dict.keys()) return sim, keys if __name__ == '__main__': args = get_args() model = None processor = None if args.model_name != "clip": # 利用renet50 or renet152作为特征抽取器 model = timm.create_model(args.model_name, pretrained=True) model.eval() else: # 加载openai clip预训练模型 model_path = '/root/autodl-fs/models/clip-vit-base-patch32' model = CLIPModel.from_pretrained(model_path) processor = CLIPProcessor.from_pretrained(model_path) if args.mode == "extract": # 1、利用预训练模型抽取特征,并保存下来 print(f'use pretrained model {args.model_name} to extract features') extract_features(args, model, img_path=args.dataset_dir, preprocess=processor) else: # 2、以图搜图 print(f'use pretrained model {args.model_name} to search {args.topk} similar images from corpus') test_images = glob.glob(os.path.join(args.test_image_dir, "*.jpg")) test_images += glob.glob(os.path.join(args.test_image_dir, "*.png")) # 2-1加载图像向量 all_vectors = np.load(f"./{args.save_dir}/{args.model_name}/{args.feature_dict_file}", allow_pickle=True) all_vectors = all_vectors.item() # 2-2 提取搜索图像的图像特征 for image_file_path in tqdm(test_images): print(f'reading {image_file_path} ......') if args.model_name == "clip": all_vectors[image_file_path] = extract_feature_by_clip(model, processor, image_file_path) else: all_vectors[image_file_path] = extract_feature_single(args, model, image_file_path) # 2-3 获取相似度矩阵及相似图片路径 sims, keys = get_similar_matrix(all_vectors) # 2-4 获取topk个相似图片 result = {} for image_file in tqdm(test_images): index = keys.index(image_file) sim_vec = sims[index] indexs = np.argsort(sim_vec)[::-1][1:args.topk] sim_imgs, sim_socres = [], [] for ind in indexs: sim_imgs.append(keys[ind]) sim_socres.append(sim_vec[ind]) result[image_file] = (sim_imgs, sim_socres) print(result)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。