训练Stable Diffusion(XL) Lora的图片是否需要caption？_lora 训练图片的caption

作者：小桥流水78 | 2024-06-27 20:34:20

踩

lora 训练图片的caption

intro

目前很多风格lora模型的训练都会对图片进行caption，训练风格lora时使用image caption步骤是否能带来正向的效果?
在sdxl的代码实现中，lora插入的位置为unet中cross attention的QKV的线性投影部分，而cross attention主要影响的是文本token和图像latents的对齐关系。如果每张图片都有caption，训练时是建立的图像风格和caption中所有words的连接，如果不使用caption，则是建立的图像风格和单一trigger words的关系，比如本文使用的‘gfzz’。为了得出哪一种方式更好的结论，下面进行一个对比实验。

对比实验

数据集：这里使用200张动漫数据集在sdxl base 1.0上进行风格lora模型训练实验，数据集的分布为80%的人物和20%的风景建筑以及动物其它，风格类型类似凡人修仙传等3D国漫，分辨率均为1024x1024。图片caption使用blip2和wd14 tagger的合并结果。
训练脚本：kohya_ss
训练参数：
calss images=None
repeat=20
instance=‘gfzz’
lr = 0.0001
batchsize=4
epoch=20 ( checkpoint-00016 is the best)
LR Scheduler=‘constant’
Optimizer=‘Adafactor’
resolution= 1024,1024
network alpha = 8
network rank = 8

于是可以对比三个模型的效果：
sdxl-base: 原始模型的生图效果
sdxl-base-gfzz: sdxl-base+不使用caption训练的风格lora
sdxl-base-gfzz-tag: sdxl-base+使用caption训练的风格lora

prompt测试集抽取自Parti 的评测集 PartiPrompts，包含一些基础的prompt和一些复杂的prompt。

   a rabbit
   a green pepper
   a portrait of an old man
   a close-up photo of a wombat wearing a red backpack and raising both arms in the air. Mount  is in the background.
   a Saint Bernard standing up with its paws in the air. A young girl is seated on the dog's shoulders.
   a man and a woman standing in the back up an old pickup truck
   a wooden deck overlooking a mountain valley
   a man riding a camel on the beach
   a volcano spewing fish into the sky
   a young man wielding a sword, moonlight sprinkled around trees, engraved sword marks, determined expression, trees, sword
1
2
3
4
5
6
7
8
9
10

negative prompt:
blurry, low quality, worst quality, ugly, duplicate, mutated body parts, extra arms, (extra heads), extra legs, fused fingers, extra fingrers, bad anatomy, bad proportions, lowres, fewer digits, cloned face, repeated person, unclear eyes, blurry eyes, malformed limbs, out of focus, cropped, monochrome, text, JPEG artifacts

生成结果

图片左侧为sdxl-base，中间为使用caption的sdxl-base-gfzz-tag，右侧为无caption的sdxl-base-gfzz。
seed：40551640821 and sample 30 steps by Euler_a
在这里插入图片描述

总结

两种方式训练的lora都存在着对某些事物比如青椒无法生成的问题。即使使用较早期的checkpoint-00008也存在同样的问题，调低lora scale到0.6可以得到缓解。
从画面质量和文本图片匹配度来看，使用caption的方式训练效果略胜一筹。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小桥流水78/article/detail/763677