[图像编辑08] InstructPix2Pix
Table of Contents

InstructPix2Pix: Learning to Follow Image Editing Instructions Link to InstructPix2Pix: Learning to Follow Image Editing Instructions

From:CVPR2023

Motivation Link to Motivation

区别于前面的工作

  • 这次的文本是“指令”
  • 这次的训练数据集是生成的

Method Link to Method

  • Generating a Multi-modal Training Dataset

    • Generating Instructions and Paired Captions

      具体来说,首先是从LAION-Aesthetics V2 6.5+数据集采集了700个input captions,然后人工对这700个进行加工,得到一个三组的数据(1) input captions, (2) edit instructions, (3) edited captions

      如图所示

      image-20250326191501807

      然后用700个数据微调GPT-3,让GPT学习只给input captions就生成edit instructions和edited captions

      image-20250326191629903

      然后利用LAION的所有数据做input caption生成的结果就是

      image-20250326191718769

    • Generating Paired Images from Paired Captions

      使用Prompt-to-Prompt,根据input captions和edited captions生成图片

      使用CLIP-based mertic,衡量生成的前后图片的相似性,对低质量数据做过滤

  • InstructPix2Pix

    最小化下面的损失函数

    L=EE(x),E(cI),cT,ϵN(0,1),t[ϵϵθ(zt,t,E(cI),cT)22]L = \mathbb{E}_{\mathcal{E}(x), \mathcal{E}(c_I), c_T, \epsilon \sim \mathcal{N}(0,1), t} \left[ \left\| \epsilon - \epsilon_\theta(z_t, t, \mathcal{E}(c_I), c_T) \right\|_2^2 \right]

    其中,E\mathcal{E}是Encoder,cIc_I是image conditioning,cTc_T是text instruction conditioning

    Pretraining is All You Need for Image-to-Image Translation认为,当训练数据有限,fine-tuning a large image diffusion models优于training a model from scratch for image translation tasks,所以本次是做fine tune。

    • Classifier-free Guidance for Two Conditionings

      关于原来的class-free guidance,是在训练的时候联合了有条件去噪和无条件去噪,并在推理的时候结合这2个分数估计

      ϵ~θ(zt,c)=ϵθ(zt,)+s(ϵθ(zt,c)ϵθ(zt,))\tilde{\epsilon}_\theta(z_t, c) = \epsilon_\theta(z_t, \varnothing) + s \cdot \left( \epsilon_\theta(z_t, c) - \epsilon_\theta(z_t, \varnothing) \right)

      推理时设置s>1s>1,即可倾向有条件去噪

      本文借鉴Compositional Visual Generation with Composable Diffusion Models,融合2个条件,分数估计改为

      ϵ~θ(zt,cI,cT)=ϵθ(zt,,)+sI(ϵθ(zt,cI,)ϵθ(zt,,))+sT(ϵθ(zt,cI,cT)ϵθ(zt,cI,))\tilde{\epsilon}_\theta(z_t, c_I, c_T) = \epsilon_\theta(z_t, \varnothing, \varnothing) + s_I \left( \epsilon_\theta(z_t, c_I, \varnothing) - \epsilon_\theta(z_t, \varnothing, \varnothing) \right) \\ + s_T \left( \epsilon_\theta(z_t, c_I, c_T) - \epsilon_\theta(z_t, c_I, \varnothing) \right)

      其中,sIs_IsTs_T分别代表了原图和edited instruction对生成图片的控制度

Thanks for reading!

[图像编辑08] InstructPix2Pix

Wed Mar 26 2025
445 words · 5 minutes