Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style.
In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio.
We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the ``style'' noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style during image inversion. To address this, we devise prompt refinement, which learns a style token assisted by human feedback.
Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise. We will release the code upon publication.
Our motivation stems from the observation that the inversion noise retains style information, which is further substantiated both theoretically and experimentally. We derive our method based on the observation, which involves two main stages.
In the first stage, we employ DDIM inversion to transform the reference image into noise. Notably, the inversion noise exhibits a non-zero signal-to-noise ratio, suggesting the presence of style signals from the reference image. Subsequently, we generate M stylized images from the inversion noise conditioned on the given textual prompts.
Due to the inherent ambiguity of the textual prompts that describe the style, it is challenging to precisely convey the desired style. Addressing this, the second stage involves the incorporation of human feedback to select N high-quality generated images from the first stage. The selected images are then used to learn a style token via prompt refinement.
@inproceedings{cui2024instastyle,
title={InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser},
author={Cui, Xing and Li, Zekun and Li, Pei Pei and Huang, Huaibo and Liu, Xuannan and He, Zhaofeng},
booktitle={ECCV},
year={2024}
}