Generative AI: Text-to-Image Models

Personalized Text-to-Image Generative models

Problem of Interest

Recently, models that generate reliable images based on text, such as Stable Diffusion, have emerged. Additionally, methods for personalizing text-to-image models by combining a few images of a specific subject with unique text identifiers, such as DreamBooth or textual inversion, have been introduced. These existing fine-tuning methods have demonstrated proficiency in training and rendering images related to specific objects like dogs or buildings.
However, due to the abstract and broad nature of stylistic attributes such as lines, shapes, textures, and colors, there still exists a challenge in learning and personalizing specific artistic styles. This paper aims to address this issue by focusing on personalizing styles in the context of these broader and more abstract style attributes.

Our Method & Results

In this paper, we aimed to enhance the training process of the existing DreamBooth. Upon examining the training process of DreamBooth, it introduces class images to mitigate overfitting and language drift. These class images are primarily employed to ensure that the meaning of the class (e.g., dog) is not lost during training. We discovered that when aiming to train styles instead of objects, class images not only alleviate overfitting and language drift but also assist in establishing the binding of styles during the learning process.

Figure 1. The architecture of StyleBoost. StyleRef images of the target style, paired with text prompt (“A [V] style”), and Aux images, collected from the Internet and paired with the prompt (“A style”) are provided as input images. After fine-tuning, the text-to-image model can generate various images of the target style with the guidance of text prompts.

We modified the training process of the existing DreamBooth and conducted experiments on three common styles (anime, SureB, realism). The evaluation criteria included FID (Fréchet Inception Distance) and CLIP (Contrastive Language-Image Pre-training) scores. Additionally, by comparing the results with the original DreamBooth, we demonstrate the extent of improvement and how well our approach binds abstract concepts of styles.

Figure 2. Text-to-image synthesis of StyleBoost. Personalized images generated by StyleBoost compared to the existing DreamBooth for 3 different styles. Across the categories of person, animal, and background (landscape), our model generates meticulously aligned, high-fidelity images that resonate with the target style.

Table 1. Comparison of FID performance for each style, with different compositions of StyleRef and Aux images. For Aux images, we chose StyleRef composition of Back+person.

Publications & Github

GitHub - ION-dgu/StyleBoost: ICTC_conference