Negative Prompts in Stable Diffusion

CV
Deep Learning
Author

Vishwas Pai

Published

October 31, 2022

In this post we will see how negative prompt work in stable diffusion.

Intuition is lets say input_text = "man in blue dress" and this generates a output latent p1. Now one more input called input_text2 = "blue" and this generates a output latent p2. Now at a high level p1-p2 –> an image of man in not blue colour.

Also similar to what we saw in Components of Stable diffusion blog. We used to generate 2 outputs, one aligned with text and one creative. We used to combine by a logic to get the final output.

In the Introduction Blog we saw the below code snippet, which generates images from the text.

pipe2 = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16).to("cuda")

prompt = "a photograph of an astronaut riding a horse"

pipe2(prompt).images[0]

The above pipe2 along with prompt also accept a param called negative_prompt.

Lets see some examples and see how good these works.

torch.manual_seed(1000)
prompt = "man riding blue bike with red dress"
pipe2(prompt).images[0]

Now lets change the color of bike with negative prompt blue.

torch.manual_seed(1000)
pipe2(prompt, negative_prompt="blue").images[0]

As you can see both blue bike and blue pant has been changed.

Lets change the shirt color using negative prompt red.

Let change from riding to something else using negative prompt riding.

Now lets change the vehicle using negative prompt cycle.

Now lets change the vehicle and the colour also using negative prompt blue cycle.

Lets give completely opposite of what input prompt given and we will see what the output looks like. negative prompt man riding blue bike.

As we can see p1 = man riding blue bike with red dress and negative prompt man riding blue bike. Yielded girl in red dress.

How this works?

We saw in Previous post a method for generating good image using guidance_scale.

The equation was :

final_prediction = creative_predection + guidance_scale * predection_based_on_text

Now lets add negative_prompt to this equation.

final_prediction = creative_predection + guidance_scale * (predection_based_on_text - predection_based_on_negative_text)

So all the codes from previous post will be same. Now instead of 2 latents noise , we will doing 3.

#We will remove noise by X times
for i, t in enumerate(tqdm(scheduler.timesteps)):
    # As we generate for both creative and text , we will have 3 noise inputs
    input = torch.cat([latents] * 3)
    input = scheduler.scale_model_input(input, t)

    # Given to unet for prediction
    with torch.no_grad(): 
        pred = unet(input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    pred_uncond, pred_text, pred_neg = pred.chunk(3)
    pred = pred_uncond + guidance_scale * (pred_text - pred_neg)

    # compute the "previous" noisy sample
    latents = scheduler.step(pred, t, latents).prev_sample