In this post we will see how negative prompt work in stable diffusion.
Intuition is lets say input_text = "man in blue dress"
and this generates a output latent p1. Now one more input called input_text2 = "blue"
and this generates a output latent p2. Now at a high level p1-p2
–> an image of man in not blue colour.
Also similar to what we saw in Components of Stable diffusion blog. We used to generate 2 outputs, one aligned with text and one creative. We used to combine by a logic to get the final output.
In the Introduction Blog we saw the below code snippet, which generates images from the text.
pipe2 = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16).to("cuda")
prompt = "a photograph of an astronaut riding a horse"
pipe2(prompt).images[0]
The above pipe2
along with prompt also accept a param called negative_prompt.
Lets see some examples and see how good these works.
torch.manual_seed(1000)
prompt = "man riding blue bike with red dress"
pipe2(prompt).images[0]
Now lets change the color of bike with negative prompt blue
.
torch.manual_seed(1000)
pipe2(prompt, negative_prompt="blue").images[0]
As you can see both blue bike and blue pant has been changed.
Lets change the shirt color using negative prompt red
.
Let change from riding to something else using negative prompt riding
.
Now lets change the vehicle using negative prompt cycle
.
Now lets change the vehicle and the colour also using negative prompt blue cycle
.
Lets give completely opposite of what input prompt given and we will see what the output looks like. negative prompt man riding blue bike
.
As we can see p1 = man riding blue bike with red dress
and negative prompt man riding blue bike
. Yielded girl in red dress.
How this works?
We saw in Previous post a method for generating good image using guidance_scale.
The equation was :
final_prediction = creative_predection + guidance_scale * predection_based_on_text
Now lets add negative_prompt to this equation.
final_prediction = creative_predection + guidance_scale * (predection_based_on_text - predection_based_on_negative_text)
So all the codes from previous post will be same. Now instead of 2 latents noise , we will doing 3.
#We will remove noise by X times
for i, t in enumerate(tqdm(scheduler.timesteps)):
# As we generate for both creative and text , we will have 3 noise inputs
input = torch.cat([latents] * 3)
input = scheduler.scale_model_input(input, t)
# Given to unet for prediction
with torch.no_grad():
pred = unet(input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
pred_uncond, pred_text, pred_neg = pred.chunk(3)
pred = pred_uncond + guidance_scale * (pred_text - pred_neg)
# compute the "previous" noisy sample
latents = scheduler.step(pred, t, latents).prev_sample