StackGAN

๋ถ„์•ผ
Text to Image Generation
๋ฆฌ๋ทฐ ๋‚ ์งœ
2021/02/20
๋ณธ ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ํœด๋จผ์Šค์ผ€์ดํ”„ ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ์— ๋จผ์ € ์ž‘์„ฑํ•˜๊ณ  ์˜ฎ๊ธด ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.
๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด์ „์— ํฌ์ŠคํŠธํ•œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์ธย GAN์„ ์ด์šฉํ•ด์„œ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ํ…์ŠคํŠธ๊ฐ€ ๋ฌ˜์‚ฌํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ์„ ๋ณด์ธ ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋ฆฌ๋ทฐํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋ทฐํ•˜๋ ค๋Š” ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
โ€œStackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networksโ€
๋…ผ๋ฌธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์ง์ ‘ ๋ณด์‹œ๊ณ  ์‹ถ์œผ์‹  ๋ถ„์€ย ์ด๊ณณ์„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.

Objective

๋…ผ๋ฌธ์˜ ๋ฐฐ๊ฒฝ์€ text description ์œผ๋กœ๋ถ€ํ„ฐ High-resolution photo-realistic images ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์‰ฝ์ง€ ์•Š์•˜๋˜ ๊ฒƒ์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์‹  GAN model ๋“ค์— ๋‹จ์ˆœํžˆ upsampling layer ๋ฅผ ๋ถ™์ด๋Š” ๊ฒƒ์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ƒ์‹์ ์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์—†๋Š” ์ด๋ฏธ์ง€๋“ค์„ ์ƒ์„ฑํ•ด ๋ƒˆ์Šต๋‹ˆ๋‹ค.
Vanilla GAN 256x256 Images
์•ž์˜ ๋‘ ๊ทธ๋ฆผ๋“ค์€ ์ƒˆ๋ฅผ, ์„ธ ๋ฒˆ์งธ ๊ทธ๋ฆผ์€ ๊ฝƒ์„ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ธ๋ฐ ์ƒ๋‹นํžˆ ๋ถˆ์•ˆ์ •ํ•œ ์ด๋ฏธ์ง€๋“ค์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ธ์šฉํ•œ ๋‹ค๋ฅธ ๋…ผ๋ฌธ์˜ ๋‚ด์šฉ์— ๋”ฐ๋ฅด๋ฉด GAN์œผ๋กœ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด๋‚ด๋Š” ๊ฒƒ์€ ์ž์—ฐ ์ƒํƒœ์˜(์ƒ์„ฑํ•ด๋‚ด์ง€ ์•Š์€) ์ด๋ฏธ์ง€๋“ค์ด ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ๋„คํŠธ์›Œํฌ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•ด๋‚ธ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๊ณ ํ•ด์ƒ๋„ ํ”ฝ์…€ ์˜์—ญ์—์„œย Support๊ฐ€ ๊ฒน์ณ์ง€์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ ํ”ฝ์…€์— ๋“ฑ์žฅํ•˜๋Š” ๊ฐ’๋“ค์˜ domain์— ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๋Š” ๋œป์ด๊ณ  ํ”ฝ์…€ ๊ฐ’์— ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ์ด๋ฏธ์ง€๊ฐ€ ๋น„์Šทํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์ €ํฌ๋Š” ์ž˜ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ ์†์—์„œ ๋ฌธ์ œ์ ์„ ํƒ€๊ฐœํ•˜๊ณ ์ž ๋…ผ๋ฌธ์—์„œ๋Š” focusing ํ•˜๊ณ ์žํ•˜๋Š” ๋ฌธ์ œ์ธ โ€œText to High-resolution Realistic Imagesโ€ ๋ฅผ ๋‘ ๊ฐœ์˜ ๋‹ค๋ฃจ๊ธฐ ์‰ฌ์šด ๋ฌธ์ œ์ธย โ€œText to Low-resolution Imagesโ€ย ์™€ย โ€œText conditional Low-resolution Images to High-resolution Imagesโ€ย ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ ๊ฐ๊ฐ์„ Stage-I GAN ๊ณผ Stage-II GAN ์œผ๋กœ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
๋”๋ถˆ์–ดย โ€œText to Low-resolution Imagesโ€ย task ๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ text-image pair ์˜ ํ•™์Šต๋ฐ์ดํ„ฐ์˜ ์ˆ˜๊ฐ€ ์ ์–ด์„œ ์ตœ์ข…์ ์œผ๋กœ text-conditional ํ•œ image ๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ์— ์–ด๋ ค์›€์ด ์žˆ์–ดย Conditioning Augmentaionย ์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ›„์ˆ ํ•˜๊ฒ ์ง€๋งŒ conditioning manifold ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋  ๊ฐ’์— ๋žœ๋คํ•˜๊ฒŒ ์ž‘์€ ๋ณ€๋™์„ ์ฃผ์–ด Stage-I GAN ์œผ๋กœ ์ƒ์„ฑ๋  ์ด๋ฏธ์ง€์— ๋‹ค์–‘์„ฑ์„ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
์ด์ œ, ๋„คํŠธ์›Œํฌ์˜ flow ์— ๋งž๊ฒŒ Conditioning Augmentation, Stage-I GAN, Stage-II GAN ์˜ ์ˆœ์„œ๋กœ ์„ธ๋ถ€์ ์ธ ๋‚ด์šฉ์„ ์„ค๋ช…๋“œ๋ฆฌ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Conditioning Augmentation

Conditioning Augmentation ์€ ์œ„ ๊ทธ๋ฆผ์˜ ์šฐ์ธก์— ๋‚˜์™€ ์žˆ๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ ๋„คํŠธ์›Œํฌ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
๋„คํŠธ์›Œํฌ์˜ ์‹œ์ž‘์€ Text description t ๊ฐ€ input ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๋ฐ์—์„œ ๋ถ€ํ„ฐ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. Input ์œผ๋กœ ์ฃผ์–ด์ง„ text description ์€ย word embeddingย ์ด๋ผ๋Š” ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•˜์ž๋ฉด ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ one-hot encoding ์„ ๋“ค ์ˆ˜ ์žˆ๋Š”๋ฐ text description์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ํ•ด๋‹น๋˜๋Š” ์œ„์น˜์— 1, ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๊ฐ€ ํ•ด๋‹น๋˜๋Š” ์œ„์น˜์— 0์„ ๋„ฃ์–ด ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š”ย ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐœ๋ฐœํ•œ word embedding ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
์ด๋ ‡๊ฒŒ ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ word embedding ์„ํ†ตํ•ด vector phi๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ผ๋ฐ˜์ ์œผ๋กœ phi ์˜ dimension ์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์— input data ๋“ค์˜ discontinuity ๊ฐ€ ์ƒ๋‹นํžˆ ํฌ๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ณ  ์ด๋Š” generator ๊ฐ€ ํ•™์Šตํ•˜๊ธฐ์— ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” phi ๋ฅผ fully connected layer ์— ํ†ต๊ณผ์‹œ์ผœ mu_0 ์™€ sigma_0 ๋ฅผ ๋ฝ‘์•„๋‚ธ ํ›„ ์ด๋ฅผ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์œผ๋กœ ํ•˜๋Š” Gaussian Distribution ์—์„œ ๊ฐ’์„ sampling ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ, Standard Normal Distribution ์„ ๋”ฐ๋ฅด๋Š” ๊ฐ’ epsilon ์„ ์ด์šฉํ•ด์„œ ์ตœ์ข…์ ์œผ๋กœ conditioning vector c_0 ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ sampling ์„ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค.
c0^=ฮผ0+ฯƒ0โŠ™ฯต\hat{c_0}=\mu_0+\sigma_0\odot\epsilon
์—ฌ๊ธฐ์„œ sigma_0 ์™€ epsilon ์‚ฌ์ด์˜ ๊ธฐํ˜ธ๋Š” element-wise multiplication ์ด๊ณ  ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ด ์—ฐ์‚ฐ์€ ์ €ํฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” fully connected layer ๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์…”๋„ ๋ฉ๋‹ˆ๋‹ค.
์ •๋ฆฌํ•˜์ž๋ฉด Conditioning Augmentation ์€ conditioning vector ์˜ dimension์„ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฉด์„œ conditioning vector์— ์–ด๋Š ์ •๋„์˜ ๋žœ๋คํ•œ ์ž‘์€ ๋ณ€๋™์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ ๋งˆ์ง€๋ง‰์œผ๋กœ ํ•œ ๊ฐ€์ง€ ์ฃผ๋ชฉํ•  ์ ์€ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ ํŠน๋ณ„ํ•œ loss๋ฅผ ์ถ”๊ฐ€ํ•œ ์ ์ž…๋‹ˆ๋‹ค. Word embedding ์œผ๋กœ ์ƒ๊ฒจ๋‚œ vector phi ๋กœ๋ถ€ํ„ฐ Normal Distribution ์„ ๊ฐ€์ •ํ•ด ์„ ์–ธํ•œ mu_0 ์™€ sigma_0 ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ์ •์˜ ๋œ fully connected layer ์˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์œ„ํ•œ term ์ด๋ผ๊ณ  ๋ณด์…”๋„ ๋ฉ๋‹ˆ๋‹ค.
DKL(N(ฮผ(ฯ†t),ฮฃ(ฯ†t))โˆฃโˆฃN(0,I))D_{KL}(N(\mu(\varphi_t),\Sigma(\varphi_t))||N(0,I))
D_KL ์€ GAN ์—์„œ๋„ ๊ฐ„๋‹จํžˆ ์„ค๋ช…์„ ๋“œ๋ ธ์Šต๋‹ˆ๋‹ค. Kullback-Leibler Divergence ๋กœ, ๊ฐ„๋‹จํ•˜๊ฒŒ๋Š” ๋‘ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ฒ™๋„๋กœ ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ •ํ™•ํ•œ ์„ค๋ช…์€ ์ƒ๋žต๋˜์—ˆ์ง€๋งŒ ๋…ผ๋ฌธ์—์„œ๋Š” mu, sigma ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์— ์žˆ์–ด smoothness ๋ฅผ ๊ฐ•ํ™”ํ•˜๊ณ  overfitting ์„ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์œ ๋ฅผ ๋“ค์–ด ํ•ญ๋ชฉ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

Stage-I GAN

Stage-I GAN ์€ Conditioning Augmentation block ์„ ํ†ต๊ณผํ•œ conditioning vector ๋ฅผ input ์œผ๋กœ ๋ฐ›์•„ low-resolution image ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ฒซ ์‹œ์ž‘์€ Generator ์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜ conditional GAN ์ด ๊ทธ๋ ‡๋“ฏ์ด noise vector z ๋˜ํ•œ input ์— concatenate ๋˜์–ด input ์„ ํ˜•์„ฑํ•œ ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ดํ›„ upsampling block ์ด ๋“ฑ์žฅํ•˜๋Š”๋ฐ ๋…ผ๋ฌธ์—์„œ ์ž์„ธํ•œ ๊ตฌ์กฐ๋ฅผ ์„ค๋ช…ํ•ด ์ฃผ๊ณ  ์žˆ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ถ๊ธˆํ•ด์„œ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์— ๋ฐฉ๋ฌธํ•ด์„œ ์–ด๋–ค ๊ตฌ์กฐ์ธ์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์€ย ์ด ๊ณณ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ๋Š”๋ฐ ์งš๊ณ  ๋„˜์–ด๊ฐ€์•ผ ํ•  ์ ์€ deconvolution layer ๋ฅผ ์‚ฌ์šฉํ•ด์„œ upsampling ์„ ์ง„ํ–‰ํ–ˆ๋‹ค ์ •๋„์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
์ดํ›„ ๋”ฐ๋ผ์˜ค๋Š” ๊ฒƒ์€ Discriminator ์ž…๋‹ˆ๋‹ค. Generator ๊ฐ€ ์ƒ์„ฑํ•œ fake image ์™€ real image ๋ฅผ input ์œผ๋กœ ๋ฐ›์•„ downsampling block ์„ ๊ฑฐ์ณ dimension ์„ ์ค„์ž…๋‹ˆ๋‹ค. ์ดํ›„ ์•ž์„œ word embedding ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ vector phi ๋ฅผ concatenate ํ•˜๋Š” ๊ณผ์ •์ธ Spatial Replication ์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ Spatial Replication ์€ Discriminator ์˜ condition ์œผ๋กœ ๋“ค์–ด๊ฐ„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Discriminator ์˜ ์—ญํ• ์€ ๊ธฐ๋ณธ GAN ์—์„œ๋Š” Generator ๊ฐ€ ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€๋ฅผ Fake ๋กœ, Real world ์ด๋ฏธ์ง€๋ฅผ Real ๋กœ ์ •ํ™•ํžˆ ๊ตฌ๋ณ„ํ•ด๋‚ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ cGAN ์—์„œ๋Š” ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€๊ฐ€ conditional ํ•œ ์ด๋ฏธ์ง€์—ฌ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜๋Š” Discriminator ์— conditional ํ•œ term ์„ ๋„ฃ์–ด์ฃผ์–ด ํ•™์Šตํšจ๊ณผ๋ฅผ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
์ด๋ ‡๊ฒŒ Spatial Replication ์„ ๋งˆ์นœ ๋’ค์—๋Š” ์ตœ์ข…์ ์œผ๋กœ Fake(0) / Real(1) ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฐ’์„ ์‚ฐ์ถœํ•˜๊ธฐ ์œ„ํ•œ fully connected layer ๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ž์„ธํ•œ ๊ตฌ์กฐ๋ฅผ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์—ย ์ด ๊ณณ์—์„œ ๊ตฌ์กฐ๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ตœ์ข…์ ์ธ Stage-I GAN ์˜ ๋ชจ์Šต์€ ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
LD0=E(I0,t)โˆผpdata[logโก(D0(I0,ฯ†t))]+Ezโˆผpz,tโˆผpdata[logโก(1โˆ’D0(G0(z,c^0),ฯ†t))]L_{D_0}=E_{(I_0,t)\sim p_{data}}[\log(D_0(I_0,\varphi_t))]+\\ E_{z\sim p_z,t\sim p_{data}}[\log(1-D_0(G_0(z,\hat{c}_0),\varphi_t))]\\
LG0=Ezโˆผpz,tโˆผpdata[logโก(1โˆ’D0(G0(z,c^0),ฯ†t))]+ฮปDKL(N(ฮผ0(ฯ†t),ฮฃ0(ฯ†t))โˆฃโˆฃN(0,I))L_{G_0}=E_{z\sim p_z,t\sim p_{data}}[\log(1-D_0(G_0(z,\hat{c}_0),\varphi_t))]+\\ \lambda D_{KL}(N(\mu_0(\varphi_t),\Sigma_0(\varphi_t))||N(0,I))
๊ทธ๋ฆฌ๊ณ  Stage-I GAN ์—์„œ ์‚ฌ์šฉํ•œ Loss ๋Š” ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด์ „์— ์ œ๊ฐ€ ํฌ์ŠคํŒ…ํ–ˆ๋˜ย GAN ๋…ผ๋ฌธ๋ฆฌ๋ทฐ๋ฅผ ๋ณด์‹  ๋ถ„๋“ค์€ ๋ฐ”๋กœ ์ดํ•ด๊ฐ€ ๋˜์‹ค ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•˜๊ฒŒ๋งŒ ๋‹ค์‹œ ์„ค๋ช… ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
Generator ๋Š” Discriminator ๋ฅผ ์†์—ฌ ์ž์‹ ์ด ๋งŒ๋“  ์ด๋ฏธ์ง€๋ฅผ Real ์ด๋ผ๊ณ  ํŒ๋‹จํ•˜๊ธธ ์›ํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— D ๊ฐ€ ํŒ๋‹จํ•œ ๊ฐ’์ด 1(Real) ์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜ค๊ธธ ์›ํ•˜๊ณ  L_G ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์•ž์„œ ๋ง์”€๋“œ๋ ธ๋˜ Kullback-Leibler Divergence ๋˜ํ•œ mu ์™€ sigma ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“  Gaussian Distribution ์ด Standard Normal Distribution ๊ณผ ๋น„์Šทํ•œ ๋ถ„ํฌ๋ฅผ ๋ณด์ด๋„๋ก ํ•™์Šต์„ ํ•˜๊ฒŒ ๋˜๋ฉด ์•ˆ์ •์„ฑ์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด ํ•ญ๋ชฉ์„ ํฌํ•จํ•ด L_G ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ์„ ์–ธํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
Discriminator ๋Š” Generator ๊ฐ€ ์ƒ์„ฑํ•œ ์ด๋ฏธ์ง€๋ฅผ Fake(0) ๋กœ, Real world ์ด๋ฏธ์ง€๋ฅผ Real(1) ๋กœ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌ๋ณ„ํ•˜๊ธธ ์›ํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— Real Image ์™€์˜ ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ์—๋Š” 1(Real) ์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜ค๊ธธ ์›ํ•˜๊ณ  Fake Image ์™€์˜ ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ์—๋Š” 0(Fake) ์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜ค๊ธธ ์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— L_D ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ์„ ์–ธํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Stage-II GAN

Stage-II GAN ์€ Stage-I GAN ์„ ํ†ต๊ณผํ•˜์—ฌ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์™€ word embedding ์œผ๋กœ ์ƒ์„ฑํ•œ vector phi ๋ฅผ input ์œผ๋กœ ๋ฐ›์•„ high-resolution image ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ฒซ ์‹œ์ž‘์€ Generator ์ž…๋‹ˆ๋‹ค. ์‹œ์ž‘ ๋ถ€๋ถ„์ธ Generator ๋Š” ์•ž์„œ ๋ง์”€๋“œ๋ฆฐ vector phi ์™€ low-resolution ์ด๋ฏธ์ง€๋ฅผ input ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค.
Stage-I GAN ์˜ Generator ์™€ ๋‹ค๋ฅธ ์ ์ด ์žˆ๋‹ค๋ฉด noise vector z ๊ฐ€ ์•„๋‹ˆ๋ผ low-resolution ์ด๋ฏธ์ง€ s_0 ๋ฅผ input ์œผ๋กœ ๋ฐ›๋Š”๋‹ค๋Š” ์ ๊ณผ, word embedding ์ž์ฒด๋Š” ๊ณต์œ ํ•˜์ง€๋งŒ fully connected layer ์ž์ฒด๋Š” ๊ณต์œ ํ•˜์ง€ ์•Š์•„ ๊ฐœ๋ณ„์ ์œผ๋กœ Stage-I GAN ์—์„œ ๊ฒฐ์‹ค๋œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋„๋ก ํ–ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” Generator ๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด encoder โ€” decoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ , ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Spatial Repliacation ์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ๋”๋ถˆ์–ด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์™€์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ layer ๊ฐ€ ํ•„์š”ํ–ˆ๋Š”๋ฐ ๊ธธ์–ด์ง„ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์—์„œ ํšจ๊ณผ์ ์œผ๋กœ Gradient ๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ Residual blocks ์œผ๋กœ ์ด๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด์„œ ๊ณ ํ•ด์ƒ๋„ fake image ๋ฅผ ์ƒ์„ฑํ•ด๋ƒ…๋‹ˆ๋‹ค.
์ดํ›„ ๋”ฐ๋ผ์˜ค๋Š” ๊ฒƒ์€ Discriminator ์ž…๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์€ Stage-I GAN ๊ณผ ๋™์ผํ•œ ๊ตฌ์กฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ž์„ธํ•œ ์„ค๋ช…์„ ์ƒ๋žตํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
์ตœ์ข…์ ์ธ Stage-II GAN ์˜ ๋ชจ์Šต์€ ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
LD=E(I,t)โˆผpdata[logโกD(I,ฯ†t)]+Es0โˆผpG0,tโˆผpdata[logโก(1โˆ’D(G(s0,c^),ฯ†t))]L_D=E_{(I,t)\sim p_{data}}[\log D(I,\varphi_t)]+\\ E_{s_0\sim p_{G_0},t\sim p_{data}}[\log(1-D(G(s_0,\hat{c}),\varphi_t))]\\
LG=Es0โˆผpG0,tโˆผpdata[logโก(1โˆ’D(G(s0,c^),ฯ†t))]+ฮปDKL(N(ฮผ(ฯ†t),ฮฃ(ฯ†t))โˆฃโˆฃN(0,I))L_G=E_{s_0\sim p_{G_0},t\sim p_{data}}[\log(1-D(G(s_0,\hat{c}),\varphi_t))]+\\ \lambda D_{KL}(N(\mu(\varphi_t),\Sigma(\varphi_t))||N(0,I))
๊ทธ๋ฆฌ๊ณ  Stage-II GAN ์—์„œ ์‚ฌ์šฉํ•œ Loss ๋Š” ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. Stage-I GAN ๊ณผ ๊ต‰์žฅํžˆ ๋น„์Šทํ•˜์‹  ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠน๋ณ„ํžˆ ๋‹ค๋ฅธ ์ ์€ L_D ์—์„œ ์‚ฌ์šฉํ•œ Real Image ๊ฐ€ ๊ณ ํ–‰์ƒ๋„ ์ด๋ฏธ์ง€๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  Generator ๊ฐ€ input ์œผ๋กœ ๋ฐ›๋Š” ๊ฒƒ์ด noise vector ๊ฐ€ ์•„๋‹ˆ๋ผ low-resolution image ๋ผ๋Š” ์ ์ž…๋‹ˆ๋‹ค.

Validation

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•œ validation metric ์€ Inception Score(IS) ๊ณผ ์œก์•ˆ์œผ๋กœ ๋ณด์ด๋Š” image quality / text-conditionality์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ Baseline ์œผ๋กœ๋Š” GAN-INT-CLS, GAWWN ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
Inception Score ์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํžˆ ์„ค๋ช…์„ ๋“œ๋ฆฌ์ž๋ฉด, Real ๊ฐ™์€ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š”์ง€์™€ ๋‹ค์–‘์„ฑ ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ์ œ์‹œํ•˜๋Š” ์ฒ™๋„์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ฐ๊ฐ Fidelity ์™€ Diversity ๋ผ ํ•˜๋Š”๋ฐ ์ด ๋ชจ๋‘ ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋„คํŠธ์›Œํฌ์— ๋†’์€ ๊ฐ’์„ ๋งค๊ฒจ์ค๋‹ˆ๋‹ค.
๊ทธ๋Ÿผ ์ง€๊ธˆ๋ถ€ํ„ฐ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ validation ์„ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
GAN-INT-CLS ๊ฐ€ ์ƒ์„ฑํ•ด๋‚ธ 64x64 image ๋“ค์€ ์ผ๋ฐ˜์ ์ธ ๋ชจ์–‘๋งŒ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ  ์ƒ๋™๊ฐ ์žˆ๋Š” ๋ถ€๋ถ„๋“ค์ด๋‚˜ ์„ค๋“๋ ฅ ์žˆ๋Š” ๋””ํ…Œ์ผ์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ์‹ค์ œ ์‚ฌ์ง„์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค๊ฑฐ๋‚˜ ๊ณ ํ•ด์ƒ๋„๋กœ ๋ณด์—ฌ์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
GAWWN ์ด ์ƒ์„ฑํ•ด๋‚ธ 128x128 image ๋“ค์€ GAN-INT-CLS ๊ฐ€ ์ƒ์„ฑํ•ด๋‚ธ ๊ทธ๊ฒƒ๋ณด๋‹ค ๊ณ ํ•ด์ƒ๋„์ฒ˜๋Ÿผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ์ด๋ฏธ์ง€๋“ค์€ text description ์œผ๋กœ๋งŒ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹ˆ๋ผ๋Š” ์ ์ด ๋…ผ๋ฌธ์˜ StackGAN ์— ๋น„ํ•ด์„œ ์•„์‰ฌ์šด ์ ์ž…๋‹ˆ๋‹ค.
์œ„ ๊ทธ๋ฆผ์€ StackGAN ์—์„œ๋„ ์กด์žฌํ•˜๋Š” ๋‘ GAN ์—์„œ ์ƒ์„ฑํ•˜๋Š” ์ด๋ฏธ์ง€๋“ค์„ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Stage-I GAN ์—์„œ๋Š” text description ์ด ๋ฌ˜์‚ฌํ•˜๋Š” rough ํ•œ ๋ชจ์–‘๊ณผ ์ƒ‰๊ฐ๋“ค์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์ด blurry ํ–ˆ๊ณ , detail ์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ€์กฑํ•จ์„ Stage-II GAN ์—์„œ ๋ณด์™„ํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
์˜ˆ๋กœ์จ ์œ„ ๊ทธ๋ฆผ์˜ 5๋ฒˆ ์งธ ์—ด์˜ ๊ทธ๋ฆผ๋“ค์„ ๋ณด๋ฉด reddish brown crown ์ธ๋ฐ ํŒŒ๋ž€์ƒ‰ ์ƒ‰๊ฐ์˜ crown ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋˜ ๋ถ€๋ถ„์„ Stage-II GAN ์—์„œ ๋ณด์™„ํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ 7๋ฒˆ ์งธ ์—ด์ฒ˜๋Ÿผ ๋ฌผ์ฒด๋ฅผ ํŠน์ • ์ง€์„ ์ˆ˜ ์—†๋˜ Stage-I GAN ์˜ ๊ฒฐ๊ณผ์—๋„ Stage-II GAN ์ด ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์ƒ์„ฑํ•ด ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
์œ„ ๊ทธ๋ฆผ์€ ๋‘ GAN ๋ชจ๋‘ ์˜๋„ํ–ˆ๋˜ ๊ธฐ๋Šฅ์„ ์ •ํ™•ํžˆ ํ•ด๋‚ด๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ์ด์—ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ–ˆ๋˜ ๊ฒƒ์€ diversity ๋ฅผ ๋ˆˆ์œผ๋กœ ๋ณด์—ฌ ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ training example ์„ ์ €์žฅํ•˜๊ณ  ๋„์›Œ์ฃผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ทธ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ํ™•๋ฅ  ๋ถ€๋…ธ๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค. ์ขŒ์ธก์˜ ๊ทธ๋ฆผ์ด text description ์œผ๋กœ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์ด๊ณ  ์šฐ์ธก์˜ ๊ทธ๋ฆผ์ด training set ์ค‘ nearest neighbors ๋ฅผ ๋ชจ์•„๋‘” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋“ค์ด ๋น„์Šทํ•จ๊ณผ ๋™์‹œ์— ๋‹ค๋ฅด๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Inception Score ๋ฅผ ํ†ตํ•œ ๋น„๊ต๋Š” ์œ„ ํ‘œ์™€ ๊ฐ™์ด ์ œ์‹œํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ ๋…ผ๋ฌธ์˜ StackGAN ์ด ๊ฐ€์žฅ ๋†’์€ ์ˆ˜์น˜๋ฅผ ๋ณด์˜€๋‹ค๋Š” ๊ฒƒ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” Conditioning Augmentation ์˜ ๊ธฐ๋Šฅ์— ๋Œ€ํ•œ ๊ฒ€์ฆ์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. CA ๊ฐ€ ์กด์žฌํ•˜๋Š” Stage-I GAN ๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ Stage-I GAN ์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์–‘์„ฑ์—์„œ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  CA ๋ฅผ ํฌํ•จํ•œ Stage-I GAN ๋„ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒˆ์„ ๋•Œ์˜ StackGAN ๋ณด๋‹ค realistic ํ•˜์ง€๋Š” ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ์•ž์„œ ์ œ์‹œํ•œ Conditining Augmentation ๊ณผ text embedding vector ๋ฅผ Stage-I GAN ์—์„œ๋งŒ ๋„ฃ๋Š” ๊ฒฝ์šฐ(Text once), Stage-II GAN ์—๋„ ๋„ฃ๋Š” ๊ฒฝ์šฐ(Text twice) ๋“ฑ์— ๋Œ€ํ•œ qualitative validation ์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ ๊ฒฐ๊ณผ ์œ„์˜ ํ‘œ ์ฒ˜๋Ÿผ CA ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜๋ก, Text twice ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜๋ก ๋†’์€ Inception Score ๋ฅผ ๊ฐ€์ง์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
๊ฐœ์ธ์ ์œผ๋กœ๋Š” ์ด ๋ถ€๋ถ„์ด ์กฐ๊ธˆ ์‹ ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ๋“ค์˜ StackGAN ์ด smooth ํ•œ latent data manifold ๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด์„œ ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์ด ๋งŒ๋“ค์–ด ๋‚ด๋Š” word embedding vector phi ๋ฅผ interpolate ํ•œ ์ดํ›„ ๋„ฃ์–ด ์ค„ ๊ฒฝ์šฐ์— ๋‚˜์˜ค๋Š” ์ด๋ฏธ์ง€ ๋˜ํ•œ interploate ๋œ ๋“ฏํ•˜๊ฒŒ ๋‚˜์˜ด์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ text ๊ฐ€ ์‚ฌ์‹ค์€ ์™„์ „ํžˆ ๋™ ๋–จ์–ด์ง„ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋Š” data ์˜์—ญ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋„๋ก ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Supplementary Materials

์ด๊ฒƒ์œผ๋กœ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ๋ฆฌ๋ทฐ๋Š” ๋๋‚ฌ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์ถ”๊ฐ€์ ์œผ๋กœ text-image ๋ฐ์ดํ„ฐ๊ฐ€ ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์„ ์œ„ํ•ด ๋…ผ๋ฌธ์— ์ฒจ๋ถ€๋œ ์‚ฌ์ง„์„ ๋ณด์—ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

Conclusion

์ด๊ฒƒ์œผ๋กœ ๋…ผ๋ฌธย โ€œStackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networksโ€์˜ ๋‚ด์šฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์š”์•ฝํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
์ฒ˜์Œ ์ ‘ํ•ด๋ณด๋Š” ์ข…๋ฅ˜์˜ ์ฃผ์ œ์—ฌ์„œ ์‹ ์„ ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, ๋…ผ๋ฌธ ๋‚ด์—์„œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ƒ์„ธํžˆ ์„ค๋ช…ํ•ด์ฃผ์ง€ ์•Š์•„์„œ ์•„์‰ฌ์› ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ์ด๋ฏธ์ง€๋“ค์„ ๋ณด๋Š” ์žฌ๋ฏธ๊ฐ€ ์ปธ์Šต๋‹ˆ๋‹ค.
๊ฐœ์ธ์ ์œผ๋กœ๋Š” GAN ์— ๋” ํฐ ํฅ๋ฏธ๋ฅผ ๋Š๋ผ๊ฒŒ ํ•ด์ฃผ๋Š” ๋…ผ๋ฌธ์ด์—ˆ๋‹ค๊ณ  ์ƒ๊ฐ์ด ๋“ญ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„๋“ค๋„ GAN ์— ๊ด€์‹ฌ์ด ์žˆ์œผ์‹œ๋‹ค๋ฉด ๊ผญ ํ•œ ๋ฒˆ ์ฝ์–ด๋ณด์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.