Gradient Descent Algorithms

๋ถ„์•ผ
Algorithms
๋ฆฌ๋ทฐ ๋‚ ์งœ
2021/01/16
๋ณธ ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ํœด๋จผ์Šค์ผ€์ดํ”„ ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ์— ๋จผ์ € ์ž‘์„ฑํ•˜๊ณ  ์˜ฎ๊ธด ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.
์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” Machine Learning์— ์‚ฌ์šฉ๋˜๋Š” ๋‹ค์–‘ํ•œ optimizer๊ฐ€ ๋ฐฐ๊ฒฝ์œผ๋กœ ํ•˜๋Š” gradient descent algorithms์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•œ ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋ฆฌ๋ทฐํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋ฆฌ๋ทฐํ•˜๋ ค๋Š” ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
โ€œAn overview of gradient descent optimization algorithmsโ€
๋…ผ๋ฌธ์˜ ๋ชฉ์ ์€ ๋…์ž๋“ค์—๊ฒŒ ๋‹ค์–‘ํ•œ Gradient Descent Algorithm์— ๋Œ€ํ•œ ์ง๊ด€์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์ง์ ‘ ๋ณด์‹œ๊ณ  ์‹ถ์œผ์‹  ๋ถ„์€ย ์ด๊ณณ์„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.
๊ฐ๊ฐ์˜ Gradient Descent Algorithm์„ ์„ค๋ช…ํ•˜๊ธฐ ์ „์— ์œ„ ๊ทธ๋ฆผ์˜ ๊ด€๊ณ„๋กœ ๊ฐ๊ฐ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํƒ„์ƒํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค ์ •๋„๋ฅผ ๊ฐ€๋ณ๊ฒŒ ๋ณด๊ณ  ๋„˜์–ด๊ฐ€์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฒ€์€์ƒ‰ ํ™”์‚ดํ‘œ์˜ ๊ฒฝ์šฐ DataSet Size์— ๊ด€๋ จํ•œ ๋ณ€ํ™”์ด๋ฉฐ ๋ถ‰์€์ƒ‰ ํ™”์‚ดํ‘œ์˜ ๊ฒฝ์šฐ Step Size์— ๊ด€๋ จํ•œ ๋ณ€ํ™”์ด๊ณ  ํ‘ธ๋ฅธ์ƒ‰ ํ™”์‚ดํ‘œ์˜ ๊ฒฝ์šฐ Step Direction์— ๊ด€๋ จํ•œ ๋ณ€ํ™”๋กœ ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.์ž์„ธํ•œ ์„ค๋ช…์„ ์ง€๊ธˆ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

BGD (Batch Gradient Descent)

Batch Gradient Descent๋Š” ์ €ํฌ๊ฐ€ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์œผ๋กœ ์•Œ๊ณ  ์žˆ๋Š” Gradient Descent ์ž…๋‹ˆ๋‹ค. ํŠน์ • step์—์„œ ํŠน์ • parameter theta๊ฐ€ ๋ณธ์ธ์— ๋Œ€ํ•œ objective function J์˜ ๋ฐฉํ–ฅ ๋ฏธ๋ถ„์„ฑ๋ถ„๋งŒํผ ๋ณ€ํ™”ํ•˜๋Š” ํ˜•ํƒœ๋ฅผย Gradient Descent๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ €ํฌ๋Š” ๋„ˆ๋ฌด๋„ ์ž˜ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ objective function J๊ฐ€ย ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ cost function์˜ ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผย Batch Gradient Descentย ๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ด algorithm์—์„œ ์งš๊ณ  ๋„˜์–ด๊ฐˆ ์ ์ด ์žˆ๋‹ค๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ objective function์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
ฮธ=ฮธโˆ’ฮทโ‹…โˆ‡ฮธJ(ฮธ)\theta = \theta-\eta \cdot \nabla_\theta J(\theta)
์ „์ฒด ๋ฐ์ดํ„ฐ์…‹...?? Batchโ€ฆ??
๋‹ค๋งŒ, โ€œBatchโ€๋ผ๋Š” ์šฉ์–ด ๋•Œ๋ฌธ์— ์ € ๋˜ํ•œ ๊ทธ๋žฌ๊ณ , ๋งŽ์€ ๋ถ„๋“ค์ด ์ฒ˜์Œ์— ํ˜ผ๋™์„ ๊ฒช๊ณ ๋Š” ํ•ฉ๋‹ˆ๋‹ค. ํ”ํžˆ Batch๋Š” ํ•™์Šต ์šฉ์–ด๋กœ ์ „์ฒด dataset์„ ๋‚˜๋ˆˆ ๋ฐ์ดํ„ฐ ์…‹์˜ ํ•œ ๋‹จ์œ„๋ฅผ ์ƒ๊ฐํ•˜๊ธฐ ์‰ฝ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, Batch ์ž์ฒด๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์…‹์„ ์˜๋ฏธํ•˜๋ฉฐ, ์‹ค์ œ๋กœ ์ €ํฌ๊ฐ€ Batch๋กœ ๋งŽ์ด ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์€ mini-batch๋กœ ๋ณด์‹œ๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.

SGD (Stochastic Gradient Descent)

BGD๋Š” Machine Learning์˜ ๊ทผ๋ณธ algorithm์ด ๋˜๊ธฐ์—๋Š” ์ถฉ๋ถ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, parameter๋ฅผ ํ•œ ๋ฒˆ updateํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ์•„ objective function์„ ๊ณ„์‚ฐํ•ด์•ผ ํ–ˆ๊ณ ย ํ•™์Šต ์†๋„๊ฐ€ ๋Š๋ฆฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด Stochastic Gradient Descent์ž…๋‹ˆ๋‹ค.
ฮธ=ฮธโˆ’ฮทโ‹…โˆ‡ฮธJ(ฮธ;x(i),y(i))\theta = \theta-\eta\cdot\nabla_\theta J(\theta;x^{(i)},y^{(i)})
BGD์™€์˜ ์ฐจ์ด์ ์€ objective function J์˜ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. BGD๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ cost function์˜ ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์—ˆ๋‹ค๋ฉด SGD๋Š” ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•œ ๊ณ„์‚ฐ๊ณผ์ •์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ์ง€ ์•Š๊ณ ย ๋น ๋ฅด๊ฒŒย ์ง„ํ–‰๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.๋ถ€๊ฐ€์ ์œผ๋กœ, SGD์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— parameter์˜ fluctuation์ด ํฐ ํŽธ์ด๋ฉฐ ์ด ๋•Œ๋ฌธ์— ๊ธฐ์กด BGD์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š”ย local minimum์— ๋น ์ ธ์„œ ๋‚˜์˜ค์ง€ ๋ชปํ•˜๋Š” ํ˜„์ƒ์ด SGD์—์„œ๋Š” ์ž˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ๋„ ์žฅ์ ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mini-Batch Gradient Descent

SGD๋Š” ๊ธฐ์กด BGD์˜ ๋Š๋ฆฐ ํ•™์Šต์†๋„๋ฅผ ํ•ด๊ฒฐํ•ด์ฃผ๊ธฐ์—๋Š” ์ ํ•ฉํ•œ algorithm์ด์—ˆ์œผ๋‚˜, update๋งˆ๋‹ค์˜ ํฐ fluctuation์ด algorithm์˜ย convergence๋ฅผ ๋ฐฉํ•ดํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์ €ํฌ๊ฐ€ ๋งŽ์ด ๊ฒช๋Š” ๋ ˆํผํ† ๋ฆฌ์ด์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์—์„œ ์–ด๋–ค ์„ค๊ณ„์˜ ์–‘๊ทน๋‹จ์ด ๊ฐ€์ง€๋Š” ํฐ ๋‹จ์ ๋“ค์„ ์ ์ ˆํžˆ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฐ˜์”ฉ ๊ฐ€์ ธ๋‹ค๊ฐ€ ์„ž๋Š” ๊ฒƒ์ด ์ข‹์€ ํšจ์œจ์„ ๋ณด์ด๊ณค ํ•ฉ๋‹ˆ๋‹ค.
Mini-Batch Gradient Descent๊ฐ€ ๊ทธ ์˜ˆ์‹œ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ „์ฒด๋„, ํ•˜๋‚˜๋„ ์•„๋‹Œย ์ ๋‹นํ•œ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
ฮธ=ฮธโˆ’ฮทโ‹…โˆ‡ฮธJ(ฮธ;x(i;i+n),y(i;i+n))\theta = \theta-\eta \cdot \nabla_\theta J(\theta;x^{(i;i+n)},y^{(i;i+n)})
์œ„์˜ ์ˆ˜์‹์€ objective function J์˜ ์ •์˜๋ฅผ ์œ„ํ•ดย n๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์ „์ฒด๋ฅผ ๋‹ค ์ด์šฉํ•˜์ง€ ์•Š์•„์„œ BGD๋ณด๋‹ค ์ƒ๋Œ€์ ์œผ๋กœ ๋น ๋ฅด๊ณ , ์–ด๋Š์ •๋„์˜ ์ง‘๋‹จ์„ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— SGD๋ณด๋‹ค ์ˆ˜๋ ด์„ฑ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ด๋Ÿฌํ•œ ์‹œ๋„๋ฅผ ํ†ตํ•ด์„œย ๋น ๋ฅด๋ฉด์„œ๋„ ์ž˜ ์ˆ˜๋ ด๋˜๋Š”ย gradient descent algorithm์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

Momentum

์ง€๊ธˆ๋ถ€ํ„ฐ๋Š” ์‚ด์ง ๊ฒฐ์ด ๋‹ค๋ฅธ ์ด์•ผ๊ธฐ๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ ์„ธ ๊ฐ€์ง€์˜ algorithm๋“ค์€ objective function J๊ฐ€ ์ •์˜๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋ฒ”์œ„์— ํ•œ์ •๋œ ์ด์•ผ๊ธฐ์˜€์Šต๋‹ˆ๋‹ค.
๊ธฐ์กด SGD์˜ ๋ฌธ์ œ์  ์ค‘ ํ•˜๋‚˜๋กœ fluctuation์ด ํฌ๋‹ค๋Š” ๊ฒƒ์„ ์–ธ๊ธ‰ํ–ˆ๋˜ ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์„œ๋Š” ์ˆ˜๋ ด ์ž์ฒด๋ฅผ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ ๋งŒ ์–ธ๊ธ‰๋“œ๋ ธ๋Š”๋ฐ, ์ถ”๊ฐ€์ ์œผ๋กœ local minima์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๊ฒฝ์‚ฌ๊ฐ€ ๊ธ‰ํ•œ gradient descent๋ฅผ ์‚ฐ์ถœํ•ด๋‚ผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋ฉฐย ์ˆ˜๋ ด ์†๋„๋„ ๋Š๋ฆฌ๊ฒŒ ํ•˜๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ด๋ ‡๊ฒŒ local minima์—์„œ ๊ธ‰ํ•˜๊ฒŒ ๋ณ€ํ•˜๋Š” gradient๋ฅผ ์ œ์–ดํ•ด์ฃผ๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฐœ๋…์ดย Momentum์ž…๋‹ˆ๋‹ค. ๋‹จ์–ด์—์„œ ์œ ์ถ”ํ•˜์‹  ๋ถ„๋„ ์žˆ๊ฒ ์ง€๋งŒ gradient์— ๊ด€์„ฑ์˜ ์„ฑ์งˆ์„ ๋”ํ•ด์ฃผ๋Š” algorithm์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋ฐฉ๋ฒ•์œผ๋กœย ํ˜„์žฌ์˜ gradient๋ฅผ ์‚ฐ์ถœํ•˜๊ธฐ ์œ„ํ•ด ๊ณผ๊ฑฐ์˜ gradient๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋ฐ˜์˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
์ œ๊ฐ€ ๋™์ชฝ์œผ๋กœ 100m/s ์˜ ์†๋ ฅ์œผ๋กœ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋Š”๋ฐ ๊ฐ‘์ž๊ธฐ ๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์„œ์ชฝ์œผ๋กœ 100m/s ๋กœ ๋‹ฌ๋ฆฌ๋ ค๊ณ  ํ•˜๋ฉด ๋ฐ”๋กœ ๋ฐ”๊พธ๊ธฐ๋Š” ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.์ด๋Ÿฌํ•œ ์ ์„ ๋ฐ˜์˜ํ•˜์—ฌ ์ œ๊ฐ€ ๋™์ชฝ์œผ๋กœ 100m/s๋กœ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ์—ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋‹ค์Œ ๋ช…๋ น์„ ์‚ฐ์ถœํ•  ๋•Œย ๋ฐ˜์˜ํ•˜์—ฌ ์„œ์ชฝ์œผ๋กœ 20m/s ์ •๋„๋กœ๋งŒ ๋‹ฌ๋ฆฌ๋ผ๋Š” ๊ฒƒ์œผ๋กœ ์™„ํ™”ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธ)ฮธ=ฮธโˆ’vtv_t=\gamma v_{t-1}+\eta\nabla_\theta J(\theta)\\\theta=\theta-v_t
๋ฌผ๋ก โ€ฆ ๊ณผ๊ฑฐ์˜ gradient๊ฐ€ ํ˜„์žฌ์˜ gradient๋ณด๋‹ค ์˜ํ–ฅ์„ ๋ผ์น˜๋Š” ์ •๋„๊ฐ€ ์ปค์ง€๋ฉด ์•ˆ๋˜๊ฒ ์ฃ ??
์ด ๋•Œ๋ฌธ์— momentum term์ธ gamma<1๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ค์ •ํ•˜๊ฒŒ ๋˜๋ฉด ๋” ๋จผ ๊ณผ๊ฑฐ์— ์žˆ์—ˆ๋˜ gradient๊ฐ€ ํ˜„์žฌ gradient์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ์ ์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ํ˜•ํƒœ๋กœย ์ด์ „ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์–ด๋Š์ •๋„ ์œ ์ง€ํ•˜์—ฌ ํ•™์Šต์˜ ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Nesterov Accelerated Gradient (NAG)

Momentum์—์„œ ๋น ๋ฅธ ํ•™์Šต์„ ์œ„ํ•ด ์ด์ „ gradient์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์œ ์ง€ํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฐ๋ฐ, ๊ณผ๊ฑฐ์˜ gradient์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์•Œ๊ณ  ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผย ์ž์‹ ์ด ํ˜„์žฌ ์˜ˆ์ƒํ•œ ์ž์‹ ์˜ gradient๋ฅผ ์ด์šฉํ•ด ์ด๋™๋˜์—ˆ์„ ๋•Œ์˜ ์œ„์น˜์—์„œ์˜ gradient๋„ ์•Œ ์ˆ˜ ์žˆ๋‹ค๋ฉด, ๋”์šฑ ๋น ๋ฅธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜์ง€ ์•Š์„๊นŒ๋ผ๋Š” ์•„์ด๋””์–ด๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค.
์ œ๊ฐ€ ๋™์ชฝ์œผ๋กœ 100m/s ์˜ ์†๋ ฅ์œผ๋กœ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋Š”๋ฐ ๊ฐ‘์ž๊ธฐ ๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์„œ์ชฝ์œผ๋กœ 100m/s ๋กœ ๋‹ฌ๋ฆฌ๋ ค๊ณ  ํ•˜๋ฉด ๋ฐ”๋กœ ๋ฐ”๊พธ๊ธฐ๋Š” ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.์ด๋Ÿฌํ•œ ์ ์„ ๋ฐ˜์˜ํ•˜์—ฌ ์ œ๊ฐ€ ๋™์ชฝ์œผ๋กœ 100m/s๋กœ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ์—ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋‹ค์Œ ๋ช…๋ น์„ ์‚ฐ์ถœํ•  ๋•Œย ๋ฐ˜์˜ํ•˜์—ฌ ์„œ์ชฝ์œผ๋กœ 20m/s ์ •๋„๋กœ๋งŒ ๋‹ฌ๋ฆฌ๋ผ๋Š” ๊ฒƒ์œผ๋กœ ์™„ํ™”ํ•  ์ƒ๊ฐ์ž…๋‹ˆ๋‹ค.์ด๋ ‡๊ฒŒ ์™„ํ™”ํ•˜์—ฌ ๋‹ฌ๋ฆด ๊ฒƒ์„ ๊ฐ€์ •ํ•˜๊ณ  ์ด๋ ‡๊ฒŒ ์ผ์ • ์‹œ๊ฐ„ ๋‹ฌ๋ ธ์„ ๋•Œ ์žˆ๊ฒŒ ๋  ์ œ ์œ„์น˜์—์„œ ๋ฐœ๊ฒฌํ•  ๋ช…๋ น์ด ๋ถ์ชฝ์œผ๋กœ 15m/s๋กœ ๋‹ฌ๋ฆฌ๋Š” ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์„ ์ €๋Š”ย ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.์ด๋Ÿฌํ•œ ์˜ˆ์ธก์„ ํ˜„์žฌ ์ œ ํŒ๋‹จ์— ์–ด๋Š์ •๋„ย ๋ฐ˜์˜ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋ถ์„œ์ชฝ์œผ๋กœ 25m/s ์ •๋„๋กœ ๋‹ฌ๋ฆฌ๋Š” ๊ฒƒ์œผ๋กœ ์ œ ๋ช…๋ น์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
vt=ฮณvtโˆ’1+ฮทโˆ‡ฮธJ(ฮธโˆ’ฮณvtโˆ’1)ฮธ=ฮธโˆ’vtv_t=\gamma v_{t-1}+\eta\nabla_\theta J(\theta-\gamma v_{t-1})\\\theta=\theta-v_t
Momentum๊ณผ์˜ ์ฐจ์ด์ ์€ objective function J์—์„œ ํ˜„์žฌ์˜ paramter theta๊ฐ€ ์•„๋‹Œ, ๊ณผ๊ฑฐ์˜ gradient์™€ momentum term gamma๋กœ๋งŒ ๊ณ„์‚ฐํ•œ ๋ณ€๊ฒฝ๋  parameter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ์„ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
์‹์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ํ˜„์žฌ์˜ gradient๋Š” ์‚ฐ์ถœํ•˜๊ธฐ ์ „์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๊ฑฐ์˜ gradient๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฏธ๋ž˜์˜ parameter๋ฅผ ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•˜๊ณ  ๊ทธ objective function์„ ๊ณ„์‚ฐํ•œ ํ›„ theta๋กœ์˜ ๋ฐฉํ–ฅ ๋ฏธ๋ถ„์„ ์ตœ์ข…์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฏธ๋ž˜์˜ gradient๋ฅผ ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ณ ,ย ๋ฏธ๋ž˜ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์–ด๋Š์ •๋„ ๊ฐ์•ˆํ•˜๊ณ  ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šต์˜ ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Adaptive Gradient (Adagrad)

์ง€๊ธˆ๋ถ€ํ„ฐ๋Š” ๋˜ ๊ฒฐ์ด ๋‹ค๋ฅธ ์ด์•ผ๊ธฐ๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ Momentum๊ณผ NAG์˜ ๊ฒฝ์šฐ์—๋Š” ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์กฐ์ ˆํ•˜์—ฌ ๋น ๋ฅธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค๊ณ„๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ์กด SGD๋ฅผ ํฌํ•จํ•˜์—ฌ ์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ algorithms๋“ค์€ paramter-independent learning rate๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ์„ค๊ณ„๋Š” ๊ทธ๋ ‡๊ฒŒ ์œ ๋™์ ์ด์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. Parameter๋งˆ๋‹ค update๋  ์ •๋„๊ฐ€ ๋‹ฌ๋ผ์•ผ ํ•  ๊ฒฝ์šฐ๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๊ฐ€ sparseํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง„ data๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.์ด ๊ฒฝ์šฐ, zero๊ฐ’์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ๋“ฑ์žฅํ•˜๊ฒŒ ๋˜๊ณ  objective function์€ ๋ฌผ๋ก  gradient์— parameter๊ฐ€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ •๋„๊ฐ€ ์ ์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.์ด๋Ÿฐ ๊ณผ์ •์ด ๋งค์šฐ ์˜ค๋ž˜ ์ง€์†๋œ๋‹ค๋ฉด, ๋‹ค๋ฅธ parameter๋“ค์€ ๊ฐ๊ฐ์˜ converge์— ๊ฐ€๊นŒ์›Œ์ ธ์„œ learning rate๋ฅผ ์ค„์ด๊ณ ์ž ํ–ˆ๋Š”๋ฐ sparse ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” parameter์˜ ๊ฒฝ์šฐ์—๋Š” ์ˆ˜๋ ด์ ์œผ๋กœ๋ถ€ํ„ฐ์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์•„์ง ๋จผ ์ƒํƒœ์ธ ์ƒํ™ฉ์ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ ‡๊ธฐ์—, Apadtive Gradient์—์„œ๋Š” ์ง€๊ธˆ๊นŒ์ง€ parameter๊ฐ€ update๋œ ์ด๋ ฅ์„ ๊ฐ€์ง€๊ณ  learning rate๋ฅผ ๊ฒฐ์ •ํ•˜๋„๋ก ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค.
ฮธt+1,i=ฮธt,iโˆ’ฮทGt,ii+ฯตโ‹…gt,i\theta_{t+1,i}=\theta_{t,i}-\frac{\eta}{\sqrt G_{t,ii}+\epsilon}\cdot g_{t,i}
์œ„์˜ ์‹์—์„œ G_t์˜ ๊ฒฝ์šฐ paramter theta_i๊ฐ€ t๋ฒˆ์˜ update ๋™์•ˆ ๊ฒช์—ˆ๋˜ gradient์˜ square sum์„ ๋Œ€๊ฐ์„  ์„ฑ๋ถ„์œผ๋กœ ๊ฐ€์ง„ ํ•ญ๋ชฉ์ž…๋‹ˆ๋‹ค. ์ฆ‰ ์œ„ ๊ฒฝ์šฐ์—๋Š” theta_i๊ฐ€ ๊ฒช์—ˆ๋˜ gradient์˜ square sum์˜ root์„ฑ๋ถ„์ด ๋ถ„๋ชจ๋กœ ๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Epsilon์˜ ๊ฒฝ์šฐ ์ฒ˜์Œ์— ๋ถ„๋ชจ๊ฐ€ 0์ด ๋˜๋Š” ๊ฒƒ์„ ๋ง‰์•„์ฃผ๋Š” ํ•ญ๋ชฉ์ž…๋‹ˆ๋‹ค.
์ด๋ ‡๊ฒŒ ์„ค๊ณ„๋ฅผ ํ•ด์„œ Adagrad์—์„œ ํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ์€ย paramter-dependent learning rate design์ด์—ˆ์Šต๋‹ˆ๋‹ค. Update๊ฐ€ ๋งŽ์ด ์ด๋ฃจ์–ด์ง„ parameter์— ๋Œ€ํ•ด์„  ์ˆ˜๋ ด์ ๊ณผ ๊ฐ€๊น๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ learning rate๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๊ณ , update๊ฐ€ ์ ๊ฒŒ ์ด๋ฃจ์–ด์ง„ parameter์— ๋Œ€ํ•ด์„  ์ˆ˜๋ ด์ ๊ณผ ๋ฉ€๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ learning rate๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์—ย ๊ฐ™์€ update ํšŸ์ˆ˜๋กœ ์–ด๋Š paramter๋Š” ํ•™์Šต์ด ์ˆ˜๋ ด์ ์— ๊ฐ€๊น๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ์–ด๋Š paramter๋Š” ํ•™์Šต์ด ๊ฑฐ์˜ ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋ฅผ ํ•ธ๋“ค๋งํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

RMSProp

Adagrad๋Š” parameter-dependent learning rate๋ฅผ ์„ค๊ณ„ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ์ข‹์€ ํšจ์œจ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด์ „๊นŒ์ง€์˜ update๋ฅผ ๋ชจ๋‘ accumulateํ•œ๋‹ค๋Š” ์ ์—์„œ ์‹œ์ž‘์ ๊ณผ ์ˆ˜๋ ด์ ๊นŒ์ง€์˜ย ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๋ฉด ์ˆ˜๋ ด ์†๋„๊ฐ€ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๋Š๋ ค์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
Geoff Hinton์€ ๊ทธ์˜ Coursera ๊ฐ•์˜์—์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.์ฒซ ๋ฒˆ์งธ๋กœ, ๋ชจ๋“  gradient๊ฐ€ ์•„๋‹Œย w๊ฐœ์˜ gradient๋งŒ ์ด์šฉํ•ด๋ณด์ž๋Š” ์•„์ด๋””์–ด๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๊ธฐ์กด์˜ Adagrad์‹์—์„œ ๋ถ„๋ชจ์— ๋‚˜ํƒ€๋‚˜๋Š” ํ•ญ๋ชฉ์„ ์ „์ฒด์ ์œผ๋กœ ์ค„์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.๋‘ ๋ฒˆ์งธ๋กœ, ๊ทธ๋ƒฅ square sum์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผย exponential decaying average๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ‰๊ท ์„ ๋‚ด๋Š” ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋” ๋‘๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
E[g2]t=ฮณE[g2]tโˆ’1+(1โˆ’ฮณ)gt2E[g^2]_t=\gamma E[g^2]_{t-1}+(1-\gamma)g_t^2
์ด ์‹์„ sigma sum์œผ๋กœ ํŽผ์ณ๋ณด์‹  ๋ถ„๋“ค์€ ์•„์‹œ๊ฒ ์ง€๋งŒ, gtg_t์˜ square sum์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š”๋ฐ t๊ฐ€ ์ž‘์„ ์ˆ˜๋ก ๊ฐ€์ค‘์น˜๋„ ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ฮ”ฮธt=โˆ’ฮทE[g2]t+ฯตgt\Delta\theta_t=-\frac{\eta}{\sqrt E[g^2]_t+\epsilon}g_t
์ด๋ฅผ ์ด์šฉํ•ด ๊ธฐ์กด์˜ Adagrad์—์„œ G_t,ii๊ฐ€ ๋“ค์–ด๊ฐ”๋˜ ๋ถ€๋ถ„์—ย exponential decaying average๋ฅผ ๋„ฃ์–ด์ค€ ํ˜•ํƒœ๋กœ ์„ค๊ณ„๋ฅผ ํ•˜๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ์„ค๊ณ„๋ฅผ ํ†ตํ•ด์„œ ์ „์ฒด์ ์œผ๋กœ G_t,ii๊ฐ€ ๊ฐ€์กŒ๋˜ ํฐ scale์„ ์ค„์ด๊ณ ย gradient๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” ์ฐธ๊ณ ๋กœ ์œ„์˜ ์‹์— ๋Œ€ํ•œ ๋‹ค๋ฅธ ํ‘œํ˜„์ž…๋‹ˆ๋‹ค.
ฮ”ฮธt=โˆ’ฮทRMS[g]tgt\Delta\theta_t=-\frac{\eta}{RMS[g]_t}g_t
์—ฌ๊ธฐ์„œ RMS๋Š” Root Mean Square์ด๊ธฐ๋Š” ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ์˜ mean์ดย exponential decaying average์ž„์„ ๊ธฐ์–ตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Adaptive Delta (Adadelta)

Adadelta๋Š” RMSProp์™€ ๋น„์Šทํ•˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
๋ฐ”๋กœย โ€œ๋‹จ์œ„์— ๋Œ€ํ•œ ์ˆ˜์ •"์ž…๋‹ˆ๋‹ค.
์ด์ „์— ์†Œ๊ฐœํ•œ Adagrad์™€ RMSProp algorithms ๋“ค์€ ๋ณ€ํ™”๋Ÿ‰์— ๋‹จ์œ„๊ฐ€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. Gradient์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฐ์ถœํ•  ๋•Œ gradient๋ฅผ ์ด์šฉํ•ด์„œ ์ •์˜๋ฅผ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
Adadelta algorithm ์—์„œ๋Š” ๋ณ€ํ™”๋Ÿ‰์ด delta theta์˜ ๋‹จ์œ„์™€ ๋™์ผํ•ด์•ผ ํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ–ˆ๊ณ , ์ด์— ๋Œ€ํ•œ ๋ณด์ •์œผ๋กœ ๊ธฐ์กด์˜ ์•ž์„  ์‹๋“ค์—์„œ eta๋กœ ์‚ฌ์šฉํ–ˆ๋˜ ํ•ญ๋ชฉ๋“ค์„ ์ƒˆ๋กœ์šด ํ•ญ๋ชฉ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.
๋จผ์ € newtonโ€™s method๋ฅผ second order๋กœ ์ ์šฉ์‹œ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โˆ‚2Jโˆ‚ฮธi2ฮ”ฮธi=โˆ‚Jโˆ‚ฮธi\frac{\partial^2J}{\partial\theta_i^2}\Delta\theta_i=\frac{\partial J}{\partial\theta_i}
์ด๋ฅผ ์•ฝ๊ฐ„ ๋ณ€ํ˜•ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์„ ์–ป์„ ์ˆ˜ ์žˆ๊ณ ,
ฮ”ฮธi=โˆ‚Jโˆ‚ฮธiโˆ‚2Jโˆ‚ฮธi2\Delta\theta_i=\frac{\frac{\partial J}{\partial\theta_i}}{\frac{\partial^2J}{\partial\theta_i^2}}
๋ถ„์ž์— ์žˆ๋Š” ํ•ญ์€ ๋ณ€ํ™”๋Ÿ‰์˜ ๋์— ๊ณฑํ•ด์ง„ g term๊ณผ ์ผ์น˜ํ•˜๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
gโˆโˆ‚Jโˆ‚ฮธig\propto\frac{\partial J}{\partial\theta_i}
๊ทธ๋ ‡๋‹ค๋ฉด, ๋ถ„์ž๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ํ•ญ๋ชฉ๋“ค์„ ์•ž์˜ ๊ฐ€์ค‘์น˜์™€ ๋‹จ์œ„๊ฐ€ ๋งž๋„๋ก ์ˆ˜์ •์„ ํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š”๋ฐ,
R1โˆ‚2Jโˆ‚ฮธi2=ฮ”ฮธiโˆ‚Jโˆ‚ฮธiR \frac{1}{\frac{\partial^2J}{\partial\theta_i^2}}=\frac{\Delta\theta_i}{\frac{\partial J}{\partial\theta_i}}
๋ถ„์ž๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ํ•ญ๋ชฉ๋“ค์€ ์•ž์„œ ๋‘ ๋ฒˆ์งธ๋กœ ์–ธ๊ธ‰ํ•œ ์‹์—์„œ ์œ„์™€ ๊ฐ™์ด ๋ณ€ํ˜•ํ•˜์—ฌ ์œ ๋„ํ•ด๋‚ผ ์ˆ˜ ์žˆ๊ณ , ์œ„ ํ•ญ๋ชฉ์˜ ์šฐํ•ญ์˜ ๋ถ„๋ชจ๋Š” ํ˜„์žฌ RMSProp์˜ ๊ฐ€์ค‘์น˜์˜ ๋ถ„๋ชจ์™€ ๋‹จ์œ„๊ฐ€ ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์—, ๋ถ„์ž์ธ eta๋ฅผ delta theta์™€ ๋‹จ์œ„๋ฅผ ๋งž์ถฐ์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
์ด๋ฅผ ์œ„ํ•ด์„œ Adadelta์—์„œ๋Š” delta theta์— ๋Œ€ํ•œ (๋ณ€ํ™”๋Ÿ‰์— ๋Œ€ํ•œ)ย exponential decaying average๋ฅผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตฌํ•˜์—ฌ ๋ถ„๋ชจ์— ๋„ฃ๋Š” ํ˜•ํƒœ๋กœ ๋‹จ์œ„๋ฅผ ๋งž์ถฐ์ค๋‹ˆ๋‹ค.
E[ฮ”ฮธ2]t=ฮณE[ฮ”ฮธ2]tโˆ’1+(1โˆ’ฮณ)ฮ”ฮธt2E[\Delta\theta^2]_t=\gamma E[\Delta\theta^2]_{t-1}+(1-\gamma)\Delta\theta_t^2
ํ•˜์ง€๋งŒ, delta theta_t์— ๋Œ€ํ•œ ํ•ญ๋ชฉ์€ ํ˜„์žฌ ์‚ฐ์ถœํ•ด์•ผํ•˜๋Š” ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— t-1๊นŒ์ง€์˜ average๋งŒ์„ ์ด์šฉํ•œ ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ๊ตฌํ•ด์ค๋‹ˆ๋‹ค.
ฮ”ฮธt=โˆ’RMS[ฮ”ฮธ]tโˆ’1RMS[g]tgtฮธt+1=ฮธt+ฮ”ฮธt\Delta\theta_t=-\frac{RMS[\Delta\theta]_{t-1}}{RMS[g]_t}g_t\\\theta_{t+1}=\theta_t+\Delta\theta_t
์‚ฌ์‹ค ์ด๋ ‡๊ฒŒ ๋‹จ์œ„๋ฅผ ๋งž์ถฐ์ฃผ๋Š” ํ–‰์œ„๊ฐ€ ์‹ค์งˆ์ ์œผ๋กœ ์–ด๋– ํ•œ ์ด๋“์„ ๊ฐ€์ ธ์˜ค๋Š”์ง€์— ๋Œ€ํ•ด์„œ ์„ค๋ช…์ด ๋ถ€์กฑํ•˜๊ธด ํ•ฉ๋‹ˆ๋‹ค. Adadelta ๋…ผ๋ฌธ์—์„œ๋Š”
โ€œHessain Matrix provides additional curvature information useful for optimization, computing accurate second order information is often expensiveโ€
์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๊ทธ๋“ค์˜ ์žฅ๋‹จ์ ์„ ํ‘œํ˜„ํ•ด์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ •๋„๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ๋งŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Adaptive Momentum(Adam)

๋‹ค์Œ์œผ๋กœ ์†Œ๊ฐœ๋“œ๋ฆด algorithm์€ Adaptive Momentum์ž…๋‹ˆ๋‹ค.
์•ž์„œ ํ–ˆ๋˜ ํ‘œํ˜„์„ ์ž ์‹œ ๋‹ค์‹œ ๋นŒ๋ ค์„œ ์“ฐ์ž๋ฉด,
์ €ํฌ๊ฐ€ ๋งŽ์ด ๊ฒช๋Š” ๋ ˆํผํ† ๋ฆฌ์ด์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์—์„œ ์–ด๋–ค ์„ค๊ณ„์˜ ์–‘๊ทน๋‹จ์ด ๊ฐ€์ง€๋Š” ํฐ ๋‹จ์ ๋“ค์„ ์ ์ ˆํžˆ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฐ˜์”ฉ ๊ฐ€์ ธ๋‹ค๊ฐ€ ์„ž๋Š” ๊ฒƒ์ด ์ข‹์€ ํšจ์œจ์„ ๋ณด์ด๊ณค ํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿฐ ๋ง์”€์„ ๋“œ๋ฆฐ ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.์ด์™€๋Š” ๋น„์Šทํ•˜๊ฒŒ ์ €ํฌ๊ฐ€ ๋งŽ์ด ๊ฒช๋Š” ๋ ˆํผํ† ๋ฆฌ์ด์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€ ๊ฐœ์„  ๋ฐฉํ–ฅ์„ ๊ฐ€์ ธ์™€ ํ•œ ๊ณณ์— ๊ทธ ๋‘˜์„ ๋ชจ์•„ ๋ชจ๋‘ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ํšจ์œจ์„ ๋ณด์ด๊ณค ํ•ฉ๋‹ˆ๋‹ค.
RMSProp๊ณผ Momentum์˜ ํ•ฉ์ž‘์„ Adam์œผ๋กœ ์†Œ๊ฐœ๋“œ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
RMSProp์—์„œ๋Š” square gredients ๋“ค์˜ exponential decaying average๋ฅผ ๊ตฌํ•ด์„œ ์ „์ฒด์ ์ธ learning rate๊ฐ€ ์—„์ฒญ ์ž‘์•„์ง€๋Š” ํ˜„์ƒ์„ ํ•ธ๋“ค๋ง ํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.Momentum์—์„œ๋Š” ๊ณผ๊ฑฐ์˜ gradient๋ฅผ ํ˜„์žฌ gradient๋ฅผ ์‚ฐ์ถœํ•˜๋Š”๋ฐ ๋ฐ˜์˜ํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์ด ๋‘˜์„ ๋ชจ๋‘ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด adam์—์„œ๋Š” first momentum๊ณผ second momentum์„ ๋ชจ๋‘ ์ •์˜ํ•˜์—ฌ learning rate์™€ gradient ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•ด์ค๋‹ˆ๋‹ค.
mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)gtvt=ฮฒ2vtโˆ’1+(1โˆ’ฮฒ2)gt2m_t=\beta_1m_{t-1}+(1-\beta_1)g_t\\v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2
์œ„์™€ ๊ฐ™์ด first momentum m_t์™€ second momentum vtv_t ๋ฅผ ์ •์˜ํ•˜๊ณ  ์‚ฌ์šฉํ•˜๋ ค๊ณ  ๋ณด์•˜๋”๋‹ˆ, ๊ฐ€์ค‘์น˜์ธ ฮฒ1\beta_1, ฮฒ2\beta_2๊ฐ€ 1์— ๊ฐ€๊น๊ณ  m0m_0, v0v_0 ๊ฐ€ 0์— ๊ฐ€๊นŒ์šฐ๋ฉด ์ดˆ๊ธฐ ๋ณ€ํ™”๋Ÿ‰์ด 0์— ๊ฐ€๊นŒ์›Œ์„œ ํ•™์Šต์ด ๋Š๋ฆฌ๋‹ค๋Š” ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
๋•Œ๋ฌธ์— ๋…ผ๋ฌธ์—์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ bias-corrected version์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
mt^=mt1โˆ’ฮฒ1tvt^=vt1โˆ’ฮฒ2t\hat{m_t}=\frac{m_t}{1-\beta_1^t}\\\hat{v_t}=\frac{v_t}{1-\beta_2^t}
๋ถ„์ž์˜ ํ•ญ๋ชฉ์ด 0์— ๊ฐ€๊นŒ์šธ๋•Œ ๊ฐ€์ค‘์น˜ ํ•ญ๋ชฉ์„ ํฌํ•จํ•œ ๋ถ„๋ชจ๋„ 0์— ๊ฐ€๊นŒ์›Œ์ ธ ํ•ญ๋ชฉ์ด ์ž‘์•„์ง€๋Š” ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
ฮธt+1=ฮธtโˆ’ฮทvt^+ฯตmt^\theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
์ดํ›„ ์•ž์„œ ์„ค๋ช…๋“œ๋ ธ๋˜ RMSProp์ฒ˜๋Ÿผ second momentum ํ•ญ๋ชฉ์„ ๋ถ„๋ชจ์— ๋ฃจํŠธ์™€ ํ•จ๊ป˜, Momentum์ฒ˜๋Ÿผ first momentum ํ•ญ๋ชฉ์„ gradient ๋ถ€๋ถ„์— ๋„ฃ์–ด ์ตœ์ข…์ ์œผ๋กœ ์‹์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. Momentum์˜ ์žฅ์ ์ธ ์ด์ „ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ณ , RMSProp์˜ ์žฅ์ ์ธ paramter-dependent learning rate ์„ gradient vanishing ์—†์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์„œย ํ•™์Šต์„ ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ optimizer/gradient descent algorithm์— ๋Œ€ํ•ด์„œ ์ž˜ ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด Adam์„ ์“ฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ต‰์žฅํžˆ ๋งŽ์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Adamax

Adamax๋Š” Adam ๋…ผ๋ฌธ์˜ extension์— ์ ํ˜€์žˆ๋Š” gradient descent algorithm์œผ๋กœ ์ฃผ์š” ๋ณ€๊ฒฝ์ ์€ Adam์˜ second momentum term์˜ L2 norm์„ ์ผ๋ฐ˜์ ์ธ Lp norm์œผ๋กœ ํ™•์žฅํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ œ๊ณฑ์˜ average์˜ ๋ฃจํŠธ์—์„œ p์ œ๊ณฑ์˜ average์˜ p์ œ๊ณฑ๊ทผ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.์ง€์†์ ์œผ๋กœ ์–ธ๊ธ‰๋“œ๋ฆฌ์ง€๋งŒ average์— ๋Œ€ํ•œ ์ •์˜๋Š” ๋•Œ์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ๋•Œ Lp norm์€ p๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋ถˆ์•ˆ์ •ํ•ด์ง€๊ธฐ ๋งˆ๋ จ์ด๋ผ์„œ L1 norm ๊ณผ L2 norm์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์“ฐ์ด๊ณ  ์žˆ์—ˆ๋Š”๋ฐ, Adam ๋…ผ๋ฌธ์—์„œ๋Š” p๊ฐ€ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๊ฒฝ์šฐ์— ํ•œํ•ด์„œ ์•ˆ์ •์„ฑ, ์ˆ˜๋ ด์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
vt=ฮฒ2pvtโˆ’1+(1โˆ’ฮฒ2p)โˆฃgtโˆฃp=(1โˆ’ฮฒ2p)โˆ‘i=1tฮฒ2p(tโˆ’i)โ‹…โˆฃgiโˆฃpv_t=\beta_2^pv_{t-1}+(1-\beta_2^p)|g_t|^p\\=(1-\beta_2^p)\sum_{i=1}^t\beta_2^{p(t-i)}\cdot |g_i|^p
์œ„๋Š” Adam์˜ second momentum์— Lp norm์„ ์ ์šฉํ•œ ์‹์ž…๋‹ˆ๋‹ค.
vt=limโกpโ†’โˆž((1โˆ’ฮฒ2p)โˆ‘i=1tฮฒ2p(tโˆ’i)โ‹…โˆฃgiโˆฃp)1/p=limโกpโ†’โˆž(1โˆ’ฮฒ2p)1/p(โˆ‘i=1tฮฒ2p(tโˆ’i)โ‹…โˆฃgiโˆฃp)1/p=limโกpโ†’โˆž(โˆ‘i=1t(ฮฒ2tโˆ’iโ‹…โˆฃgiโˆฃ)p)1/p=maxโก(ฮฒ2tโˆ’1โˆฃg1โˆฃ,ฮฒ2tโˆ’2โˆฃg2โˆฃ,...,ฮฒ2โˆฃgtโˆ’1โˆฃ,โˆฃgtโˆฃ)v_t=\lim_{p\to\infty}((1-\beta_2^p)\sum_{i=1}^t\beta_2^{p(t-i)}\cdot|g_i|^p)^{1/p}\\=\lim_{p\to\infty}(1-\beta_2^p)^{1/p}(\sum_{i=1}^t\beta_2^{p(t-i)}\cdot|g_i|^p)^{1/p}\\=\lim_{p\to\infty}(\sum_{i=1}^t(\beta_2^{t-i}\cdot|g_i|)^p)^{1/p}=\\\max(\beta_2^{t-1}|g_1|,\beta_2^{t-2}|g_2|,...,\beta_2|g_{t-1}|,|g_t|)
์œ„๋Š” Lp norm์˜ ์ˆ˜๋ ด์„ฑ์„ ์ฆ๋ช…ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ฒซ ๋ฒˆ์งธ์ค„์—์„œ ๋‘ ๋ฒˆ์งธ์ค„๋กœ ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•œ ๊ณผ์ •์ด๋‹ˆ ์„ค๋ช…์„ ์ƒ๋žตํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ์ค„์—์„œ ์„ธ ๋ฒˆ์งธ์ค„๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ๋Š” ์•ž์˜ ํ•ญ๋ชฉ์ด p๋ฅผ ๋ฌดํ•œ๋Œ€๋กœ ๋ณด๋ƒˆ์„ ๋•Œ 1๋กœ ์ˆ˜๋ ดํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธํ•˜๊ณ  ๋‚จ์€ ๊ฒƒ์„ p์Šน์œผ๋กœ ๋ฌถ์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.์„ธ ๋ฒˆ์งธ์ค„์—์„œ ๋„ค ๋ฒˆ์งธ์ค„๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ๋Š” maximum norm์˜ ๊ฒฝ์šฐ๋กœ, ๊ด„ํ˜ธ ๋‚ด๋ถ€์— ์žˆ๋Š” ํ•ญ๋ชฉ๋“ค ์ค‘ maximum ๊ฐ’์„ g_max๋ผ ํ•˜๋ฉด g_max๋กœ ๋ชจ๋“  ํ•ญ๋ชฉ์„ ๋ฐ”๊พผ ๊ฒƒ๋ณด๋‹ค๋Š” ์ž‘๊ณ , g_max๋ฅผ ํ•˜๋‚˜๋งŒ ๋‘” ๊ฒƒ๋ณด๋‹ค๋Š” ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋‘˜ ์‚ฌ์ด์— ๊ฐ’์ด ์กด์žฌํ•ด์•ผ ํ•˜๋ฉฐ ์ด๋Š” ๊ทนํ•œ์„ ๋ณด๋ƒˆ์„ ๋•Œ๋„ ์˜ˆ์™ธ๊ฐ€ ์•„๋‹Œ๋ฐ, ๋‘ ๊ฐ’ ๋ชจ๋‘ ๊ทนํ•œ์œผ๋กœ ๋ณด๋ƒˆ์„ ๋•Œ g_max ์ด๋ฏ€๋กœ ๊ฐ’์ด g_max๋กœ ์ˆ˜๋ ดํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ut=ฮฒ2โˆžvtโˆ’1+(1โˆ’ฮฒ2โˆž)โˆฃgtโˆฃโˆž=maxโก(ฮฒ2โ‹…vtโˆ’1,โˆฃgtโˆฃ)u_t=\beta_2^{\infty}v_{t-1}+(1-\beta_2^{\infty})|g_t|^{\infty}\\=\max(\beta_2\cdot v_{t-1},|g_t|)
์œ„์˜ ์‹์„ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€ํ˜•ํ•ด ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. Beta_2*v_t-1 ํ•ญ๋ชฉ์ด ์•ž์„  max๋‚ด๋ถ€์˜ g_t๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ํ•ญ๋ชฉ๋“ค์„ ํฌํ•จํ•œ๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. (์ „๊ฐœ๋ฅผ ํ•˜๊ธฐ ์ „์˜ ๋ฒ„์ „์œผ๋กœ ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.)
ฮธt+1=ฮธtโˆ’ฮทutmt^\theta_{t+1}=\theta_t-\frac{\eta}{u_t}\hat{m_t}
๋ฐ”๋€ norm ์„ ๊ธฐ์กด norm์˜ ์ž๋ฆฌ๋กœ ๋Œ€์ฒดํ•ด์ฃผ๋ฉด ์œ„์™€ ๊ฐ™์€ ์‹์ด ๋ฉ๋‹ˆ๋‹ค. Adam๊ณผ ๋‹ค๋ฅธ ์ ์ด ์žˆ๋‹ค๋ฉด ์ด๋ ‡๊ฒŒ ๋ฐ”๊ฟ”์คŒ์œผ๋กœ์จ max operation๋งŒ์œผ๋กœ u_t๋ฅผ ์ •์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์ค‘์น˜์ธ beta_1, beta_2๊ฐ€ 1์— ๊ฐ€๊น๊ณ  m_0, v_0์ด 0์— ๊ฐ€๊นŒ์›Œ๋„ย bias-correction ์—†์ด gradient๊ฐ€ ์ดˆ๊ธฐ์— ์œ ์ง€๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Nesterov Accelerated Adaptive Momentum (NAdam)

Adam์—์„œ Momentum๊ณผ RMSProp์„ ์„ž์–ด ์ด์ „ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ณ , paramter-dependent learning rate ์„ gradient vanishing ์—†์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ง์”€๋“œ๋ฆฐ ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฐ๋ฐ, ์ €ํฌ๊ฐ€ ๋ฐฉํ–ฅ์„ฑ์— ๋Œ€ํ•œ ๊ณ ๋ ค๋ฅผ ํ•˜๋ฉด์„œ ๊ณผ๊ฑฐ์™€ ํ˜„์žฌ, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜์˜ ๋ฐฉํ–ฅ์„ฑ๊นŒ์ง€ ๊ณ ๋ ค๋ฅผ ํ–ˆ์—ˆ๋Š”๋ฐ Adam์—๋Š” ๊ทธ๊ฒƒ๊นŒ์ง€ ๊ณ ๋ ค๊ฐ€ ๋˜์–ด ์žˆ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ๋ฏธ๋ž˜ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ๋ฏธ๋ฆฌ ๊ณ ๋ คํ•˜๋Š” ์„ค๊ณ„๋Š” ์ง„ํ–‰๋˜์–ด ์žˆ์ง€ ์•Š์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด์ฏค๋˜๋ฉด ๋ˆˆ์น˜์ฑ„์…จ์„ ์ˆ˜๋„ ์žˆ์œผ์…จ๊ฒ ์ง€๋งŒ NAG์™€ Adam์„ ์„ž์€ ์„ค๊ณ„, NAdam์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœ๋“œ๋ฆฌ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
๋จผ์ € Momentum์„ ๋‹ค์‹œ ์ƒ๊ธฐํ•ด ๋ด…์‹œ๋‹ค.
gt=โˆ‡ฮธtJ(ฮธt)mt=ฮณmtโˆ’1+ฮทgtฮธt+1=ฮธtโˆ’mtg_t=\nabla_{\theta_t}J(\theta_t)\\ m_t=\gamma m_{t-1}+\eta g_t\\ \theta_{t+1}=\theta_t-m_t
์œ„์˜ ์‹์„ ์•ž์„œ ๋ณด์‹  ์ ์ด ์žˆ์œผ์‹ค ๊ฒ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•ด,
ฮธt+1=ฮธtโˆ’(ฮณmtโˆ’1+ฮทgt)\theta_{t+1}=\theta_t-(\gamma m_{t-1}+\eta g_t)
์œ„์™€ ๊ฐ™์€ ์‹์œผ๋กœ ์ตœ์ข…์ ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌ ํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ NAG๋ฅผ ๋‹ค์‹œ ์ƒ๊ธฐํ•ด ๋ด…์‹œ๋‹ค.
gt=โˆ‡ฮธtJ(ฮธtโˆ’ฮณmtโˆ’1)mt=ฮณmtโˆ’1+ฮทgtฮธt+1=ฮธtโˆ’mtg_t=\nabla_{\theta_t}J(\theta_t-\gamma m_{t-1})\\ m_t=\gamma m_{t-1}+\eta g_t\\ \theta_{t+1}=\theta_t-m_t
์—ฌ๊ธฐ์„œ NAdam์€ ๊ณผ๊ฑฐ์˜ momentum์„ ๋ˆ„์ ์‹œ์ผœ์˜จ ์ด ๋ฐฉ์‹์„ ์œ ์ง€ํ•œ ์ฑ„ m_t๋กœ ์ž‘์„ฑํ•œ NAG์˜ ๋งˆ์ง€๋ง‰ ์‹์„ m_(t+1)์„ ์‚ฌ์šฉํ•œ ํšจ๊ณผ๋ฅผ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€ํ˜•ํ•ฉ๋‹ˆ๋‹ค.
gt=โˆ‡ฮธtJ(ฮธt)mt=ฮณmtโˆ’1+ฮทgtฮธt+1=ฮธtโˆ’(ฮณmt+ฮทgt)g_t=\nabla_{\theta_t}J(\theta_t)\\ m_t=\gamma m_{t-1}+\eta g_t\\ \theta_{t+1}=\theta_t-(\gamma m_t+\eta g_t)
์—ฌ๊ธฐ์„œ๋Š” m_t์„ m_(t+1)๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๋ƒˆ๋‹ค๋Š” ์‚ฌ์‹ค์— ์ฃผ๋ชฉํ•ด ์ฃผ์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค. (eta*g_t+1 ํ•ญ๋ชฉ์€ ๋งŒ๋“ค ์ˆ˜ ์—†๊ธฐ์— eta*g_t๋กœ ๋‘” ๊ฒƒ์ž…๋‹ˆ๋‹ค.)
๋‹ค์Œ์œผ๋กœ Adam์„ ๋‹ค์‹œ ์ƒ๊ธฐํ•ด ๋ด…์‹œ๋‹ค.
mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)gtmt^=mt1โˆ’ฮฒ1tฮธt+1=ฮธtโˆ’ฮทvt^+ฯตmt^m_t=\beta_1m_{t-1}+(1-\beta_1)g_t\\ \hat{m_t}=\frac{m_t}{1-\beta_1^t}\\ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}\hat{m_t}
์œ„์˜ ์‹๋“ค์„ ์ด์šฉํ•ด ์ตœ์ข…์ ์œผ๋กœ ์‹์„ ์ „๊ฐœํ•ด๋ณด๋ฉด,
ฮธt+1=ฮธtโˆ’ฮทvt^+ฯต(ฮฒ1mtโˆ’11โˆ’ฮฒ1t+(1โˆ’ฮฒ1)gt1โˆ’ฮฒ1t)\theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}(\frac{\beta_1m_{t-1}}{1-\beta_1^t}+\frac{(1-\beta_1)g_t}{1-\beta_1^t})
์œ„์™€ ๊ฐ™๊ณ  ๋‹ค์‹œ bias-correction ์‹์„ ์ด์šฉํ•ด ์‹์„ ๋ฐ”๊ฟ”๋ณด๋ฉด,
ฮธt+1=ฮธtโˆ’ฮทvt^+ฯต(ฮฒ1m^tโˆ’1+(1โˆ’ฮฒ1)gt1โˆ’ฮฒ1t)\theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}(\beta_1\hat{m}_{t-1}+\frac{(1-\beta_1)g_t}{1-\beta_1^t})
์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ดํ›„ m_(t-1)์„ m_t๋กœ ๋ฐ”๊พธ์–ด์ฃผ์–ด ์•„๊นŒ ๋ง์”€๋“œ๋ฆฐ ๋ฏธ๋ž˜์˜ momentum์— ๋Œ€ํ•œ ํšจ๊ณผ๋ฅผ ๋ฐ˜์˜ํ•ด์ฃผ๋ฉด,
ฮธt+1=ฮธtโˆ’ฮทvt^+ฯต(ฮฒ1m^t+(1โˆ’ฮฒ1)gt1โˆ’ฮฒ1t)\theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}}+\epsilon}(\beta_1\hat{m}_t+\frac{(1-\beta_1)g_t}{1-\beta_1^t})
์ตœ์ข…์ ์œผ๋กœ ์œ„์™€ ๊ฐ™์€ ์‹์œผ๋กœ ์ •๋ฆฌํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.Nadam์€ Adam์— ์กด์žฌํ•˜์ง€ ์•Š์•˜๋˜ย ๋ฏธ๋ž˜์˜ ํ•™์Šต ๋ฐฉํ–ฅ์„ฑ์„ ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•˜์—ฌ ๊ณ ๋ คํ•˜๊ณ  ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

Conclusion

๊ธ€๋„ ๊ธธ๊ณ , ๋‚ด์šฉ๋„ ๋งŽ์•„ ์ •๋ฆฌ๊ฐ€ ์•ˆ๋˜์‹ค ๋ถ„๋“ค์„ ์œ„ํ•ด ์ œ๊ฐ€ ๋ช‡ ๊ฐ€์ง€ ์ค‘์š” ์‚ฌํ•ญ๋“ค์„ ์ •๋ฆฌํ•ด๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
๊ฐ€์žฅ ๋จผ์ €, BGD, SGD, Mini-Batch GD์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค์€ objective function์ด ์ •์˜๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋ฒ”์œ„์— ๋”ฐ๋ผ์„œ ํ˜•ํƒœ๊ฐ€ ๋‹ฌ๋ž์—ˆ๊ณ , ํ•™์Šต ์†๋„์˜ SGD์™€ ํ•™์Šต์˜ ์ •ํ™•์„ฑ BGD(local minima์— ๋น ์ง€์ง€ ์•Š๋Š” ๋“ฑ)์˜ trade-off๋กœ ์ตœ์ข…์ ์œผ๋กœ Mini-Batch GD๋ฅผ ์†Œ๊ฐœ๋“œ๋ ธ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” Momentum, Nesterov Accelerated Gradient์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค์€ ๊ฐ๊ฐ ๊ณผ๊ฑฐ์˜ gradient๋ฅผ ์ด์šฉ, ๋ฏธ๋ž˜์˜ gradient๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ํ˜„์žฌ์˜ gradient ์‚ฐ์ถœ์— ๋ฐ˜์˜ํ•˜๋Š” ํ˜•ํƒœ์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต์˜ ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฒฐ์ด ๋Š˜์–ด๋‚ฌ์—ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ํ•™์Šต์˜ ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” Adagrad, RMSProp, Adadelta์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค์€ ๊ธฐ๋ณธ์ ์œผ๋กœ parameter-dependent learning rate๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด๋ฉฐ, RMSProp์—์„œ๋Š” gradient vanishing ๋ฌธ์ œ๋ฅผ, Adadelta์—์„œ๋Š” unit์— ๋Œ€ํ•œ ๋ณด์ •์„ ์ง„ํ–‰ํ•˜์—ฌ ๊ฐ๊ฐ ํ•™์Šต์— ๊ฐœ์„ ์„ ์ง„ํ–‰ํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.
๋งˆ์ง€๋ง‰์œผ๋กœ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ํ•™์Šต์˜ ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ๋ชจ๋‘ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ํ˜•ํƒœ์ธ Adam, Adamax, Nadam์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค์€ Momentum๊ณผ RMSProp์„ ํ•ฉ์ณค๋˜ Adam์—์„œ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•œ ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐœ์„ ์‚ฌํ•ญ์„ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒƒ๋“ค์ž…๋‹ˆ๋‹ค. Adam์—์„œ second momentum์„ L_infinity norm์œผ๋กœ ๋ณ€๊ฒฝํ•œ Adamax์—์„œ๋Š” bias-correction์—†์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ Adam์— NAG๋ฅผ ํ•ฉ์ณค๋˜ Nadam์œผ๋กœ ๋ฏธ๋ž˜์˜ ํ•™์Šต ๋ฐฉํ–ฅ์„ฑ๋„ ์ถ”๊ฐ€๋กœ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์ด ๊ธ€์„ ์ฝ์œผ์‹  ์—ฌ๋Ÿฌ๋ถ„๋“ค, ์ด ๊ธ€์˜ ์ฒ˜์Œ์œผ๋กœ ๋Œ์•„๊ฐ€ ์ธ๋„ค์ผ ์‚ฌ์ง„์„ ๋‹ค์‹œ ํ•œ ๋ฒˆ ๋Œ์•„๋ณด์‹œ๊ณ  ์ฐจ๊ทผ์ฐจ๊ทผ ์ •๋ฆฌํ•ด๋ณด์‹œ๋ฉด ๋„์›€์ด ๋งŽ์ด ๋˜์‹ค ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.