ResNet

๋ถ„์•ผ
Network Architecture
๋ฆฌ๋ทฐ ๋‚ ์งœ
2020/08/31
๋ณธ ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ํœด๋จผ์Šค์ผ€์ดํ”„ ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ์— ๋จผ์ € ์ž‘์„ฑํ•˜๊ณ  ์˜ฎ๊ธด ํฌ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.
๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” semantic segmentation ๋ถ„์•ผ์—์„œ ์ธ์šฉ ์ˆ˜ 54000+์„ ์œก๋ฐ•ํ•˜๋Š” ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋ฆฌ๋ทฐํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์ด ๋…ผ๋ฌธ์€ deep neural network์˜ ํ•™์Šต ํšจ์œจ์— ๋Œ€ํ•œ ๊ธฐ์—ฌ๋กœ CVPR 2016์— ์‹ค๋ ธ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ์ œ๋ชฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
โ€œDeep Residual Learning for Image Recognitionโ€
๋…ผ๋ฌธ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์ง์ ‘ ๋ณด์‹œ๊ณ  ์‹ถ์œผ์‹  ๋ถ„์€ย ์ด๊ณณ์„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.
Basic concept of Residual Learning

Objective

๋…ผ๋ฌธ์—์„œ ๋ชฉ์ ์œผ๋กœ ํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์€ deeper neural network ๊ตฌ์กฐ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š”ย gradient vanishing ํ˜„์ƒ์œผ๋กœ ์ธํ•œ degradation(ํ•™์Šตํšจ๊ณผ ์ €ํ•ด)์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ผ๋ฐ˜์ ์œผ๋กœ gradient vanishing์€ ์‹ ๊ฒฝ๋ง์ด ๊นŠ์–ด์ง์— ๋”ฐ๋ผ์„œย layer์˜ ํ•œ weight ๊ฐ’์ด ์ „์ฒด ์—ฐ์‚ฐ ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ •๋„๊ฐ€ ๊ต‰์žฅํžˆ ์ž‘์•„์ง€๋Š” ํ˜„์ƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” neural network์—์„œ ์‚ฌ์šฉํ•˜๋Š” activation function์ด ๋ฏธ๋ถ„์„ ๊ฑฐ์น˜๊ฒŒ ๋˜์—ˆ์„ ๋•Œ ๋‚˜์˜ค๋Š” output์˜ scale์ด ์ค„์–ด๋“ ๋‹ค๋Š” ์  ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, back propagation์„ ์ง„ํ–‰ํ•  ๋•Œ ์˜ค๋ฅ˜ ํ•ญ๋ชฉ์— ์ง€์†์ ์œผ๋กœ activation function์˜ ๋ฏธ๋ถ„ ํ•ญ์ด ํฌํ•จ๋˜๋ฉด์„œ ์‹ค์ œ output layer๋กœ๋ถ€ํ„ฐ ๋จผ ์ชฝ(์ดˆ๊ธฐ layer)์˜ ๊ฒฝ์šฐ, ๊ทธ ์˜ค๋ฅ˜์˜ scale์ด ๊ต‰์žฅํžˆ ์ž‘์•„์ง€๋ฉด์„œ gradient descent๊ฐ€ ์‹ค์ œ output์— ๋ฐ˜์˜๋˜๊ธฐ ํž˜๋“  ๊ตฌ์กฐ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ํ”ํžˆ ์ด๋Ÿฐ deep neural network์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํ•™์Šตํšจ๊ณผ ์ €ํ•ด๋ฅผ overfitting(ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์˜ ํŠน์„ฑ์„ ๊ณผ๋„ํ•˜๊ฒŒ ๋ฐ˜์˜ํ•˜์—ฌ ์ผ๋ฐ˜์ ์œผ๋กœ ์—๋Ÿฌ๊ฐ€ ์ปค์ง€๋Š” ํ˜„์ƒ)์— ์˜ํ•œ ๊ฒƒ์œผ๋กœ ๋‹จ์ˆœํžˆ ํ•ด์„ํ•˜๊ณ  ๋„˜์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋…ผ๋ฌธ์—์„œ ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋Š” ํ˜„์ƒ์€ ์ด๋Ÿฌํ•œ ์›์ธ์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ๋ช…ํ™•ํžˆ ์งš๊ณ  ๋„˜์–ด๊ฐ‘๋‹ˆ๋‹ค.
Reason why the problem is not caused by โ€œoverfittingโ€
Overfitting์ด ์›์ธ์ธ ํ•™์Šตํšจ๊ณผ ์ €ํ•ด์˜ ๊ฒฝ์šฐ, training error๋Š” ์ž‘์ง€๋งŒ test error๋Š” ํฌ๊ฒŒ ๋“ฑ์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์–‘์ชฝ ๊ทธ๋ž˜ํ”„์˜ ๊ฒฝ์šฐ ๋ชจ๋‘ layer ์ˆ˜๊ฐ€ ๋” ๋งŽ์€ ์ชฝ์ด error๊ฐ€ ํฌ๊ฒŒ ๋“ฑ์žฅํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ†ตํ•ด, ์ผ๋ฐ˜์ ์œผ๋กœ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•œ ํ•™์Šตํšจ๊ณผ ์ €ํ•ด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Residual Learning

์•ž์—์„œ ์„ค๋ช…๋“œ๋ฆฐ degradation ํ˜„์ƒ์€ ๋ณต์žกํ•œ layer๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ  ๋‹จ์ˆœํžˆ identity layer ๋งŒ์„ ๊ธฐ์กด์˜ ์‹ ๊ฒฝ๋ง์— ๋ถ™์ด๋”๋ผ๋„ ๋ฐœ์ƒํ–ˆ๋˜ ํ˜„์ƒ์ž…๋‹ˆ๋‹ค. ResNet์€ ์ด๋Ÿฐ identity mapping์„ ์กฐ๊ธˆ ๋” ์ž˜ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ residual learning์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๊ธฐ์กด์— ์กด์žฌํ–ˆ๋˜ neural network๋Š” ์ขŒ์ธก์˜ plain net์˜ ๊ตฌ์กฐ๋กœ, input x์— ๋Œ€ํ•ด์„œ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” H(x)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, residual net ๊ตฌ์กฐ๋Š” input x ์— ๋Œ€ํ•œ ๋ชฉํ‘œ H(x)์— identity mapping์ธ x๋ฅผ ํฌํ•จ์‹œํ‚จ ํ˜•ํƒœ๋กœ ๋‚จ์€ F(x)๋ฅผ optimizeํ•˜๋Š” ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Identity mapping์˜ ๊ด€์ ์—์„œ ์œ„ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์ขŒ์ธก์˜ plain net์€ H(x)๋ฅผ x์™€ ๋™์ผํ•˜๋„๋ก ํ•™์Šต์‹œ์ผœ์•ผ ํ•˜๋Š” ๋ฐ˜๋ฉด, ์šฐ์ธก์˜ ๊ฒฝ์šฐ F(x)=H(x)-x๋ฅผ 0์— ์ˆ˜๋ ดํ•˜๋„๋ก ํ•™์Šต์‹œ์ผœ์•ผ ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. layer๋ฅผ ํŠน์ • ๊ฐ’ input x๋ฅผ ๊ฐ€์ง€๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒฝ์šฐ๋ณด๋‹ค๋Š”, ์–ด๋–ค x๊ฐ€ ๋“ค์–ด์˜ค๋”๋ผ๋„ residual(์ž”์ฐจ)๋ฅผ 0์„ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋”์šฑ ์ข‹์€ ํ•™์Šตํšจ๊ณผ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์—์„œ ์‹œ์ž‘ํ•œ ๋ฐœ์ƒ์ž…๋‹ˆ๋‹ค.
y=F(x,{Wi})+Wsxy=F(x,\{W_i\})+W_sx
๋ง๋ถ™์—ฌ, ์•ž์„œ ์„ค๋ช…ํ•œ residual net์„ ์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ W_s์˜ ๊ฒฝ์šฐ, input๊ณผ output์˜ dimension์ด ๋™์ผํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•œ linear projection matrix์ž…๋‹ˆ๋‹ค.

How Residual Learning Solves Problem

๋‹ค์ค‘ layer๋ฅผ ์ด์šฉํ•œ identity mapping์˜ ๊ตฌํ˜„์˜ ์šฉ์ด์„ฑ์ด๋ผ๋Š” ๊ด€์ ์—์„œ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ, residual learning์€ ๊ทผ๋ณธ์ ์œผ๋กœ degradation์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์— ์ ํ•ฉํ–ˆ์Šต๋‹ˆ๋‹ค.
์ด๋Š” H(x) ํ•ญ๋ชฉ์— ๊ณ ์ •์ ์œผ๋กœ linearly addition๋œ x๊ฐ€ ๊ฒฐ๊ณผ์ ์œผ๋กœ output layer์—์„œ ๋จผ layer์—์„œ๋„ output์— ๋Œ€ํ•œ x๊ฐ’ ๋ฐ˜์˜์ด ์šฉ์ดํ•˜๊ฒŒ ์ผ์–ด๋‚˜๋„๋ก ์ž‘์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๊ฐ layer์˜ feature ์˜ output์— ๋Œ€ํ•œ ๋ฐ˜์˜์— ๋Œ€ํ•œ ์ธก๋ฉด์œผ๋กœ ResNet์˜ ํšจ์œจ์ ์ธ ์ธก๋ฉด์„ ๋ถ„์„ํ•œย โ€œIdentity Mappings in Deep Neural Networksโ€ย ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ๋„ ํ•œ ๋ฒˆ ์ฝ์–ด๋ณด์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ๊ฐ„๋žตํžˆ ์†Œ๊ฐœํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
yl=h(xl)+F(xl,Wl)xl+1=f(yl)y_l=h(x_l)+F(x_l,W_l)\\x_{l+1}=f(y_l)
๋จผ์ € residual net์—์„œ ํ˜•์„ฑํ•œ block์— ๋”ฐ๋ผ์„œ ์œ„์™€ ๊ฐ™์€ ์‹์œผ๋กœ input-output์˜ mapping์„ ํ˜•์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ h๋Š” shortcut-connection์„ ์ œ๊ณตํ•˜๋Š” ํ•จ์ˆ˜์ด๋ฉฐ, F๋Š” weight์™€ feature๋กœ ๊ณ„์‚ฐ๋œ ํ•ญ๋ชฉ์ž…๋‹ˆ๋‹ค. ๋”๋ถˆ์–ด f๋Š” activation function์ž…๋‹ˆ๋‹ค.
xL=xl+โˆ‘i=lLโˆ’1F(xi,Wi)x_L=x_l+\sum_{i=l}^{L-1} F(x_i,W_i)
์ดํ›„ ๋…ผ๋ฌธ์—์„œ๋„ ๋“ฑ์žฅํ•˜๊ฒ ์ง€๋งŒ, h๋ฅผ identity mapping์œผ๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, activation function ๋˜ํ•œ ReLU๋“ฑ์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๊ฒฝ์šฐ๋ฅผ ๋ฐ˜์˜ํ•œ identity๋กœ ๊ฐ€์ •ํ•˜๊ณ  ์ˆ˜์‹์„ ์ „๊ฐœํ•˜๋ฉด ์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ์ˆ˜์‹์—์„œ๋ถ€ํ„ฐ feature์˜ ์˜จ์ „ํ•œ ์ „๋‹ฌ์ด ์ด๋ฃจ์–ด์ง„๋‹ค๋Š” ๋Š๋‚Œ์„ ๊ฐ•ํ•˜๊ฒŒ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โˆ‚ฯตโˆ‚xl=โˆ‚ฯตโˆ‚xLโˆ‚xLโˆ‚xl=โˆ‚ฯตโˆ‚xL(1+โˆ‚โˆ‚xlโˆ‘i=1Lโˆ’1F(xi,Wi))\frac{\partial\epsilon}{\partial x_l}=\frac{\partial\epsilon}{\partial x_L}\frac{\partial x_L}{\partial x_l}=\frac{\partial\epsilon}{\partial x_L}(1+\frac{\partial}{\partial x_l}\sum_{i=1}^{L-1}F(x_i,W_i))
ํŠน์ • layer์˜ feature๊ฐ€ ๊ธฐ์—ฌํ•˜๋Š” error ํ•ญ๋ชฉ์„ ๊ตฌํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ์•ž์„œ์„œ feature๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ธ forward propagation์‹์„ ์ด์šฉํ•ด ์ „๊ฐœ๋ฅผ ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•ด์•ผ ํ•  ๊ฒƒ์€ ์‹์˜ ์ฒซ ๋ฒˆ์งธ ํ•ญ์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ layer์˜ ๊นŠ์ด์— ๊ด€๊ณ„์—†์ด ์ผ์ •ํ•˜๊ฒŒ backpropagation์„ ํ†ตํ•ด ์ „๋‹ฌ๋˜๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ตœ์†Œํ•œ์˜ gradient๋ฅผ ๋ณด์žฅํ•ด ์คŒ์œผ๋กœ์จ feature์— ๋Œ€ํ•œ error๋ฅผ ์ผ์ • ์ˆ˜์ค€์œผ๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด์„œ ๊ฐ weight๊ฐ€ total output์— ๊ธฐ์—ฌํ•˜๋Š” ์ •๋„๋„ ์ผ์ •์ˆ˜์ค€ ์ด์ƒ์œผ๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.(์ด ๋ถ€๋ถ„์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์ž˜ ๋˜์ง€ ์•Š๋Š”๋‹ค๋ฉด backpropagation formula์— ๋Œ€ํ•ด์„œ ๋ณด๊ณ  ์˜ค์…”๋„ ์ข‹์Šต๋‹ˆ๋‹ค)

Deeper Bottleneck Architecture

์ด๋ ‡๊ฒŒ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์„ ๊ตฌํ˜„ํ•˜๋ฉด ๋ณต์žกํ•ด์ง„ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋งŒํผ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. GoogleNet์€ ์ด๋Ÿฌํ•œ ์ ์„ ์ด์šฉํ•ด Inception v1์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•ด ์—ฐ์‚ฐ์˜ ์ˆ˜๋ฅผ ๋Œ€ํญ ์ค„์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์–ด๋Š์ •๋„ ์œ ์ง€ํ•˜๋Š” ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ResNet์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๊ธฐ์กด์˜ residual net block์˜ ์–‘ ๋์— 1x1 convolutional layer๋ฅผ ๋ถ™์—ฌ์„œ feature์˜ depth๋ฅผ ์ค„์ธ ํ›„ ์—ฐ์‚ฐ์„ ๊ฑฐ์นœ๋‹ค์Œ ๋‹ค์‹œ ๋Š˜๋ฆฌ๋Š” ํ˜•ํƒœ๋ฅผ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. (์—ฐ์‚ฐ ํšŸ์ˆ˜์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋‚ด์šฉ์€ ๋‹จ์ˆœ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.)

Experiment 1 : ImageNet

CVPR์— ๊ฒŒ์žฌ๋œ ๋…ผ๋ฌธ๋‹ต๊ฒŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฌํ˜„ํ•œ ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•˜๊ณ  ์„ธ์‹ฌํ•œ ์‹คํ—˜๋“ค์„ ํ†ตํ•ด ์ œ์‹œํ•œ ๊ตฌ์กฐ๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ํšจ์œจ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์— ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
๊ฐ€์žฅ ๋จผ์ € ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๊ฒƒ์ด ImageNet 2012 classificaiton dataset์„ ์ด์šฉํ•ด์„œ VGG-19 net์„ ์ฐธ๊ณ ํ•˜์—ฌ 34-layer plain net๊ณผ 34-layer residual net์„ ์„ค๊ณ„ํ•˜์—ฌ ๋น„๊ตํ•œ ์„ฑ๋Šฅ์ž…๋‹ˆ๋‹ค.
์œ„ ๊ทธ๋ž˜ํ”„๋Š” iteration์— ๋”ฐ๋ฅธ error๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ž˜ํ”„์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜ํ”„์—์„œ ์–‡์€ ์„ ์ด training error ์ด๊ณ , ๊ตต์€ ์„ ์ด validation error์ž…๋‹ˆ๋‹ค.
๋จผ์ € ์ขŒ์ธก ๊ทธ๋ž˜ํ”„์˜ ๊ฒฝ์šฐ PlainNet ๋งŒ์„ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, degradation problem์ด ๋‚˜ํƒ€๋‚œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ๋…ผ๋ฌธ์—์„œ ํ™•์ธํ•œ ๋ฐ”์— ๋”ฐ๋ฅด๋ฉด ์‹ค์ œ๋กœ gradient ๊ฐ’์„ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ gradient vanishing์œผ๋กœ ๋ณผ ๋งŒํผ ์ž‘์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— gradient vanishig์— ์˜ํ•œ degradation์€ ์•„๋‹Œ ๊ฒƒ์œผ๋กœ ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ optimization difficulty ์— ๋Œ€ํ•ด์„œ๋Š” ์ถ”ํ›„์˜ ์—ฐ๊ตฌ๋กœ ๋ฏธ๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ์šฐ์ธก ๊ทธ๋ž˜ํ”„์˜ ๊ฒฝ์šฐ ResNet์„ ๋‚˜ํƒ€๋‚ด๋Š”๋ฐ, ๊นŠ์€ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์—์„œ์˜ error๊ฐ€ ๋” ์ž‘๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
Top 1 Error
Top 1 error์˜ ๊ด€์ ์—์„œ๋„ ResNet์ด PlainNet ๋ณด๋‹ค ๋” ์ข‹์€ ํ•™์Šตํšจ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ๋ชจ๋“  shortcut connection์˜ increasing dimension์— ๋Œ€ํ•ด zero padding์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด residual net์„ ์ด์šฉํ•ด degradation problem์„ ํ•ด๊ฒฐํ•œ ๊ฒƒ์œผ๋กœ ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๋งˆ์ง€๋ง‰์œผ๋กœ 18 layer PlainNet๊ณผ 18 layer ResNet์— ๋Œ€ํ•ด์„œ๋Š” PlainNet๊ณผ ResNet ์ด ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ error๋กœ ์ˆ˜๋ ดํ–ˆ์ง€๋งŒ, ResNet์ด ์•ž์„œ ์ œ์‹œํ•œ optimization(ํŠน์ • ๊ฐ’์œผ๋กœ์˜ ํ•™์Šต์ด ์•„๋‹ˆ๋ผ ์ž”์ฐจ๋ฅผ 0์œผ๋กœ ๋ณด๋‚ด๋Š”)์„ ์‰ฝ๊ฒŒ ํ•˜์—ฌ ํ›จ์”ฌ ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด์ง€์ ์— ๋„๋‹ฌํ–ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Experiment 2: Dimension Increasing Option & Deeper Bottleneck Architecture

์•ž์„œ ์„ค๋ช…ํ•œ Deeper Bottleneck Architecture๋ฅผ ์ด์šฉํ•ด์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ResNet-50, ResNet-101, ResNet-152๋“ฑ์˜ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ dimension increasing(์•ž์„œ ์„ค๋ช…ํ•œ W_s๋ฅผ ์‚ฌ์šฉํ•œ dimension ๋ณ€ํ™”๊ฐ€ ์กด์žฌํ•  ๊ฒฝ์šฐ์˜ ๋ฐฉ๋ฒ•)์„ ์œ„ํ•œ ๊ฒฝ์šฐ์˜ ์ˆ˜๋กœ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์—ฌ ๊ฐ๊ฐ์˜ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
Dimension increasing option & Deeper bottleneck architecture
3๊ฐ€์ง€ ๋ฐฉ๋ฒ• A, B, C ๋Š” ๊ฐ๊ฐ dimension increasing์„ ์œ„ํ•ด zero padding์„ ์‚ฌ์šฉํ•˜๊ธฐ, dimension increasing์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋งŒ projection shortcut ์‚ฌ์šฉํ•˜๊ธฐ, ๋ชจ๋“  ๊ฒฝ์šฐ์— projection shortcut ์‚ฌ์šฉํ•˜๊ธฐ ์˜€์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ์ƒ์œผ๋กœ๋Š” C, B, A ์ˆœ์„œ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ  ๋ชจ๋ธ์„ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค์ง€ ์•Š๊ธฐ ์œ„ํ•ด์„œ C์˜ ๊ฒฝ์šฐ๋Š” ๋ฐฐ์ œํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
50-layer ResNet์˜ ๊ฒฝ์šฐ๋Š” ๊ธฐ์กด 34-layer์—์„œ 2-layer block์œผ๋กœ ๊ตฌํ˜„๋œ residual block์„ ์ผ๋ถ€ 3-layer residual block์œผ๋กœ ๋ฐ”๊พธ์—ˆ๊ณ , 101, 150 ๋˜ํ•œ 50๋ณด๋‹ค ๋” ๋งŽ์€ block์„ ๋ฐ”๊พธ์–ด layer์ˆ˜๋ฅผ ๋Š˜๋ ธ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ๋“ค ์ „๋ถ€ degradation problem์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์—†์—ˆ๊ณ  ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ ์ง์ ‘์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ์‹œํ•˜์ง€ ์•Š๊ฒ ์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„ ๊ตฌ์กฐ๋“ค์„ ์กฐํ•ฉํ•ด์„œ ImageNet validation์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ ์ตœ์ข…์ ์œผ๋กœ top 5 error 3.57%์˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์˜ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Experiment 3: CIFAR-10 dataset

๋‹ค์Œ์œผ๋กœ ๋…ผ๋ฌธ์—์„œ๋Š” ์ƒ๋‹นํžˆ ์œ ๋ช…ํ•œ dataset์ธ CIFAR-10์œผ๋กœ ๊ตฌ์กฐ๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. 32x32์˜ pixel image input์„ ๊ฐ€์ง€๊ณ  ์•„๋ž˜์™€ ๊ฐ™์ด 3x3 convolutional layer์˜ ๊ฐœ์ˆ˜๋ฅผ ๋ถ„ํฌํ•˜์—ฌ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค.
์ดํ›„ ๋งˆ์ง€๋ง‰์— 10-way fully connected layer๋ฅผ ๋†“์•„ ์ด 6n+2๊ฐœ์˜ layer๋ฅผ ๊ฐ€์ง„ ๊ตฌ์กฐ ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ํ‰๊ฐ€๋ฅผ ์‹ค์‹œํ•ฉ๋‹ˆ๋‹ค.
CIFAR-10 Evaluation
์ ์  layer์˜ ์ˆ˜๋ฅผ ๋Š˜๋ ค๊ฐ€๋ฉด์„œ error๋ฅผ ์ธก์ •ํ–ˆ๊ณ , ๊ทธ ๊ฒฐ๊ณผ ์›ํ•˜๋Š” ๋Œ€๋กœ degradation problem์„ ํ•ด๊ฒฐํ•˜์—ฌ layer์˜ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ์„œ error๊ฐ€ ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 1202๊ฐœ์˜ layer์— ๋Œ€ํ•ด์„œ๋„ ์‹คํ—˜์„ ํ•˜๋Š”๋ฐ, ์ด ๊ฒฝ์šฐ๋Š” overfitting์˜ ๋ฌธ์ œ๋กœ error๊ฐ€ ์ฆ๊ฐ€ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
PlainNet VS ResNet
๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ด ๊ฒฐ๊ณผ๋Š” iteration์— ๋”ฐ๋ฅธ error์˜ ๊ทธ๋ž˜ํ”„์—์„œ๋„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. PlainNet์€ layer๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก error๊ฐ€ ์ปธ์ง€๋งŒ, ResNet์€ layer๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก error๊ฐ€ ์ž‘์Œ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
๋”๋ถˆ์–ด ๋…ผ๋ฌธ์—์„œ๋Š” batch normalization ์ดํ›„, ๊ทธ๋ฆฌ๊ณ  activation์ด์ „์˜ ๊ฐ’๋“ค์ธ response์— ๋Œ€ํ•œ ๊ทธ๋ž˜ํ”„๋ฅผ layer index์— ๋Œ€ํ•ด ํ‘œํ˜„ํ•œ ๋‚ด์šฉ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
์œ„ ๊ทธ๋ฆผ์„ ํ†ตํ•ด plain๋ณด๋‹ค๋Š” residual์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ๊ฐ€ response๊ฐ€ ์ž‘์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ์ด๋Š” ์ด๋ฏธ shortcut ํ•ญ๋ชฉ์„ ํฌํ•จํ•œ residual์˜ ๊ฒฝ์šฐ๊ฐ€ optimal์„ ํ–ฅํ•ด ๊ฐ€๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๋ณ€ํ™”๋Ÿ‰์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ ์„ ๊ฒƒ์ด๋ผ๋Š” ์„ค๊ณ„์™€ ๋งž์•„๋–จ์–ด์ง€๋Š” ๋ถ€๋ถ„์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Conclusion

์ด๊ฒƒ์œผ๋กœ ๋…ผ๋ฌธย โ€œDeep Residual Learning for Image Recognitionโ€์˜ ๋‚ด์šฉ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์š”์•ฝํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ํ™•์‹คํžˆ, CVPR์— ์‹ค๋ฆฐ ๋…ผ๋ฌธ์ด์–ด์„œ ์„ค๊ฒŒ์™€ ํ‰๊ฐ€ ๋‘ ๋ถ€๋ถ„์— ์žˆ์–ด์„œ ์ž์„ธํ•œ ๊ฒ€์ฆ์ด ์ด๋ฃจ์–ด์ ธ์„œ์ธ์ง€ ์žฌ๋ฏธ์žˆ๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
์™œ ์ด ๋…ผ๋ฌธ์ด ์ €๋ ‡๊ฒŒ ๋งŽ์€ ์ธ์šฉ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ , semantic segmentation์— ์žˆ์–ด์„œ ์ค‘์š”ํ•œ ๋…ผ๋ฌธ์œผ๋กœ ํ‰๊ฐ€๋ฐ›๋Š”์ง€๋ฅผ ์ถฉ๋ถ„ํžˆ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋˜ ๋…ผ๋ฌธ์ด์—ˆ๊ณ , ์ด ๋ถ„์•ผ์— ๋Œ€ํ•œ ํฅ๋ฏธ๋ฅผ ์ผ๊นจ์›Œ์ฃผ๋Š” ๋…ผ๋ฌธ์ด์—ˆ๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ๋Š” ๋‹ค๋ฃจ์ง€ ์•Š์•˜์ง€๋งŒ ๋…ผ๋ฌธ์˜ ๋ถ€๋ก์—์„œย Object Detection on PASCAL and MS COCOย ๊ด€๋ จํ•˜์—ฌ ์ถ”๊ฐ€์ ์œผ๋กœ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜๋Š”๋ฐ ๊ด€์‹ฌ์žˆ๋Š” ๋ถ„๋“ค์€ ํ•œ ๋ฒˆ์ฏค ์ฝ์–ด๋ณด์…”๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.