BPTT for RNN

728x90

Back Propagation Through Time (BPTT)

States computed in the forward pass must be stored until they are reused during the backward pass, so the memory cost is also O(τ).

https://mmuratarat.github.io/2019-02-07/bptt-of-rnn

Loss function

Let's say we are using cross-entropy loss

Derivative of loss with respect to Wyh

derivative of cross entrophy loss w.r.t. ot is (yt^ - yt)

derivative of cross entrophy Loss w.r.t. softmax and derivative of output(t) w.r.t. Wyh

Derivative of loss with respect to Bias by

Derivative of loss with respect to hidden state

at the time-step $t + 1$ , we can compute the gradient and further use backpropagation through time from $t + 1$ to $1$ to compute the overall gradient with respect to $W_{h h}$

Aggregate the gradients with respect to $W_{h h}$ over the whole time-steps with backpropagation,

Derivative of loss with respect to Wxh

at the time-step $t + 1$

we can take the derivative with respect to $W_{x h}$ over the whole sequence as

vanishing gradient and exploding gradient

large increase in the norm of the gradient -> explode

opposite -> vanish, making it impossible for the model to learn correlation between temporally distant events.

그래서 backpropagate되는 gradient들을 적당히 끊어주는 truncated rnn을 사용할 수있다.

이런 이유로 LSTM, GRU가 대체로 많이 사용된다.

728x90

저작자표시 (새창열림)

'boostcamp AI tech > boostcamp AI' 카테고리의 다른 글

Maximum Likelihood Estimation (0)	2023.11.12
Bayes Theorem (0)	2023.11.10
Recurrent Neural Networks (0)	2023.08.23
Convolutional Neural Network (0)	2023.08.23
Autoencoder (0)	2023.08.17

jsdysw

BPTT for RNN

Back Propagation Through Time (BPTT)

Loss function

Derivative of loss with respect to Wyh

Derivative of loss with respect to Bias by

Derivative of loss with respect to hidden state

Derivative of loss with respect to Wxh

vanishing gradient and exploding gradient

'boostcamp AI tech > boostcamp AI' 카테고리의 다른 글

티스토리툴바

BPTT for RNN

Back Propagation Through Time (BPTT)

Loss function

Derivative of loss with respect to Wyh

Derivative of loss with respect to Bias by

Derivative of loss with respect to hidden state

Derivative of loss with respect to Wxh

vanishing gradient and exploding gradient

'boostcamp AI tech > boostcamp AI' 카테고리의 다른 글

'boostcamp AI tech/boostcamp AI' Related Articles

티스토리툴바