Deep Learning - 2.4 Softmax Regression

728x90

Regression is the hammer we reach for when we want to answer how much? or how many? questions.

In practice, we are more often interested in classification: asking not “how much” but “which one”:

Classification
Network Architecture
Initializing Model Parameters
ParameterizationCostofFully-ConnectedLayers
Softmax Operation
Vectorization for Minibatches
Loss Function
Softmax and Derivatives
Cross-Entropy Loss
Information Theory Basics

1. Classification

Here, each input consists of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar, giving us four features x1, x2, x3, x4.

And for the labels we can use one-hot encoding, a vector with as many components as we have categories. The component corresponding to particular instanceʼs category is set to 1 and all other components are set to 0.

2. Network Architecture

In order to estimate the conditional probabilities associated with all the possible classes, we need a model with multiple outputs, one per class.

3. Parameterization Cost of Fully-Connected Layers

We can use a frameworkʼs predefined layers

Specifically, for any fully-connected layer with d inputs and q outputs, the parameterization cost is O(dq), which can be prohibitively high in practice. Fortunately, this cost of transforming d inputs into q outputs can be reduced to O(dq), where the hyperparameter n can be flexibly specified by us to balance between parameter saving and model effectiveness in real-world applications

????????????????????????????????????????????????????

5. Softmax Operation

Out goal is to interpret the outputs of our model as probabilities

we would like any output yˆj to be interpreted as the probability that a given item belongs to class j.

if yˆ1, yˆ2, and yˆ3 are 0.1, 0.8, and 0.1, respectively, then we predict cate- gory 2,

But directly interpreting the output of the linear layer as a probability are a bit dangerous because nothing constrains these numbers to sum to 1 and depending on the inputs, they can take negative values

So to transform our logits such that they become nonnegative and sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1):

This is softmax function

so hat of y is a proper probability distribution

Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an affine transformation of input features; thus, softmax regression is a linear model.

6. Vectorization for Minibatches

Assume that we are given a minibatch X of examples with feature dimensionality (number of inputs) d and batch size n. Moreover, assume that we have q categories in the output. Then the minibatch features X are in Rn×d, weights W ∈ Rd×q, and the bias satisfies b ∈ R1×q.

Each row in X rep- resents a data example, the softmax operation itself can be computed rowwise: for each row of O, exponentiate all entries and then normalize them by the sum. Triggering broadcasting during the summation XW + b in (3.4.5), both the minibatch logits O and output probabilities Yˆ are n × q matrices.

7. Loss Function

We will rely on maximum likelihood estimation, the very same concept that we encountered when providing a probabilistic justification for the mean squared error objective in linear regression

The softmax function gives us a vector yˆ, which we can interpret as estimated conditional prob- abilities of each class given any input x, e.g., yˆ1 = P (y = cat | x).

Our goal is to maximize P(Y|X). According to maximun likelihood estimation, it is equivalent to minimizing the negative log-likelihood.

where for any pair of label y and model prediction yˆ over q classes, the loss function l is

The loss function above is commonly called the cross-entropy loss.

Since y is a one-hot vector of length q, the sum over all its coordinates j vanishes for all but one term. Since all yˆj are predicted probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be minimized any further if we correctly predict the actual label with certainty, i.e., if the predicted probability P (y | x) = 1 for the actual label y.

8. Softmax and Derivatives

Plugging softmax into the definition of loss we obtain:

consider the derivative with respect to any logit oj. We get:

In other words, the derivative is the difference between the probability assigned by our model, as expressed by the softmax operation, and what actually happened, as expressed by elements in the one-hot label vector.

9. Cross-Entropy Loss

We can use the same representation as before for the label y. The only difference is that rather than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability vector, say (0.1, 0.2, 0.7). The math that we used previously to define the loss l in (3.4.8) still works out fine, just that the interpretation is slightly more general. It is the expected value of the loss for a distribution over labels. This loss is called the cross-entropy loss and it is one of the most commonly used losses for classification problems.

10. Information Theory Basics

Entropy

This is called the entropy of a distribution P.

This quantity places a hard limit on our ability to compress the data.

Surprisal

If it is always easy for us to predict the next token, then this data is easy to compress!

However if we cannot perfectly predict every event, then we might sometimes be surprised.

The entropy is then the expected surprisal when one assigned the correct probabilities that truly match the data-generating process.

Cross-Entropy Revisited

So if entropy is level of surprise experienced by someone who knows the true probability, then you might be wondering, what is cross-entropy? The cross-entropy from P to Q, denoted H(P,Q), is the expected surprisal of an observer with subjective probabilities Q upon seeing data that were actually generated according to probabilities P . The lowest possible cross-entropy is achieved when P = Q. In this case, the cross-entropy from P to Q is H(P,P) = H(P).

In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.

728x90

저작자표시 (새창열림)

'ComputerScience > Machine Learning' 카테고리의 다른 글

Deep Learning - 2.6 Implementation of Softmax Regression from Scratch (0)	2022.08.23
Deep Learning - 2.5 The Image Classification Dataset (0)	2022.08.19
Deep Learning - 2.3 Concise Implementation of Linear Regression (0)	2022.08.16
Deep Learning - 2.2 Linear Regression Implementation from Scratch (0)	2022.08.13
Deep Learning - 2.1 Linear Regression (0)	2022.08.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

jsdysw

Deep Learning - 2.4 Softmax Regression

1. Classification

2. Network Architecture

3. Parameterization Cost of Fully-Connected Layers

5. Softmax Operation

6. Vectorization for Minibatches

7. Loss Function

8. Softmax and Derivatives

9. Cross-Entropy Loss

10. Information Theory Basics

'ComputerScience > Machine Learning' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Deep Learning - 2.4 Softmax Regression

1. Classification

2. Network Architecture

3. Parameterization Cost of Fully-Connected Layers

5. Softmax Operation

6. Vectorization for Minibatches

7. Loss Function

8. Softmax and Derivatives

9. Cross-Entropy Loss

10. Information Theory Basics

'ComputerScience > Machine Learning' 카테고리의 다른 글

'ComputerScience/Machine Learning' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역