본문 바로가기

ComputerScience/Machine Learning

Deep Learning - 2.4 Softmax Regression

728x90

 

Regression is the hammer we reach for when we want to answer how much? or how many? questions.

In practice, we are more often interested in classification: asking not “how much” but “which one”:

  • Classification
  • Network Architecture
  • Initializing Model Parameters
  • ParameterizationCostofFully-ConnectedLayers
  • Softmax Operation
  • Vectorization for Minibatches
  • Loss Function
  • Softmax and Derivatives
  • Cross-Entropy Loss
  • Information Theory Basics

 

1. Classification

Here, each input consists of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar, giving us four features x1, x2, x3, x4.

And for the labels we can use one-hot encoding, a vector with as many components as we have categories. The component corresponding to particular instanceʼs category is set to 1 and all other components are set to 0.

 

2. Network Architecture

In order to estimate the conditional probabilities associated with all the possible classes, we need a model with multiple outputs, one per class.

 

3. Parameterization Cost of Fully-Connected Layers

We can use a frameworkʼs predefined layers 

 

Specifically, for any fully-connected layer with d inputs and q outputs, the parameterization cost is O(dq), which can be prohibitively high in practice. Fortunately, this cost of transforming d inputs into q outputs can be reduced to O(dq), where the hyperparameter n can be flexibly specified by us to balance between parameter saving and model effectiveness in real-world applications

????????????????????????????????????????????????????

 

5. Softmax Operation

Out goal is to interpret the outputs of our model as probabilities

we would like any output yˆj to be interpreted as the probability that a given item belongs to class j.

if yˆ1, yˆ2, and yˆ3 are 0.1, 0.8, and 0.1, respectively, then we predict cate- gory 2,

But directly interpreting the output of the linear layer as a probability are a bit dangerous because nothing constrains these numbers to sum to 1 and depending on the inputs, they can take negative values

So to transform our logits such that they become nonnegative and sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1):

This is softmax function

so hat of y is a proper probability distribution

Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an affine transformation of input features; thus, softmax regression is a linear model.

6. Vectorization for Minibatches

Assume that we are given a minibatch X of examples with feature dimensionality (number of inputs) d and batch size n. Moreover, assume that we have q categories in the output. Then the minibatch features X are in Rn×d, weights W Rd×q, and the bias satisfies b R1×q.

Each row in X rep- resents a data example, the softmax operation itself can be computed rowwise: for each row of O, exponentiate all entries and then normalize them by the sum. Triggering broadcasting during the summation XW + b in (3.4.5), both the minibatch logits O and output probabilities Yˆ are n × q matrices.

7. Loss Function

We will rely on maximum likelihood estimation, the very same concept that we encountered when providing a probabilistic justification for the mean squared error objective in linear regression

The softmax function gives us a vector yˆ, which we can interpret as estimated conditional prob- abilities of each class given any input x, e.g., yˆ1 = P (y = cat | x).

Our goal is to maximize P(Y|X). According to maximun likelihood estimation, it is equivalent to minimizing the negative log-likelihood.

where for any pair of label y and model prediction yˆ over q classes, the loss function l is

The loss function above is commonly called the cross-entropy loss.

Since y is a one-hot vector of length q, the sum over all its coordinates j vanishes for all but one term. Since all yˆj are predicted probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be minimized any further if we correctly predict the actual label with certainty, i.e., if the predicted probability P (y | x) = 1 for the actual label y.

 

8. Softmax and Derivatives

Plugging softmax into the definition of loss we obtain:

consider the derivative with respect to any logit oj. We get:

In other words, the derivative is the difference between the probability assigned by our model, as expressed by the softmax operation, and what actually happened, as expressed by elements in the one-hot label vector.

9. Cross-Entropy Loss

We can use the same representation as before for the label y. The only difference is that rather than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability vector, say (0.1, 0.2, 0.7). The math that we used previously to define the loss l in (3.4.8) still works out fine, just that the interpretation is slightly more general. It is the expected value of the loss for a distribution over labels. This loss is called the cross-entropy loss and it is one of the most commonly used losses for classification problems.

 

10. Information Theory Basics

Entropy

This is called the entropy of a distribution P.

This quantity places a hard limit on our ability to compress the data.

 

Surprisal

If it is always easy for us to predict the next token, then this data is easy to compress!

However if we cannot perfectly predict every event, then we might sometimes be surprised.

The entropy is then the expected surprisal when one assigned the correct probabilities that truly match the data-generating process.

Cross-Entropy Revisited

So if entropy is level of surprise experienced by someone who knows the true probability, then you might be wondering, what is cross-entropy? The cross-entropy from P to Q, denoted H(P,Q), is the expected surprisal of an observer with subjective probabilities Q upon seeing data that were actually generated according to probabilities P . The lowest possible cross-entropy is achieved when P = Q. In this case, the cross-entropy from P to Q is H(P,P) = H(P).

In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.

728x90
반응형