Regression is the hammer we reach for when we want to answer how much? or how many? questions.
In practice, we are more often interested in classification: asking not “how much” but “which one”:
- Classification
- Network Architecture
- Initializing Model Parameters
- ParameterizationCostofFully-ConnectedLayers
- Softmax Operation
- Vectorization for Minibatches
- Loss Function
- Softmax and Derivatives
- Cross-Entropy Loss
- Information Theory Basics
1. Classification
Here, each input consists of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar, giving us four features x1, x2, x3, x4.
And for the labels we can use one-hot encoding, a vector with as many components as we have categories. The component corresponding to particular instanceʼs category is set to 1 and all other components are set to 0.
2. Network Architecture
In order to estimate the conditional probabilities associated with all the possible classes, we need a model with multiple outputs, one per class.
3. Parameterization Cost of Fully-Connected Layers
We can use a frameworkʼs predefined layers
Specifically, for any fully-connected layer with d inputs and q outputs, the parameterization cost is O(dq), which can be prohibitively high in practice. Fortunately, this cost of transforming d inputs into q outputs can be reduced to O(dq), where the hyperparameter n can be flexibly specified by us to balance between parameter saving and model effectiveness in real-world applications
????????????????????????????????????????????????????
5. Softmax Operation
Out goal is to interpret the outputs of our model as probabilities
we would like any output yˆj to be interpreted as the probability that a given item belongs to class j.
if yˆ1, yˆ2, and yˆ3 are 0.1, 0.8, and 0.1, respectively, then we predict cate- gory 2,
But directly interpreting the output of the linear layer as a probability are a bit dangerous because nothing constrains these numbers to sum to 1 and depending on the inputs, they can take negative values
So to transform our logits such that they become nonnegative and sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1):
This is softmax function
so hat of y is a proper probability distribution
Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an affine transformation of input features; thus, softmax regression is a linear model.
6. Vectorization for Minibatches
Assume that we are given a minibatch X of examples with feature dimensionality (number of inputs) d and batch size n. Moreover, assume that we have q categories in the output. Then the minibatch features X are in Rn×d, weights W ∈ Rd×q, and the bias satisfies b ∈ R1×q.
Each row in X rep- resents a data example, the softmax operation itself can be computed rowwise: for each row of O, exponentiate all entries and then normalize them by the sum. Triggering broadcasting during the summation XW + b in (3.4.5), both the minibatch logits O and output probabilities Yˆ are n × q matrices.
7. Loss Function
We will rely on maximum likelihood estimation, the very same concept that we encountered when providing a probabilistic justification for the mean squared error objective in linear regression
The softmax function gives us a vector yˆ, which we can interpret as estimated conditional prob- abilities of each class given any input x, e.g., yˆ1 = P (y = cat | x).
Our goal is to maximize P(Y|X). According to maximun likelihood estimation, it is equivalent to minimizing the negative log-likelihood.
where for any pair of label y and model prediction yˆ over q classes, the loss function l is
The loss function above is commonly called the cross-entropy loss.
Since y is a one-hot vector of length q, the sum over all its coordinates j vanishes for all but one term. Since all yˆj are predicted probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be minimized any further if we correctly predict the actual label with certainty, i.e., if the predicted probability P (y | x) = 1 for the actual label y.
8. Softmax and Derivatives
Plugging softmax into the definition of loss we obtain:
consider the derivative with respect to any logit oj. We get:
In other words, the derivative is the difference between the probability assigned by our model, as expressed by the softmax operation, and what actually happened, as expressed by elements in the one-hot label vector.
9. Cross-Entropy Loss
We can use the same representation as before for the label y. The only difference is that rather than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability vector, say (0.1, 0.2, 0.7). The math that we used previously to define the loss l in (3.4.8) still works out fine, just that the interpretation is slightly more general. It is the expected value of the loss for a distribution over labels. This loss is called the cross-entropy loss and it is one of the most commonly used losses for classification problems.
10. Information Theory Basics
Entropy
This is called the entropy of a distribution P.
This quantity places a hard limit on our ability to compress the data.
Surprisal
If it is always easy for us to predict the next token, then this data is easy to compress!
However if we cannot perfectly predict every event, then we might sometimes be surprised.
The entropy is then the expected surprisal when one assigned the correct probabilities that truly match the data-generating process.
Cross-Entropy Revisited
So if entropy is level of surprise experienced by someone who knows the true probability, then you might be wondering, what is cross-entropy? The cross-entropy from P to Q, denoted H(P,Q), is the expected surprisal of an observer with subjective probabilities Q upon seeing data that were actually generated according to probabilities P . The lowest possible cross-entropy is achieved when P = Q. In this case, the cross-entropy from P to Q is H(P,P) = H(P).
In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.
'ComputerScience > Machine Learning' 카테고리의 다른 글
Deep Learning - 2.6 Implementation of Softmax Regression from Scratch (0) | 2022.08.23 |
---|---|
Deep Learning - 2.5 The Image Classification Dataset (0) | 2022.08.19 |
Deep Learning - 2.3 Concise Implementation of Linear Regression (0) | 2022.08.16 |
Deep Learning - 2.2 Linear Regression Implementation from Scratch (0) | 2022.08.13 |
Deep Learning - 2.1 Linear Regression (0) | 2022.08.11 |