Now, we are going to take a look at a specific CNN architecture, depicted in Figure 3. The CNN consists of different layers, which are connected to each other. The starting layer is the embedding layer, which takes integer encoded texts as input, and initializes random or pre-trained word embedding weights. The weights are learnt during training of the CNN by updating the weights in each epoch (an epoch is a full run through the network). In Figure 3, the word embeddings are shown for the words in the text “Green shirt for men regular fit”, where for illustration purposes the word embeddings are of dimension 5.
The next layer is the convolutional layer. The convolutional layer takes the texts transformed to sequences of word embeddings as input, and creates feature vectors by analysing the word embeddings for each text. This is done by the use of convolution filters. A convolution filter is a matrix that is filled with weights, that analyses multiple consecutive words in a text simultaneously, and continues like this by going through the whole text to create a feature map. This operation is performed for every text, by using multiple convolution filters, to detect various different relationships between the words. The various convolution filters may also differ from each other in height, which indicates how many consecutive words a filter considers simultaneously in each step. To obtain the feature vectors, the feature maps following from the convolution operation all get added a bias term, and an activation function is applied to add non-linearity (ReLU is the most commonly used activation function). This non-linearity enables to learn more complex relationships.
The pooling layer takes the variable-length feature vectors of the convolutional layer as input, and creates fixed-length vectors out of them. By creating these fixed-length feature vectors, the less-relevant local information is removed.
The final layer of the CNN is the softmax layer. This layer takes the fixed-length feature vectors as input, where these vectors first enter a fully-connected layer. This fully-connected layer is an efficient way of learning non-linear combinations of the features. The output of this fully-connected layer are numerical values for each class. To assign a straightforward interpretation to these numbers, the softmax function is applied. The softmax function forces the output of the CNN to represent predicted probabilities for each of the classes. The class achieving the highest predicted probability is the resulting predicted class following from the CNN.
During training of the CNN, in each epoch, the weights in the embedding, convolutional and softmax layers are updated by making use of the categorical cross-entropy loss function. This process of updating the weights is called back-propagation, and is the essence of neural net training. Back-propagation is a way of propagating the total loss obtained from the forward propagation in this epoch (i.e. , this run through the network) back into the CNN to know what portion of the loss every node is responsible for, and subsequently updating the weights in such a way that minimizes the loss by giving the nodes with higher error rates lower weights and vice versa.
Model 3: Soft Vote
Until now, we have seen two different text classification methods, the logistic regression and CNN text classifiers. We can try to improve on both these individual models by combining the two, creating an ensemble text classifier. These two classifiers can be combined through a so called Soft Voting (SV) classifier, where the classes are predicted by taking a weighted average of the class probabilities following from the LR and CNN text classifiers. Such a voting classifier can be useful for equally performing classifiers in order to balance out their individual weaknesses. For the SV text classifier, we have to determine the weightings to assign to the predicted probabilities of the LR and CNN classifiers. We make use of a validation set of texts, which were not used for the training of the LR and CNN text classifiers. The SV weights are determined by finding the weights that maximize the number of correct class predictions on this validation set.
When the weights have been determined, the soft vote model is ready to calculate the weighted predicted probabilities for each class. Then, the predicted class by the SV text classifier is the class achieving the highest weighted predicted probability.
Risk Level Prediction
As mentioned earlier, there is an uncertainty regarding the correctness of the class predictions following from a text classifier. For this reason, some methodology is explained on how to assess the risk of the class predictions following from a text classifier. To perform the risk assessment of the class predictions, we make use of Risk Level Prediction (RLP). This method provides a risk level for every class prediction of the text classifier. The goal of the risk levels is to provide a critical view on the class predictions of the text classifier, to indicate how ‘insecure’ the classifier is about that particular class prediction. In this way, only the class predictions with a higher risk level have to be manually checked, since these are the uncertain class predictions.
In multi-class classification, when predicting the class corresponding to a text, the class achieving the highest predicted probability, which follows from the text classifier, is assigned to the text. You might say that this predicted probability could serve as an appropriate indication of how risky the class prediction is. The higher the predicted probability, the more certain the text classifier is about its prediction being correct. Although the predicted probability of the assigned class sounds as an appropriate indicator of risk on its own, it might be even more beneficial to involve more indicators of risk to predict the risk level assigned to the class prediction even better.
The idea of the RLP is to train a second classifier which takes certain risk indicators as input, and outputs either that the class prediction is `correct’ or `wrong’. Based on these predictions, we assign the risk levels. Since this classifier can assign either `correct’ or `wrong’ to the class prediction, this is a binary classifier. To train such a second classifier, we need to split the training data set into two subsets. The first subset is used to train the text classifier, and the second subset is used to train the classifier of the RLP. The training procedure will then be as follows:
- The text classifier is trained on the first subset of the training data in the usual way.
- Since at this point the text classifier has been trained, we are able to apply it to predict the classes for the texts in the second subset of the training data. Now, we both have the predicted class following from the text classifier, and the true pre-assigned class. By comparing these two classes, it can be checked whether the prediction of the text classifier was correct or wrong. This binary variable, which is either `correct’ or `wrong’, is the target variable within the classifier of the RLP. By using risk indicators as features for this binary classifier, we train this classifier of the RLP such that it is able to predict for new texts whether the assigned class by the text classifier is correct or wrong.
We desire that the classifier of the RLP is able to accurately predict whether a class prediction of the text classifier is correct or wrong. When this is the case, we are able to accurately assign risk levels based on the predictions of this binary classifier. To achieve this goal, we have to select useful features which are able to indicate risk of a class prediction. Useful features are the top-k predicted probabilities (i.e., the k highest predicted probabilities), sentence statistics (e.g., the length of the text), features on the behaviour of the classes (e.g., F1-score of the class), etc.