• Machine Learning Company

Improving automatic product classification for e-commerce with image recognition

E-commerce is booming, the field of machine learning is maturing more and more, and therefore it is no wonder that these two fields are becoming more interweaved than ever before. For consumers this is most often notable via product recommendations, which are driven by intelligent algorithms. A less notable aspect driving e-commerce is its ability to facilitate an easy search for consumers, that is, enabling consumers to find the products they’re looking for (in the place they would expect it to be). Behind the scenes these products are stored in e-catalogs, also called the product information management (PIM) system. This brings us to the phenomenon of product taxonomies, tree structures that are responsible for allocating the different products in their respective categories. These taxonomies can be graphically represented as a hierarchical structure (Kim, Lee, Chun, & Lee, 2006), as shown in figure 1.

Figure 1: Example of a product taxonomy

A product taxonomy simply narrows down on the different subcategories within categories, which in the figure above would lead a customer to all Adidas Predator football shoes, for example. One of the key strengths of e-commerce businesses is their virtually infinite shelf space, which allows consumers to browse through an immense number of products (Elberse, 2008). However, as you might have thought of already, these products all need to be categorized in the taxonomy first. For many companies this is still a manual task performed by their employees, which can take up quite a lot of time. However, an increasing amount of companies is making the shift to automated product classification, that is, make use of artificial intelligence to do the classification task for you. To be more specific, machine learning is applied. In other words, these companies outsource this task to companies specialized in machine learning, companies such as ours! By automating the classification process, a huge time and therefore cost efficiency can be realized.

Figure 2: from product information to product category

Lately, we have applied one of our newest algorithms to the product classification task, for a company that was able to hand us not only textual product data, but also each product’s image. In general, most product classification tasks are performed using the textual data only, due to the satisfactory performance that this approach delivers. However, the use of images in combination with textual product information is promising! By using a hybrid approach, combining several algorithms to deal with the text and image data, an increase in classification performance can be achieved.

You might wonder: ‘Why is it necessary to combine several algorithms to do the classification task?’  Well, this has to do with the heterogeneous nature of the data, also called multimodal learning (Lahat, Adali, & Jutten, 2014; Ngiam, Khosla, Kim, Nam, Lee, & Ng, 2011). The difference between these types of data are pointed out in a striking manner by Srivastava and Salakhutdinov (2012): ‘Text is usually represented as discrete sparse word count vectors, whereas an image is represented using pixel intensities or outputs of feature extractors which are real-valued and dense’ (p.1). Therefore, an approach that makes these data sources equal or complementary is required. Complementarity is one of those characteristics of multimodal data, meaning that each data source adds value to the whole, which cannot be deduced from a single source (Lahat et al., 2014). Thus, to wrap up the intuition behind multimodality, we are aiming to achieve synergy between sources, in which the whole is greater than the sum of its parts.

The hybrid approach that is responsible for achieving synergy between text and image can be represented as a pyramid, with a base and top layer. The base layer consists of two algorithms, one specifically for text data and one specifically for image data. The results of this base layer flow into the top layer, consisting of one algorithm, that provides us with the final predictions of product classes. A simple representation of the hybrid model is shown in figure 3, more in-depth elaboration will follow along the way.

The representation above simply shows that we have two separate inputs, which are both fed to a different classifier. The outputs of both classifiers are fed to a meta-learner, which is another classification algorithm, and finally the outputs of this last algorithm are the predicted classes for the products.

What we actually see visualized in figure 3 is an example of an ensemble model. Ensemble models stem from the field of ensemble learning (Džeroski & Ženko, 2004), which focuses on creating ensemble models based on existing classifiers. From a non-technical aspect, ensemble models can be compared to the human mind. When we have an important decision to make, we often evaluate several viewpoints before providing a conclusive answer. A similar kind of reasoning underlies ensemble models, as ‘the main idea behind the ensemble methodology is to weigh several individual classifiers, and combine them in order to obtain a classifier that outperforms every one of them’ (Rokach, 2010, p. 1). So, with an ensemble model we want to outperform the individual classifiers which it is built upon. That is exactly what we as a company want to achieve, provide a better service for our customers, by all means possible!

Figure 3: Simple representation of the hybrid model

The field of ensemble learning knows several architectures to form an ensemble, such as bagging, boosting, and stacking. Right now only stacking will be discussed, but an abundance of information can be found about the former two architectures with a quick Google search. Let us move on to stacking, the ensemble architecture we use for our product classification task. In a stacking ensemble, the outputs of the classifiers of the base layer are used as inputs for the classifier in the top layer. The base layer classifiers can be heterogeneous (Graczyk, Lasota, Trawínski, & Trawínski, 2010), in the sense that different models, and thus different datasets, can be used. The outputs generated by the base level classifiers are either each product’s predicted class, or the probabilities of the product belonging to each possible class. An example of each output is shown in figure 4.

Figure 4: The two possible types of outputs

Let’s zoom in on our approach…

As you can see, the output of the base level classifiers can either be all or nothing, meaning that they will predict the one class they think a product belongs to, or they can return the probability for each possible class. The next step is that these outputs form the input for the meta-learner. This means that the predictions for each class, in one of the two formats above, are used as input. In the example above we have 5 different classes, this would mean that the input would consist of these 5 features. Based on these features the final predictions will be made. By following these steps, we have combined textual data and image data, and thereby have established synergy that led to an improved product classification service!

As you understand by now, our ensemble model uses a different classifier for each type of data and then on top of that uses a classifier to combine both data types. In the coming paragraphs an elaboration will follow on each classifier, to provide some more detail of our approach.

Text classifier

The textual data that is available to us can be classified relatively easy by using either simple algorithms such as k-nearest neighbors or by using more complex algorithms such as neural networks. In our approach we made a setup that evaluates several different classifiers and picks the best performing one (in terms of final accuracy of the complete ensemble model) to be used as text classifier in the ensemble model. In practice this means that, for example, logistic regression performs best solely, but when predicting with the complete ensemble model, the highest final accuracy is reached when k-nearest neighbors is used as text classifier.

To use these algorithms for classification, several text preprocessing steps are to be applied. For text classification tasks it is common practice to use the bag-of-words (BoW) method, which simply is a ‘bag’ of all the unique words occurring in the dataset. To make sure that we have the most informative words included in this bag, we first manipulate the text by removing stop words, applying lemmatization and more practices of this kind. This leads to a more informative BoW, and should thus lead to an improved classification accuracy. To make the words in the bag more informative, well-known methods such as CountVectorizer and TfidfTransformer are applied. The former simply counts each word’s occurrence per product, while the latter calculates a weighted frequency score based on each word’s occurrence per product and this same word’s occurrence throughout all products. Training and testing each classifier happens via a typical setup, as the one shown in figure 5.

Figure 5: The typical text classification setup (Credits).

In this setup the label corresponds to the product category, the input document is the textual product information, and the features are the unique words in each sentence (represented via their TF-IDFscore). One peculiarity is the use of the One-Versus-Rest (OVR) method, meaning that when classifying a product, it is done via the intuition of one class against all other classes. This is different from the standard approach of comparing each class against every other class. So instead of asking: “Does this product belong to class A, B, C or D?”, the question now is: “Does this product belong to class A or one of the other classes?”.

Figure 6: The change in intuition

Once the classifier is trained and tested, actual predictions are made on the validation dataset. The predictions can either be in the form of an actual prediction, or a probability for each class (as mentioned previously). In our case we observed that the probabilities led to a higher accuracy score of the ensemble model, so we use the probabilities for each class per product as input for the meta-learner.

Image classifier

Next in line is the image dataset. For this data type we picked a rather complex algorithm for classification. The enormous advances made in the field of computer vision in the last years are mainly because of different neural network architectures such as AlexNet, which sparked the renewed interest in neural networks, and lately because of networks such as Inception Resnet V2. These models are superior for most tasks compared to old-school approaches using logistic regression, for example.

One obstacle that must be tackled when using these models is the fact that they require massive amounts of images to be trained, in the order of thousands per class. Our client did not come close to this with their dataset, with classes containing at most hundreds of products, but also many classes having fewer than one hundred. However, models such as AlexNet are pre-trained on millions of images coming from thousands of classes. Therefore, the feature extraction parameters are well-trained, which allows us to use a technique called transfer learning. See figure 7.

Figure 7: Training a neural network with transfer learning (Credits).

In short, this means that we can reuse such a pre-trained model and apply it to a new dataset (in a different field), even if this dataset is much smaller! In this case, we can still finetune the model to learn to recognize domain-specific features, but the most high-level features such as edges and corners are already learned by the model. This often leads to satisfactory results in different domains, which is also the case here!

The image classifier that we have chosen for our ensemble is the Inception ResNet V2 model, mainly because the model is easily accessible, has a pythonic TensorFlow implementation, and is a state-of-the-art model. This model requires that the input images are of fixed size, namely that each image is 299*299 in terms of height*width. Therefore, image preprocessing steps have been taken to make this happen, as each image in our dataset was of variable size. Thus, every image is resized to the specified format, but not by simply adjusting the height and width to these dimensions, as this would result in a loss of proportionality for most images. To guarantee that the aspect ratio is maintained, a technique called padding is applied. This technique ensures that the aspect ratio is maintained, and fills up the empty space (in either height or width) by black pixels, which will not influence the neural network’s ability to classify these images.

Once the images are padded, and thus resized while maintaining their aspect ratio, the time has come to train the model. As we apply transfer learning, we are only finetuning the existing model, however this doesn’t mean that we should leave its parameters as it is. Arguably the most important parameter of a neural network is its learning rate, which determines how well (and how fast) the network will converge to an optimum. It is often argued that the learning rate can be determined via a trial-and-error approach, while this is not that bad, more ‘structured’ approaches exist. One of these approaches is Cyclical Learning Rate (Smith, 2017). By applying this method, one can find optimal boundaries for the learning rate, namely a minimum and maximum learning rate value. When training the neural network, the learning rate will cycle between these values, thereby often obtaining good results faster than by just setting a learning rate value.

In the figure 8, the learning rate was set to cycle between a value of 0.0001 and 0.0010, while performing 6000 iterations over the data.

Once the image classifier is trained and tested, the validation set is fed to it, to generate the actual predictions. Again, these predictions are in the form of predicted probabilities for each class per product, which will be used as input for the meta-learner. Now, let’s move on to the final part of the ensemble model, the meta-learner.


The base level classifiers, responsible for the text and images, have been elaborated on. Once they have done their job, meaning they have provided predicted probabilities, these probabilities are used as input for the meta-learner. As the dataset of our client consisted of approximately 400 unique classes, we now have about 800 features which are used as input. As the meta-learner is simply another classification algorithm, our pick for this task is a Support Vector Machine (SVM). The main reason is that an SVM has proven to perform well on classification tasks where the amount of features is very large (Joachims, 1998). This leaves us with a simple classification task, this time generating the actual class predictions instead of probabilities.

Figure 8: The learning rate cycling between its lower and upper bound (Credits).

…which brings us to our final model!

Earlier on in this blogpost we mentioned that several different text classifiers are put to the test, to see which combination of classifiers gives the best overall classification accuracy. The reason we do this is based on the no free lunch theorem, meaning that there simply is no ‘one algorithm to rule them all’. Algorithms respond differently to different datasets, and therefore it never hurts to try several algorithms. Our optimal combination of classifiers that form our stacking ensemble is displayed below.

This stacking ensemble has allowed us to combine text and image data of our client, which enabled us to improve our service and thereby increasing our value proposition for this client! This is just one example of a service we as Squadra Machine Learning Company offer, don’t hesitate to check out our other services!

Figure 9: The stacking ensemble

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

16 − fourteen =