The way consumers and businesses do their shopping has changed dramatically in recent years. More frequently, purchases are being made online. The global retail e-commerce sales for this year (2022) is estimated to reach 5.55 trillion dollars accounting for 21.0% of the total retail spending and expected to increase steadily [1]. To fulfil this demand, business owners are required to move into e-commerce or expand their existing online product portfolio. An important condition for such strategies is the digital availability of product information. Through the online channels, consumers must be able to find what they are seeking and must be convinced by this product information to make the purchase. This data is typically stored in product information management systems that feed the e-commerce shop and catalogue systems. In practice, however, much of the required data is not or only partly available and can only be obtained after intensive manual operations. As a result, for a lot of organizations a group of data stewards is working full time gathering the right product data from various sources which is costly, tedious, and inefficient.

From a holistic point of view, the internet is perhaps the largest existing product database available. There are numerous online stores that offer a tremendous number of products on the internet. Typically, each product has its own specific product detail page (PDP) that presents the relevant information of that product. This includes but is not limited to the product images, specifications, price, shipping, and description. This information supports consumers in their decision making and is for this reason publicly available. As a result, business owners that want to move into e-commerce actually have a lot of product information at their disposal already. There is a high chance that the products these business owners want to offer are already offered on the internet by other business owners that have the right product information. The challenge here is to automatically find, retrieve, and store this product information which can then be used for the enrichment process. For this, we need to figure out how to make sense of the enormous, diverse, and poorly structured database that stores this data. In other words, how can we overcome the challenges of the internet for large-scale automated e-commerce product data enrichment. We define two main challenges for this purpose: the heterogeneity in structure among product detail pages and the syntactic discrepancy between descriptions of identical product specifications.

The first problem is the heterogeneity in structure among PDPs of different online stores. As mentioned, the PDPs provide the product data that can be used for the enrichment process. The way this data is presented and structured on PDPs, however, varies amongst online stores. Online stores are not obligated to present their product information in a specified format. As a result, the product information can be contained in different structures and portrayed on different positions on the PDPs. The second problem is the syntactic discrepancy between descriptions of identical product specifications. Online stores presumably make use of their own attribute names and standard values to describe their products. There will be numerous similar but differently named attributes with inconsistent values due to the use of synonyms, abbreviations, misspellings, etcetera. For large-scale data enrichment, we should recognize duplicates and standardize the product specifications to guarantee high-quality data.

In this article we will demonstrate how transformer-based language models can be leveraged to overcome both challenges that we have at hand. First, we will briefly explain the transformer-based language models. Subsequently we will elucidate how we can use such language models for detecting product specifications on PDP without regard to their structure. Finally, we will elaborate how these language models can be used to match product specifications.

Transformer-Based Language Models

The first transformer-based language model was introduced by Devlin et al. (2018) [4]. They named it BERT, Bidirectional Encoder Representations from Transformers. BERT has significantly changed the NLP landscape. In contrast to previous contextual language models like ELMo, BERT can take into account all the words of a sentence simultaneously to determine the embeddings of the words. This enables BERT to have a deeper understanding of language. Since the introduction of BERT, numerous fine-tuned and derivative versions of BERT have surfaced and shown to outperform vanilla BERT. All these language models have been used for a wide variety of NLP task, including semantic textual similarity which we will use to solve our problem at hand.

Detecting Product Specifications

We mentioned the problem of heterogeneity in structure among PDPs of different online stores. Product specifications can be presented anywhere on the PDP. However, in general, they are contained in HTML tables on webpages [5]. These product specifications describe properties of the product and are, therefore, related to the product and the product category. Other HTML tables that are used on PDPs typically contain the contact details of the store, their opening hours, their social media links, and other information that is unrelated to the product. This indicates that, in general, product specification tables are the only HTML tables on the PDPs that relate to the product. Given this, we can formulate our detection problem as a semantic search task: we want to search for and only retrieve the tables that are related to the product (specifications). Semantic search closely resembles sentence similarity and semantic textual similarity. In essence, different sentences are compared by computing their semantic similarity. A high semantic similarity indicates that two sentences hold the same meaning or are related.

We can easily convert each HTML table on a PDP to a single sentence by concatenating all the words in the table. This results in a number of distinct sentences that can be compared to a certain input sentence. See for example the conversion of a product specification table into a single sentence below:

Product Specifications Amazon
Printing TechnologyInkjet
BrandHP
Connectivity TechnologyWi-Fi;Cloud Printing
Model NameJ9V92A#B1H
Compatible DevicesSmartphones, PC, Tablets
Recommended Uses For ProductOffice, Home
Sheet Size3 x 5 to 8.5 x 14, Letter, Legal, Envelope
ColorSeagrass
Printer OutputColor
Item Weight5.13 pounds
Product LineHP DeskJet

“Printing Technology Inkjet Brand HP Connectivity Technology Wi-Fi;Cloud Printing Model Name J9V92A#B1H Compatible Devices Smartphones, PC, Tablets Recommended Uses For Product Office, Home Sheet Size 3 x 5 to 8.5 x 14, Letter, Legal, Envelope Color Seagrass Printer Output Color Item Weight 5.13 pounds Product Line HP DeskJet

And, similarly, the conversion of a non-specification HTML table into a single sentence (Amazon presents some categories in an HTML table in the footer of the page):

“Amazon Music Stream millions of songs Amazon Advertising Find, attract, and engage customers Amazon Drive Cloud storage from Amazon 6pm Score deals on fashion brands AbeBooks Books, art & collectibles Sell on Amazon Start a Selling Account Amazon Business Everything For Your Business AmazonGlobal Ship Orders Internationally Home …”

We can do this for each HTML table on the PDP and we will obtain n sentences where n is equal to the number of HTML tables on the PDP. But how do we find the sentence(s) that relate(s) to product specifications? For this, we need to create an input sentence that is semantically related to the sentence that holds the product specifications. Product specification can be quite domain-specific but are somehow related to the product and its category. Take for example the specification table above. This table is retrieved from Amazon [2] and contains some specifications of a specific printer. As a result, we observe attributes and values related to the printer such as ‘printing technology’, ‘cloud printing’, ‘office’, ‘sheet size’, ‘letter’, ‘printer output’, etcetera. These are clearly related to the product category ‘printer’. This implies that a sentence that contains the product category can be semantically related to the product specifications and could, therefore, be used for the automatic semantic search. Typically, the product category is contained in the ‘breadcrumbs’ of PDPs, also known as the navigation chain. Hence, we could use the breadcrumbs to find the product specifications. For the same printer from Amazon we observe the following breadcrumbs that can also be converted into a single sentence:

“Office Products › Office Electronics › Printers & Accessories › Printers”

With the breadcrumbs and HTML tables converted to distinct sentences, we now have all the ingredients to conduct the semantic search. The breadcrumbs sentence serves as the input sentence (query) and will be compared to all the HTML table sentences. To do this, we will first vectorize all the sentences using a transformer-based langue model. Since we are comparing sentences, the Sentence-BERT (SBERT) architecture [6] is the most suitable. This architecture is optimized for deriving semantically meaningful sentence embeddings. Numerous fine-tuned BERT models on this architecture have surfaced but for this article we will use the one that has been downloaded the most on HuggingFace: multi-qa-MiniLM-L6-cos-v1 [3], a fine-tuned variant of BERT optimized for semantic search. Because this model leverages the SBERT architecture, sentence-pairs can be easily compared. Specifically, two sentences will be vectorized and their semantic similarity will be calculated by the cosine similarity. A value close to 1 means high semantic similarity while a value close to 0 means low semantic similairty. Let’s compute the similarities of the sentences that we have created in this article so far.

Sentence 1Sentence 2Semantic similarity
“Printing Technology Inkjet Brand HP Connectivity Technology Wi-Fi;Cloud Printing Model Name J9V92A#B1H Compatible Devices Smartphones, PC, Tablets Recommended Uses For Product Office, Home Sheet Size 3 x 5 to 8.5 x 14, Letter, Legal, Envelope Color Seagrass Printer Output Color Item Weight 5.13 pounds Product Line HP DeskJet”“Office Products › Office Electronics › Printers & Accessories › Printers”0.60
(multi-qa-MiniLM-L6-cos-v1 + cosine similarity)
Sentence 1Sentence 2Semantic similarity
“Amazon Music Stream millions of songs Amazon Advertising Find, attract, and engage customers Amazon Drive Cloud storage from Amazon 6pm Score deals on fashion brands AbeBooks Books, art & collectibles Sell on Amazon Start a Selling Account Amazon Business Everything For Your Business AmazonGlobal Ship Orders Internationally Home …”“Office Products › Office Electronics › Printers & Accessories › Printers”0.22
(multi-qa-MiniLM-L6-cos-v1 + cosine similarity)

We observe significant differences in the similarity scores! Where the product specification table attains a cosine similarity of 0.60, the non-specification table only has a similarity of 0.22 with the breadcrumbs. Given these scores, we can use them directly for retrieval by determining a specific threshold or we could use these similarity scores as a new feature for a classification model similar to Petrovski and Bizer (2017) [5]. We can repeat this process that we just discussed for each PDP and collect the product specifications this way. Because our approach only uses HTML tables and breadcrumbs, it is insensitive to website heterogeneity, and this makes it a suitable solution for large-scale automated e-commerce product data enrichment.

Matching Product Specifications

Once the product information is retrieved from different sources, we need to combine them into one consistent list of specifications that can directly be used. We do not want to present duplicate information just because several specifications are differently described despite them being the same specification. This means that we must match product specifications. We will demonstrate how the same transformer-based language model can be used to solve this problem.

Above we have presented some product specifications from Amazon and Ebay [7] regarding the same product: the specific printer that was also used in the previous section. Amazon and Ebay describe some identical product specifications in a slightly different way. For each specification from one source, we want to find the specification from the other source that represents the same meaning if it is present. Matching product specifications is essentially a sentence similarity task and can also be considered a semantic search task. We are approaching the matching task in a similar way as the detection task. Specifically, we will convert each specification (attribute + value) to distinct sentences that can be compared. In the above example “Printing Technology Inkjet” that originates from Amazon will be compared with “Brand HP”, “Model J9V92A#B1H”, “Memory 64 MB”, etcetera, that originate from Ebay. We again used the multi-qa-MiniLM-L6-cos-v1 model with the cosine similarity to derive the semantic similarity of the specifications. In the image above, we indicated the specifications that should be matched and computed their similarity score using the transformer-based language model. We observe the same specifications to attain high similarity scores. Dissimilar product specifications generally have a much lower similarity score. “Printing Technology Inkjet” and “Brand HP”, for instance, will only give a similarity score of 0.32. As such, the transformer-based language model can be used to match product specifications.

Conclusion

In this article we have demonstrated the effectiveness of a transformer-based language model for large-scale automated e-commerce product data enrichment. Specifically, we have provided a solution for detecting product specifications that is independent of website structure. We approached the detecting problem as a semantic search task and were able to retrieve the specifications by computing the similarity of the tables with the breadcrumbs. Second, we demonstrated that the transformer-based language model is also able to match identical product specifications despite syntactic discrepancies. All in all, transformer-based language models are extremely powerful and can be applied to a wide variety of NLP tasks. As we have demonstrated in this article, this also includes semantic search (or sentence similarity) for large-scale automated e-commerce product data enrichment.