eCommerce Taxonomy Automation

Automate Your Ecommerce Product Categorization in Five Easy Steps with Ai

Matt Payne
·
November 23, 2022

As a product catalog manager or content specialist, the last thing you want to be doing is spending your time on data entry and excel lookups. You’d rather spend your time copywriting, optimizing product listings, and other enrichment tasks that require the domain expertise you have. Wouldn't it be nice if you could do that while all the little mundane tasks that take up so much time and effort just ran by themselves?

One such task is categorizing products accurately to your taxonomy. Using modern AI and machine learning, it can indeed run quietly in the background while you focus on more critical tasks. In this article, find out how Pumice.ai automates ecommerce product categorization.

Why Proper Product Categorization of Your Products is Crucial

In both business-to-customer (B2C) and business-to-business (B2B) ecommerce, correct product categorization is critical for your customers and your business:

  • You can convert more of your website visitors or app users to customers. Accurate categorization makes it easier for potential customers to find what they’re looking for in as few searches as possible. Get it wrong and your bounce rate becomes too high for comfort. As many as 79% of your visitors may head to a competitor.
  • Not only do you get more buyers but those buyers are likely to buy more. Forrester found that poorly structured sites sell 50% less than organized sites with solid product taxonomy. Bottom line - when customers find what they're looking for in as few steps as possible they convert better. 
  • You automatically increase your organic search traffic because correct product categories means better SEO. Search engine optimization is critical — the top ranking site for a keyword gets as high as 49% of all search traffic. Failure to rank and search invisibility can cost you thousands of customers. A low ranking is not much better — it results in low click-through rates given that a mere 0.78% of Google searchers make it to the second page. Correct categories make it easier for search engines and shoppers to find what they're looking for.
  • Many external selling channels like Walmart, Amazon, and eBay require correct categories. These categories also tie to specific product attributes that must be assigned correctly alongside the category.

Wrong or inexact categories can create additional business, financial, and legal issues in almost every modern ecommerce channel as explained below.

Google Product Taxonomy

product to google shopping page

The Google product taxonomy has about 5,600 unique categories organized as hierarchical categories that get more granular the further you go down the tree. It enables your products to show up in Google Shopping, as shown above, and in Google search, Google Maps, YouTube, and other Google results. It's also used for targeted advertising campaigns in Google Ads.

Miscategorization can lead to losing potential leads and reduced conversions as shown in the above illustration. As mentioned before, these external channels require your categorization to be correct, with some categorizing for you, and some requiring you to categorize your products. 

Even if you’re not selling products through Google Shopping, this taxonomy is a great one to use for your site or marketplace, as it covers a very large range of products, and is granular where it’s needed.

Amazon Taxonomy

amazon taxonomy for electronics

Amazon's product taxonomy is visible to sellers when uploading products and to buyers on the Amazon search page as shown above. Popular products in a specific category are found on the category pages.

Setting the correct category is important for sellers because your products can be removed and your account flagged if categories are incorrect. The right category lets you correctly fill in all required product fields. While some products might seem like a candidate for multiple categories, it's important to choose the correct one. 

Shopify Standard Product Taxonomy

The Shopify product taxonomy is a list of categories that is publicly available. It's important because it:

  • enables category-specific metafields in your storefront for users to search and filter more conveniently
  • facilitates better product management by setting smart collection conditions or filtering product lists
  • eases selling on other channels that require a standardized product type such as Facebook or Google
  • determines the tax rate for a product because some get special rates and getting the sales tax wrong can create financial or legal liabilities

Meta Product Taxonomy

On Meta properties like Facebook, Instagram, and WhatsApp, setting the right category is important for the following reasons:

  • On your Facebook and Instagram shops, product categories are used to create product sets which can be featured as special collections, like new or seasonal products.
  • The product sets decide which products are shown in Meta Advantage+ catalog ads.
  • The category enables adding category-specific fields.
  • The category determines whether the product requires a size and should show size charts.
  • The category decides the return window for a product.
  • A missing or wrong category can adversely affect the user experience, impact conversion to purchases, or mislead and erode trust in your shop.
  • The category decides the tax rates and taxability for each product. Getting it wrong can create financial and legal liabilities.

B2B Taxonomies

For global B2B ecommerce, standardized taxonomies like the United Nations Standard Products and Services Code (UNSPSC) and GS1 enable streamlining procurement, regulatory compliance, taxation, and more.

How do Ecommerce Businesses Categorize Products? What are Some Common Problems?

Many ecommerce companies use a combination of rules-based automation, data entry employees, and knowledge process outsourcing to categorize products.

The rules-based automation suffers from several drawbacks:

  • Brittleness: Over time, the number of rules can build up to tens of thousands. They become difficult to change, maintain, and evaluate. Categorization decisions may even rely on the order of rule processing. All this can lead to subtle logical errors as the number of products and categories grow. Since teams never know what else may get affected, they shy away from touching any rules. But this prevents any user experience improvements from being proposed or implemented, potentially losing leads and conversions.
  • Performance: The large number of rules can take a long time to process as new products get added to the inventory. If you have products that belong in multiple categories these rule systems can struggle to understand instances where that logic should be used. 

  • Mandatory human review: Since the rules-based logic is known to be brittle, it necessitates human review by category managers and merchandising experts. This dilutes the purpose of automation.

All the manual data entry and review workflows come with their own problems:

  • Product and category nuances: Even people may not be able to categorize correctly when subtle cultural or domain-specific nuances are involved. Workers must have extremely good domain knowledge here.
  • Training costs: Ensuring that workers categorize correctly and reliably requires dedicated training. They need a strong understanding of product similarities, product category differentiation, and high-level business goals. However, in practice, many workers may never develop the level of business understanding due to poor training as well as cognitive and motivational limitations.
  • Data quality errors: Reviewing and categorizing products every day can be boring. It can easily result in all kinds of data quality issues by bored workers.
  • Taxonomy complications: Adding, deleting, or moving branch or leaf categories can easily create a cascade of errors.
  • Labor costs: Hiring, supervising, and outsourcing add considerable business costs.
  • Poor productivity: The classification speeds of workers will always remain slow while the number of products and categories grow fast.

Is ChatGPT Enough?

Perhaps you're considering using general chatbots like ChatGPT or large language models (LLMs) like O1 or Gemini to solve such product categorization bottlenecks. Will a prompt like the one below work?

chatgpt example for product categorization

If you're categorizing just a few products, this will work with decent accuracy and efficiency.

However, they have the following drawbacks for heavier or more nuanced tasks:

  • Nature of chatbots and LLMs: Chatbots and LLMs are trained to generate natural language text for a large variety of questions on a vast number of topics. While they can do specific tasks like categorizing products, you must supply the full taxonomy along with examples that show how to categorize correctly. With large or deep taxonomies, the number of hallucinations also goes up. With custom taxonomies using GPT-4o Mini we’ve seen about 75% accuracy at the 5 level deep benchmark. 
  • Flagship models versus open-source private models: Competence and correctness are often limited to the flagship models from OpenAI, Google, Anthropic, and other top AI companies. If you're using any of the open-source models, for privacy and lower costs, you may find that their accuracy, errors, and hallucinations are bad.
  • Subtle product and category nuances: Generic chatbots and LLMs are trained on a large variety of questions in different areas.Since they aren't trained on your product data relationships to the taxonomy, they're incapable of correctly reasoning out your ecommerce store categorization business logic. Especially for deep taxonomies like hardware supplies or electronics components or cultural products, they frequently fail to identify the best category because they haven't been trained to pay attention to your product and category nuances.
    The reasoning LLMs are often more accurate on such tasks. However, they too aren't trained on your product data relationships to your taxonomy nor is their reasoning based on your business logic for categorizing a product under a particular category. The reasoning these models generate could be completely unrelated to how you think about placing products in categories. If the reasoning is wrong, your category will be wrong. 
  • Performance and quota limits: All the flagship models take a lot of time per request, especially if an entire taxonomy is included in the context. The reasoning models take even more time on average. If you have thousands of products, you'll have to wait a very long time. Plus, all the flagship LLMs place many limits on the number of requests and tokens sent per minute.
  • Costs: Costs can go high if you're including entire taxonomies in the system or user prompt.
  • Limited context windows: On a related note, trying to stuff entire taxonomies and examples in a single prompt may be impossible. While the million-token context windows of flagship models like Gemini are sufficient, open-source models like Llama 3 are still severely lacking in this area. It has been proven that LLMs get progressively worse at context understanding as the context grows (supporting research), meaning that as your taxonomy gets larger and more granular, your accuracy can go down. 
  • Input format limitations: You may have to convert your product and taxonomy data into a special text syntax to suit the LLM's quirks.

Categorize Millions of Products Accurately Using Pumice Categorization Models

Pumice's AI models are highly specialized and tuned just for product categorization. Not only do they overcome all the problems and costs of manual workflows but they also overcome all the above drawbacks of general-purpose chatbots and LLMs since they are explicitly trained for product categorization.

We explain the different types of models we use, their training philosophy, and their benefits below.

Baseline Categorization Models for Any Standard or Custom Taxonomy

Our baseline models are generic categorization models. Their training includes a large number of products of different types categorized under large generic taxonomies like Google's.

This allows our baseline models to generalize quite well to any product and taxonomy as long as there aren't subtle nuances between products or categories.

You can even specify your company custom taxonomies and get accurate results. For example, if your online shop specializes in hardware products, you can bring your custom categories extracted from your inventory. Our baseline models will identify the best fit category for your products with high accuracy.

Our model looks at text data such as title and description, as well as the image. T

But what if you have half a million products and 5000 categories? Then, even  high accuracy means thousands of miscategorized products that need manual reviews. Do you have the time, manpower, and costs for that?

For such scenarios, you need models with extremely high accuracies whose error rates are far lower. You need our fine-tuned models!

What are Fine-Tuned Models?

Fine-tuning means training a model to pay close attention to subtle nuances between different product details as well as different categories. These models learn the exact relationships between your product data and the taxonomy. 

Unlike our baseline models that were trained on generic products and taxonomies, the training data for fine-tuning consists of very carefully selected products and categories whose nuances can confuse even human workers.

By deliberately training our models on such difficult datasets, we force a fine-tuned model to pay attention to all product and category nuances for deciding the best possible category. If a model miscategorizes a product, we optimize it so that it learns to categorize that product more carefully. Fine-tuning is also great for multi-language use cases. Many categorization services struggle with different languages or use multiple languages. We use fine-tuning in these instances to help the model completely understand language differences and still achieve a high accuracy

Such stringent training enables our fine-tuned models to achieve far higher accuracies compared to our baseline models. Our fine-tuned models guarantee at least 90% accuracy; if it's less than 90%, you don't have to pay!

Fine-Tuned Models for Standard Taxonomies

Our fine-tuned and enterprise plans provide carefully fine-tuned models for the following standard taxonomies:

  • Google product taxonomy
  • Shopify product taxonomy

These models can accurately categorize your products even if the categories are confusingly similar with subtle cultural nuances or domain-specific differences like the ones below.

google product taxonomy
Snippet of the google product taxonomy category tree

Fine-Tuned Models for Your Custom Taxonomies

amazon category example

We even fine-tune models for your specific ecommerce site products and categories!

Perhaps you run a specialized ecommerce B2C business that sells only tea products from all over the world.

Perhaps it's a B2B business that sells all kinds of electronic components or hardware industrial supplies with a business specific taxonomy of website categories.

For such use cases, you'll likely have a specialized and deep taxonomy. For example, Google's taxonomy has just one category for all kinds of screws while a specialized supplier like McMaster-Carr has 17-20 different categories just for screws. AI needs to understand such product and category nuances.

To facilitate this, our data scientists work with your experts to carefully design stringent training datasets consisting of your specific products and categories. The dataset is designed to confuse even human experts unless they pay very close attention to all product details. We then rigorously fine-tune our models on such data to achieve extremely high accuracies. We guarantee at least 90% categorization accuracy from our fine-tuned models even on your custom taxonomies; if it's less than 90%, you don't have to pay for usage.

Our enterprise plan supports any number of such fine-tuned categorization models. 

How to Completely Automate Your Ecommerce Product Categorization with Pumice.ai

In this section, you'll learn to use Pumice.ai to automate ecommerce product categorization.

Step 1. Prepare Your Product Data for Upload

Pumice needs only 2 product details:

  • Product name or title
  • Product description

As mentioned above, images can be provided as well, although they are optional. 

These two details are almost always more than enough for Pumice to accurately categorize all your products.

Other details like fabric type, ingredients, or MPN codes are completely optional. However, don't hesitate to include them if:

  • doing so is more convenient, or 
  • the names and descriptions are vague, have spelling mistakes, or other quality problems

Including more details may even bump up the accuracy slightly for some products with lots of variants with minor differences.

Export these product details from your existing product information management (PIM) or ecommerce service in comma-separated values (CSV) files.

Step 2. Upload Your Product Data to Pumice.ai

For bulk categorization, you can either upload a CSV file using the "Upload" button or import directly from your Shopify shop as shown below.

pumice.ai dashboard

Make sure your CSV file has a header row with "title" and "description" columns. Product titles and descriptions are mandatory. All other details are optional but it doesn't hurt to include them.

Alternatively, if you have a dev team taking care of your ecommerce properties or PIM, you can ask them to upload product CSV files using Pumice's application programming interface (API). They can use the CSV upload endpoint and get back an upload ID for each file. 

Step 3. Select Your Preferred Category Taxonomy

Pumice can categorize products either using one of the built-in taxonomies or using one of your custom taxonomies. The first approach is called static categorization and the second is dynamic categorization.

The built-in static taxonomies include pre-defined taxonomy models for Google product category taxonomy and Shopify product taxonomy.

For dynamic taxonomies, upload a text file containing all your custom categories using one of the options shown below. If you're using the API, upload your custom taxonomy using the upload tree endpoint and get a tree ID.

A typical taxonomy text file looks something like this:

taxonomy example

Step 4. Let Pumice.ai Automatically Identify the Best Categories for Your Products

Pumice supports different ways to do the categorization.

If you're a non-technical user, you can do either bulk categorization ("batch run") or item-by-item categorization ("single run") from Pumice's dashboard.

Alternatively, your dev team can integrate the categorization into your existing business workflow using the Pumice API.

Let's look at all these ways below.

Bulk Categorization in the Dashboard

categorization of bulk records

For a batch run, upload your product details CSV file and select a built-in or custom taxonomy. Click "Upload & Generate" to start the categorization. The results of each run are available in the "Generated Entries" section.

Single Product Categorization in the Dashboard

If you have just one or a few products, enter the product title and description in the "Single Run" section. Select a built-in or custom taxonomy and click "Generate." The product's details and category are displayed.

Bulk Categorization Using the API

If you have a dev team, they can use the API in different ways from your applications or ecommerce plugins.

The batch categorization endpoint can either do dynamic categorization using your custom taxonomy or static categorization using a built-in taxonomy.

For batch dynamic categorization using your custom taxonomy, call the endpoint as follows:

  • Set the run type to "dynamic."
  • Supply the products CSV as a URL or upload ID.
  • Specify the model ID of one of the models provided to you. These will either be baseline generic models that work with any product taxonomy or fine-tuned models trained for your custom taxonomies.
  • Specify the tree ID of your custom product taxonomy.

For batch static categorization using a built-in taxonomy, call the endpoint as follows:

  • Set the run type to "static."
  • Supply the products CSV as a URL or upload ID.
  • Specify the model ID of one of the models provided to you. These are either baseline generic models that work with any product taxonomy or fine-tuned models trained on a built-in taxonomy.

In either case, the endpoint returns a task ID to track the categorization. Use the get results endpoint to get back the categorization results.

Single Item Categorization Using the API

For item-by-item categorization, use the single categorization endpoint. Like the batch endpoint, it can either dynamically categorize against a custom taxonomy or statically against a built-in taxonomy. Instead of a CSV, supply a single product title and description in a "data" object as shown above. Everything else is similar to the batch endpoint.

Product and Image Similarity

product similarity in pumice.ai

Our product and image similarity features can evaluate the similarity of two products based on their details or images. They allow you to judge the best categories for new products by comparing them with existing categorized products in your catalog.

They're available on the dashboard and via the similarity API.

Step 5. Integrate the Category Data With All Your Ecommerce Channels

You can upload the identified product categories back to your PIM or ecommerce service manually or using their APIs through custom integrations. Our most popular integrations include:

  • Shopify
  • WooCommerce
  • Jasper PIM
  • Salsify
  • Sales Layer PIM
  • Your custom product data storage system
  • Physical store POS systems
  • Online web shop database

Streamline Your Ecommerce Product Categorization

As you saw, Pumice.ai is a PIM enhancement service to streamline product information tasks at scale.

Contact us today to learn how you can get started automating your product categorization and improve your customer experience.