Introduction

In the fast-growing world of e-commerce, personal style and fit play a crucial role in purchase decisions. However, many users encounter the limitations of traditional keyword-based search, which often relies too heavily on exact keyword matches what often limits product discovery. As e-commerce continues to grow, integrating advanced tools to improve product recommendations and customer experience is becoming increasingly important.

To tackle these challenges, we explored machine learning models that connect textual descriptions to images. In the first phase of our project, we tested two models: CLIP, a versatile model trained on diverse images and text, and FashionCLIP, a version of CLIP fine-tuned specifically for fashion. During testing, we implemented a scoring system to rate results based on color and category accuracy. After aggregating the scores, it was clear that FashionCLIP outperformed CLIP for our use case. However, during the review of the results, we identified consistent inaccuracies in test cases involving the collar, shoelace color, and shoe sole. We agreed that fine-tuning one of these cases would be a focus of the next project iteration, which is the subject of this article.

Fine-tuning models is the process of adapting a pre-trained model to a specific task with the aim of increasing performance and accuracy for that specific task. Since FashionCLIP showed better results for our case, we decided to fine-tune that model. This process, known as transfer learning, is adopting the model to specialized dataset, improving its performance for specific tasks. By fine-tuning FashionCLIP, we aimed to improve its capabilities to better recognize shoelace color. We described the process of testing fine-tuned models and using various metrics and testing techniques for comparing the performance of different models.

Fine-tuning model validation

The fine-tuning process aimed to optimize the FashionCLIP model for a specific task: predicting shoelace color. This process involved several epochs, as detailed in the whitepaper Deep Dive in Methodology which contains more details about fine-tuning process and techniques used. During each epoch, the model was fine-tuned on the training dataset and validated on the validation dataset to determine if further parameter optimization was required. At the same time, we monitored for signs of underfitting, such as low performance, which indicated that the model was not effectively capturing the underlying patterns in the data. Fine-tuning led to the development of multiple models, which were then rigorously tested using a testing dataset to identify the best performing once. Various metrics and scoring techniques were applied to assess each model, with a comprehensive explanation provided in the next chapter, alongside the validation results of the best performing model. In addition to model validation, we worked on detecting overfitting by comparing the performance between the training and validation datasets. If a significant performance gap was observed, it signaled overfitting, prompting an early intervention. More details about overfitting can be found in the Deep Dive in Methodology whitepaper.

Dataset

One of the most important things when fine-tuning a model is to choose the right dataset. The dataset needs to closely match the domain and tasks the model will encounter in real-world applications. While choosing our dataset, the main motivation was to have the images of shoes and information about the shoe lace color.

The Fashion Product Images Dataset from Kaggle was chosen, which contains 44,441 diverse clothing images. From this dataset, we selected only footwear, 7344 items in total. This dataset is imbalanced, representing a real-life scenario where some classes are underrepresented. The original dataset contained basic details such as brand, product name, gender, and primary color. To enhance the dataset to contain information about shoelace color, Bakllava AI was utilized, and in cases where data was still incomplete, we manually filled in the missing details.

To ensure effective model evaluation, we split the dataset into three subsets: training, validation and testing. Following best practices, we allocated 80% of the dataset for training and 20% for validation. Besides that, 10% of training data was allocated for validation. Validation dataset was used for tuning hyperparameters, while testing dataset was used for testing of final fine-tuned model.

Metrics and Model Evaluation

When evaluating a machine learning model, it is important to use the right metrics to understand how well the model is performing. Since the original model is FashionCLIP, designed for fashion image classification, we concentrated on classification metrics to assess the discrete values produced by the model after classifying all the given data.

Standard Evaluation Metrics

The first metric to consider is accuracy, which represents the percentage of correctly classified items. When using imbalanced dataset like ours, relying solely on accuracy can lead to an inaccurate overall evaluation of performance as the model may perform well in the majority class but poorly in the minority class.

To better assess performance, we included precision, recall and F1 Score. F1 Score is a measure that combines both precision and recall. Precision is the ratio of true positive predictions to the total number of positive predictions made, while recall (or sensitivity) is the ratio of true positive predictions to the total number of actual positive cases. The F1 Score is the harmonic mean of precision and recall, calculated as:

This metric is especially valuable for imbalanced datasets, as it ensures the model balances both precision and recall, performing well across both minority and majority classes. After computing F1 Score for each class individually, we average them to get a single value representing the model’s overall performance. There are different ways to average these scores, and the choice of averaging method can impact the interpretation of the results. When working with imbalanced dataset, using the weighted average and macro average can help give a more accurate representation of our model’s performance.

The weighted average is based on the support (number of instances) of each class, so classes with more examples have a greater influence on the final metric. The macro average treats each class equally, regardless of how many instances belong to each class.

To validate the model by using the metrics explained above, the ‘classification_report’ function from sklearn library was used. It takes the actual and predicted category arrays as input and provides a detailed performance summary including metrics like precision, recall, F1-score, and support for each class, along with weighted and macro averages.

From our dataset, the array of actual categories is already available. The following process resulted in an array of predicted categories for every image from the test dataset:

The image below shows the results of the classification_report function for both the original FashionCLIP and the fine-tuned model:

The fine-tuned model demonstrates higher precision, recall, and F1 score compared to the Fashion CLIP, as well as the macro and weighted averages which also reflect improved performance.

Zero shot and top-k

Besides the above-mentioned metrics, tests for zero shot classification and top K were created.

Zero shot is a machine learning technique where an AI model is trained to recognize and categorize objects without any training examples. For each photo in the testing dataset, we compared the actual shoelace color from the metadata with the shoelace color predicted by the model. The model either correctly guessed the shoelace color or it did not. The FashionCLIP model achieved an accuracy of 59.66%, while the fine-tuned model achieved an accuracy of 69.76%.

Top-K measures how often the relevant item appears within the top K predictions or retrieved results from the model. In retrieval systems, like ours, we want to make sure that the best items are shown in top-k results. The accuracy score for the FashionCLIP model when K=1 is 59.66%, while the fine-tuned model achieves an accuracy of 69.62%. Results for various values of K are displayed in the image below Zero-shot, top-K, and custom metrics performance at the end of this chapter.

Custom metric

In addition to already explained metrics, we defined our own scoring technique as an additional metric for assessing the model. This scoring system relied on the top 5 category guesses, with the highest points awarded if the guessed color was in first place and the lowest points if the guessed color was beyond fifth place.

Example: Actual shoelace color is Red

Position	Predicted color
1	Pink shoelace
2	Red shoelace
3	Brown shoelace
4	Black shoelace
5	Purple shoelace

The score for this test case is 4. The final score is determined by summing the scores for each image in our dataset.

We compared fine-tuned models to FashionCLIP. For every model and each epoch, scores were computed using automated scripts and then compared. Since the image descriptions included shoelace color, we were able to fully automate those tests. The tests, written in Python, followed the same scoring method as manual testing, adhering to the approach described above. Having automated scripts was crucial for efficiently validating multiple models at each epoch. Without them, the validation process would have been too time-consuming and ineffective.

The maximum possible score was 7,325 points (5 points per image in our dataset). In this evaluation, FashionCLIP scored 5,622 points, while fine-tuned model under test achieved a score of 6,290 points. Representation of the forementioned metrics and scorings are visible in the image below.

Zero shot, top-K and custom metrics results

The validation results displayed above represent the performance of the best-performing model from the fine-tuning phase. During the fine-tuning phase, several models were evaluated for their recognition capabilities. Based on these results, other models with promising performance were identified and subjected to more extensive testing in the subsequent phase.

In the fine-tuning phase, models were primarily assessed based on prediction of category based on image. However, in the next phase, they were tested in real-world scenarios. This involved defining test cases with various input values through text or images. The outputs, which included recognized similar products, were then validated. Further details about this phase can be found in the next chapter.

Testing approach - image retrieval

After the model validation phase which indicated that fine-tuned model showed better results, we needed to test image retrieval and evaluate how well our model works with text and image search queries. The goal was to simulate user behavior by providing shoe image or a text prompt as input, then validating the retrieved results. The system was expected to return similar products, with focus on matching the shoelace color, as this was the primary goal of the fine-tuning process. To ensure we didn’t disrupt core search functionalities, we used also the shoe category criteria. For example, a test case could involve a prompt like “Casual shoes with beige shoelace” or an image of casual shoes with beige laces. These test cases were designed for image retrieval evaluations.

It is important to note that all models demonstrating good results in the validation phase underwent testing using the approach outlined in the following pages. The results displayed reflect the performance of the best-performing model. More details about the model results can be found in chapter Final model evaluation.

While performing the test, following steps were executed:

Image retrieval test steps

Scoring

Multiple scoring methods were used to evaluate the model’s predictions as accurately as possible. For each test case we took 5 images from the results and independently scored whether the model correctly guessed the shoe category and the shoelace color. These scoring techniques are implemented by our team, and their purpose was to measure how successful image retrieval is for our specific test cases.

All scoring methods are applied to the same set of 26 test cases selected for validating image retrieval. Representation of test cases:

Test cases for testing of the fine-tuned model

The first scoring method is validating shoe category and shoelace color, where priority is given to the shoe category.

	Correct shoe category	Wrong shoe category
Correct shoelace color	3	1
Wrong shoelace color	2	0

Steps to apply the first scoring method to each of 26 test cases are as follows:

Enter text prompt, for example: “Casual shoes with Brown shoelace”
Check top 5 images in the output.

Result is:

	Shoe category	Shoelace color	Points
1.	Casual shoes	Brown	3
2.	Casual shoes	White	2
3.	Casual Shoes	Brown	3
4.	Casual shoes	Brown	3
5.	Casual shoes	Beige	2
Sum			13

The score for this test case is 13 points. The maximum possible score per test case is 15 points (5 images in the output × maximum of 3 points per image).

Sum the points for all test cases described above; the total score represents the result for this test iteration.

To make a comparison, we gathered scores for all test cases from both our specific model and FashionCLIP. The maximum possible score was 390 points (1 × 26 test cases). Fine -tuned model achieved better results as displayed in the table below.

	Score (max. 390)
Fashion CLIP	247
Fine-tuned model	274

The second scoring method focuses on the shoelace color because our model was specifically fine-tuned to recognize and identify features related to shoelaces. By prioritizing the correct identification of shoelace colors, this method directly evaluates the model’s performance in the area it was trained on. In this method, 1 point was awarded for correctly identifying the shoelace color, with no points given for incorrect guesses. With a maximum possible score of 130 points (5 points per test case × number of test cases), the FashionCLIP model scored 35 points, while our fine-tuned model scored 50 points.

	Score (max. 130)
Fashion CLIP	35
Fine-tuned model	50

The third scoring method, unlike the first two methods, considers the position of the image results, emphasizing the importance of a ‘good’ result appearing higher in the search results. When a text prompt is entered and 5 results are returned, if the results are accurate, higher-ranking results receive more points: 5 points for 1st position, 4 points for 2nd position, and so on.

To compare the fine–tuned model with FashionCLIP, we summed the scores of all test cases, with a maximum potential score of 390 points (15 points per test case multiplied by the number of test cases). Results:

	Score (max. 390)
Fashion CLIP	84
Fine-tuned model	127

The fourth scoring method also included scoring based on the position of results but focused solely on the correct shoelace color. Results are as displayed in the table below:

	Score (max. 390)
Fashion CLIP	111
Fine-tuned model	163

All four scoring methods were automated through Python scripts. Python scripts fully reflect the scoring and calculation process outlined above. This automation was applied to all test cases where the number of corresponding product images in the dataset was 5 or more. All 26 test cases listed above meet these criteria.

Since the process was semi-automated, all models which were recognized as good performers at the end of the validation phase underwent these tests. After running the automated tests, manual testing was performed on models that showed good results, specifically for test cases where the number of images per product was fewer than five. For these cases, testing was manual to ensure consistent results and verify that the core functionality remained intact.

Benchmarks

Beyond the metrics outlined above, as the final step in evaluating the performance of fine-tuned models, we assessed its accuracy on three entirely new datasets: Fashion-MNIST, DeepFashion, and custom fashion dataset. Fashion-MNIST and DeepFashion are datasets available on the internet, containing diverse fashion images, while custom fashion dataset is created using images provided by a fashion retail company.

The goal was to perform a final, general check to ensure that our fine-tuned model functions correctly for broader tasks, rather than just the specific task for which it was fine-tuned. At this stage, we did not focus on the color of the shoelaces. Instead, we concentrated solely on the measure if category of the products in the dataset is correctly predicted by a model. Since the datasets were balanced, accuracy was a good choice, given its simplicity and clear measure of how often predictions match the true values.

The results of our benchmark comparison are summarized in the chart provided. The results demonstrate that our model’s performance in the general test is not significantly inferior to FashionCLIP, indicating that our fine-tuning process has successfully maintained the high standards set by the original model. While retaining the performance of FashionCLIP, we tailored it to better suit our specific use case of predicting shoelace colors.

Benchmark results

Final Model Evaluation

As a final step of this fine-tuning journey, after rigorous testing and the usage of various metrics, it was time to compare the models and see which one is the best among the group. At first glance, some models performed better for tests used in training phase like accuracy, topK and other metrics. Others were better at image retrieval and some of them at benchmarks. To decide which one is the best, we had to establish priorities and to set clear criteria, which are:

Better results than FashionCLIP in all tests explained in the document
Maximum of +-5% deviation in benchmark test

Applying both the above established criteria, we got to the list of 6 models. To make the final decision, we gave most priority to image retrieval as it provided the most relevant demonstration of the model’s performance in implementing product search.

The table below displays the top six models, with the results of FashionCLIP and the winning model highlighted in green:

Model analytics

Conclusion

Testing fine-tuned models is a critical step in the machine learning pipeline, ensuring that these models perform optimally in their specialized tasks. By developing robust testing approaches we can accurately compare different models and identify the superior one.

Throughout this document, we have emphasized the importance of both automated and manual evaluations, as well as comprehensive scoring systems. By rigorously testing and comparing fine-tuned models, we can drive advancements in machine learning and deliver more reliable and effective solutions.

Written by:

Dragana Momčilović, Bojana Gudurić Maksimović, Nemanja Romanić,
Danijela Šavija, Nikolina Pavković, Jana Terzić, Milica Milivojević
Levi9

Published:

23 May 2025

Blog post

Why 9 is the Magic Number – Let’s Hear the Story of Dušanka Lečić

March 28, 2025

Blog post

From an Accountant to Team Lead: A Levininer’s Journey of Transitioning Careers in IT

August 18, 2022

Blog post

Introducing Spring Modulith

November 22, 2023

Blog post

Lifting AI’s Sacred Veil

September 28, 2023

Blog post

Make Your Job Harder, and 10 Other Ways to Adopt a Total Ownership Mindset

November 29, 2022

Blog post

DIY Pocket Planetarium: Create Your Own with Raspberry Pi Pico

March 28, 2023

Cyber attacks: know thy enemy

December 5, 2024

Blog post

Back to the Future with Mirjana Kolarov: 15-Year Journey at Levi9

February 10, 2025

Blog post

Driving Business Growth and Compliance: Lessons from AWS Summit Amsterdam 2025

April 18, 2025

Fine-tuning a Fashion Model: Testing approach