sm4llVTONs: A Family of Specialized Virtual Try-On Models

Introduction

Our Current Models

The sm4llVTONs family consists of several lightweight models, each an expert in a specific VTON domain. This specialization allows them to achieve state-of-the-art results on relatively small, targeted datasets.

Model Name	Task	Status
sm4ll-eye	Sunglasses & Eyewear	Pre-release
sm4ll-shoes	Shoes & Footwear	Pre-release
sm4ll-face	Face Swapping	Beta
sm4ll-top	Upper Body Garments	Alpha
sm4ll-bottom	Lower Body Garments	Alpha
sm4ll-dress	Dresses	Alpha
sm4ll-bg	Background Replacement	Alpha

Key Philosophy & Features

Our work is guided by a core philosophy that distinguishes it from general-purpose VTON and image editing models. Instead of a single, large model that handles many tasks, sm4llVTONs are experts fine-tuned for a single purpose. This results in higher fidelity, better detail preservation, and more intuitive control. Our methodology is built around a "train-like-you-infer" principle, ensuring that our models perform reliably on in-the-wild images, not just curated datasets.

Methodology

While the complete methodology will be detailed in our paper, we can share a high-level overview of our three-stage process.

Data Curation & Preparation

The model is trained on a dataset constructed from paired image sources and various augmentation techniques, such as cropping to mask size, automasking, etc. To improve robustness to real-world use cases, we employ a inference-aware masking strategy during the training phase, i.e.: instead of applying the best possible masking pipeline specific to the dataset, we employ the best possible masking pipeline that would be applied during inference instead during training as well. This way, we are able to generate masks that simulate user behaviour in production pipelines, instead of relying on best case scenarios that try to replicate training conditions during inference. This enhances the model's robustness against both variance and artifacts common in user-uploaded content.

The Training Process

All models in the sm4llVTONs family are trained using a unified methodology derived from foundational instruction-based models. We have introduced significant modifications to the training loop, including an optimized loss calculation. This technique focuses the model’s learning exclusively on the relevant image regions, which dramatically improves sample efficiency and the quality of the final result.

Inference & Validation

Inference is handled via ComfyUI, in an environment that mirrors the training conditions. Our most critical component is a custom, and tailored to the model’s objective, automasking process that is consistent between training and real-world use. This end-to-end consistency is fundamental to our models' success and addresses a common pitfall in VTON systems, where models fail to generalize because their training data does not reflect real-world input.

Results

Glasses

Top

Bottom

Dresses

Shoes

Qualitative results and creative production pipeline

While we tried our best in getting Flux Kontext to deliver acceptable results, we didn't find a consistent prompt structure that could output the desired type of image modifications consistently. As such, these results are the best we could generate, but we don't necessarily think they are indicative of what the model could be capable of in the hands of Black Forest Labs. It is, anyway, in the very nature of instruct-based models to be somewhat limited in their ability to properly execute on prompts written by users who either don't have access to the training dataset, or have limited access to proper documentation.

On another note, OpenAI's GPT-4o is probably the best model in terms of aesthetic scoring, but, as all current autoregressive models do, "cheats" by delivering an image that is not a direct modification of neither of the two input images, but rather a third, similar image, that cannot be used in a production environment, as even small changes in product and / or underlying approved assets is usually not acceptable in a creative production pipeline.

In terms of autosegmentation, FASHN.ai is the only one amongst the benchmarked models that tries to have a predictive approach to the segmentation problem. As such, some of the worst generations from FASHN are not to be taken as a limitation of the underlying model, but most likely than not, a limitation of the autosegmentation system employed itself.

CatVTON Flux still performs rather well in terms of generalization, even when considering that the dataset it's based off is DressCode, which is optimized for Top, Bottom, and Full Body garments only. Regardless of this, it still can perform decently when tasked with out of scope items, such as eyewear and shoes.

Real World Fashion Production Pipelines

There are currently multiple ways to generate images of people wearing product, but most of them do not fully integrate in traditional production pipelines. As shown in the below example, VTON models should be able to be integrated in a complex system where each image is approved throughout different stages and people in a corporate environment.

Some of the most commonly known models, like GPT-4o, cannot satisfy this need, as the images it generates are always slight variations of both the product and the underlying asset. Other models, like CatVTON or other commercially available VTONs, do not reach the desired level of fidelity when generating products, or suffer from poor masking quality that ends up changing the full outfits in a significant way.

Although these benchmarks are from a preliminary testing phase and a more in-depth evaluation is ongoing, the sm4ll model family has demonstrated consistently stronger performance compared to the other models in the following areas:

Delivering accurate results through models specialized in specific product categories.
Being able to minimally affect the underlying input image(s) while retaining product precision, which is, in our experience, a relevant concern in real-world production pipelines.

Due to our time spent working with and talking to the Creative, Marketing, and Product departments at global brands, we believe that expert models would satisfy both of the above needs.

Another way to solve these needs would be to train single LoRAs on every product. While our results are highly competitive with single-product, single-view LoRAs, the slight trade-off in quality is strategically outweighed by the scalability factor of not needing to train a unique model for every product. Training LoRAs is time and resource intensive, and often necessitates a higher degree of automation than VTON models if companies wish to develop a truly scalable system.

The necessary degree of precision and fidelity is also different whether the models are used in a B2B or a B2B2C (or B2C) setting. For bigger campaigns and marketing efforts, LoRAs provide a higher degree of fidelity than VTON models, however specialized they are. On the other hand, if the effort is geared towards letting marketplace users try on products, specialized VTONs are faster to deploy than both LoRAs and regular AR solutions.

Outside of traditional VTON use-cases

Face-swap

Currently in its initial stages, sm4ll-face has undergone preliminary testing against best-in-class, publicly available face swapping models: ACE++ and ReActor.

From the early tests we have conducted so far, while ACE++ came close to our results in terms of scoring and qualitative similarity, it struggled with adapting the subject's light based on the context around it (see example 1 and example 2, a cold and blue-ish light coming from our subject's picture). Instead, sm4ll-face reaches a slightly higher FID score in around 60% of the benchmark generations while achieving consistently a good relighting of the subject's face based on the surrounding context.

While preliminary tests performed with the Beta version of sm4ll-face model haven't been consistently surpassing ACE++ and ReActor in FID and CLIP scores 100% of the times, we will run more comprehensive quantitative tests once the model is ready for release.

Conclusion

The sm4llVTONs family of models represents a significant step forward in the field of specialized virtual try-on applications. By focusing on lightweight, expert models and a "train-like-you-infer" methodology, we achieve high-fidelity results that are robust to real-world conditions. Future work will involve the release of the full research paper with detailed benchmarks and continued development of the alpha and beta models.

Acknowledgement

This work builds upon the foundational research of many, including:

Brooks, T., Holynski, A. and Efros, A.A. (2023). InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv preprint arXiv:2211.09800.

Chong, Z., et al. (2025). CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. arXiv preprint arXiv:2407.15886.

Huang, L., et al. (2024). In-Context LoRA for Diffusion Transformers. arXiv preprint arXiv:2410.23775.