sm4llVTONs (same methodology 4 all VTON) is a new family of highly efficient and specialized diffusion models for virtual try-on (VTON) applications, and more. This page provides an overview of our current models, methodology, and performance benchmarks. The full details of our training methodology and architecture will be published in our upcoming research paper.
The sm4llVTONs family consists of several lightweight models, each an expert in a specific VTON domain. This specialization allows them to achieve state-of-the-art results on relatively small, targeted datasets.
Model Name | Task | Status |
---|---|---|
sm4ll-eye | Sunglasses & Eyewear | Pre-release |
sm4ll-shoes | Shoes & Footwear | Pre-release |
sm4ll-face | Face Swapping | Beta |
sm4ll-top | Upper Body Garments | Alpha |
sm4ll-bottom | Lower Body Garments | Alpha |
sm4ll-dress | Dresses | Alpha |
sm4ll-bg | Background Replacement | Alpha |
Our work is guided by a core philosophy that distinguishes it from general-purpose VTON and image editing models. Instead of a single, large model that handles many tasks, sm4llVTONs are experts fine-tuned for a single purpose. This results in higher fidelity, better detail preservation, and more intuitive control. Our methodology is built around a "train-like-you-infer" principle, ensuring that our models perform reliably on in-the-wild images, not just curated datasets.
While the complete methodology will be detailed in our paper, we can share a high-level overview of our three-stage process.
The model is trained on a dataset constructed from paired image sources and various augmentation techniques, such as cropping to mask size, automasking, etc. To improve robustness to real-world use cases, we employ a inference-aware masking strategy during the training phase, i.e.: instead of applying the best possible masking pipeline specific to the dataset, we employ the best possible masking pipeline that would be applied during inference instead during training as well. This way, we are able to generate masks that simulate user behaviour in production pipelines, instead of relying on best case scenarios that try to replicate training conditions during inference. This enhances the model's robustness against both variance and artifacts common in user-uploaded content.
All models in the sm4llVTONs family are trained using a unified methodology derived from foundational instruction-based models. We have introduced significant modifications to the training loop, including an optimized loss calculation. This technique focuses the model’s learning exclusively on the relevant image regions, which dramatically improves sample efficiency and the quality of the final result.
Inference is handled via ComfyUI, in an environment that mirrors the training conditions. Our most critical component is a custom, and tailored to the model’s objective, automasking process that is consistent between training and real-world use. This end-to-end consistency is fundamental to our models' success and addresses a common pitfall in VTON systems, where models fail to generalize because their training data does not reflect real-world input.
While we tried our best in getting Flux Kontext to deliver acceptable results, we didn't find a consistent prompt structure that could output the desired type of image modifications consistently. As such, these results are the best we could generate, but we don't necessarily think they are indicative of what the model could be capable of in the hands of Black Forest Labs. It is, anyway, in the very nature of instruct-based models to be somewhat limited in their ability to properly execute on prompts written by users who either don't have access to the training dataset, or have limited access to proper documentation.
On another note, OpenAI's GPT-4o is probably the best model in terms of aesthetic scoring, but, as all current autoregressive models do, "cheats" by delivering an image that is not a direct modification of neither of the two input images, but rather a third, similar image, that cannot be used in a production environment, as even small changes in product and / or underlying approved assets is usually not acceptable in a creative production pipeline.
In terms of autosegmentation, FASHN.ai is the only one amongst the benchmarked models that tries to have a predictive approach to the segmentation problem. As such, some of the worst generations from FASHN are not to be taken as a limitation of the underlying model, but most likely than not, a limitation of the autosegmentation system employed itself.
CatVTON Flux still performs rather well in terms of generalization, even when considering that the dataset it's based off is DressCode, which is optimized for Top, Bottom, and Full Body garments only. Regardless of this, it still can perform decently when tasked with out of scope items, such as eyewear and shoes.
There are currently multiple ways to generate images of people wearing product, but most of them do not fully integrate in traditional production pipelines. As shown in the below example, VTON models should be able to be integrated in a complex system where each image is approved throughout different stages and people in a corporate environment.
Some of the most commonly known models, like GPT-4o, cannot satisfy this need, as the images it generates are always slight variations of both the product and the underlying asset. Other models, like CatVTON or other commercially available VTONs, do not reach the desired level of fidelity when generating products, or suffer from poor masking quality that ends up changing the full outfits in a significant way.
Although these benchmarks are from a preliminary testing phase and a more in-depth evaluation is ongoing, the sm4ll model family has demonstrated consistently stronger performance compared to the other models in the following areas:
Due to our time spent working with and talking to the Creative, Marketing, and Product departments at global brands, we believe that expert models would satisfy both of the above needs.
Another way to solve these needs would be to train single LoRAs on every product. While our results are highly competitive with single-product, single-view LoRAs, the slight trade-off in quality is strategically outweighed by the scalability factor of not needing to train a unique model for every product. Training LoRAs is time and resource intensive, and often necessitates a higher degree of automation than VTON models if companies wish to develop a truly scalable system.
The necessary degree of precision and fidelity is also different whether the models are used in a B2B or a B2B2C (or B2C) setting. For bigger campaigns and marketing efforts, LoRAs provide a higher degree of fidelity than VTON models, however specialized they are. On the other hand, if the effort is geared towards letting marketplace users try on products, specialized VTONs are faster to deploy than both LoRAs and regular AR solutions.
Currently in its initial stages, sm4ll-face has undergone preliminary testing against best-in-class, publicly available face swapping models: ACE++ and ReActor.
From the early tests we have conducted so far, while ACE++ came close to our results in terms of scoring and qualitative similarity, it struggled with adapting the subject's light based on the context around it (see example 1 and example 2, a cold and blue-ish light coming from our subject's picture). Instead, sm4ll-face reaches a slightly higher FID score in around 60% of the benchmark generations while achieving consistently a good relighting of the subject's face based on the surrounding context.
While preliminary tests performed with the Beta version of sm4ll-face model haven't been consistently surpassing ACE++ and ReActor in FID and CLIP scores 100% of the times, we will run more comprehensive quantitative tests once the model is ready for release.
The sm4llVTONs family of models represents a significant step forward in the field of specialized virtual try-on applications. By focusing on lightweight, expert models and a "train-like-you-infer" methodology, we achieve high-fidelity results that are robust to real-world conditions. Future work will involve the release of the full research paper with detailed benchmarks and continued development of the alpha and beta models.
@inproceedings{sm4llVTONs2025,
title={sm4llVTONs: A Family of Specialized Virtual Try-On Models},
author={Andrea Baioni and Alex Puliatti},
year={2025}
}
This work builds upon the foundational research of many, including:
Brooks, T., Holynski, A. and Efros, A.A. (2023). InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv preprint arXiv:2211.09800.
Chong, Z., et al. (2025). CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. arXiv preprint arXiv:2407.15886.
Huang, L., et al. (2024). In-Context LoRA for Diffusion Transformers. arXiv preprint arXiv:2410.23775.