In my final year at university, we received some servers, but one was special: it contained an NVIDIA RTX 6000 ADA, a name that would make any gamer have a heart attack if they learned its power (and its price).

This was a special card made for huge calculations and artificial intelligence: thanks to its 40GB of VRAM, it could handle a huge model, from a simple large language model to video generation.

I installed Debian on it and fought the whole day with the NVIDIA drivers, but after some time, I was capable of running some tests models thanks to PyTorch! We could see the endless possibilities of this machine.

We wanted a solution that would allow anybody to make a request of any type to the server, making it automatically load and unload models depending on the requests so that we could make any models available whenever we wanted.

Sadly, there was no direct solution: NVIDIA offers their “Triton Inference Server”, but this is very hard to set up and will always keep all the models loaded, VLLM and PyTorch offer a simple way to expose a LLM model through a web interface, but this still won’t allow us to use multiple of them easily and would still be restricted to LLM type model, while we wanted to try image and video generation.

I ended up working on my own solution: models would automatically load and unload depending on the user’s requests, they shall all be easily usable thanks to custom Gradlio interfaces, be easily to automate through an API and have easy-to-read documentation, thanks to FastAPI.

After a week of work, this ended up working pretty nicely. It could only handle one model at a time due to memory measurement not being that easy for a model, but it worked really nicely for our usage.

The hardest part of this was that some models work with a different Python and dependencies version. This forced me to add a subprocess mechanism with the Pyro5 library to start an independent Conda environment with the correct Python and dependencies version. This led to some tough to find bugs, but it ended up working really nicely.