How Open AI models efficiently manage finetuned models of millions of customers
GPU hardware is really expensive. OpenAI pricing model is based on model usage.
One GPU cluster per customer.
Thousand of customers doing fine tuning means thousands of clusters for fine tuning.
What if models are trained by customers and not used.
How does open ai then make money.
Open AI does it Smart by not allowing any of the billion parameter to get fine turned by any customer.
This is done by only updating the parameter of adapter specific to a customer.
Adaptors that can be “plugged“ into the base model.
The most common and efficient adapter type is the Low-Rank Adapter (LoRA). Small sized simple logic adaptors
One plug in for one task and Adaptors become Active only during Serving Time And customer Query routed via adaptors
Each adapters is trained separately and plugged together at serving time.
In the case of OpenAI, with multiple LoRA adapters, it becomes easy for them to deploy multiple fine-tuned models on the same GPU cluster.
Trainee LoRA weights are stored in a model registry.
OpenAI monitors the customer usage behaviors and accordingly autoscale or scale out the compute associated with Adaptors
Source AIedge newsletter