ML Deployment: When the Big Problem is… Going Big

After weeks of training and fine-tuning a model, you are finally ready to deploy to production on the cloud for everyone to use. This will be a piece of cake.

Famous last words.

The modern day ML engineer will immediately be bombarded with requirements. "This uses too many GPUs and the network costs to pass the data to and from are too high, we can't afford this" says the cloud infra team. "The calls are too spiky, the worker scale up and scale down takes too long, and the throughput isn't high enough" says the compute team. "Packaging is hard, our fault tolerance isn't generic enough to support this, and we have to maintain both CPU and GPU versions" says the ML infra team.

Great, you think, time to spend the next couple months (if lucky) simply trying to get the model into production. Or wait, maybe there is a third party solution that can help with model deployment and orchestration!

Not minutes after raising this idea, a flood of concerns pile in. "Do we have to give up our IP" says the legal and security teams. "We need to make sure we use our cloud provider credits" says finance. "This better not take a lot of time to integrate and maintain" says your manager. Heck, even you have concerns:

Will the results be reproducible?
Will it work for the size and throughput of my model?
Will it constrain what kinds of architectures and custom ops I can use in the future?
Will it adapt code specialized for my backend of choice?
Will I get support from the solution's creators?
And most importantly, will this service even solve all of the cost, orchestration, and other pains that made me consider this solution in the first place?

You do your research, you check whether all of the pains and concerns are addressed, and then you make a decision - all third party solutions seem incomplete, they might lead to more work than benefit, and they might constrain future development. You'll just roll up your sleeves and build out the model deployment from scratch. Literally trying to deploy your ML workflow at a big scale just became a big headache.

If you're an ML practitioner, you have likely experienced the above situation. Or maybe you are one of the people who have felt the pains or raised the concerns. In either case, at Exafunction, we started with the thesis that the last decision could have a different outcome - highly effective and efficient model deployment can be automated without any additional engineering effort. So we rolled up our sleeves and started to dissect what would be needed to do so, talking to practitioners across the ML product landscape. And to all of our friends that have been in situations like the above before, we learned a few things:

Your pains exist. And they will only get bigger.

At Exafunction, we've seen companies go through the process of doing more and more complex ML at larger and larger scales. There are more models. There are more custom ops. There is stateful computation. There are bigger inputs. There are both realtime and batch workloads. There is more and more rogue GPU code. The models need to start spanning multiple GPUs.

Costs go up. If you want the same accuracy, latency goes up.

You'll need a dedicated team to root cause, speed up the models, set up work scheduling, investigate other types of accelerators, and more.

Costs go up more.

From our experiences and conversations with others, we are certain that for any company that is serious about doing ML, this evolution happens rapidly, often within a year. And to no fault of anyone, it often comes as a surprise (after all, ML acceleration is not the company's actual product!).

Imagine what you could do if you needed just a tenth of your current compute usage for your ML workloads without having to sacrifice accuracy and latency or having to hire an entire team to try to get there - you could parlay cost savings to other areas, run your workloads at the larger scale you always desired, and be ready for the increased demand as your team and projects grow.

The pains are more complex and nuanced than at first glance.

Take something like "our GPU spend is too high." It's a pain we heard over and over, but diving deeper, the causes were varied:

Some workloads, like e-commerce applications during holiday seasons, were spiky, and the difficulty/time to scale up and down workers meant people kept their GPU instances running all the time.
Some workloads truly just had poor utilization of GPUs, such as in autonomous vehicle simulation, where the sheer amount of CPU work before and after the inference stages meant GPUs lay idle for vast stretches of time*
Some workloads, such as fraud detection, must provide high availability and so have to pay a premium for scarce, expensive GPU machines rather than spot or preemptible instances (and have to also implement the code to support warm GPU, warm CPU, or fast startup!)
Some workloads have a few models with high throughput and poor autoscaling (one model to many machines), such as processing satellite imagery or DNA sequences. Other workloads have many models with low throughput and poor colocating (many models to one machine), such as in the NLP space where there might be tens to hundreds of fine tuned models.
Some workloads actually get burned on the network costs, passing large intermediate outputs back and forth across the network, such as in video processing.
Different workloads have different combinations of these causes. Solutions that address just one or two of these actually don't actively help most developers, and often constrain developers from modifying their workloads in ways that the solution cannot address.

All of your concerns are valid.

Even if something exists that can address all of the pains, there are still concerns with taking a third party solution. And while the identity of the concerns might change from workload to workload and company to company, we understand the concerns that you have are concerns for a reason.

You've spent months or years creating a proprietary model, of course you want to control your IP. You're on a small team with a million things to do, of course you don't want high integration overhead on your end (and definitely don't want continuous maintenance work). You have a product that you want to keep improving, of course you don't want to be constrained on that development by some third party solution that cannot accommodate or keep up with your roadmap.

The tl;dr: Don't be fooled by the small annoyances with your ML deployments today, there's more coming...

We know we aren't the first to think about revolutionizing model deployment, and we probably won't be the last, so why hasn't the solution been developed yet? The more we talked to practitioners, the more we realized that for someone to actually use a third party solution, all of the pains and all of the concerns must be addressed.

We questioned why people needed features to address everything as opposed to just wanted. It turns out that individuals and companies have investigated (and even tried) other solutions that do solve a subset of the pains as promised, but then at least one of two things happened:

The other pains still existed! While products might have made it easy to set-up and use, no effort was made in integrations with other products (mostly because it is really hard to integrate with other products with conflicting underlying architectures, assumptions, and design principles). This difficulty to mix and match products often caused teams to revert to in-house solutions where they had "full control."
The unaddressed concerns ended up becoming more painful than the original pains, or were even non-starters from the beginning! We heard everything from the nature of a managed service being a nonstarter with the security team to constraints imposed by these solutions (ex. on model size and custom ops) eventually butting heads with the real cutting edge model development that the company actually cared about.

The observant will point out that a third party solution can never solve the self-definitional concern of using a third party service! We understand that this is a valid concern we cannot completely solve, but we strongly believe it is out-balanced by the benefits of (a) not needing dedicated internal headcount to eternally maintain your ML workload deployments and orchestration, which distracts from your actual product, (b) unparalleled SLAs and support from world class engineers, and (c) a team dedicated to squeezing out every cent and millisecond of savings possible, constantly pushing the boundary of what's possible even when the status quo seems acceptable. And we already have proven trustworthiness - we already run workloads on thousands of GPUs concurrently and serve more inferences monthly than all of Sagemaker.

With the understanding of what ML practitioners actually needed and why, we've developed a single product that would let ML companies run workloads on their terms at unprecedented efficiency and cost without putting any constraints on what those workloads look like. We call it ExaDeploy.

Your cluster so your IP and cloud credits stay yours. No limits on model size, model architecture, or types of custom ops. Integration in as little as one engineer day. Optimized autoscaling and model colocation. Maximized utilization. Reproducible results. Spot instance support with no data or computation loss. Minimized network egress costs. Your framework/backend of choice. Simple code packaging and fault tolerance. Strong SLAs and a world class team ready to support.

Bottom line: Save on costs without conceding on latency and accuracy.

And while there is always more to be done, we're excited to share where we've reached. Already, we are seeing some of our ExaDeploy customers going from needing 200 GPU nodes for their inference needs down to only 7 by bumping utilization to over 80% - an overall 80% cost reduction on their cloud bill**, no significant runtime differences, and potential to parallelize even more. We are looking forward to sharing more statistics and case studies in the near future!

We have leading ML companies in multiple industries trusting ExaDeploy to execute their workloads at unparalleled efficiency and cost, and we are excited to bring this technology to you. We want to solve your existing pains and develop brand new technology to push the boundaries of ML deployment even further. If you want to accelerate your ML deployments (or just find this interesting), we want to chat with you - send us a message here!

Also look out for future blog posts, where we will go into more technical details and introduce more features that have made our partners go "wow"!

* It actually turns out that nvidia-smi shows that you are utilizing the GPU during memory transfer, even though the model might be doing nothing. It even counts if the kernels are executing on only one of the GPU's SMs. These factors likely mean that the "utilization" people often measure is a vast overestimation of the true amount of power that can be juiced out of the GPU resource. We will deep dive into utilization statistics in a future blog post!

** Cost reduction isn't exactly proportional with the GPU node reduction because of CPU nodes, but clearly GPUs are a big cost center. The actual reduction in GPU instance costs is closer to 97%. We will go into more of the cost considerations in another blog post!