Serverless GPUs: Unlocking Software Margins for ML Companies

Serverless GPUs: Unlocking Software Margins for ML Companies

Banana Dev logo with superhero cape.

We hope you’re sitting down. Starting today, Banana is offering Serverless GPUs for Inference Hosting. There is some pretty cool stuff to unpack with this launch, but if you are pressed for time and want to get back to building here is the gist.

TL;DR - Banana’s Serverless GPU Inference Hosting:

  • Supports models of any size and loads large models up to 99% faster. Example: Warmup time for GPT-J is ~10sec instead of 25 minutes
  • 90% cost savings for cloud compute on average.
  • Autoscaling is included.
  • Easy to use with just 2 lines of code through our SDK.
  • Eliminates many weeks of manual infrastructure work.
  • Available for all Banana customers starting today.

The Dark Ages: Life before Serverless GPUs

Many of us instinctually know the perils of building ML products without serverless GPU for inference because that has been the norm when deploying models to production for quite some time. But for fun, let’s recap.

null

Expensive as SH*T - Yup. Paying for “always-on” GPUs really stings the frontal lobe of your brain because it defies all logical rationale. And then every time you muster the courage to peek at your cloud bill you can’t help but look to the sky and start yelling:

Can’t I only pay for the GPU resources I actually use?!”Why does it have to take 25+ minutes to warm up my model?!”

Hard to Scale - The goldilocks dilemma. Either you keep extra GPUs running and wait for traffic spikes to happen, or you YOLO it and run the bare minimum for GPUs at the risk of a poor user experience during traffic spikes.

Run extra GPUs = refer to the “Expensive as SH*T” point above.

YOLO your GPUs = lose customers, negative product experience, and still can be pricey.

Annoying to Build Infra - Congrats! You been putting in the meaningful work of building and training models to get them ready for deployment, only to find out when they are ready that you have weeks more of gruntwork required to build production infrastructure? No thanks. Isn’t there a better way?

I thought Serverless GPU’s already exist?

Meh, sort of. Serverless GPU hosting for inference does exist, but it has been really difficult to self-implement because in a serverless environment most ML models have such a long cold-start that serverless isn’t viable and is a nightmare to try and make work.

If you decided to avoid the self-implementation headache and work with a hosting provider to go serverless you ran into other headaches. Very few companies can offer truly serverless GPU hosting, most of the time you’ll run into limitations such as not having autoscaling or not being able to support large models like GPT-J.

Banana solves this.

The 4 Keys to Serverless GPU Inference Hosting

To make serverless GPU inference hosting valuable to the ML community, we knew it had to check four key criteria:

  1. Substantially Reduce Hosting Costs
  2. Support Models of any Size
  3. Autoscaling
  4. Easy to Use

Not to brag, but we feel like we’ve achieved this.

Substantially Reduce Hosting Costs - Customers are experiencing 90% or more cost savings on cloud compute with Banana’s serverless GPU product. How?

You only pay for the GPU resources you use (utilization time) rather than always on GPU costs. Plus, we can decrease the warmup time of large models by 99% on average.

Support Models of any Size - Enjoy serverless GPUs for all of your models, regardless of size. We decreased the warmup time for GPT-J from 25 minutes to ~10 seconds.

Autoscaling - Bring on the traffic spikes. Our serverless GPUs offer real-time autoscaling so you don’t have to worry about a poor user experience when your product goes viral. This is truly a solution that scales with you as you grow, but can also scale down during times of less usage.

Easy to Use - Say goodbye to weeks of building hosting infrastructure. With just two lines of code you get scalable inference hosting for your models. It’s that easy! And if you ever have any questions or need help with implementation, just ping our MLOps team in your private slack channel for a less than 24hr SLA response time.

We’re pumped to launch this to customers!

If you have any questions about the technical challenges we had to overcome for serverless GPUs, head over to our Discord and we’d love to chat. Otherwise hit us up on Twitter and let us know what you think about the product.