Scaling a Machine Learning Product to 1,000,000+ Users

June 01, 2022

This is a guide on building scaleable production infrastructure for ML products, and the decision points you will encounter as your product grows from zero to one million+ users.

We split this article into four stages of user growth and map out the common pathways for production hosting at each stage.

Our goal was to keep this guide as unbiased as possible, but keep in mind Banana's core product is serverless GPU hosting so some of the content is from our experience building production infra ourselves and working with customers. Enjoy!

Congrats! You are ready to launch and deploy your product to customers. Let’s discuss the pathways for hosting your ML product on production.

Stage 1 (0-1000 Users) - Product Launch

Pathway 1: Self-deploy on a single “always-on” GPUThis is the pathway most startups historically have chosen when launching their ML product to customers. One of the reasons for choosing this pathway to launch is because serverless GPUs didn’t exist as an option for early-stage companies until recently, so launching with a single “always-on” GPU was the cheapest priced option.

If your team is convinced on building production infrastructure in-house, this pathway will achieve that. At small scale this pathway can work assuming you monitor latency and customer experience. Keep in mind it will be more expensive than going serverless at launch, so consider that if costs are top of mind for you. At scale, this pathway can break down really quick and you will need to deploy additional “always-on” GPUs to prevent latency spikes or make the jump to serverless.

Choose this solution when:

Scale isn’t the highest priority at product launch (small number of customers, beta users).
Have engineering resources to dedicate to this (estimated engineer time: 1-3 days).
Not interested in using external MLOps tools to support production infrastructure.
Cost is not the primary concern for you

Pathway 2: Deploy with a Serverless GPU solutionIf you value flexibility, value, and speed to market, launching with a serverless GPU solution is a great decision. It’s worth explaining the core benefits of serverless GPUs for production hosting before we dive deeper in this article.

Cost SavingsGoing serverless means you only pay for the GPU resources you use (utilization time) rather than “always-on” GPU costs. In other words, if your product only needs the compute of 1/4 of a GPU you are only paying for 1/4 of a GPU. When your product has no usage at moments, you don’t pay for those times either. We are seeing companies experiencing upwards of 90% cost savings on cloud compute by going serverless.

Contrast this to an “always-on” GPU, you need to pay for the entire GPU to be running 24/7, regardless of what percentage of the GPU compute power you actually use.

The other cost savings trick that serverless tools can provide you is decreased warmup time of models, regardless of size. At Banana we are decreasing the warmup time of large models by 99% on average. With GPT-J, we decreased the warmup time from 25 minutes to ~10 seconds.

AutoscalingHow many GPUs should you run at a given moment if you are using “always-on” GPUs? It’s a bit of a guessing game. Either you keep extra GPUs running and wait for traffic spikes to happen, or you YOLO it and run the bare minimum GPUs at the risk of a poor user experience during traffic spikes.

When you use a serverless GPU solution that offers autoscaling, this guessing game disappears. As your usage begins to grow you will be allocated additional GPU compute to handle the increased traffic load in real-time. Latency doesn’t spike and queueing is much less likely to happen, maintaining a quality user experience for customers. The reverse is also true. If usage slows down your allocated GPU compute will decrease.

ExtrasThe other notable benefits to going serverless with your production hosting is the speed to market when you deploy with a tooling partner. Instead of having to hire and invest engineering resources to build this infrastructure, this is a fast-track to cutting-edge production infrastructure with minimal investment.

Choose this solution when:

You need the ability to scale at launch based on customer demand
Flexibility and speed to market is important to you
Cost of an “always-on” GPU is too expensive
Working with external MLOps tooling is acceptable for you
Lack engineering resources to dedicate to infrastructure

Pathway 3: Deploy with multiple “always-on” GPUsFor a detailed explanation of this pathway, skip to Stage 2.

Choose this solution when:

You expect to have customer scale at launch
Must have the lowest latency possible, regardless of cost
Cost is not a concern for you
Speed to market is not a priority
Possess excess engineering resources to dedicate to infrastructure

Stage 2 (1000-100k Users) - Customer Demand Rises, Latency Issues Arrive

_(skip to Stage 3 if you implemented serverless GPUs in Stage 1)_The product launch was a success, and customers are consistently using your product. You are growing at a healthy rate, and the new issue that has bubbled up for your team is latency. As customer usage increases, so does your latency and it is starting to create a poor user experience. You have reached another decision point. Let’s look at your pathways!

Pathway 1: Implement multiple “always-on” GPUsThis is the most expensive pathway in Stage 2. Both in terms of pure compute costs, but also when you consider the cost of engineering time. If you value building infrastructure in-house and the lowest latency is your highest priority you should consider this pathway. High engineer time is required to implement this yourself (20-25 days). Engineers with production infrastructure expertise will likely be required. You will start to see the price gap widen between this option and serverless GPUs as you grow from here onward.

What kind of tooling will your engineering team need to build out? To name a few, you'll need multiple GPUs running servers to listen for inferences, loadbalancing to those servers, queueing infra to track calls (e.g., redis), auto-restart logic to do healthchecks on servers and restart them if they crash, and logging of calls for errors. As you start to add these tasks up and overlay them on a timeline you will see the scope of this project start to grow.

Choose this solution when:

Must have the lowest latency possible, regardless of cost
Cost is not a concern for you
Speed to market is not a priority
Set on building infrastructure in-house

Pathway 2: Implement a Serverless GPU solutionIf you already implemented serverless GPUs in Stage 1, then this problem doesn’t exist for you. If you didn’t, jump back to the section on serverless GPUs in Stage 1 to understand why it matters for you right now. Going serverless with a tooling partner is generally less than 7 days of implementation time with very low engineer investment needed.

Choose this solution when:

Unlimited customer scale is a requirement
Cost of running “always-on” GPUs is too expensive
Speed to market is important to you
Latency needs to be really low for customers, but with wiggle room
Looking to save engineer resources for other parts of the product

**Pathway 3: Build a Serverless GPU solution yourself.**For a detailed explanation of this pathway, skip to Stage 3.

Choose this solution when:

Unlimited customer scale is a requirement
Speed to market is not a priority
Latency needs to be really low for customers, but with wiggle room
Set on building infrastructure in-house and have engineering resources to allocate to this
Looking to gain some cost-efficiencies, but not looking for the lowest cost option

It’s worth noting that there is an edge-case scenario where one “always-on” GPU could be the most effective option for you from a performance and cost point of view. If you have zero concurrent overlap, and calls that use the full hour you may be better off with one "always-on" GPU instead of serverless because you don't experience GPU usage downtime during that hour.

For these edge cases, Banana offers an “always-on” GPU that is price-matched with AWS. The added benefit is that when customer demand starts to go beyond the compute of one “always-on” GPU, we can seamlessly transition your hosting to serverless with minimal engineering burden.

Stage 3 (100k- 1M Users) - Unit Economics are SH*T, Desperate for Autoscaling

(skip to Stage 4 if you implemented serverless GPUs with autoscaling in Stage 1 or 2)

The good news, your company is crushing it from a customer growth POV. The bad news, you can’t afford to scale any further based on the compute costs that are piling up. Your unit economics are trash. Doing nothing to improve your cost structure at this point means disaster. Welcome to Stage 3, let’s look at your pathways.

Pathway 1: Build a Serverless GPU solution with autoscaling yourself

For the teams that value building their infrastructure in-house over all other trade-offs, this can work. What kind of tooling will your engineering team need to build out and consider here?

You'll need an autoscale strategy and execution for rollup and teardown, plan for handling timely cold and warm starts, monitoring strategy around cap limits and scale spikes, health checks on all of above so you don't blow up your cloud account. Once GPU servers are actually running, you'll need a way for servers to listen for inference requests, load balancing of jobs across servers, queuing infra to track calls (eg. redis), auto-restart and health check logic incase of crashes, and logging of calls and errors.

Choose this solution when:

Looking for gains on unit economics to scale with good margins
Set on building infrastructure in-house and willing to accept trade-offs
Speed to market is not a priority

Pathway 2: Implement a Serverless GPU solution with autoscalingJump back to Stage 1 and read the autoscaling section if you missed it. It’s integral that your serverless tooling solution can offer autoscaling for you at this stage of user growth.

Choose this solution when:

Needing better unit economics to allow your company to scale with great margins.
Speed to market is important to you
Lack of engineer resources or interest to deal with production infrastructure

Stage 4 (1M+ Users) - Life is likely Serverless

Welcome to the promised land. By now, most teams will have made the transition or are about to make the transition to serverless GPUs to improve their unit economics and maintain robust infrastructure. At this point if you are serverless your primary constraint is making sure you have enough available machines to keep up with product usage.

It’s worth noting that it’s possible for a company to get to Stage 4 while running “always-on” GPUs and ignore the challenges of unit economics they encounter at Stage 3. You would need a deep coffer of cash and to not be focused on unit economics whatsoever to go this route.

The likely outcome is that you are running serverless GPUs in Stage 4. The challenge with scaling an ML company historically has been that the compute costs really hurt margins. With serverless dropping the compute costs we have seen it unlock ML companies to scale with software margins.