4 Ways to Deploy Machine Learning Models to Production

August 17, 2022

Deciding which method your team will use for deploying ML models to production is a lot like that meme where there's a fork in the road and depending which route you choose, you could end up in the part of town with the nice castles and uplifting energy, or you end up in the doom and gloom neighbourhood.

It's important to understand the tradeoffs that one can make and the traps you can fall into on this journey. With all of that in mind, we have narrowed down the choices for deploying ML models to production into 4 viable paths to choose from.

Let's begin.

Option 1: Build Your Own Inference Engine from Scratch

The first option is to write your own server at the lowest level possible. Like, we're talking writing a C++, Go, or Rust server where you implement from scratch the execution graph and plug that into the GPU. This is extremely expensive and not practical for most teams.

The only companies that generally do this are people who have flagship models like OpenAI, or Cohere, where they are implementing everything from the ground up around one very, very specific architecture that they've trained elsewhere and have now ported to a faster executing language. They're generally writing that as an HTTP or GRPC server, and it is very engineering intensive and doesn't translate to other model architectures.

If you have a model that's already trained, if you have engineers who know how to re-implement graph-style architectures in other languages and you don't expect your model to change a lot, this can be a viable route. We'll call this the ultra-high performance server route.

Option 2: Build Your Own Python-based Server

The second option is to take a more generalized approach. Using a framework like pytorch and a basic Python HTTP server like flask, you can do a lot of the same stuff in a much more generalizable way.

Basically, load the model ahead of time as a global variable, start the http server, and when calls come in you run those calls against the already warm preloaded model. It's a lot more modular, you can change the model in and out so it gives you much more flexibility. But because you're using the Python layer on top of pytorch as your execution environment, you're limited to the speeds that those can get. Granted, these speeds are quite good, just maybe not as fast as option 1 mentioned above.

The challenge is, you're basically hand rolling a Python server therefore you'd need to figure out the infrastructure side of things. This means getting the server setup on Kubernetes, having it replicated out, making sure that you have enough replicas running (both the nodes as well as the pods).

Take this route if you want to manage your own infrastructure, and you have a wide range of use cases that makes it so it doesn't make sense to implement your model inference from scratch.

How bad could it be? The lengthy todo list:

Let's get into it! At the application level, you need to know Python and whatever ML framework you are using, could be Pytorch or TensorFlow. We'd certainly suggest pytorch because it executes faster in our experience. Regardless, you select your frameworks and you need to be very good in those frameworks for two reasons:

You need to know how to train the model in the first place.
In a full-stack engineering sense, you need to know how to load these resources in a way that's different than most data scientists are used to in, for example, a Google Colab or Jupyter Notebook environment.

With an HTTP server, you need to understand caching (loading the model in advance before the server starts up so that you don't have to do redundant loads for every call). That's a non trivial aspect of hand rolling that generally is within the full-stack knowledge domain rather than the machine learning domain where folks are used to simply pressing run on a Google Colab.

Make sure you generally understand how these processes start up, how the HTTP handler starts up, and how those are accessing the global state. There are plenty of "gotchas" along this journey. Watch out to not share global state memory between different processes because the GPU generally throws an error in that case. Be mindful of processes sharing information and having multi threading errors or race conditions.

That's just some of the application level work required. Going beyond, we move to the environment.

Within the environment you have the Docker container. Therefore, you need to know Docker. If you want to deploy it you will likely want replication, which requires an orchestrating platform (Kubernetes) and knowing how to operate a load balancer.

From there, you have the routing. Most companies prefer an event-based architecture, something like Kafka or a pub/sub queue (what we use). We have tasks that enter a queue and they're picked up by workers. It's not a straight HTTP call for two reasons. First, is fault tolerance. If a process fails you want the task to be picked up and worked on again. Second, is execution time. If you have a long running process you can't afford to have that in a single HTTP call. I think you can start to see how there is plenty of middleware that needs to be built as well.

Finally, you have the call site. You need to build a client or implement bare REST calls from wherever you're calling the service from.

Option 3: Use an Infrastructure Platform (like Banana)

Third option, use a tool like Banana. Banana does the Python inference server work for you, we have a boilerplate that handles the the setup and routing. All you have to worry about is the inference and the model loading paths. This will run on the same Python-based execution environment as option 2 which means it's really fast and perfect for most teams.

Basically, you get to kiss goodbye to the massive todo list in option 2 if you were to build your own python server. Banana handles 90%+ of that.

Another benefit with our template is that the speed to production is FAST. Generally it takes teams 1 hour to 2 days (dependent on engineering skills) to deploy to production with Banana. With the engineering iteration time so quick, you don't need to think about infrastructure and we scale in and out as you need.

You also don't need to think about how the server is interacting with the model. Our boilerplate already sets that up for you. To use Banana you just have to be an ML engineer and drop in your ML code, no need to worry about the rest of the software stuff going on behind the scenes.

From there, we handle the Dockerization, Kubernetes, load balancing, Client SDK, and the async worker system. Basically, Banana handles "the pipes" that bring the call to the server and the autoscaling logic to make sure there's a server warm and ready to go.

Additional perks would be that we have custom execution environments that speed up the inference of the model and the loading of the model, as well as our intelligent Auto-Scaling that allows us to go from zero-to-one-to-many replicas on a call by call basis, meaning you don't ever have to over-provision GPUs. Pretty sweet.

Option 4: Use a Pre-Existing Inference Run-Time Environment (not recommended)

We include this on the list because it is technically an option, but likely for most teams this is a route to avoid. There are pre-built inference runtime environments for machine learning models. An example would be TensorFlow Serving, where you upload a TensorFlow model to an outputted standard format and it converts that into a GRPC or REST server.

This route could be suitable for people who have 5-10 different versions of the same model and want to perform split testing. The value of these servers are not the fact that they are fast. The value is that you can upload a bunch of different versions of the same model to it and swap the model versions in and out easily. It's a viable option if you want to play around with weights and not need to think at all about the inference server side of things.

In our experience, this option is significantly slower than any of the Python implementations. Running a raw HTTP flask and Pytorch setup will run 3-4x faster than a TensorFlow serving module.