The time has come for you to deploy your ML models to production. But before you deploy to production, you need to figure out how you're going to do that. This process starts with generally asking yourself questions and/or running proof of concepts to understand what constraints you need to fit into when deploying. Once you understand your constraints, you can then explore which ml model deployment solution is best for your team.
The goal of this article is to prompt you with the core guiding questions you'll need to answer in order to refine your model deployment constraints.
What is acceptable latency for your model?
Latency is the amount of time it takes for an inference to run. Latency constraints can vary greatly depending on what type of application you're building.
For example, if it's an asynchronous use case that is simply chugging through geographic data you don't need real time latency. If you're doing a chatbot, 1-2 second latency might be sufficient. If you're working on a self-driving car, you need latency in the milliseconds. The first question to ask yourself is, what's the absolute maximum latency we can tolerate with this product?
The primary tradeoff with latency is cost, which we look at further down in this article. If you want really low latency, cost increases substantially. The goal is to find a balance that is acceptable for your product between latency and cost.
What is the size of your model?
This is a pretty straightforward question, and the answer mostly will dictate whether you deploy your model "on the edge" (on-device), or on the cloud.
If you have a smaller model, for example less than 1GB model size, then you could likely deploy this on the edge and fit it onto a phone, NVIDIA Jetson, Raspberry Pi, or similar style GPU.
For models that are considered large, 8-16GB+ or like a GPT-J sized model, you would need to deploy on the cloud as it's likely not viable to deploy on the edge in this case.
How much time, money, and engineering resources can you spend on this?
This is the most critical question of all because it will affect the answer for the questions above.
You need to scope out how much time your company is willing to spend on building production infrastructure. Can you afford to spend 6+ months to build out a production server platform? If not, what timeline is acceptable for your team? If you have a shorter timeline for deployment, you will likely need to lean towards using an inference framework (like Banana) to get you deployed in a time efficient manner. If you can afford the time of multiple months to build the production infrastructure yourself, you can look at building your own inference framework from scratch.
Cost plays a huge role in your method of production deployment for your ML models. If you have a product that requires very minimal latency, you should expect costs to be much higher for production infrastructure that can accommodate your latency constraints than if your product can accept some latency. If costs are too high, this can make having positive unit economics a challenge and nearly impossible for some use cases. You may need to ask yourself if you can accept a slower latency in order to hit a cost target.
Similarly, you can gain cost efficiencies based on the size of your model. For example, you may want to look critically to see if you can reduce model size to fit onto an edge device if that gains you economic efficiencies.
There are two components to this topic that you need to consider.
Do you have sufficient engineering bandwidth available to take on this project? Are they tied up in other engineering work for your product? Or, is it higher value for your engineers to be focused on the parts of your product unrelated to production infrastructure?
What are the skills of your engineering team right now? If they are a pure ML team, their expertise may not be in production infrastructure so it would make sense to lean on a framework for deployment like TensorFlow's serving engine that you setup yourself, or something like Banana that takes care of all the deployment infrastructure for you and you simply connect your models into it.
Compared to if your engineering team has full-stack and systems-level engineers with the capability to implement models from scratch in production environments, then you could have them investigate building infrastructure in-house if your timeline and cost constraints are aligned.
Once you have answered these questions, your team should have a clear understanding of your deployment constraints. This will make your life much easier when taking on the task of deploying models to production, and hopefully you don't fall into the trap of these common mistakes we see teams make when deploying their models. Your next step is to take these constraints and investigate which ml model deployment solution is best for your use case. Good luck!