How to Deploy FLAN-T5 to Production on Serverless GPUs

How to Deploy FLAN-T5 to Production on Serverless GPUs

graphic of FLAN-T5 and our blog post title "FLAN-T5 deployment tutorial".

In this tutorial, we're going to demonstrate how you can deploy FLAN-T5 to production. The content is beginner friendly, Banana's deployment framework gives you the "rails" to easily run ML models like FLAN-T5 on serverless GPUs in production.

For reference, we're using this FLAN-T5 base model from HuggingFace in this demonstration. Time to dive in!

What is FLAN-T5?

FLAN-T5 is an open source text generation model developed by Google AI. One of the unique features of FLAN-T5 that has been helping it gain popularity in the ML community is its ability to reason and explain answers that it provides. Instead of just spitting out an answer to a question, it can provide details around how it arrived at this answer. There are 5 versions of FLAN-T5 that have been released, in this tutorial we focus on the base model.

Some interesting use cases of FLAN-T5 are its translation capabilities (over 60 languages), ability to answer historical and general questions, and summarization techniques.

How to Deploy FLAN-T5 on Serverless GPUs

1. Fork Banana's Serverless Framework Repo

The first step is to take Banana's Serverless Framework and fork it as your own private repository. Consider this to be your base repository that you will use to deploy FLAN-T5 to Banana.

2. Customize Repository to run FLAN-T5

By default, the serverless framework is setup to deploy BERT as the model of choice. Since we are here to deploy FLAN-T5 and not BERT, we need to modify the repository in a few spots to swap out BERT with FLAN-T5.

There are a few places within the repository that you'll need to adapt to run FLAN-T5. To summarize:

  • Make sure the download.py file downloads FLAN-T5
  • Load FLAN-T5 within the init() block
  • Update the inference() block to run FLAN-T5

It's highly recommended that you review the documentation of our Serverless Framework. We breakdown and explain the framework components so you can better understand how it all operates.

3. Create Banana Account and Deploy FLAN-T5

Once you have adapted the repo to run FLAN-T5, make sure you test your code before deploying to production. We suggest using Brev (follow this tutorial) to test.

Next, login to your Banana Dashboard and click the "New Model" button.

A popup will appear:

screenshot of Banana ml model deployment options.

Select "GitHub Repo", and choose your FLAN-T5 repository. Click "Deploy" and the model will start to build. The build process can take up to 1 hour so please be patient.

You'll see the Model Status change from "Building" to "Deployed" when it's ready to be called.

screenshot of model building status.

screenshot of model deployed status.

You can also monitor the status of your build in the Model Logs tab.

screenshot of banana model build logs.

4. Call your FLAN-T5 Model

After your model has built, it's ready to run in production! Choose the programming language you plan to use (Python, Node, Go) and then jump over to the Banana SDK. Within the SDK you will see example code snippets of how you can call your FLAN-T5 model.

That's it! Congratulations on running FLAN-T5 on serverless GPUs. You are officially deployed in production!


Wrap Up

Reach out to us if you have any questions or want to talk about FLAN-T5. We're around on our Discord or by tweeting us on Twitter. What other machine learning models would you like to see a deployment tutorial for? Let us know!