Banana V2.1 infra: moving to A5000s

October 25, 2023Banana V2.1 infra: moving to A5000s

This upcoming Wednesday, Nov 1, we will be moving all workloads from A100s to A5000 GPUs as Banana infra Version 2.1

No changes* are needed from your end.

  1. This is not a breaking change on the API, so there is no need to update any of your clientside or serverside code.
  2. We’ve tested all models that were called in the last 30 days and verified that they fit on the new machines (after running init()), so unless you have very large memory bloat from inference-time tensors, you can trust your models will run as they did before.

*note: if you version pin your potassium in requirements.txt, you will need to upgrade it to >=0.3.1 for builds to pass.

To move to the new infra:

  1. Let us do it for you Nov 1st (10m downtime expected). Note that during this we’ll trigger a build and set the model as invalid for the A100 infra, so during the build you may experience 10m downtime.
  2. Manually update (10m downtime expected). In app.banana.dev you’ll now see “Update Required” badge on your projects. If you click into a project, into its settings, you’ll see a button to upgrade the model to the A5000 cluster. This allows you to be present in case the project has unexpected issues on the A5000s.
  3. Manually rolling deploy (no downtime expected). As of yesterday, all new projects created are deployed to the A5000 cluster, so you can manually deploy duplicates of your production projects and have a side-by-side test between the two, with the ability to repoint your client code to the new project url when you’re ready to transition.

Why we are excited:

With the change we roll out a significant infra overhaul, in which we revert many of the heavyhanded decisions we made in the V2 rollout two months ago.

We’re confident that this is a step in the right direction.

With this change, you get:

  1. A much higher capacity ceiling. We chose A5000s because of their relative ease of acquisition, allowing us to run 10x more replicas than prior.
  2. The ability to use arbitrary dockerfile images, python versions, pytorch versions, etc. You can now finally run pytorch 2.0 or bring your own image base. These restrictions were from V2 and ended up being a pain that many of you felt, so we’re excited to move back to arbitrary container environments.
  3. Higher throughput (and lower cost to you). Workers now pull from a shared project queue making it so new cold boots don’t lock up their trigger call; the first available replica will handle the job, reducing the total amount of machine time you need. Only relevant with concurrency > 1.
  4. VM-level isolation between every replica. A massive security update and also a tool to avoid noisy neighbor issues.
  5. Faster builds
  6. Unmodified images. No more automagical stuff injected into your images. They’ll just be whatever you give us.
  7. (future) GET requests
  8. (future) on-cluster object Store persistence
  9. (future) websocket & streaming support

Everything has tradeoffs, so consider that:

  1. A5000s are smaller than A100s.
    They have 24 GB GPU RAM, 30 vCPUs, and 16 GB CPU RAM.
    Since none of you are running more than 24GB of RAM (in part due to our build boxes being A10s), this downsizing shouldn’t be a concern. Inference speed will take a small hit (~50% slower, though highly dependent on model implementation). For the sort of traffic most of you run, the throughput increases above should help offset the inference slowdown. To further offset, we plan to build in batching and parallelization patterns into the API over time.
  2. Coldboots can be expected to increase, though we now have significantly more surface area for R&D to decrease them down closer to what you’ve been seeing, so please be patient with us as we drive that back down.
  3. Prices will remain without changes. If you purchased credits expecting specifically A100s, please email me erik@banana.dev and I’ll get you refunded.

This all is intended to be a change that you don’t need to notice or care about, but if there are any concerns with this change, we’re happy to connect you with our friends at other GPU providers with A100 capacity who will take care of you.

Again, to move to the new infra:

  1. Let us do it for you Nov 1st (10m downtime expected)
  2. Manually update (10m downtime expected)
  3. Manually rolling deploy (no downtime expected)

Our team is very excited about these changes, because it gives us a platform to better serve our users. Please don't hesitate to reach out to let us know what you think!

Thank you for running on Banana.

Erik

Cofounder & CEO

erik@banana.dev