In this tutorial we’ll show you how to transcribe audio files to text using OpenAI’s whisper model and a function written in Python. We’ll start off with GPU acceleration using an NVIDIA GPU and OpenFaaS installed on a K3s cluster. But we’ll also be showing you how you can run the same code using a CPU, which is more commonly available.
Why is audio transcription useful?
Common use-cases for transcribing audio could be a bot that summarises customer complaints during a Zoom call, collects negative product feedback from reviews on YouTube, or that generates a set of timestamps for YouTube videos, which are later attached via API. You could even take traditional voice or VoIP recordings from a customer service center, and transcribe each one to look for training issues or high performing telephone agents. If you listen to podcasts on a regular basis and have ever read the show notes, they could have been generated by a transcription model.
GPU is generally faster than CPU, but CPU can also be very effective if you are able to batch up requests via the OpenFaaS Asynchronous invocations system, and collect the results later on. To collect results from async invocations, you can supply a callback URL to the initial request, or have the function store its result in S3. We have some tutorials in the conclusion that show this approach for other use-cases like PDF generation.
Here’s what we’ll cover:
- Prepare a K3s cluster with Nvidia GPU support
- Install OpenFaaS with a GPU Profile
- Create a Python function to run OpenAI Whisper
- Make sure the function has a long enough timeout
- Limit concurrent requests to the function to prevent overloading
- Run the function with CPU inference, without a GPU.
Prepare a k3s with NVIDIA container runtime support
Kubernetes has support for managing GPUs across different nodes using device plugins. The setup in your cluster will depend on your platform and GPU vendor. We will be setting up a k3s cluster with NVIDIA container runtime support.
k3sup is a light-weight CLI utility that lets you quickly setup a k3s on any local or remote VM. If you already have a k3s cluster you can also use k3sup to join an additional agent to your cluster.
You can use our article on how to setup a production-ready Kubernetes cluster with k3s on Akamai cloud computing as an additional reference.
I would suggest setting up a cluster first and once that is done SSH into any agent or server with a GPU to prepare the host OS by installing the Nvidia drivers and container runtime package.
-
Install the Nvidia drivers, for example:
apt install -y cuda-drivers-fabricmanager-515 nvidia-headless-515-server
This example uses driver version
515
but you should select the appropriate driver version for your hardware.Make sure the GPU is detected on the system by running the
nvidia-smi
command.+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GT 1030 On | 00000000:01:00.0 Off | N/A | | 35% 19C P8 N/A / 19W | 92MiB / 2048MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
-
Install the Nvidia container runtime packages.
Add the NVIDIA Container Toolkit package repository by following the instructions at: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt
Install the NVIDIA container runtime:
apt install -y nvidia-container-runtime
- Install K3s, or restart it if already installed:
curl -ksL get.k3s.io | sh -
- Confirm that the nvidia container runtime has been found by k3s:
grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
Once the hosts have been prepared and your cluster is running, apply the NVIDIA runtime class in the cluster:
cat > nvidia-runtime.yaml <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
kubectl apply -f nvidia-runtime.yaml
Install OpenFaaS with a GPU profile
Next install OpenFaaS in your cluster. GPU support is a feature that is only available in the commercial version of OpenFaaS.
Follow the installation instructions in the docs to install OpenFaaS using to official Helm Chart
Add a GPU profile
Function deployments that require a GPU will need to have the nvidia
runtimeClass set. OpenFaaS uses profiles to support adding additional Kubernetes specific configuration to function deployments.
Create a new OpenFaaS Profile to set the runtimeClass:
cat > gpu-profile.yaml <<EOF
kind: Profile
apiVersion: openfaas.com/v1
metadata:
name: gpu
namespace: openfaas
spec:
runtimeClassName: nvidia
EOF
kubectl apply -f gpu-profile.yaml
Profiles can be applied to a function through annotations. To apply the gpu
profile to a function you need to add an annotation com.openfaas.profile: gpu
to the function configuration.
Create a GPU accelerated function
In this section we will create a function that runs the Whisper speech recognition model to transcribe an audio file.
Every OpenFaaS function is built into Open Container Initiative (OCI) format container image and published into a container registry, then when it’s deployed a fully qualified image reference is sent to the Kubernetes node. Kubernetes will then pull down that image and start a Pod from it for the function.
OpenFaaS supports various different languages through the use of its own templates concept. The job of a template is to help you create a container image, whilst abstracting away most of the boiler-plate code and implementation details.
The Whisper model is available as a python package. We will be using a slightly adapted version of the python3-http template called python-http-cuda
to scaffold our function. To provide the CUDA Toolkit from NVIDIA the python3-http-cuda
template uses nividia/cuda instead of Debian as the base image.
Create a new function with the OpenFaaS CLI then rename its YAML file to stack.yml. We do this so we don’t need to specify the name using –yaml or -f on every command.
# Change this line to your own registry
export OPENFAAS_PREFIX="ttl.sh/of-whisper"
# Pull the python templates
faas-cli template pull https://github.com/skatolo/python-flask-template
# Scaffold a new function using the python3-http-cuda template
faas-cli new whisper --lang python3-http-cuda
# Rename the function configuration file to stack.yaml
mv whisper.yaml stack.yaml
The function handler whisper/handler.py
is where we write our custom code. In this case the function retrieves an audio file from a url that is passed in through the request body. Next the whisper model transcribes the audio file and the transcript is returned in the response.
import tempfile
from urllib.request import urlretrieve
import whisper
def handle(event, context):
models_cache = '/tmp/models'
model_size = "tiny.en"
url = str(event.body, "UTF-8")
audio = tempfile.NamedTemporaryFile(suffix=".mp3", delete=True)
urlretrieve(url, audio.name)
model = whisper.load_model(name=model_size, download_root=models_cache)
result = model.transcribe(audio.name)
return (result["text"], 200, {'Content-Type': 'text/plain'})
The first time the function is invoked it will download the model and save it to the location set in the models_cache
variable, /tmp/models
. Subsequent invocations of the function will not need to refetch the model.
It is good practice to make your function only write to the
/tmp
folder. This way you can make the function file system read-only. OpenFaaS supports this by settingreadonly_root_filesystem: true
in the stack.yaml file. Only the temporary/tmp
folder will still be writable. This prevents the function from writing to or modifying the filesystem and provides tighter security for your functions.
Before we can build, deploy and run the function there are a couple of configuration settings that we need to run through.
Add runtime dependencies
Our function handler uses the openai-whisper
python packages. Edit the whisper/requirements.txt
file and add the following line:
openai-whisper
The whisper package also requires the command-line tool ffmpeg for audio transcoding. It needs to be installed in the function container. The OpenFaaS python3 templates support specifying additional packages that will be installed with apt through the ADDITIONAL_PACKAGE
build arguments.
Update the stack.yaml
file:
functions:
whisper:
lang: python3-http-cuda
handler: ./whisper
image: whisper:0.0.1
+ build_args:
+ ADDITIONAL_PACKAGE: "ffmpeg"
Apply profiles
The function will need to use the alternative nvidia
runtime class in order to use the GPU. This can be applied by using the OpenFaaS gpu
profile created earlier. Add the com.openfaas.profile: gpu
annotations to the stack.yaml
file:
functions:
whisper:
lang: python3-http-cuda
handler: ./whisper
image: whisper:0.0.1
+ annotations:
+ com.openfaas.profile: gpu
Configure timeouts
It is common for inference or other machine learning workloads to be long running jobs. In this example transcribing the audio file can take some time depending on the size of the file and the GPU speed. To ensure the function can run to completion timeouts for the function and OpenFaaS components need to be configured correctly.
For more info see: Expanding timeouts.
functions:
whisper:
lang: python3-http-cuda
handler: ./whisper
image: whisper:0.0.1
+ environment:
+ write_timeout: 5m5s
+ exec_timeout: 5m
Build and deploy the function
Once the function is configured you can deploy it straight to the Kubernetes cluster using the faas-cli
:
faas-cli up whisper
Then, invoke the function when ready.
curl -i http://127.0.0.1:8080/function/whisper -d https://example.com/track.mp3
Limit concurrent requests to the function
Depending on the number of GPUs available in your cluster and the available memory for each GPU you might want to limit the amount of requests that can go to the whisper function at once. Kubernetes doesn’t implement any kind of request limiting for applications, but OpenFaaS can help here.
To prevent overloading the Pod and GPU, we can set a hard limit on the number of concurrent requests the function can handle. This is done by setting the max_inflight
environment variable on the function.
For example if your GPU has enough memory to handle 6 concurrent requests you can set max_inflight: 6
. Any subsequent requests would be dropped and receive a 429 response. This assumes the producer can buffer the requests to retry them later on. Fortunately, when using async in OpenFaaS, the queue-worker does just that, you can learn how here: How to process your data the resilient way with back pressure
functions:
whisper:
lang: python3-http-cuda
handler: ./whisper
image: ttl.sh/of-whisper:0.0.1
environment:
write_timeout: 5m5s
exec_timeout: 5m
+ max_inflight: 6
How to run with CPU inference, without a GPU
You can still try out the Whisper inference function even if you don’t have a GPU available or when you don’t have the commercial version of OpenFaaS. With only a couple of changes the function can run with CPU inference.
The function handler does not need to change. The openai-whisper
package automatically detects whether a GPU is available and will fall back to using CPU as a default.
Change the template of the function in the stack.yaml
file to python3-http
and remove the gpu
profile annotation.
whisper:
- lang: python3-http-cuda
+ lang: python3-http
handler: ./whisper
image: ttl.sh/of-whisper:0.0.1
- annotations:
- com.openfaas.profile: gpu
Pull the python3-http
template.
faas-cli template store pull python3-http
Deploy the function and invoke it with curl as shown in the previous section. The function will now run the inference in CPU instead. Depending on your hardware this will probably increase the execution time compared to running on GPU. Make sure to adjust your timeouts as required.
Take it further
Take a look at some other patterns that can be useful for running ML workflows and pipelines with OpenFaaS.
- Asynchronous functions
- Exploring the Fan out and Fan in pattern with OpenFaaS
- Improving long-running jobs for OpenFaaS users
- How to process your data the resilient way with back pressure
Conclusion
In this tutorial we showed how a K3s cluster can be configured with NVIDIA container runtime support to run GPU enabled containers OpenFaaS was installed in the cluster with an additional gpu
Profile that is required to run functions with an alternative nvidia runtimeClass. Using a custom Python template that includes the CUDA Toolkit from NVIDIA we created a function to transcribe audio files with the OpenAI Whisper model.
We ran through several configuration steps for the function to set appropriate timeouts and applied the OpenFaaS gpu
profile to make the GPU available in the function container. Additionally we discussed how OpenFaaS features like async invocations and retries can be used together with concurrency limiting to prevent overloading your GPU while still making sure all requests can run to completion.
For people who don’t have a GPU available or that are running the Community Edition of OpenFaaS, we showed how the same function can be deployed to run with CPU inference.
Future work
We showed you how to apply concurrency limiting to make sure the GPU wasn’t overwhelmed with requests, however Kubernetes does have a very basic way of scheduling Pods to GPUs. The approach taken is to exclusively dedicate at least 1 GPU to a Pod, so if you wanted the function to scale, you’d need several nodes each with at least one GPU.
In Kubernetes this is done by passing in an additional value to the Pod under the requests/limits section i.e.
resources:
limits:
nvidia.com/gpu: 1
We’re looking into the best way to add this for OpenFaaS functions - either directly for each Function Custom Resource, or via a Profile, so feel free to reach out if that’s of interest to you.
Co-authored by: