FFmpeg 8
This release adds support for FFmpeg 8.
RVMedia now supports FFmpeg versions 1 through 8.
The RVMedia installation now includes an FFmpeg 8.1.1 build for Windows 64-bit with Whisper support (see below). This build is compatible with the LGPL license.
Options that require the GPL license have been removed, since they can only be used in open-source applications.
Speech to text: Overview
Whisper: AI speech recognition
Whisper is an free open-source speech recognition and transcription AI model developed by OpenAI. It is designed to convert spoken language into text.
RVMedia can use a Whisper version integrated in FFmpeg 8+.
The Whisper code is included in FFmpeg. However, a model file is also required.
Speech-to-text conversion is performed entirely on the user's computer and does not require any online services or API keys. All that is needed is a speech recognition model file.
The RVMedia installation includes the smallest available English-only model. While it is not very suitable for real-world use, it allows you to test speech recognition functionality and can run even on relatively low-end computers.
Additional models can be downloaded here:
https://huggingface.co/ggerganov/whisper.cpp/tree/main.
Larger model files provide better recognition accuracy, but they also require more powerful hardware. Ideally, the user should have a modern high-performance GPU. However, even without a GPU, the smaller models can be used on the CPU.
The available model files are divided into:
- English-only models (their filenames contain "en"),
- multilingual models, which support many languages.
Voice Activity Detection (Optional)
In addition to the main models that perform speech recognition, FFmpeg can optionally use VAD (Voice Activity Detection) AI models.
These models detect when speech starts and ends in the audio stream, allowing the main recognition model to run only when necessary. This provides two important benefits:
- more efficient use of CPU/GPU resources;
- reduced risk of recognizing noise as speech (so-called hallucinations of the speech recognition model). Unfortunately, Whisper is prone to this problem, especially when using multilingual models.
The drawback of this approach is that it requires significantly more audio to be buffered before recognition can begin. As a result, recognized text becomes available with greater latency.