Vector search vs. model retraining: a comparison of audio similarity search methods

There is currently lots of attention to vector databases with the advent of enterprise application of LLMs combined with RAG techniques. In many use cases, these vector databases are hidden behind frameworks like LangChain which abstract away document ingestion into the vector database and retrieval of relevant documents, making it unnecessary to directly interact with the vector databases themselves. I found it interesting to get a bit more of a deep dive into a vector database product to get a view of how to set up and work with them without working with a RAG use case again.

In my search for an interesting vector database use case I recently came across the following post which suggests using a vector database for an audio similarity search: https://medium.com/@zilliz_learn/scaling-audio-similarity-search-with-vector-databases-30bccfd70279

As I have been working an audio classifier performing similarity search in the past, I thought it was interesting to combine this previous work with an exploration of vector database functionality for the same purpose: how well does a vector database stack up when performing audio similarity search when compared to a classical audio classification approach?

High-level approaches

A tried and tested way of performing audio similarity search is by training an audio classification model or finetuning an existing one. A popular approach is using VGGish for converting the audio file to a log-mel spectrogram and using a custom classifier on the spectrograms to classify the audio samples. This works pretty well, but requires you to go through the process of gathering a sufficiently large catalog of audio samples and training a classifier on these samples. This can work if all your classes are known upfront and if the number of allowed classes is not too massive, but what if you cannot know all classes upfront if it is impossible to train on them all, like for instance music similarity search?

This is where a vector database should be coming in. By converting your audio samples to vectors by a process called embeddings extraction, you are able to plug in all converted audio samples into a vector database and let that perform a similarity search with one of its built-in Approximate Nearest Neighbour (ANN) algorithms.

Experiment setup

I will be comparing the following two different approaches of audio search:

A vector database-oriented approach allowing for similarity search based on embeddings generated by the Wav2Vec2 model
An audio classifier based on a fine tuned version of the the same Wav2Vec2 model.

The experiment should deliver which of the two approaches is better suited for an audio similarity search use case.

The pros and cons of both approaches are listed here:

Vector database

Pros	Cons
Does not require manual labeling of samples	Possibly worse accuracy
No manual model training	Will only retreive a specifically requested amount of samples

Classifier

Pros	Cons
Possibly better accuracy	Requires manual labeling of samples
Usage of pre-trained models	Requires manual model fine tuning

The steps I will be taking are as follows. First let’s examine the steps needed for the vector database search:

Download and prepare a dataset, split into train, test and validation sets
Generate audio embeddings for the gathered samples
Insert all audio embeddings of the train set into the vector database
Use the validation sample to retrieve similar samples from the train-set
Validate if the class of the validation sample is the same as the class of the top 1 (best matching) and top 3 train samples
Repeat this for all validation samples, the share of correctly classified samples will be the accuracy number

Now for validating the classifier the approach is more straightforward. The data preparation will have been done as part of the vector search experiment.

Fine tune an existing audio classification model on the dataset
Run the validation sample of each class through the classifier
Measure accuracy on the classifications, measured by the share of correct classes identified

At this point both accuracy numbers can be compared to get a rough impression of which approach will be better. If you would like to follow along, go to the following Github repository and follow the README to prepare your data and environment: https://github.com/kemperd/audio-similarity-search

Preparing the dataset

Before starting off the experiments, we need to prepare a dataset that will be used by both approaches. For our data I will be using the ESC50 dataset which can be downloaded at https://github.com/karolpiczak/ESC-50. This consists of 50 classes of environmental sounds with 40 examples each. After downloading and unzipping, I will use the process_esc50.ipynb notebook to prepare the files in a directory structure that can be readily consumed by the HuggingFace datasets package, which is as follows:

class/
    train/
        item1.wav
        item2.wav
    test/
        item1.wav
    val/
        item.wav

The script creates the following 3-fold split: 34 train samples, 5 test samples and 1 validation sample, the latter is only used for manual validation of the model and not used during the finetuning phase. It would be nice to have more validation examples, but some experimenting revealed that we need enough train-samples for the model to be sufficiently accurate.

Experiment 1 – Similarity search using vector database

A vector database basically needs your data to be represented in the form of a vector, i.e. a fixed-length numerical array. To generate these for the input audio files, I will be using the same model as for the classification experiment, but will use the internal state of the neural network as a means for converting the audio information to numerical form. This process is referred to as retrieving the embeddings for a certain piece of data, which for example could also be applied to images or texts.

The concept of retrieving embeddings from the Wav2vec2 model is described here: https://stackoverflow.com/questions/69266293/getting-embeddings-from-wav2vec2-models-in-huggingface

In this case, the embeddings represent the last hidden layer of the transformer neural network and can be seen as a numerical representation of the information contained in the audio file. The size of these embeddings depend on the model used, which in the case of the Wav2vec2 model used here is 512-bytes. Note that you can retrieve a 512-byte vector at any point in the audio file which can of course be of indefinite length, allowing you to retrieve endless amounts of 512-byte vectors.

The idea is of course that processing similar audio fragments through the Wav2vec2 network will produce embeddings vectors that show similarities in the same locations of the vectors, therefore making a vector similarity search using a vector database an effective choice.

Let’s first have a look at the function that converts the audio file into an embeddings vector:

def retrieve_embeddings_for_audiofile(filename, feature_extractor, model):
    input_audio, sample_rate = librosa.load(filename, sr=SAMPLE_RATE, mono=True)
    input_features = feature_extractor(
        input_audio, 
        return_tensors='pt', 
        sampling_rate=SAMPLE_RATE
    )
    with torch.no_grad():
        output = model(input_features.input_values)

    file_length = len(output.extract_features.squeeze().numpy()) - 1

    # Maximum Milvus dimensions is 32768. 
    # Wav2Vec2 feature vector is 512, so we can fit 64 feature vectors into Milvus
    indexes = np.linspace(0, file_length, 64, dtype=int)

    feature_list = []
    for index in indexes:
        feature_list.append(output.extract_features.squeeze().numpy()[index])

    feature_vector = np.concatenate(feature_list)
    return feature_vector/np.linalg.norm(feature_vector)

You’ll notice I am using librosa for loading the file and immediately converting it into a sample rate of 16000, which is the preferred sample rate by Wav2vec2. Then I am using the feature extractor to convert the audio file to embeddings. Each of the embeddings vectors is 512 bytes and due to the audio files always being 5 seconds long in this dataset, there will be 250 of these embeddings vectors. This will of course vary with the length of the file.

The issue at this point is that Milvus can only store vectors with 32768 dimensions (no matrices), so we need to find a way to convert the 250 * 512 matrix we extracted from the file to a vector that fits in Milvus, while retaining the properties of the audio file. Calculating 32768 / 512 means we can extract 64 embeddings vectors per audio file which will be linearly distributed amongst the audio file length. These 64 embeddings vectors will be concatenated to give a vector with 32768 dimensions describing the audio file.

Note that this approach will be getting more problematic as the audio file length increases, as the 64 embeddings need to be distributed over a longer file.

Now the embeddings generation function has been defined, we are able to insert these embeddings into a vector database. I will be using Milvus in standalone mode for this purpose, which is a locally running instance accessible from a single Python process only, so please make sure not to access the database from multiple notebooks in parallel.

First initialize a new database and create a schema:

MILVUS_DATABASE = 'esc50.db'
MILVUS_COLLECTION_NAME = 'esc50'
EMBEDDINGS_DIMENSIONS = 4608

def init_milvus(milvus_client):
    schema = MilvusClient.create_schema(auto_id=False)
    schema.add_field(field_name='id', datatype=DataType.INT64, is_primary=True, auto_id=True)
    schema.add_field(field_name='filename', max_length=500, datatype=DataType.VARCHAR)
    schema.add_field(field_name='embeddings', datatype=DataType.FLOAT_VECTOR, 
                     dim=EMBEDDINGS_DIMENSIONS, description='sample embeddings vector')

    index_params = milvus_client.prepare_index_params()
    index_params.add_index(
        field_name="embeddings",
        index_type="AUTOINDEX",
        metric_type="IP",
    )

    milvus_client.create_collection(
    collection_name=MILVUS_COLLECTION_NAME,
    schema=schema,
    index_params=index_params,
)

The following helper function will be used for inserting an embeddings vector into the database:

def insert_embeddings_into_db(feature_vector, filename, milvus_client):
    data = [ { 'filename': filename, 'embeddings': feature_vector } ]
    milvus_client.insert(collection_name=MILVUS_COLLECTION_NAME, data=data)

The full script loops over the training files, extracts their embeddings and inserts each of them into Milvus. Note that Milvus also exposes a bulk insert function for really large datasets, which is not used here.

After running for a few minutes to insert all embeddings into Milvus, we are able to calculate the accuracy for the whole validation set using the below code. I will be distinguishing between top 1 accuracy indicating whether the first result has the correct class and top 3 accuracy indicating whether if any of the first 3 results have the correct class.

val_files = glob.glob('esc50/val/**/*.wav', recursive=True)

top_1_scores_list = []
top_3_scores_list = []

for file in val_files:
    target_category = file.split('/')[2]
    feature_vector = embeddings_util.retrieve_embeddings_for_audiofile(
        file, feature_extractor, model)
    result_json = embeddings_util.retrieve_by_sample(feature_vector, milvus_client)
    inferred_category = result_json[0][0]['entity']['filename'].split('/')[2]

    top_1_scores_list.append(1) if target_category == inferred_category else top_1_scores_list.append(0)

    top_3_classes = []
    for r in result_json[0][0:3]:
        top_3_classes.append(r['entity']['filename'].split('/')[2])
    top_3_scores_list.append(1) if target_category in top_3_classes else top_3_scores_list.append(0)


print('Top 1 accuracy: {}'.format(top_1_scores_list.count(1) / len(top_1_scores_list)))
print('Top 3 accuracy: {}'.format(top_3_scores_list.count(1) / len(top_3_scores_list)))

Which outputs a top 1 accuracy of 0.22 and a top 3 accuracy of 0.36, which seem a bit disappointing.

Experiment 2 – Similarity search using finetuned model

Let’s now have a look at how a finetuned audio classification model performs on our dataset. For this experiment I will be using the same Wav2Vec2 model as used for extracting the embeddings in order to make a fair comparison of the two methodologies.

The full code for the audio classification setup is in this notebook, I will walk through the main steps here.

First read in the local directory structure into a Huggingface Dataset object and convert to the correct sample rate:

ds = datasets.load_dataset('esc50')

ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))

Initialize the feature extractor:

MODEL_NAME = 'facebook/wav2vec2-large'

feature_extractor = AutoFeatureExtractor.from_pretrained(
    MODEL_NAME, do_normalize=True, return_attention_mask=True
)

At this point the “audio” key of the DataSet contains a numerical representation of the waveform. To start the training, we’ll convert the numerical waveform representation into a feature vector by using the Wav2Vec2 feature extractor. This can be done at scale using the map-function. Note the MAX_DURATION constant which can set to 5 seconds for all of the audio files, which is a luxury we have due to all of the files in the dataset being the same length.

max_duration = 5.0

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * max_duration),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

ds_encoded = ds.map(
    preprocess_function,
    remove_columns=["audio"],
    batched=True,
    batch_size=100,
    num_proc=1,
)

The following helper function creates a dict to look up the text class label from a numerical label which are used in the DataSet:

id2label_fn = ds["train"].features['label'].int2str
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(ds_encoded["train"].features["label"].names))
}
label2id = {v: k for k, v in id2label.items()}

At this point we are ready to start the actual finetuning of an existing model. First init the model using the weights downloaded from the Huggingface hub:

from transformers import AutoModelForAudioClassification

num_labels = len(id2label)

model = AutoModelForAudioClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

Set the training arguments:

from transformers import TrainingArguments

model_name = MODEL_NAME.split("/")[-1]
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 20

training_args = TrainingArguments(
    f"{model_name}-finetuned-gtzan",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    push_to_hub=False,
)

Create a helper function for properly evaluating accuracy during training:

import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

Now finally start the actual model finetuning as follows:

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=ds_encoded["train"],
    eval_dataset=ds_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

trainer.save_model('audio_classification_model')

The output of the training run was the following on my end:

Epoch	Training Loss	Validation Loss	Accuracy
1	3.649700	3.566805	0.108000
2	3.117400	3.122789	0.196000
3	2.729100	2.300732	0.424000
4	1.935000	2.110009	0.408000
5	1.427300	1.812006	0.504000
6	1.451700	1.652343	0.564000
7	1.249300	1.470184	0.644000
8	0.824600	1.414082	0.652000
9	0.732000	1.136560	0.724000
10	0.389300	1.254131	0.676000
11	0.286900	1.177638	0.724000
12	0.178800	1.122814	0.744000
13	0.143100	1.146752	0.752000
14	0.031500	1.320253	0.768000
15	0.103300	1.154021	0.796000
16	0.012800	1.275106	0.784000
17	0.125700	1.162060	0.804000
18	0.021200	1.132426	0.816000
19	0.030600	1.193988	0.800000
20	0.049300	1.165005	0.792000

This indicates an accuracy on the test set of about 0.8, which should be sufficient for an initial test. Given the limited amount of train samples I feel the classifier is performing fine already and probably needs more samples to really improve.

Now we can calculate the custom accuracy metric on the validation set as defined in the experiment setup:

pipe = pipeline('audio-classification', model='audio_classification_model')

# Classify the audio file at the given path
def classify_audio(filepath):
    preds = pipe(filepath)
    outputs = {}
    outputs[preds[0]["label"]] = preds[0]["score"]
    return next(iter(outputs))

val_files = glob.glob('esc50/val/**/*.wav', recursive=True)
scores_list = []
for file in val_files:
    target_category = file.split('/')[2]
    inferred_category = classify_audio(file)

    scores_list.append(1) if target_category == inferred_category else scores_list.append(0)

print('Accuracy: {}'.format(scores_list.count(1) / len(scores_list)))

This outputs an accuracy of 0.82, in line with the accuracy found during training and obviously much better than the vector search approach.

Conclusions

As is shown by my approach, the classical approach of finetuning a model tailored to audio classification performs much better than using a vector database similarity search with an accuracy of 0.82 for the finetuned model and 0.22 for the vector search approach.

While the experiment can be generalized to other similarity search use cases such as image or video similarity search, it is not said that the same conclusion will be true for these other types of similarity searches. In other words: a vector database may be performing better for those use cases.

Note that the performance of the vector database search heavily hinges on a proper conversion of the audio file into an embeddings vector, in which I find various limitations:

The Milvus-database accepts vectors with a maximum length of 32.768 dimensions, which is insufficient to store all Wav2Vec2 embeddings extracted from a 5-second audio file, let alone longer files. This required a workaround in the form of sampling a few embeddings along the file.
The approach of linearly sampling 64 embeddings vectors along the audio file and concatenating these into a vector is somewhat of an own conjecture, I did not find any documented approaches that are specifically tackling this problem. This approach will also not scale to larger audio files or music tracks.

If these limitations could be overcome in a proper way, I am confident a vector database search could be bringing value to this use case as well.

At the same time, finetuning a pre-existing audio classification model on a custom dataset is straightforward, doesn’t really require that much time or audio samples and is a tried-and-tested and simple to understand approach for working with audio files which should not be overlooked.

Dirk Kemper 2025-05-11
milvus wav2vec2