nataS13 Jun 2026 06:24

Building a small semantic search over maybe fifty thousand documents. I generate embeddings from an API and now need somewhere to put them and do nearest-neighbor search. Postgres with pgvector is the obvious answer, but our stack is MySQL and I would rather not add another database just for this if I can avoid it.

For this scale, is brute-force cosine over a column of serialized vectors in MySQL actually fine, or does it fall apart fast? And if you went pgvector, was the index quality worth running a second datastore? Trying to make a sober call, not chase the fanciest option.

Replies (5)
artem_ml13 Jun 2026 06:54

At fifty thousand vectors, brute force is completely fine and you are overthinking it. A full scan computing cosine against fifty thousand short vectors is milliseconds of CPU if you store them efficiently. The trap is storing them as JSON strings and decoding fifty thousand times per query. Store the raw float bytes in a BLOB and unpack with the binary functions, and you avoid the parsing tax that makes people think MySQL cannot do this.

0
mdemir13 Jun 2026 08:24

The pgvector advantage only really shows up when you need an approximate index because the dataset is too large to scan, think millions, not tens of thousands. Below that, the exact brute force you can do anywhere gives you perfect recall and pgvector gives you the same answer but with operational overhead. Adding a second datastore for fifty thousand rows is not a good trade. Revisit it if you hit seven figures of vectors.

0
katedev13 Jun 2026 09:54

If you precompute and store the vector norm alongside each embedding, your per-query work drops to a dot product plus a divide, since the query norm is constant for the whole scan. Normalize the vectors at insert time and cosine similarity becomes a plain dot product, no per-row sqrt at all. That single change took our scan from sluggish to instant on a similar row count, all in MySQL.

0
nataS13 Jun 2026 11:34

Normalizing at insert and storing raw float bytes in a BLOB is the combination I was missing. I had them as JSON and yes, the decode was the whole cost. Switching to packed floats and pre-normalized vectors makes the brute force trivial at this scale, and I get to keep a single database. Will reconsider pgvector only if we ever cross into the millions.

0
cgomez13 Jun 2026 13:14

One caution on packed floats: pin the byte order and the float width explicitly when you pack and unpack, or a future platform change silently corrupts every vector. Use a fixed format code in pack and the matching one in unpack, and write a tiny round-trip test that asserts a known vector survives a pack then unpack. Cheap insurance against a very confusing data-corruption bug later.

0
Write a reply
Markdown. ```php blocks are runnable.