artem_ml15 May 2026 16:01

If you are working with LLM embeddings and need to compare them, you do not need a vector database for small datasets. Cosine similarity is straightforward to implement in PHP and runs fine for thousands of vectors.

Here is a working implementation you can run directly:

PHP
<?php
function cosineSimilarity(array $a, array $b): float
{
if (count($a) !== count($b)) {
throw new InvalidArgumentException("Vectors must have the same dimension");
}
$dot = 0.0;
$normA = 0.0;
$normB = 0.0;
foreach ($a as $i => $val) {
$dot += $val * $b[$i];
$normA += $val * $val;
$normB += $b[$i] * $b[$i];
}
$denom = sqrt($normA) * sqrt($normB);
if ($denom < 1e-10) return 0.0;
return $dot / $denom;
}
// Simulate small embedding vectors (normally 1536 dims from OpenAI or 1024 from Claude)
// Using 8 dims here just to make it readable
$embeddingA = [0.2, 0.5, 0.1, 0.8, 0.3, 0.9, 0.4, 0.6];
$embeddingB = [0.2, 0.5, 0.1, 0.8, 0.3, 0.9, 0.4, 0.6]; // identical
$embeddingC = [0.9, 0.1, 0.8, 0.2, 0.7, 0.1, 0.6, 0.3]; // different
printf("A vs B (identical): %.6f\n", cosineSimilarity($embeddingA, $embeddingB)); // ~1.0
printf("A vs C (different): %.6f\n", cosineSimilarity($embeddingA, $embeddingC)); // ~0.5ish
// Simple nearest neighbor search
$query = [0.3, 0.4, 0.2, 0.7, 0.4, 0.8, 0.5, 0.5];
$corpus = [
'doc1' => $embeddingA,
'doc2' => $embeddingB,
'doc3' => $embeddingC,
];
$scores = [];
foreach ($corpus as $id => $vec) {
$scores[$id] = cosineSimilarity($query, $vec);
 
הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

For real embeddings you would store the vectors in a table as JSON blobs and load them into PHP arrays. Up to about 50k vectors this is fast enough for most use cases without any external tooling.

Replies (1)
mdemir15 May 2026 16:56

Good implementation. One optimization: if your vectors are pre-normalized (unit length, which most embedding APIs return), you can skip the norm computation entirely because for unit vectors cosine similarity equals the dot product.

PHP
<?php
// If vectors are guaranteed unit length:
function dotProduct(array $a, array $b): float
{
$dot = 0.0;
foreach ($a as $i => $val) {
$dot += $val * $b[$i];
}
return $dot;
}
// Check if a vector is roughly unit length
function isNormalized(array $v, float $tolerance = 0.001): bool
{
$norm = sqrt(array_sum(array_map(fn($x) => $x * $x, $v)));
return abs($norm - 1.0) < $tolerance;
}
$v = [0.6, 0.8]; // unit vector: sqrt(0.36 + 0.64) = 1.0
var_dump(isNormalized($v)); // bool(true)
$w = [0.6, 0.8];
printf("dot product (unit vecs): %.4f\n", dotProduct($v, $w)); // 1.0000
הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The dot product version is about 30% faster on large vectors since it avoids two sqrt() calls.

0
Write a reply
Markdown. ```php blocks are runnable.