Ollama local LLMs from PHP: setup and practical limits
Set up Ollama on a dev server (32GB RAM, no GPU) to run local LLMs from our PHP application. Using llama3.2:3b for low-latency tasks and mistral:7b for more complex ones.
The Ollama REST API is simple enough that a direct Guzzle client works without a dedicated library.
CPU inference with 7B models is usable but slow: about 3-5 tokens/second on a 16-core server. For real-time user-facing features this is too slow. For background jobs (summarization, classification) it is fine.
The streaming endpoint /api/generate with stream: true is important for user-facing use because it lets you start showing output before the full response is done, which makes CPU inference feel faster.
Quantized models (Q4_K_M) fit in less RAM and are faster with minimal quality loss for most tasks. llama3.2:3b-q4 takes about 3GB RAM vs 7GB for the full model. Worth it.
Running Ollama in Docker and exposing it to the PHP container via internal network is cleaner than running it on the host. The official Docker image handles model storage and GPU passthrough.
For production cost savings on high-volume tasks: classify the task complexity first, route simple tasks to the local 3B model, complex ones to GPT-4. Our API bill dropped 60%.
wait, so if two users hit the endpoint simultaneously they both just… wait in line? no parallel processing at all?
Yes, Ollama queues requests and processes them serially by default. You can set OLLAMA_NUM_PARALLEL to serve concurrent requests but it increases per-request memory usage proportionally.