artem_ml18 Apr 2025 05:42

Set up Ollama on a dev server (32GB RAM, no GPU) to run local LLMs from our PHP application. Using llama3.2:3b for low-latency tasks and mistral:7b for more complex ones.

The Ollama REST API is simple enough that a direct Guzzle client works without a dedicated library.

Replies (7)
alex_petrov18 Apr 2025 06:02

CPU inference with 7B models is usable but slow: about 3-5 tokens/second on a 16-core server. For real-time user-facing features this is too slow. For background jobs (summarization, classification) it is fine.

0
dmitry_kv18 Apr 2025 06:44

The streaming endpoint /api/generate with stream: true is important for user-facing use because it lets you start showing output before the full response is done, which makes CPU inference feel faster.

0
artem_ml18 Apr 2025 07:11

Quantized models (Q4_K_M) fit in less RAM and are faster with minimal quality loss for most tasks. llama3.2:3b-q4 takes about 3GB RAM vs 7GB for the full model. Worth it.

0
petr_sys18 Apr 2025 08:19

Running Ollama in Docker and exposing it to the PHP container via internal network is cleaner than running it on the host. The official Docker image handles model storage and GPU passthrough.

0
vova18 Apr 2025 09:07

For production cost savings on high-volume tasks: classify the task complexity first, route simple tasks to the local 3B model, complex ones to GPT-4. Our API bill dropped 60%.

0
katedev18 Apr 2025 10:13

wait, so if two users hit the endpoint simultaneously they both just… wait in line? no parallel processing at all?

0
artem_ml18 Apr 2025 11:35

Yes, Ollama queues requests and processes them serially by default. You can set OLLAMA_NUM_PARALLEL to serve concurrent requests but it increases per-request memory usage proportionally.

0