PHP for ML data preprocessing: viable or always use Python
We have an existing PHP infrastructure. The ML pipeline needs data preprocessing (normalization, tokenization, feature extraction). Making a case to the team for keeping it in PHP rather than spinning up a Python service.
Looking for arguments either way and practical experience.
For preprocessing that is mostly string manipulation, JSON parsing, and basic math: PHP is fine. For anything involving matrix operations, convolutions, or libraries that have no PHP equivalent: Python is the right tool. Do not fight the ecosystem.
The question is whether your preprocessing is tightly coupled to the rest of the PHP app. If it just takes DB records and outputs feature vectors, a Python script that reads from the same DB is clean and not a big operational addition.
PHP FFI can call into native C libraries and Python extensions indirectly but it is fragile and hard to debug. Not worth it unless you have a very specific bottleneck you cannot solve otherwise.
We keep ingestion and chunking in PHP (DB access, file reading, text splitting). Embedding generation goes to OpenAI API. Vector storage/retrieval uses the Qdrant client. No Python anywhere in the pipeline. Depends on whether you use external APIs or run your own models.
If you eventually need to run local models for cost or privacy reasons, Python becomes harder to avoid. Starting with hosted APIs means you can stay in PHP longer before hitting that wall.
```php blocks are runnable.