LLM context window management in PHP applications
Building a chat feature where users have multi-turn conversations with an LLM. The challenge is keeping conversation history within the context window without losing important early messages.
How are people handling this in PHP? Interested in both technical strategies and any PHP-specific considerations.
Simplest approach: keep the last N messages where N is tuned so total tokens stay under the limit. Use tiktoken-php or a rough heuristic (chars/4) for token counting. Not perfect but works for most chat use cases.
Better approach: sliding window with the system prompt pinned at the beginning. Always include the first user message and last assistant response. Summarize older turns with a separate LLM call when the window fills up.
We store messages with their token count in the DB (estimated at insert time). When building the context, select messages from the end up to the budget. The system prompt token count is subtracted from the budget first.
Claude 3 and GPT-4 128k context windows make this less urgent for most use cases. The bigger problem at scale is cost: charging users based on conversation length or truncating aggressively.
Token counting in PHP: the tiktoken-php library is the most accurate. For OpenAI models it gives exact counts. For rough estimates, (string length / 3.5) is close enough for GPT-4 English text.