What is Latency?
Latency — The time delay between a user submitting a prompt and the AI model returning a response.
In AI systems, latency measures the time from request to response. Typical cloud LLM responses take 500ms-5s depending on model size and output length. For real-time applications like chatbots, keeping latency under 2 seconds is critical for user experience.
Frequently Asked Questions
What causes AI latency?
Model size, input/output length, server load, network distance, and whether the model uses streaming or batch responses. Larger models with longer outputs take more time.
How do I reduce AI latency?
Use smaller models, enable streaming responses, deploy models closer to users (edge), use caching for common queries, and optimize prompt length to reduce token processing.
What is time-to-first-token?
The time until the first word of the response appears. With streaming enabled, users see output progressively rather than waiting for the complete response, improving perceived speed.