What is Inference?
Inference — The phase where a trained AI model processes new data to make predictions or generate outputs.
Inference is when a trained model processes new data to generate predictions or outputs — every API call to ChatGPT is an inference request. Inference costs, latency, and throughput are the primary operational concerns for production AI systems.
Frequently Asked Questions
How is inference different from training?
Training teaches the model by processing massive datasets over weeks. Inference is using the trained model to process individual requests in milliseconds to seconds.
What drives inference costs?
Model size, input length, output length, and hardware. Larger models on premium GPUs cost more per request. Optimization techniques like quantization and batching reduce costs.
What is inference latency?
The time between sending a prompt and receiving the first token of the response. For real-time applications, sub-second latency is critical. Batch processing can tolerate higher latency.