Ray Serve Autoscaling
Each Ray Serve deployment has one replica
by default. This means there is one worker process running the model
and serving requests. When traffic to your deployment increases, the
single replica can become overloaded. To maintain high performance of
your service, you need to scale out your deployment.
Manual Scaling
Before jumping into autoscaling, which is more complex, the other
option to consider is manual scaling. You can increase the number of
replicas by setting a higher value for num_replicas in the deployment options through in place updates. By default, num_replicas
is 1. Increasing the number of replicas will horizontally scale out
your deployment and improve latency and throughput for increased levels
of traffic.
Autoscaling Basic Configuration
Instead of setting a fixed number of replicas for a deployment and
manually updating it, you can configure a deployment to autoscale based
on incoming traffic. The Serve autoscaler reacts to traffic spikes by
monitoring queue sizes and making scaling decisions to add or remove
replicas. Turn on autoscaling for a deployment by setting num_replicas="auto". You can further configure it by tuning the autoscaling_config in deployment options.
https://docs.ray.io/en/latest/serve/autoscaling-guide.html