About instance autoscaling in Cloud Run services | Cloud Run Documentation

In Cloud Run, each revisionis automatically scaled to the number of instances needed to handleall incoming requests, events, or CPU utilization.

When a revision does not receive any traffic, by default it is scaled in to zeroinstances. However, if desired, you can change this default tospecify an instance to be kept idle or "warm" using theminimum instances setting. If you areusing CPU outside of requests, you should set minimum instances equal to1.

In addition to the rate of incoming requests, events, or CPU utilization, thenumber of instances scheduled is impacted by:

The average CPU utilization of existing instances over a one minute window, targeting to keep scheduled instances toa 60% CPU utilization.
The current request concurrency, compared to the maximum concurrencyover a one minute window.
The maximum number of instances setting
The minimum number of instances setting

The Cloud Run autoscaler evaluates these every 5 seconds.

CPU always allocated and autoscaling

If you configure your Cloud Run service to haveCPU always allocated, you should beaware of scaling to and from zero behavior.

CPU always allocated scaling from zero. Scaling from zero can only be triggeredby a request, so a service that is not processing requests cannot scale fromzero. For these workloads, you can either set minimum instances > 0, or includea "wake-up request" in your design to restart processing after scaling to zero.

CPU always allocated scaling to zero. Given that no instance is ever at 0%CPU, looking at all CPU usage would result in never scaling to zero. This meansthe decision to scale from one to zero can only be made by checking to see ifthe instance is processing a request.

About maximum instances

In some cases you may want to limit the total number of instancesthat can be started, for cost control reasons, or for better compatibility withother resources used by your service. For example, your Cloud Runservice might interact with a database that can only handle a certain number ofconcurrent open connections.

You can use the maximum instances setting to limit the total number ofinstances that can be started in parallel, as documented inSetting a maximum number of instances.

Exceeding maximum instances

Under normal circ*mstances, your revision scales out by creating new instancesto handle incoming traffic load. But when you set a maximum instances limit, in somescenarios there will be insufficient instances to meet that traffic load. Inthat case, incoming requests are queued (pending) as follows:

If new instances are starting up, such as during a scale-out, requests willpend for at least the average startup time of container instances of this service.This includes when the request initiates a scale-out, such as when scalingfrom zero.
If the startup time is less than 10seconds, requests will pend for up to 10seconds.
If there are no instances in the process of starting, and the request does notinitiate a scale-out, requests will pend for up to10 seconds.

During this time window, if an instance finishes processing requests, it becomesavailable to process the queued pending requests.If no instances become available during the window, the request fails with a429 error code.

Scaling guarantees

The maximum instances limit is an upper limit per revision and it means that thenumber of instances for this revision shouldn't exceed the maximum.

Under normal circ*mstances, Cloud Run is able to scale out to the maximum instances limit very fast to handle all incoming requests or events. However,setting a high limit does not mean that your revision will be able scale out tothe specified number of instances at any given moment. In exceptional circ*mstances, Cloud Run can throttle scaling to ensure good servicefor all customers.

Exceeding maximum instances due to traffic spikes

In some cases, such as rapid traffic surges or system maintenance,Cloud Run might, for a short period of time, create moreinstances than are specified in the maximum instances setting. New instances can bestarted in excess of the maximum instances setting to replace existing instances and to providea grace period for inflight requests to finish processing.

The maximum instance limit can be exceeded under normal operation a few times perweek. The grace period usually lasts up to 15 minutes, or up tothe value specified in the request timeout setting.These extra instances are destroyed within 15 minutes after they become idle.

If many replacements are needed, the updates are usually spread out over many minutesor hours, but each replacement has an excess instance for just the grace period.Instances in excess of the maximum instance value are normally less than twice theconfigured maximum instances limit, but can be much larger for sudden large traffic spikes.

Load tests experience more instances exceeding the maximum instances setting becausethe system may change where traffic spikes are served to preserve capacity for existing workloadsthat have sustained load patterns.

If your service cannot tolerate this temporary behavior, you may wantto factor in a safety margin and set a lower maximum instances value.

Traffic splits

Because the maximum instances limit is a limit for each revision, if the servicesplits traffic across multiple revisions,the total number of instances for the service can exceed the maximum instancesper revision. This can be observed in the Instance Countmetrics.

Deployments

When you deploy a new revision to serve 100% of the traffic,Cloud Run starts enough instances of the new revision before directingtraffic to it. This reduces the impact of new revision deployments on requestlatencies, notably when serving high levels of traffic.Because the maximum instances limit is a limit for each revision, during adeployment, the total number of instances for the service can exceed the maximuminstances per revision. This can be observed in the Instance Countmetrics.

Idle instances and minimizing cold starts

Cloud Run does not immediately shut down instances once they havehandled all requests.To minimize the impact of cold starts, Cloud Run may keep some instancesidle for a maximum of 15 minutes.These instances are ready to handle requests in case of a sudden traffic spike.

For example, when an instance has finished handling requests, it mayremain idle for a period of time in case another request needs tobe handled. An idle instance may persist resources, such as opendatabase connections. Note that CPU is only allocated during request processingunless you explicitly configure your service to haveCPU always allocated.

To keep idle instances permanently available, use themin-instance setting. Note that usingthis feature will incur cost even when the service is notactively serving requests.

Autoscaling and pending requests

If new instances are starting up, such as during a scale-out, requests willpend for at least the average startup time of container instances of this service.This includes when the request initiates a scale-out, such as when scalingfrom zero.
If the startup time is less than 10seconds, requests will pend for up to 10seconds.
If there are no instances in the process of starting, and the request does notinitiate a scale-out, requests will pend for up to10 seconds.

Autoscaling impact on backing services

As the number of instances automatically increases, yourCloud Run service might encounter limits with its backing services.For example, Cloud SQL has an API quota limit.Make sure these backing services have enough quota and can handle connectionsfrom all instances of your Cloud Run service.Consider setting a maximum number of instancesto avoid overloading backing services.

Autoscaling and Pub/Sub

Google recommends using push subscriptions to consume messages from aPub/Sub topic on Cloud Run. Pushed messages are received likeHTTP requests by the container, thus triggering the same autoscaling behavior.

Autoscaling and multiple containers (sidecars)

Cloud Run considers the CPU utilization of instances for autoscaling, wherethe CPU utilization of an instance is the percentage of allocated CPU in use.

Note that you allocate CPU when you set CPU limits at the container level. If you use multiple containers per instance,the actual CPU allocation for that instance is the sum of the CPU limits you set on each container.

What's next

To manage the maximum number of instances of your Cloud Run services, seeSetting a maximum number of instances.
To manage the maximum number of simultaneous requests handled by each instance, seeSetting concurrency.
To optimize your concurrency setting, seedevelopment tips for tuning concurrency.
To specify an idle instance to keep running to minimize latency or cold startson first requests, seeUsing min-instance to enable idle instances.

About instance autoscaling in Cloud Run services | Cloud Run Documentation | Google Cloud (2024)