Retaining Amazon SageMaker Instance Capacity with SageMaker Managed Warm Pools

An Alternative Solution to Cloud Instance Reservation

4 min readFeb 27, 2024

In previous posts (e.g., here and here), we covered some of the pros and cons of training ML workloads using Amazon SageMaker. In this post we address one of its more inconvenient limitations — its lack of support (as of the time of this writing) for training on reserved Amazon EC2 instances. This limitation has become more and more restrictive of late due to the increasing difficulty to acquire the instance types required in a reliable and timely fashion. Recent advances in the field of generative AI have led to unprecedented demand for AI compute while challenges in the global supply chain continue to linger. In this post we propose a partial mitigation to this limitation using SageMaker managed warm pools. By using SageMaker managed warm pools you can, under certain circumstances, retain access to provisioned instance capacity for successive training workloads. Not only will this hold on to acquired capacity for as long as you need (up to four weeks), but it can also reduce the latency between experiments.

Training With Managed Warm Pools - Example

In the example below, we start up a PyTorch training job on a p5.48xlarge instance of type using the Amazon SageMaker Python SDK (version 2.208). We use the keep_alive_period_in_seconds control to configure the instance to remain warm for ten minutes.

from sagemaker.pytorch import PyTorch

# define job
estimator = PyTorch(
    role='<sagemaker role>',
    entry_point='train.py',
    instance_type='ml.p5.48xlarge',
    instance_count=1,
    framework_version='2.0.1',
    py_version='py310',
    keep_alive_period_in_seconds=60 # keep warm for 1 minute
)

# start job
estimator.fit()

As long as we start up another job with matching settings within the sixty seconds allotted, the same instance will be retained for the next job. Thus, by configuring the use of SageMaker warm pools we have guaranteed instance capacity for our next workload. As an added bonus, the start-up time of the second workload will be noticeably reduced since the instance has already been provisioned.

Limitations of Managed Warm Pools

Although this technique offers an instance capacity guarantee similar to the one provided by Amazon EC2 reservations (and without the long-term commitment!!), it is important to note its significant limitations.

The method relies on our ability to secure instance capacity for the first training job. Generally speaking, this is a safe assumption — sooner or later, we will succeed in securing an instance, but it is hard to know how much time and patience will be required.
The method assumes that our workloads have matching settings, particularly with regards to the number and types of instances. Although AI development teams will frequently have multiple workloads with similar instance requirements, the inability to share resources between jobs with different settings (e.g., with different SageMaker Roles) is limiting.
The method works only if the subsequent workload is started within the specified warm pool duration, with the maximum duration being one hour. Unless we want to constantly monitor our training jobs to detect when they stop, we will need to implement an automated system for submitting new jobs when our provisioned instances become available.
In cases where a matching training job is not found during the warm pool durations, we still need to pay for the provisioned instance. Thus, there is a certain risk of waste associated with this method and the way it is used (e.g., the most appropriate warm pool duration setting) should be planned accordingly.
The maximum period of time during which an instance can be retained in this manner is twenty-eight days.

Please see the official documentation for more details on how warm pooling works as well as additional considerations associated with its use.

Reducing Cost with SageMaker Savings Plans

The method we have described for retaining control of instances is relevant for AI teams that have a consistent requirement for AI compute. This is manifested as a continuous backlog of training experiments waiting to be processed. In situations where this requirement is expected to last for an extended period of time, Amazon SageMaker Savings Plans may provide a great opportunity for training-cost savings. SageMaker Savings Plans offer significant discounts in exchange for a commitment to pay for consistent usage. The instance types offered under this plan can vary — please refer to the documentation for the most up-do-date details. Importantly, despite some similarities to Amazon EC2 reservations, SageMaker Savings Plans does not guarantee instance capacity. However, the method described in this post for retaining control of provisioned instance capacity can help you take the most advantage of the instances you have committed to using.

SageMaker Savings Plans is not for everyone. Make sure to fully understand all the terms of the offering before deciding whether it is the right solution for your team.

Summary

A common approach for dealing with the difficulty of acquiring AI/ML compute resources in the cloud is to guarantee capacity by purchasing instance reservations. Unfortunately, as of the time of this writing, Amazon SageMaker does not support instance reservation. In this post, we have demonstrated how SageMaker warm pools can be used to maintain control over instance capacity for successive training workloads. We noted that for this type of solution to be effective, we require some form of mechanism for automating the detection of available warm pool instances and triggering a job with matching settings. In a future post we will propose a solution that addresses this challenge.