Pros and Cons of Amazon SageMaker Asynchronous Inference

March 22, 2022
Ali Yazdizadeh Kharrazi

What is SageMaker Asynchronous Inference?

Introduced on Aug 2021, Asynchronous Inference is a new machine learning model deployment option on SageMaker. Instead of processing the incoming request in real-time, Asynchronous Inference queues incoming requests and processes them asynchronously. This detachment of submission from returning the response can be very helpful when you have a large payload size and/or long processing time. Asynchronous Inference along with the Real-Time and Batch Transform are the available option for inference on SageMaker. The table below shows how these options differ in their time and payload requirements:

Real-Time Asynchronous Batch Transform
Max Allowed Processing Time 60 seconds 15 Min Unlimited
Suitable Payload Size Less than 6MB Up to 1GB Unlimited

Video version

The video version of this post is available on our YouTube channel:

What is the architecture compared to the other options?

Sagemaker Asynchronous Inference uses similar architecture to conventional Real-Time endpoint. They are similar in some respects with the exception of the way they process requests. The Real-Time endpoint accepts requests using a direct POST request and returns the model’s prediction in response. So, the POST request’s latency is dependent on the processing time of the model.

In order to resolve this issue, Asynchronous inference uses an internal queue to process the request one by one, and instead of returning the POST request with the result, it returns an S3 bucket path. This bucket stores the result after processing has been completed.

The image below shows the Async Inference architecture [1]:

different inference options

When you should use Asynchronous Inference?

When you deploy your machine learning model you want to interact with the model using an API (usually a REST API). Usually, you send the prediction payload to the model with a POST request and wait for the response but In the case of large models (e.g. Transformers) or large payload sizes (e.g. HD images or videos), this latency can be high. This kind of request is supposed to be finished in under a second and if you working with some services like API Gateway, there is a 30-second hard limit for API calls to return a response. In this way, you don’t get a 504 gateway timeout error by API Gateway. So as a rule of thumb, If you have a maximum allowed latency for your API call that is shorter than your model’s processing time combined with payload processing time, you should consider Asynchronous Inference.

Case study

In this part we will train an xgboost model and deploy it using asynchronous inference. In the end, we will compare it with a normal Real-Time inference.

How to deploy an Asynchronous endpoint?

In order to see how you can use Asynchronous inference in action, we trained and deployed an XGBoost model using the SageMaker XGBoost Algorithm. The full notebook to train and deploy the model is accessible on our GitHub Repository.

The following code creates an XGBoost model and uses the deploy method with the async_inference_config option to deploy an async endpoint.

from sagemaker.async_inference import AsyncInferenceConfig

# Create an AsyncInferenceConfig object
async_config = AsyncInferenceConfig(output_path=f"s3://{bucket}/{prefix}/output")

# Deploy model
xgb_async_predictor = xgb.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m4.xlarge', # instance type
    serializer=CSVSerializer(), # define serializer to convert bytes to CSV

Here we deployed the model using SageMaker SDK, this can also be done using AWS Python SDK (boto3), for that approach check out the official example.

What is the latency, and how much time does each step take?

After the deployment is finished, you can see the logs of this endpoint on the CloudWatch page, under log groups with /aws/sagemaker/Endpoints/<YOUR_ENDPOINT_NAME> name.

The diagram below shows the steps of a typical SageMaker asynchronous prediction along with an example time for a small size model (SageMaker XGBoost Algorithm) with model artifact size ~ 1MB and a large model (GPT-J by HuggingFace) with model artifact size ~ 22.5GB.

latency details

Note that in a Real-Time endpoint, the latency almost only consists of the Model Latency part of the above diagram and other parts are the added latency due to the asynchronous architecture.


When it comes to inference type, you have many options in SageMaker, so you have to know them to be able to choose wisely. In this post, we described how SageMaker Asynchronous Inference works and how to deploy it. You can find the full code on our GitHub Repository.

Finally, here we summarized the pros and cons of this service as well as the scenario you should consider this service.


  • It’s perfect for when you have an API call latency limit (like API Gateway 30s limit)
  • Large models that have high processing time
  • Large request payload that is already located on S3
  • Asynchronous Inference uses the same pricing as a Real-Time inference and has no extra cost
  • The model instance count can lower to zero in case of no request. See the official example for more information.


  • For many scenarios, if you don’t need an immediate response to your requests, you can simply go with batch transform (maybe scheduled to run periodically) and don’t get into the complexity of having an async endpoint and setting an autoscaling policy.
  • Even with autoscaling policy due to latency, there are some times you have no request coming but there is a server up and you are paid for that. So if you want to pay just for what you used it’s better to consider SageMaker Serverless Inference that we discussed in our previous post.
  • Unlike a Real-Time endpoint, a REST-API tool (e.g. curl, Postman) is not enough for sending the prediction an getting the predictions and you a companion piece of code with that retrieve the results from s3 whenever they were ready.

Figure Reference:

Made with + in Amsterdam

Back to top