Amazon SageMaker is AWS's end-to-end machine learning platform. It provides the tools to label data, build and train models, tune hyperparameters, deploy to managed endpoints, and monitor models in production — all without provisioning or managing the underlying GPU/CPU infrastructure directly.
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
model = HuggingFaceModel(
model_data="s3://my-bucket/models/distilbert.tar.gz",
role=role,
transformers_version="4.37",
pytorch_version="2.1",
py_version="py310",
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g5.xlarge",
endpoint_name="distilbert-sentiment",
)
print(predictor.predict({"inputs": "SageMaker makes model deployment straightforward."}))
Bedrock is the managed-API path for consuming foundation models; SageMaker is the full ML platform for teams that need to train, host, and operate their own models. Many production architectures combine both — Bedrock for generic text/embedding tasks and SageMaker for custom models and specialized inference.
Studio is the web-based IDE — notebooks, experiment tracking, model registry UI, pipeline visualization — where data scientists do interactive work. Endpoints are managed inference infrastructure (real-time, serverless, async, batch transform) that hosts trained models behind an HTTPS API. Pipelines are SageMaker's native CI/CD DAGs — versioned, parameterized, reproducible workflows that wire preprocessing, training, evaluation, conditional deploy, and model-registry steps. Studio is where work happens; Pipelines is how it gets shipped; Endpoints is how it gets served.
Real-time endpoints keep instances always-on with sub-100ms latency and auto-scaling — best for interactive APIs. Serverless inference scales to zero, cold starts in seconds, and bills per invocation — best for spiky low-volume workloads where idle cost dominates. Async inference queues requests and writes results to S3 with payloads up to 1 GB and minute-long inference — best for large documents or video. Batch transform spins up an ephemeral cluster, scores a whole S3 dataset, and shuts down — best for nightly scoring jobs. Pick by traffic shape and payload size, not by feature familiarity.
The Model Registry is a versioned catalog of trained models with approval status (PendingManualApproval / Approved / Rejected), lineage back to the training job and source data, and metadata for metrics. Pipelines write candidate models to the registry; an approval step (manual or automated based on metrics) flips status to Approved; a deployment Lambda or pipeline picks up Approved models and ships them to an endpoint. It gives you auditable model promotion analogous to artifact promotion in software CI/CD.
Match instance to workload: ml.m5/ml.c5 for classical ML and small NN; ml.g5 (A10G) for fine-tuning small/mid LLMs and most computer vision; ml.p4d/ml.p5 (A100/H100) for large-model pretraining or distributed training; ml.trn1 (Trainium) for cost-optimized large-scale training when the framework is supported. Always start with a single small instance to validate the script, then scale to multi-GPU with SageMaker Distributed Data Parallel or Model Parallel. Use Spot training (managed spot with checkpoints to S3) for up to 90% savings on long jobs.
Multi-Model Endpoints (MME) load model artifacts from S3 on demand into a single endpoint, sharing the underlying instance — best when you have many low-traffic models (per-customer or per-region) and don't want to pay for an idle endpoint each. Multi-Container Endpoints host different containers behind one endpoint and route by header — useful for an A/B between two frameworks or a preprocessing + inference pipeline in one hop. Inference Pipelines chain containers serially in a single request. Pick MME for many models, MCE for heterogeneous serving, Pipelines for transform-then-predict.
Use SageMaker deployment guardrails — blue/green or canary on the endpoint with CloudWatch alarms as the rollback trigger. The pipeline registers a new model version, an approval step verifies offline metrics, then EndpointConfig updates with a canary slice (e.g., 10% traffic for 15 minutes) while alarms watch latency, error rate, and a custom data-quality metric from Model Monitor. If any alarm trips, SageMaker auto-rolls back to the previous EndpointConfig; on success, traffic shifts to 100%. Pair with shadow testing for risky model changes — replicate live traffic to the new variant without affecting users and compare predictions offline.