ince the introduction of OpenAI’s ChatGPT in late 2022, generative AI has made significant advancements within the tech community. Amazon Web Services (AWS) previously offered services like SageMaker for building and deploying machine learning models. Following the success of OpenAI’s ChatGPT, AWS launched its serverless AI enablement service, AWS Bedrock, in late 2023. AWS Bedrock enables businesses to create generative AI applications by providing access to leading large language models. Additionally, Amazon is developing its line of foundational models known as Amazon Titan.
As Amazon Bedrock becomes increasingly important, it's essential to understand its fundamentals and pricing model. This knowledge enables businesses to effectively leverage Bedrock's capabilities while managing costs, ensuring they maximize the benefits of AWS’s powerful AI tools. This blog will explore the basics of Amazon Bedrock, its core components, pricing model, and the optimization strategies for Bedrock costs.
Foundational models in generative AI serve as an interesting enabling technology. Previously, we had to train machine learning models for each specific task from scratch, which was expensive and presented a significant cost.
A foundational model in generative AI can be thought of as a super smart starting point for creating new things with artificial intelligence. Imagine it as a comprehensive tool that has absorbed a vast amount of information, similar to reading every book in a library. This model recognizes patterns in various types of data, including language, images, and sounds (multimodal). Because it has learned so extensively, the foundational model can be applied to many different tasks without the need to start from zero each time.
For instance, if the model has learned the structure of sentences, it can assist in creating chatbots, translating languages, or even writing stories. In terms of images, it can generate new artwork or aid in graphic design.
When working with generative AI models, it's essential to understand tokens, as they directly impact resource consumption and pricing. Input data is converted into tokens, which can represent either a word or a character. Each token is transformed into a numerical vector (also known as an embedding) that captures its semantic meaning. The models process sequences of these token embeddings to understand context and generate responses.
More tokens allow the model to consider a longer context, resulting in better responses. However, a higher number of tokens leads to increased computation and memory usage, which in turn affects processing time and overall resource consumption.
Model inference refers to the process of using a trained machine-learning model to make predictions or decisions based on new, unseen data. During this process, the model applies the patterns and knowledge it acquired during training to analyze the incoming data. Model inference takes place each time new data is processed.
Amazon Bedrock is a fully managed service that enables users to leverage advanced foundational models from leading AI companies like AI21, OpenAI, Anthropic, and Cohere. It offers a straightforward API for accessing these models, allowing developers to focus on experimentation without worrying about infrastructure. This flexibility enables developers to fine-tune the models with custom data and use techniques such as Retrieval Augmented Generation (RAG). Consequently, AWS Bedrock facilitates the development of AI agents that can interact with specific datasets.
Like many AWS services, Bedrock integrates seamlessly with other AWS offerings, such as S3 for data storage, Amazon CloudWatch for monitoring, AWS Lambda, and Amazon SageMaker. This integration simplifies AI agent development and allows developers to iterate quickly. Additionally, Bedrock includes advanced features like prompt management and workflow integrations.
Inference requires compute resources to run AI models using input data, referred to as a dataset. This dataset needs to be stored, and in some cases, a custom model may also require storage. It is essential to transfer the dataset from storage to compute, and this data transfer is a significant cost factor. Top AI companies develop and train the models, with each model priced differently. Therefore, the key factors impacting pricing are: compute, storage, data transfer, and the model itself.
It's important to understand the Bedrock pricing model before using the service. This model is designed to help developers explore and test various foundational models while also considering their production workload. Like many AWS services, the pricing comes primarily in two forms: a pay-per-use option and a guaranteed provisioned throughput option.
In the Pay-As-You-Go model, pricing is based on the number of input and output tokens processed. This model does not require any long-term commitment, making it ideal for experimentation and development phases. It is especially beneficial for applications with significant and frequent fluctuations in request volume, such as a customer service chatbot that handles varying inquiries. However, there are some important details to note. On-demand pricing for text generation models charges separately for input tokens (which include words, punctuation marks, etc., provided to the model) and output tokens (which are generated by the model). For embedding models, charges are based solely on input tokens, while image generation models have a fixed cost for each generated image.
Additionally, On-Demand pricing supports cross-region model inference, allowing users to manage traffic spikes without incurring extra cross-region charges. In these cases, the pricing is determined by the source region from which the request originates.
With provisioned throughput pricing, you reserve a specific number of tokens per hour, known as "model units." This pricing model is ideal for applications that require predictable and consistent workloads. By committing to a certain processing capacity, measured in model units, you can benefit from an hourly discounted rate for either a one-month or six-month term. Reserving capacity ensures reliable performance for large and predictable workloads. It's important to note that you will be billed for the reserved capacity, regardless of your actual usage during the commitment period. Additionally, to perform inference on customized models, you need to utilize a provisioned throughput plan.
Batch processing allows for a reduction in token usage by submitting a large set of prompts as a single input file, significantly lowering costs. The responses are stored in S3 buckets, making them accessible to other parts of the application. Amazon Bedrock offers select foundation models (FMs) from leading AI providers such as Anthropic, Meta, Mistral AI, and Amazon for batch inference at a price that is 50% lower than on-demand inference pricing. Please note that batch processing is not supported for all models. Refer to the model list here for a list of supported models.
Amazon Bedrock enables you to customize foundational models (FMs) using your own data to provide tailored responses. When you customize a foundational model through techniques like fine-tuning or Retrieval Augmented Generation (RAG), you will incur costs associated with training, storage, and inference.
Fine-tuning can be performed using labeled data or by continued pretraining with unlabeled data. The costs involved include charges for model training based on the total number of tokens processed, as well as monthly storage fees for your custom models. Additionally, using customized models for inference requires the purchase of provisioned throughput, with charges based on the number of hours the models are utilized
This is a general cost optimization strategy applicable to any generative AI application, not just Bedrock. As mentioned earlier, the number of input tokens significantly impacts costs; thus, reducing the number of input tokens can lead to substantial savings. Understanding prompt engineering is crucial for crafting concise and clear prompts. Additionally, applications may use BPF tokenizer libraries such as tiktoken for compressing the prompts.
Processing large datasets as a single batch can be more cost-effective than handling them individually on demand. This approach is particularly beneficial for tasks such as sentiment analysis or language translation, where all the data can be processed together instead of analyzing each text separately. By grouping multiple requests into a single batch, you can reduce token usage and make your workflows more efficient.
Understanding the usage pattern of your workload is essential. By utilizing AWS CloudWatch, you can gain better insights into this pattern. For large organizations with predictable and consistent workloads, reserving capacity through Provisioned Throughput may be more cost-effective than On-Demand pricing. However, the pricing for Provisioned Throughput might be too high for small, independent use cases. By analyzing the usage pattern, you can make informed decisions about whether to switch to Provisioned Throughput.
Every day, AI companies are releasing new foundational models (FMs). Before selecting a foundational model, it's important to experiment with its capabilities and consider its pricing. Amazon Bedrock offers model evaluation jobs, which allow you to compare different models and their inference outputs. This feature is very helpful in choosing the model that is best suited for your generative AI applications. This approach can save you both processing and token costs without compromising on the quality required for your application.
When customizing the model, using batch mode processing, or conducting cross-region model inference, it is important to carefully consider the storage costs associated with your dataset, batch responses, and data transfer. To manage these costs effectively, utilize the appropriate Amazon S3 storage classes and implement S3 Lifecycle policies to transition data between different storage classes. Additionally, you can use S3 gateway VPC endpoints or PrivateLink to help reduce data transfer expenses.
By analyzing your usage patterns and other business information, you can estimate costs in advance. Use this estimate to set budget alerts with appropriate thresholds.
Managing and optimizing costs for AWS Bedrock can be challenging. CloudYali offers comprehensive resource inventory and real-time cost monitoring, providing you with contextual visibility into your cloud spending. CloudYali AI Cost tracking feature is designed to keep track of the AI cost, models, tokens and more. Customizable budget alerts help you stay on top of your expenses, identify inefficiencies, and take immediate action. These features empower you to effectively manage Amazon Bedrock costs, optimize your budget, and maximize the value of your cloud investment.
Not a CloudYali use yet? Sign up here to get started with comprehensive cloud cost control.
Get the latest updates, news, and exclusive offers delivered to your inbox.
Stay up to date with our informative blog posts.