Amazon Bedrock: foundations, systems, and scaling

🇧🇷 Ler em Português
Table of Contents

This article assumes familiarity with Transformers, probabilistic inference, and optimization. The focus is the Amazon Bedrock service layer and how its components connect to a modern generative AI stack.

1) What Amazon Bedrock is at the system level

Amazon Bedrock is a control/data plane for foundational model (FM) inference. In simplified terms:

  • Control plane: model selection, access control, versioning, metrics, and policies.
  • Data plane: inference execution with isolation, governance, and integration with AWS services.

Formally, inference can be seen as an operator:

Iθ:(x,h)y\mathcal{I}_{\theta}: (x, h) \mapsto y

where xx is the prompt, hh are generation hyperparameters (temperature, top-pp, top-kk, etc.), and yy is the generated sequence sample from a model parameterized by θ\theta.

2) Mathematical foundations of generation

2.1 Autoregressive Markov chain

Text generation is an autoregressive process:

P(y1:Tx)=t=1TP(ytx,y<t).P(y_{1:T} \mid x) = \prod_{t=1}^{T} P(y_t \mid x, y_{<t}).

Inference is a sampling problem over P(ytx,y<t)P(y_t \mid x, y_{<t}). Bedrock exposes this dynamic via sampling parameters.

2.2 Temperature, top-kk, and top-pp

If i\ell_i are the model logits for the next token, then:

P(i)=exp(i/τ)jexp(j/τ)P(i) = \frac{\exp(\ell_i / \tau)}{\sum_j \exp(\ell_j / \tau)}
  • Temperature τ\tau controls entropy. As τ0\tau \to 0, the distribution collapses to the argmax.
  • Top-kk restricts support to the kk most probable tokens.
  • Top-pp (nucleus sampling) chooses the smallest set SS such that iSP(i)p\sum_{i \in S} P(i) \ge p.

Mathematically, top-pp yields a truncated, renormalized distribution:

Pp(i)=P(i)1[iS]jSP(j).P_p(i) = \frac{P(i) \cdot \mathbf{1}[i \in S]}{\sum_{j \in S} P(j)}.

2.3 Perplexity and cross-entropy

Language model quality is commonly analyzed via cross-entropy:

L=1Tt=1TlogP(ytx,y<t).\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(y_t \mid x, y_{<t}).

Perplexity is:

PPL=exp(L).\mathrm{PPL} = \exp(\mathcal{L}).

In evaluation, reducing L\mathcal{L} implies higher predictability and lower uncertainty in generation.

3) Attention: the Transformer core

For a multi-head attention block:

Attention(Q,K,V)=softmax(QKdk)V.\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.

For hh heads:

MHA(X)=Concat(head1,,headh)WO,\mathrm{MHA}(X)=\mathrm{Concat}(\text{head}_1,\dots,\text{head}_h)W^O,

with

extheadi=Attention(XWiQ,XWiK,XWiV). ext{head}_i = \mathrm{Attention}(XW_i^Q, XW_i^K, XW_i^V).

Per-layer complexity is O(T2d)O(T^2 d), which explains latency costs for long sequences. In Bedrock, this translates into higher time/cost for large prompts and long generations.

4) RAG (Retrieval-Augmented Generation) in Bedrock

A typical RAG pipeline can be viewed as a composition:

y^=Iθ(xRetrieve(x,D),h)\hat{y} = \mathcal{I}_{\theta}(x \oplus \mathrm{Retrieve}(x, \mathcal{D}), h)

where D\mathcal{D} is the indexed corpus and \oplus is a concatenation or fusion operator.

The embedding e(x)Rde(x) \in \mathbb{R}^d is produced by an encoder:

e(x)=fϕ(x).e(x) = f_\phi(x).

Retrieval uses similarity, e.g., cosine:

sim(x,z)=e(x)e(z)e(x)e(z).\mathrm{sim}(x, z) = \frac{e(x) \cdot e(z)}{\|e(x)\| \|e(z)\|}.

The top-kk documents {zi}\{z_i\} are:

argmaxzD  sim(x,z).\arg\max_{z \in \mathcal{D}} \; \mathrm{sim}(x,z).

4.2 Optimal context mixing

To mitigate hallucinations, one strategy is to weight retrieved chunks by score:

C=i=1kwici,wi=exp(αsi)jexp(αsj)C = \sum_{i=1}^k w_i c_i,\quad w_i=\frac{\exp(\alpha s_i)}{\sum_j \exp(\alpha s_j)}

where sis_i is the similarity score and cic_i is the content. This induces soft routing of context.

5) Routing and model selection

Bedrock lets you choose different FMs. We can model the choice as a risk minimization problem:

heta=argminθΘ  E(x,y)D[(Iθ(x,h),y)]+λCost(θ). heta^* = \arg\min_{\theta \in \Theta} \; \mathbb{E}_{(x,y) \sim \mathcal{D}}\big[\ell(\mathcal{I}_\theta(x,h), y)\big] + \lambda \cdot \mathrm{Cost}(\theta).

This balances quality (loss \ell) and cost. For production applications, this tradeoff is central.

6) Latency and cost: a simplified model

Total latency can be approximated as:

Ttotal=Ttokenize+Tforward(nin)+Tdecode(nout).T_{\text{total}} = T_{\text{tokenize}} + T_{\text{forward}}(n_{\text{in}}) + T_{\text{decode}}(n_{\text{out}}).

If CinC_\text{in} and CoutC_\text{out} are per-token costs (hypothetical) and nin,noutn_{\text{in}}, n_{\text{out}} are input/output tokens:

Cost=Cinnin+Coutnout.\mathrm{Cost} = C_\text{in} \cdot n_{\text{in}} + C_\text{out} \cdot n_{\text{out}}.

Practical optimization involves:

  • reducing ninn_{\text{in}} via prompt compression
  • limiting noutn_{\text{out}} via max_tokens
  • choosing θ\theta with the best cost/quality tradeoff

7) Evaluation and calibration

To evaluate generated answers, you can use metrics based on semantic distance and factual consistency. A simple model:

Score(y)=β1sim(y,y)β2Risk(y)\mathrm{Score}(y) = \beta_1 \cdot \mathrm{sim}(y, y^*) - \beta_2 \cdot \mathrm{Risk}(y)

where yy^* is a reference answer. For probabilistic calibration, reliability can be measured via Expected Calibration Error (ECE):

ECE=m=1MBmnacc(Bm)conf(Bm).\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left|\mathrm{acc}(B_m) - \mathrm{conf}(B_m)\right|.

8) Safety, policies, and mitigation

A safety classifier can be modeled as gψ(x)[0,1]g_\psi(x) \in [0,1]. The policy can be:

extAllow(x)=1[gψ(x)δ]. ext{Allow}(x) = \mathbf{1}[g_\psi(x) \leq \delta].

In robust pipelines, the classifier acts before and after generation (pre- and post-filter), reducing the risk of undesired outputs.

9) Numerical example: temperature effect

Consider logits for three tokens: =[2.0,1.0,0.1]\ell = [2.0, 1.0, 0.1].

For τ=1\tau = 1:

P=softmax([2.0,1.0,0.1])[0.659,0.242,0.099].P = \mathrm{softmax}([2.0, 1.0, 0.1]) \approx [0.659, 0.242, 0.099].

For τ=0.5\tau = 0.5:

P=softmax([4.0,2.0,0.2])[0.866,0.117,0.017].P = \mathrm{softmax}([4.0, 2.0, 0.2]) \approx [0.866, 0.117, 0.017].

Entropy drops from H0.86H \approx 0.86 to H0.42H \approx 0.42, making generation more deterministic.

10) Technical production checklist

  1. Define quantitative quality and cost targets.
  2. Model latency and token usage with observable metrics.
  3. Implement RAG with vectors and re-ranking.
  4. Apply safety policies with calibrated thresholds.
  5. Run offline evaluations and continuous A/B tests.

If you want, I can add a benchmarks section or a practical tutorial using the AWS SDK (Python or TypeScript).

Comments