<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Mark Sayson</title>
    <description>Notes on software development, technology, and life.</description>
    <link>https://www.marksayson.com/</link>
    <atom:link href="https://www.marksayson.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Sat, 28 Feb 2026 23:06:52 +0000</pubDate>
    <lastBuildDate>Sat, 28 Feb 2026 23:06:52 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Strategies for querying periodic S3 data snapshots</title>
        <description>&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;A common AWS analytics use case is making aggregate queries across multiple data sets stored in S3.  For example, one partner team may store product metadata, while another team stores purchase order metadata, and we may want to join these data sets to determine which products are most popular across each marketplace.&lt;/p&gt;

&lt;p&gt;In this post I’ll cover a few options for syncing data to S3 and retrieving data snapshots for use in aggregate queries, and specifically discuss the use case where we need to maintain access patterns to complete, recent data snapshots without partial unavailability during data syncs.&lt;/p&gt;

&lt;h2 id=&quot;a-few-options-for-syncing-data-to-s3&quot;&gt;A few options for syncing data to S3&lt;/h2&gt;

&lt;h3 id=&quot;1-syncing-each-record-to-a-unique-stable-file-path&quot;&gt;1. Syncing each record to a unique, stable file path&lt;/h3&gt;
&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Provides a stable “latest” data set that always represents all records.&lt;/li&gt;
  &lt;li&gt;Enables O(1) look-ups of specific records if data consumers query by S3 key (“filepath”).&lt;/li&gt;
  &lt;li&gt;Low storage costs since only a single copy of each record is stored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Poor performance and higher costs for aggregate queries at scale. Retrieving tens of thousands of small files is much slower than retrieving a few multi-MB files.&lt;/li&gt;
  &lt;li&gt;Limits design choices for data providers and may increase complexity of their implementation (eg. AWS Glue defaults to writing aggregated partial result files).&lt;/li&gt;
  &lt;li&gt;No historical data is retained unless explicitly backed up elsewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is best suited to when data consumers retrieve specific records, only need their latest state, and don’t make aggregated queries across a large number of records.&lt;/p&gt;

&lt;h3 id=&quot;2-appending-all-events-as-new-data-without-overwriting-prior-events&quot;&gt;2. Appending all events as new data without overwriting prior events&lt;/h3&gt;
&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Provides a complete history of events, allowing for time-series analysis.&lt;/li&gt;
  &lt;li&gt;If partition data by time, supports efficient queries of specific time periods.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Increase storage costs as accumulate historical data.&lt;/li&gt;
  &lt;li&gt;Increased complexity, latency, and cost to retrieve the latest version of each record.&lt;/li&gt;
  &lt;li&gt;If data isn’t partitioned in a way that aligns with query use cases, may have very inefficient scans.&lt;/li&gt;
  &lt;li&gt;If upstream workflows fail, may not have a way to recover data, and have missing records.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is best suited to queries of time-partitioned events, rather than needing the latest state of records.&lt;/p&gt;

&lt;h3 id=&quot;3-overwriting-one-or-more-files-that-include-multiple-records&quot;&gt;3. Overwriting one or more files that include multiple records&lt;/h3&gt;
&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Consumers always access the most up-to-date version of the dataset.&lt;/li&gt;
  &lt;li&gt;More efficient aggregate queries than retrieving thousands of single-record files.&lt;/li&gt;
  &lt;li&gt;Low storage costs since only a single copy of each record is stored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Risk of data loss if a failure occurs during the overwrite process.&lt;/li&gt;
  &lt;li&gt;Data consumers querying during a data sync may receive incomplete or duplicate data.&lt;/li&gt;
  &lt;li&gt;No historical data is retained unless explicitly backed up elsewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This can work for aggregated queries of the latest records, but if we can’t accept the risk of corrupt/partial/duplicate data during sync failures or read/write race conditions, we may prefer writing to separate snapshot partitions.&lt;/p&gt;

&lt;h3 id=&quot;4-periodically-writing-complete-data-to-time-based-partitions&quot;&gt;4. Periodically writing complete data to time-based partitions&lt;/h3&gt;
&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Consumers always access a complete dataset when querying a completed/past partition.&lt;/li&gt;
  &lt;li&gt;More efficient aggregate queries than retrieving thousands of single-record files.&lt;/li&gt;
  &lt;li&gt;Maintain historical data for as long as prior time partitions are kept in storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;More complex to identify the latest complete partition.&lt;/li&gt;
  &lt;li&gt;Data consumers querying the most recent partition during a data sync may receive incomplete data.  Need some type of completeness signal to mitigate.&lt;/li&gt;
  &lt;li&gt;Increase storage costs as accumulate historical data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can mitigate storage costs by setting a lifecycle rule to automatically delete S3 files past a certain age, eg. auto-delete files over 2 weeks old, if we only need to query recent snapshots.&lt;/p&gt;

&lt;h2 id=&quot;case-study-aggregate-queries-where-completeness-is-important&quot;&gt;Case study: Aggregate queries where completeness is important&lt;/h2&gt;
&lt;p&gt;For this post, we’ll consider the case where we want to optimize for aggregated queries against hundreds of thousands of records, with filter criteria applied across all records.  Our business requirements are that data consumers must always have access to complete and accurate data (no duplicates or missing records), but data does not need to be real-time as long as it’s up to date within a few hours.&lt;/p&gt;

&lt;p&gt;In this case, single-record files and time-series events are not a good fit due to increased latency to query latest state across this number of records.  We may prefer time-based partitions with complete data written to a new partition every N hours, eg. hourly, to mitigate partial/duplicate data issues during concurrent read/write operations.&lt;/p&gt;

&lt;h2 id=&quot;a-few-options-for-querying-time-partitioned-snapshots&quot;&gt;A few options for querying time-partitioned snapshots&lt;/h2&gt;

&lt;p&gt;A common challenge for time-partitioned partitions is identifying the latest complete partition, especially if upstream data syncs are not 100% successful.&lt;/p&gt;

&lt;h3 id=&quot;1-retrieve-a-specific-snapshot-relative-to-the-current-time&quot;&gt;1. Retrieve a specific snapshot relative to the current time&lt;/h3&gt;

&lt;p&gt;Assuming we have hourly snapshots that are partitioned by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snapshot_date&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snapshot_hour&lt;/code&gt;, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snapshot_date&lt;/code&gt; is formatted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;2025-01-31&quot;&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snapshot_hour&lt;/code&gt; is formatted as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;01&quot;&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;23&quot;&lt;/code&gt; to support consistent string comparisons, the following Athena SQL query retrieves data from the last hour’s time partition:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_FORMAT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;1&apos;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HOUR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;%Y-%m-%d&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_FORMAT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;1&apos;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HOUR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;%H&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Simple, easy to understand.&lt;/li&gt;
  &lt;li&gt;Very efficient since only querying data from a single partition, and it’s O(1) to locate that partition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;May fail to get any data if there were upstream sync issues for the selected partition. &lt;strong&gt;This is a blocker for us.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;May get partial data if query the most recent partition during an ongoing sync.&lt;/li&gt;
  &lt;li&gt;May need to query older and more out-of-date partitions to avoid the above race condition.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-retrieve-most-recent-snapshot&quot;&gt;2. Retrieve most recent snapshot&lt;/h3&gt;

&lt;p&gt;We can improve on the prior option by using an initial query to identify the most recent snapshot partition that has data.&lt;/p&gt;

&lt;p&gt;Since it could be expensive to search across all partitions, we can set a maximum look-back period, for example, the last 7 days, to allow for occasional upstream failures that may take a few days to resolve.&lt;/p&gt;

&lt;p&gt;Example Athena SQL query:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;WITH&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DISTINCT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Set a maximum look-back period to reduce query search space while allowing a few days of upstream failures&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_DATE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;7&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DAY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Relatively simple and easy to understand.&lt;/li&gt;
  &lt;li&gt;Guaranteed to get most recent snapshot with data within the given look-back period.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;May get partial data if query the most recent partition during an ongoing sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-retrieve-the-second-most-recent-snapshot&quot;&gt;3. Retrieve the second most recent snapshot&lt;/h3&gt;

&lt;p&gt;If we need to mitigate race conditions where the latest partition may have partial data, we could always query for the second most recent partition.&lt;/p&gt;

&lt;p&gt;We can maintain the same maximum look-back period to limit the query search space for performance reasons, while accounting for a few days of upstream failures.&lt;/p&gt;

&lt;p&gt;Example Athena SQL query:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;WITH&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentSnapshotPartitions&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DISTINCT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Set a maximum look-back period to reduce query search space while allowing a few days of upstream failures&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_DATE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;7&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DAY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Get the two most recent partitions with data&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;RankedSnapshotPartitions&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;-- Set a row number so we can select the second most recent partition&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ROW_NUMBER&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OVER&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recency_rank&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentSnapshotPartitions&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RankedSnapshotPartitions&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RankedSnapshotPartitions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RankedSnapshotPartitions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RankedSnapshotPartitions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;recency_rank&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;If the second most recent partition is always complete, this guarantees complete data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;More complex and difficult to understand than other options.&lt;/li&gt;
  &lt;li&gt;Less efficient than other options.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-retrieve-the-most-recent-snapshot-older-than-the-period-between-data-syncs&quot;&gt;4. Retrieve the most recent snapshot older than the period between data syncs&lt;/h3&gt;

&lt;p&gt;If we need to account for temporary upstream failures and cannot accept race conditions where the most recent partition may sometimes be complete, we can simplify from Option 1 by querying for the most recent partition older than the time between syncs, eg. the most recent partition more than 1 hour old.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;WITH&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentCompleteSnapshotPartition&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DISTINCT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Get snapshots from more than 1 hour ago, to avoid querying a partition with an ongoing sync if have race conditions near the start of an hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_FORMAT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;1&apos;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HOUR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;%Y-%m-%d&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DATE_FORMAT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;1&apos;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HOUR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;%H&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Set a maximum look-back period to reduce query search space while allowing a few days of upstream failures&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_DATE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;7&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DAY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentCompleteSnapshotPartition&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentCompleteSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RecentCompleteSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;If populated partitions that are older than the period between syncs are always complete, guarantees complete data.&lt;/li&gt;
  &lt;li&gt;Simpler and more efficient than Option 3, while covering the flaws of Options 1-2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Slightly more complex query than Options 1-2.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-send-data-completeness-events-to-trigger-downstream-queries&quot;&gt;5. Send data completeness events to trigger downstream queries&lt;/h3&gt;

&lt;p&gt;If we can request that our data provider pushes a data completeness signal file such as an empty file named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_SUCCESS&lt;/code&gt; to the latest S3 directory after completing a data sync, we can set up S3 PutObject notifications that filter for this specific filename and trigger downstream workflows whenever this signal is received.&lt;/p&gt;

&lt;p&gt;This is ideal for event-based workflows that only need to listen for a single trigger, while it may not be sufficient if there are multiple upstream datasets that we need to query that have different schedules and may not all be complete at the same time.&lt;/p&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Guarantees completeness of the given data set at the time we receive the event.&lt;/li&gt;
  &lt;li&gt;Allows downstream workflows to run with the most recent data possible, as soon as data is received.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;More complex to support multiple upstream data sets, best for single-dependency workflows.&lt;/li&gt;
  &lt;li&gt;May not work with workflows that are strictly schedule-based and cannot be easily triggered by an event.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;6-leverage-aws-glue-catalog-trigger-glue-crawler-from-glue-job-event&quot;&gt;6. Leverage AWS Glue Catalog, trigger Glue Crawler from Glue job event&lt;/h3&gt;

&lt;p&gt;If we populate our S3 bucket with an AWS Glue job, we can set up a Glue Catalog and a Glue Crawler that updates the catalog with an abstracted representation of the source data and its available partitions for downstream services such as AWS Athena to query.&lt;/p&gt;

&lt;p&gt;If we set up the Glue Catalog Table as the data source for our downstream queries, we will only query partitions that it has crawled.&lt;/p&gt;

&lt;p&gt;We can set up a &lt;a href=&quot;https://docs.aws.amazon.com/glue/latest/dg/about-triggers.html&quot;&gt;Glue trigger&lt;/a&gt; to automatically run the Crawler after the upstream Glue job that populates new S3 partitions has succeeded, or after &lt;a href=&quot;https://docs.aws.amazon.com/glue/latest/dg/starting-workflow-eventbridge.html&quot;&gt;some event is received via EventBridge&lt;/a&gt;, such as creation of a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_SUCCESS&lt;/code&gt; file, to automatically make new partitions available only after their sync jobs have completed.&lt;/p&gt;

&lt;p&gt;We could then use the Athena SQL query from Option 2 but with the Glue Table as our source, to simplify querying the most recent complete partition.&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;WITH&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DISTINCT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;-- Set a maximum look-back period to reduce query search space while allowing a few days of upstream failures&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;CAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;CURRENT_DATE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;INTERVAL&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;7&apos;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DAY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;AS&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;VARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ORDER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;DESC&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;LIMIT&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;INNER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;JOIN&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;ON&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_date&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;AND&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MostRecentSnapshotPartition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;snapshot_hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Abstracts logic for how to identify the most recent partition from data consumers.&lt;/li&gt;
  &lt;li&gt;Enables always querying the most recent complete partition.&lt;/li&gt;
  &lt;li&gt;Avoids the incomplete/duplicate data issues from Options 1-2 and the query complexity from Options 3-4.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Requires additional infrastructure set-up and hardware costs for Glue resources and event-based triggers.&lt;/li&gt;
  &lt;li&gt;May add several minutes of data latency from events -&amp;gt; Glue trigger -&amp;gt; Glue crawler -&amp;gt; Glue catalog update, compared to directly querying S3 for the most recent partition more than N hours old.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;7-leverage-aws-glue-catalog-use-aws-lambda-to-update-a-latest-table-after-completeness-events&quot;&gt;7. Leverage AWS Glue Catalog, use AWS Lambda to update a “latest” table after completeness events&lt;/h3&gt;

&lt;p&gt;If we have multiple upstream data sets which rules out Option 5, and we’re willing to invest a few developer weeks to optimize and simplify queries for data consumers, we could create a custom Lambda function that programmatically updates Glue resources to point to a “latest” S3 partition whenever a data sync completes.&lt;/p&gt;

&lt;p&gt;We could require data providers to write an empty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_SUCCESS&lt;/code&gt; file to the new partition after a successful sync.  We can then set up a S3 PutObject notification that filters for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_SUCCESS&lt;/code&gt; files, and make this a trigger to our Lambda function.&lt;/p&gt;

&lt;p&gt;The Lambda function will then:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Query the S3 directory for files in the new partition and scan them to infer their partitions and table schema.&lt;/li&gt;
  &lt;li&gt;Query Glue’s &lt;a href=&quot;https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html&quot;&gt;UpdateTable API&lt;/a&gt; to update the Glue Catalog Table that represents the latest version of the snapshot-based data set, to have the latest data schema and point to the new partition.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data consumers can then simplify their queries to the following, without needing to select snapshot partition attributes:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;upstream_dataset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Significantly simplifies queries for data consumers.&lt;/li&gt;
  &lt;li&gt;Enables always querying the most recent complete partition.&lt;/li&gt;
  &lt;li&gt;Avoids the incomplete/duplicate data issues from Options 1-2 and the query complexity from Options 3-6.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Adds multiple weeks of developer effort to replicate AWS Glue’s functionality for inferring schemas and updating Glue Catalog Tables.&lt;/li&gt;
  &lt;li&gt;Increases backend architecture and code complexity, with more points of failures that need to be maintained and supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this works, the end result may be great from the data consumer side, but in many cases this developer time is more valuable spent on other problems, and we’re willing to accept either slightly more complex but well-documented queries for data consumers, or occasional race conditions where recent snapshots may be incomplete if we happen to be reading at the same time that a data sync is occurring.&lt;/p&gt;

&lt;h2 id=&quot;case-study-retrieving-recent-data-when-completeness-is-critical-and-developer-time-is-limited&quot;&gt;Case study: Retrieving recent data when completeness is critical and developer time is limited&lt;/h2&gt;

&lt;p&gt;Take the case where we are limited to a few days of developer time to set up associated infrastructure, while we have the business requirements that:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Data consumers must always be able to query complete and accurate data that is no more than a few hours old.  Missing or duplicate data due to read/write race conditions is not acceptable.&lt;/li&gt;
  &lt;li&gt;Data consumers must be able to aggregate data from multiple upstream sources while satisfying the above conditions for completeness without partial data if querying during data syncs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this scenario, Option 6 may be an acceptable tradeoff, where we leverage existing AWS Glue functionality to automatically surface new partitions and schemas after receiving a data sync completion signal, and point data consumers to the Glue Catalog Table to query partitions that have completed syncs.&lt;/p&gt;

&lt;p&gt;Data consumers will then need to be aware of snapshot partition attributes to select data from the most recent partition, but they can simply query for the most recent partition since we will only surface partitions after receiving a data sync completion signal.&lt;/p&gt;
</description>
        <pubDate>Sat, 31 May 2025 00:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/strategies-for-querying-snapshot-s3-data/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/strategies-for-querying-snapshot-s3-data/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>Granting AWS Organization member accounts access to Cost Explorer</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;By default, adding accounts to an AWS Organization results in consolidated billing and cost management in the Organization management account, and Organization member accounts lose access to Cost Explorer, Billing, and other cost management services unless access is explicitly enabled from the Organization management account.&lt;/p&gt;

&lt;p&gt;This post walks through how to allow member accounts to access cost management services, to enable each team to review and manage their AWS spending.&lt;/p&gt;

&lt;h2 id=&quot;enabling-cost-explorer-and-cost-optimization-hub&quot;&gt;Enabling Cost Explorer and Cost Optimization Hub&lt;/h2&gt;

&lt;p&gt;Cost Explorer allows you to analyze AWS service costs and usage changes over time, and is free to access through the AWS UI.  You can filter and group costs and usage across multiple dimensions including AWS service, resource, usage type, region, and tag.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorer.png&quot; alt=&quot;Cost Explorer Screenshot&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After logging into the Organization management account using the root user, navigate to the Cost Explorer service to automatically enable it from the current time onwards.  You will need to do this once per account in your Organization, logging in as the root user for each account.&lt;/p&gt;

&lt;p&gt;Cost Optimization Hub is a free service that provides cost optimization recommendations across multiple services.  To enable member accounts access, we’ll need to enable the services from the AWS Organizations management account, and explicitly add permissions to the Permission Sets granting them access.&lt;/p&gt;

&lt;p&gt;To enable Cost Optimization Hub, navigate to the Cost Optimization Hub landing page, scroll down, select “Enable Cost Optimization Hub for this account and all member accounts”, and click “Enable”.&lt;/p&gt;

&lt;p&gt;As indicated by the prompt, to fully benefit from this service you’ll also need to opt into the free AWS Compute Optimizer service to import service rightsizing recommendations.  Follow the link to navigate to AWS Compute Optimizer, and click “Get started”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostOptimizationHub.png&quot; alt=&quot;Prompt to enable Compute Optimizer&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_ComputeOptimizerEnablePage.png&quot; alt=&quot;Compute Optimizer Enable Link&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_ComputeOptimizerEnablePage2.png&quot; alt=&quot;Compute Optimizer Enable Page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Select to opt in all member accounts, and click “Opt in”.&lt;/p&gt;

&lt;h2 id=&quot;enabling-access-for-member-accounts&quot;&gt;Enabling access for member accounts&lt;/h2&gt;

&lt;p&gt;On the top right hand of the AWS console, select your account name to open the account dropdown, and click “Account”, or directly navigate to &lt;a href=&quot;https://us-east-1.console.aws.amazon.com/billing/home?region=us-east-1#/account&quot;&gt;https://us-east-1.console.aws.amazon.com/billing/home?region=us-east-1#/account&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_AwsAccountDropdown.png&quot; alt=&quot;AWS account dropdown&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Scroll down to “IAM user and role access to Billing information” section and click “Edit”.  Select “Activate IAM Access” and click “Update”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_AccountSettings_EnableIamAccessToBilling.png&quot; alt=&quot;Enable IAM access to Billing&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Navigate to “Billing and Cost Management” &amp;gt; “Cost Management Preferences”.&lt;/p&gt;

&lt;p&gt;Under the “General” tab, enable the options below, then select “Save preferences” at the bottom of the page and confirm changes.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Linked account access&lt;/li&gt;
  &lt;li&gt;Linked account refunds and credits&lt;/li&gt;
  &lt;li&gt;Linked account discounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostManagementPrefs_GeneralPrefs.png&quot; alt=&quot;Enable linked accounts access to cost management services&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Under the “Cost Explorer” tab, enable “Granular data” with “Resource-level data at daily granularity”, and select “All services”, then click “Save preferences” at the bottom of the page and confirm changes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostManagementPrefs_CostExplorerPrefs.png&quot; alt=&quot;Enable Cost Explorer data at daily granularity&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Under the “Cost Optimization Hub” tab, under “Organization and member account settings”, select “Enable Cost Optimization Hub for all member accounts” and “Allow member account discount visibility”, then click “Save preferences” and confirm changes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostManagementPrefs_CostOptimizationHubPrefs.png&quot; alt=&quot;Enable linked accounts access to Cost Optimization Hub&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;accessing-cost-explorer-from-member-accounts&quot;&gt;Accessing Cost Explorer from member accounts&lt;/h2&gt;

&lt;p&gt;At this point, team members logging into Organization member accounts to view Cost Explorer will still see access denied errors across all widgets, even if their IAM permissions grant access to all Cost Explorer actions.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorerAccessDenied.png&quot; alt=&quot;Cost Explorer Access Denied&quot; /&gt;&lt;/p&gt;

&lt;p&gt;When clicking on one of the errors, we can see the following message:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;You don’t have permission to [Cost Explorer:]. To request access, copy the following text and send it to your AWS administrator. Learn more about troubleshooting access denied errors&lt;/p&gt;

  &lt;p&gt;User: [REDACTED_ACCOUNT_ID]&lt;/p&gt;

  &lt;p&gt;Service: [Cost Explorer]&lt;/p&gt;

  &lt;p&gt;Name: [AccessDeniedException]&lt;/p&gt;

  &lt;p&gt;HTTP status code: [400]&lt;/p&gt;

  &lt;p&gt;Context: [IAM user access not activated]&lt;/p&gt;

  &lt;p&gt;Request ID: [REDACTED_REQUEST_ID]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To resolve this error, we need to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Log into the Organization member account as the root user.&lt;/li&gt;
  &lt;li&gt;Visit AWS Cost Explorer once to enable this service if we haven’t done this before.&lt;/li&gt;
  &lt;li&gt;Navigate to Account settings via the top right hand account dropdown &amp;gt; “Account”.&lt;/li&gt;
  &lt;li&gt;Scroll down to “IAM user and role access to Billing information” section and click “Edit”.&lt;/li&gt;
  &lt;li&gt;Select “Activate IAM Access” and click “Update”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_AccountSettings_EnableIamAccessToBilling_EditPage.png&quot; alt=&quot;Enable IAM access to billing&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Non-root users with billing permissions should now have access to Cost Explorer, while the page may initially show zero costs and only show correct spending for the current time going forward.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorerZeroCosts.png&quot; alt=&quot;Cost Explorer page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After waiting for a few days, I noticed that the default view still shows zero spending, while after updating the time period to the last 7 days and updating the granularity to daily, I see correct costs for the past few days up to the date when Cost Explorer was enabled from the Organization management account.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorerLast7Days.png&quot; alt=&quot;Cost Explorer filtered to last 7 days&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can now add any filters or groupings we like, such as grouping costs by usage type to see what specify AWS service usages are contributing to our daily spending.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorerLast7DaysGroupByUsageType.png&quot; alt=&quot;Cost Explorer filtered to last 7 days, grouped by usage type&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;granting-cross-account-user-groups-access-to-cost-explorer&quot;&gt;Granting cross-account user groups access to Cost Explorer&lt;/h2&gt;

&lt;p&gt;We need to explicitly grant user groups access to Cost Explorer, otherwise they will be denied by default.&lt;/p&gt;

&lt;p&gt;Users with full read permissions across all AWS services will be able to view Cost Explorer after the earlier steps in this post, but we may want to allow additional user roles access as well.&lt;/p&gt;

&lt;p&gt;See my last post’s &lt;a href=&quot;/blog/aws-organizations/&quot;&gt;“Creating cross-account role-based permission groups”&lt;/a&gt; section for how to create a Permission Set and assign user groups these permissions on specific linked accounts.&lt;/p&gt;

&lt;p&gt;To manage your Permission Sets, navigate to “IAM Identity Center” &amp;gt; “Multi-account permissions” &amp;gt; “Permission sets”.&lt;/p&gt;

&lt;p&gt;Assuming you have created a Permission Set that does not have explicit permissions to Cost Explorer, and you want that group to have read access to all cost management services, select that Permission Set.&lt;/p&gt;

&lt;p&gt;In this case, I have a SupportUser Permission Set where I want my support team to be able to view cost and usage data for their assigned accounts.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_SupportUserPermissionSetWithoutBillingAccess.png&quot; alt=&quot;Enable linked accounts access to Cost Optimization Hub&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While this Permission Set only has the SupportUser policy assigned, users logging in with this role will be denied access to all Cost Explorer widgets with the error:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;User: [REDACTED_ACCOUNT_ID]&lt;/p&gt;

  &lt;p&gt;Service: [Cost Explorer]&lt;/p&gt;

  &lt;p&gt;Name: [AccessDeniedException]&lt;/p&gt;

  &lt;p&gt;HTTP status code: [400]&lt;/p&gt;

  &lt;p&gt;Context: [User: arn:aws:sts::REDACTED_ACCOUNT_ID:assumed-role/REDACTED_ROLE_ID is not authorized to perform: ce:GetCostAndUsage on resource: arn:aws:ce:us-east-1:REDACTED_ACCOUNT_ID:/GetCostAndUsage because no identity-based policy allows the ce:GetCostAndUsage action]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Under the Permission Set’s “Permissions” &amp;gt; “Inline policy” section, select “Edit”.&lt;/p&gt;

&lt;p&gt;Enter the following IAM policy document, or the scoped permissions you want to grant:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Version&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2012-10-17&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Statement&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Sid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;CostManagementViewAccess&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Effect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Allow&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Action&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;account:GetAccountInformation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;aws-portal:ViewBilling&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;billing:Get*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;billing:List*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;budgets:ViewBudget&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;budgets:Describe*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ce:Describe*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ce:Get*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ce:List*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;consolidatedbilling:Get*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;consolidatedbilling:List*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cur:Describe*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cur:Get*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;freetier:Get*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
                &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sustainability:GetCarbonFootprintSummary&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Resource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Scroll to the bottom of the page and save your changes.&lt;/p&gt;

&lt;p&gt;Validate that users assigned this Permission Set on a given account now have read access to Billing and Cost Management pages.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250510_CostExplorerAccessAsSupportUser.png&quot; alt=&quot;Support user now able to view Cost Explorer pages&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;posts-in-this-series&quot;&gt;Posts in this series&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/blog/aws-organizations/&quot;&gt;Using AWS Organizations to standardize security controls across AWS accounts&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;(Current post) Granting AWS Organization member accounts access to Cost Explorer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;AWS Organizations User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html&quot;&gt;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Cost Explorer Access User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/cost-management/latest/userguide/ce-access.html&quot;&gt;https://docs.aws.amazon.com/cost-management/latest/userguide/ce-access.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS IAM Identity Center Permission Sets User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/singlesignon/latest/userguide/permissionsetsconcept.html&quot;&gt;https://docs.aws.amazon.com/singlesignon/latest/userguide/permissionsetsconcept.html&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 10 May 2025 00:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/aws-organization-members-cost-explorer-access/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/aws-organization-members-cost-explorer-access/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>Using AWS Organizations to standardize security controls across AWS accounts</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;AWS Organizations provide a helpful way to centralize management of AWS accounts, with support for consolidating billing, role-based permission sets, service and resource control policies, and AWS service configurations.&lt;/p&gt;

&lt;p&gt;Service control policies (SCPs) can be used to enforce security controls such as:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Requiring multi-factor authentication (MFA) to complete certain actions&lt;/li&gt;
  &lt;li&gt;Blocking use of the root user outside of the AWS Organization management account&lt;/li&gt;
  &lt;li&gt;Blocking certain changes such as leaving the AWS Organization or disabling security tools&lt;/li&gt;
  &lt;li&gt;Blocking access to specific regions or setting a region allow-list, if your organization has policies restricting where services can be deployed&lt;/li&gt;
  &lt;li&gt;Blocking granting VPCs direct Internet access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Resource control policies (RCPs) are supported by S3, STS, KMS, SQS, and Secrets Manager, and can be used to enforce security controls such:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Requiring all queries against in-scope resources to be over HTTPS&lt;/li&gt;
  &lt;li&gt;Restricting access to your S3 buckets, KMS encryption keys, and Secrets Manager secrets to principles within your AWS Organization&lt;/li&gt;
  &lt;li&gt;Restricting sts:AssumeRoleWithWebIdentity requests to allow-listed OpenID Connect (OIDC) Identity Providers and identities&lt;/li&gt;
  &lt;li&gt;Requesting KMS encryption to be used for all S3 objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SCPs and RCPs are provided at no cost, while only up to 5 SCPs and 5 RCPs can be assigned to a given AWS account or AWS Organizational Unit.&lt;/p&gt;

&lt;p&gt;Each AWS Organization has a single management account that can’t be changed, and AWS recommends using an account that has no other resources or workloads for this purpose.  Access to the management account should be limited to admin users that need to make organization changes.  SCPs and RCPs do not apply to the management account.&lt;/p&gt;

&lt;p&gt;There are no costs associated with using AWS Organizations, so while you do need to take care when using it (with great power comes great responsibility), it’s an awesome tool to simplify managing multi-account organizations and services.&lt;/p&gt;

&lt;h2 id=&quot;creating-an-aws-organization&quot;&gt;Creating an AWS Organization&lt;/h2&gt;

&lt;p&gt;Once you’ve created an AWS account that will be specifically for AWS Organization management, log into that account, navigate to the AWS Organizations service and click “Create an Organization”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_01_StarterLandingPage.png&quot; alt=&quot;AWS Organizations Landing Page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Review the linked recommendations, and click “Create an Organization” to proceed.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_02_CreateOrgPage.png&quot; alt=&quot;Create AWS Organization Page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can then view and modify Organization settings and Invite Accounts to join the Organization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_03_NewOrgPage.png&quot; alt=&quot;New AWS Organization Page&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;standardizing-user-access-via-iam-identity-center&quot;&gt;Standardizing user access via IAM Identity Center&lt;/h2&gt;

&lt;p&gt;From the AWS Organization management account, navigate to the IAM Identity Center service, and enable Identity Center for your Organization if not yet enabled.&lt;/p&gt;

&lt;p&gt;Then, under “Settings” &amp;gt; “Identity source”, configure an Identity Source that matches your use case.&lt;/p&gt;

&lt;p&gt;You can choose between:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Identity Center directory, where you create users and groups in IAM Identity Center, and users sign in through the AWS access portal with usernames and passwords.&lt;/li&gt;
  &lt;li&gt;Active Directory, which many companies already have set up to manage network access controls.&lt;/li&gt;
  &lt;li&gt;Other external identity providers that implement supported identity federation protocols.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For my proof-of-concept, Active Directory and external identity providers aren’t relevant, so I’ll use the native AWS Identity Center directory.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_01_ChooseIdentitySource.png&quot; alt=&quot;Choose Identity Source&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;enforcing-multi-factor-authentication-mfa&quot;&gt;Enforcing multi-factor-authentication (MFA)&lt;/h2&gt;

&lt;p&gt;Especially if using traditional usernames and passwords, multi-factor authentication is critical to protect your accounts from attackers.&lt;/p&gt;

&lt;p&gt;MFA requires that after a user enters a correct username and password, they also provide additional verification that they are who they say they are.&lt;/p&gt;

&lt;p&gt;Companies such as &lt;a href=&quot;https://security.googleblog.com/2019/05/new-research-how-effective-is-basic.html&quot;&gt;Google&lt;/a&gt; and &lt;a href=&quot;https://www.microsoft.com/en-us/security/blog/2019/08/20/one-simple-action-you-can-take-to-prevent-99-9-percent-of-account-attacks/&quot;&gt;Microsoft&lt;/a&gt; that have enforced MFA have reported 99%+ decreases in successful account take-overs when requiring on-device prompts, and 100% decreases when requiring physical security keys.&lt;/p&gt;

&lt;p&gt;To enforce MFA for your Organization:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Navigate to “IAM Identity Center” &amp;gt; “Settings” &amp;gt; “Authentication”.&lt;/li&gt;
  &lt;li&gt;Under “Multi-factor authentication”, click “Configure”.&lt;/li&gt;
  &lt;li&gt;Under “Prompt users for MFA”, select “Every time they sign in”.&lt;/li&gt;
  &lt;li&gt;Under “Users can authenticate with these MFA types”, select one or more of the following:
    &lt;ul&gt;
      &lt;li&gt;Security keys (eg. YubiKey) and built-in authenticators (eg. Apple TouchID) - recommend always enabling so can users can choose the strongest MFA type available; or&lt;/li&gt;
      &lt;li&gt;Authenticator apps (eg. Authy, Google Authenticator) - good back-up if not all users can use the first option.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Under “If a user does not yet have a registered MFA device”, select “Require them to register an MFA device at sign in”.&lt;/li&gt;
  &lt;li&gt;Click “Save changes”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_02_EnforceMfa.png&quot; alt=&quot;Enforce MFA&quot; /&gt;&lt;/p&gt;

&lt;p&gt;MFA will now be automatically enforced for all users.&lt;/p&gt;

&lt;h2 id=&quot;adding-aws-accounts-to-an-aws-organization&quot;&gt;Adding AWS accounts to an AWS Organization&lt;/h2&gt;

&lt;p&gt;From the AWS Organization’s “AWS accounts” page, click “Add an AWS account”.&lt;/p&gt;

&lt;p&gt;From here, you can either create new AWS accounts under the Organization, or invite existing AWS accounts to join the Organization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_04_InviteAccountPage.png&quot; alt=&quot;Invite AWS Account Page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To invite an existing account, click “Invite an existing AWS account”, enter the account’s details, and click “Send invitation”.&lt;/p&gt;

&lt;p&gt;You can then log into the invited AWS account, select “Invitations”, and accept or decline the invite.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_05_MemberAccountInvitation.png&quot; alt=&quot;View Invite to AWS Organization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After accepting the invite, billing for the member account will be managed by the Organization’s management account, and any Organization-level security controls and AWS service configurations will be automatically applied.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_06_MemberAccountInviteAccepted.png&quot; alt=&quot;Invite Accepted Page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Member accounts can leave the Organization at any time from their AWS Organizations dashboard, unless the Organization has enforced a service control policy (SCP) blocking this action.  It’s a good practice to implement such an SCP to prevent actors who’ve compromised an account from leaving the Organization to disable its security controls.&lt;/p&gt;

&lt;h2 id=&quot;creating-cross-account-role-based-permission-groups&quot;&gt;Creating cross-account role-based permission groups&lt;/h2&gt;

&lt;p&gt;A permission set is a collection of IAM policies and permission boundaries that defines the access that will be granted to a logged in user.&lt;/p&gt;

&lt;p&gt;Multiple permission sets can be created and assigned to user groups to enable those users to log into a AWS account with one of the permission sets that have been made available to them.&lt;/p&gt;

&lt;p&gt;To manage your permission sets, navigate to “IAM Identity Center” &amp;gt; “Multi-account permissions” &amp;gt; “Permission sets”.&lt;/p&gt;

&lt;p&gt;If you click “Create permission set”, you can select from a number of AWS-defined template permission sets, or create a custom permission set with any combination of managed and inline IAM policies and an optional permission boundary.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_04_CreatePermissionSet.png&quot; alt=&quot;Create Permission Set&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For example, we could create the following predefined permission sets for our Organization:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;AdministratorAccess: provides full access to AWS services and resources&lt;/li&gt;
  &lt;li&gt;PowerUserAccess: provides full access to AWS services and resources, but does not allow management of Users and groups&lt;/li&gt;
  &lt;li&gt;SupportUser: provides permissions to troubleshoot and resolve issues in an AWS account, and contact AWS support to create and manage cases&lt;/li&gt;
  &lt;li&gt;ReadOnlyAccess: provides read-only access to AWS services and resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_03_PermissionSets.png&quot; alt=&quot;Permission Sets List View&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can then create a user group and grant that group a subset of these permission sets on specific member accounts in our Organization, depending on the level of access they need on the given accounts for their business functions.&lt;/p&gt;

&lt;p&gt;Under “IAM Identity Center” &amp;gt; “Groups”, you can manage permission groups that you can assign role-based permissions, and these groups can be assigned to any selection of Organizational Units or accounts within your AWS Organization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_05_GroupsList.png&quot; alt=&quot;Groups List View&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To create a new group, select “Create group”, enter a group name and optional description, optionally add users, and click “Create group”.&lt;/p&gt;

&lt;p&gt;Then open that group and under AWS accounts, select “Assign accounts”.&lt;/p&gt;

&lt;p&gt;Select accounts and permission sets that the given group should have on those accounts, the click “Assign” to apply the permissions.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_IamIdentityCenter_06_GroupAssignPermissionSets.png&quot; alt=&quot;Assign permission sets and accounts to a group&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can then add users to the group, and after they log into their provided AWS access portal, they will be able to select an account and a permission set out of those assigned to their group for that account.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_UserLoginPermissionSetsView.png&quot; alt=&quot;User portal with accounts and permission sets&quot; /&gt;&lt;/p&gt;

&lt;p&gt;A common use case for setting up multiple permission sets on a group is to allow developers to log into accounts with read-only access by default to validate workflows, and only have them log in with admin permissions when necessary to manually remediate issues.&lt;/p&gt;

&lt;h2 id=&quot;setting-up-service-control-policies&quot;&gt;Setting up Service Control Policies&lt;/h2&gt;

&lt;p&gt;From your AWS Organization management account, navigate to “AWS Organizations” &amp;gt; “Policies”.&lt;/p&gt;

&lt;p&gt;Click “Service control policies” and enable this feature.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_07_SecurityControlPoliciesPage.png&quot; alt=&quot;Security Control Policies page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Click “Create policy” and create a policy based on your use cases.&lt;/p&gt;

&lt;p&gt;For example, to create a policy that blocks member accounts from leaving the AWS Organization, disabling or modifying GuardDuty config, or attaching their VPCs directly to the Internet, we can create a SCP with:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Policy name: DenyBypassOrgSecurityControls
    &lt;ul&gt;
      &lt;li&gt;Note: Will likely want a more specific policy and name, using this for testing purposes.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Description: Prevent member accounts from leaving the AWS Organization, disabling or modifying GuardDuty configurations, or opening direct VPC access to the Internet.&lt;/li&gt;
  &lt;li&gt;Policy JSON:&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Version&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2012-10-17&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Statement&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Sid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;DenyLeaveOrganization&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Effect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Deny&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Action&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;organizations:LeaveOrganization&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Resource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Sid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;DenyModifyGuardDuty&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Effect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Deny&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Action&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:AcceptInvitation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:ArchiveFindings&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateDetector&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateFilter&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateIPSet&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreatePublishingDestination&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateSampleFindings&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:CreateThreatIntelSet&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeclineInvitations&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteDetector&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteFilter&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteInvitations&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteIPSet&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeletePublishingDestination&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DeleteThreatIntelSet&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DisassociateFromMasterAccount&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:DisassociateMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:InviteMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:StartMonitoringMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:StopMonitoringMembers&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:TagResource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UnarchiveFindings&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UntagResource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdateDetector&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdateFilter&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdateFindingsFeedback&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdateIPSet&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdatePublishingDestination&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;guardduty:UpdateThreatIntelSet&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Resource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Sid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;DenyOpenVpcInternetAccess&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Effect&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Deny&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Action&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ec2:AttachInternetGateway&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ec2:CreateInternetGateway&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ec2:CreateEgressOnlyInternetGateway&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ec2:CreateVpcPeeringConnection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ec2:AcceptVpcPeeringConnection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;globalaccelerator:Create*&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;globalaccelerator:Update*&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;Resource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;*&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Click “Create policy” to save your changes.&lt;/p&gt;

&lt;p&gt;Then, from the “Service control policies” page, select your created policy and select “Actions” &amp;gt; “Attach Policy”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_08_SecurityControlPoliciesPage_AttachPolicyDropdown.png&quot; alt=&quot;Security Control Policies actions&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Select the Organizational Units or specific accounts you want to apply the policy to, and click “Attach policy”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrgs_09_SecurityControlPolicies_AttachPolicyPage.png&quot; alt=&quot;Security Control Policies attach policy page&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The security controls will now be automatically enforced for the associated member accounts.&lt;/p&gt;

&lt;p&gt;We can test this by logging into a member account using the AdministratorAccess role, navigating to AWS Organizations, and selecting to leave the Organization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20250507_AwsOrg_ScpValidation_ErrorLeavingOrg.png&quot; alt=&quot;Error leaving AWS Organization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As shown above, the Service Control Policy successfully blocks member accounts from leaving the Organization.  Removing member accounts can now only be initiated from the Organization management account, which reduces the risk of a compromised account bypassing Organization security policies.&lt;/p&gt;

&lt;h2 id=&quot;posts-in-this-series&quot;&gt;Posts in this series&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;(Current post) Using AWS Organizations to standardize security controls across AWS accounts&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/blog/aws-organization-members-cost-explorer-access/&quot;&gt;Granting AWS Organization member accounts access to Cost Explorer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;AWS Organizations User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html&quot;&gt;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Organizations Best Practices: &lt;a href=&quot;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_best-practices_mgmt-acct.html&quot;&gt;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_best-practices_mgmt-acct.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Organizations Security Control Policies User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html&quot;&gt;https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Organizations Security Control Policies Best Practices: &lt;a href=&quot;https://aws.amazon.com/blogs/industries/best-practices-for-aws-organizations-service-control-policies-in-a-multi-account-environment/&quot;&gt;https://aws.amazon.com/blogs/industries/best-practices-for-aws-organizations-service-control-policies-in-a-multi-account-environment/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS IAM Identity Center MFA User Guide: &lt;a href=&quot;https://docs.aws.amazon.com/singlesignon/latest/userguide/mfa-configure.html&quot;&gt;https://docs.aws.amazon.com/singlesignon/latest/userguide/mfa-configure.html&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Wed, 07 May 2025 00:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/aws-organizations/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/aws-organizations/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>Reducing Lambda latency by 76% with AWS Lambda Power Tuning</title>
        <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Optimizing AWS Lambda memory capacity can decrease customer-facing latencies by up to 2-5 times without significantly increasing hardware costs.  However, this takes trial and error, and many teams just pick an amount of memory and stick with it, leaving their services several times slower than necessary.&lt;/p&gt;

&lt;p&gt;Other teams spend hours setting up custom code and metrics to measure latencies for each of their service’s use cases, benchmark each use case against various memory capacities, and use the AWS Cost Estimator or AWS Lambda pricing documentation to estimate costs and choose the amount of memory with the best latency-to-cost tradeoff.&lt;/p&gt;

&lt;p&gt;This is no longer necessary with the AWS Lambda Power Tuning tool, which can be run against any Lambda function in your AWS account to automatically determine the optimal memory capacity that minimizes execution latency and/or hardware costs.&lt;/p&gt;

&lt;p&gt;There is no cost to deploy and run this besides its underlying hardware costs, which is likely free if you only run it a few times before deleting it from your account.&lt;/p&gt;

&lt;p&gt;Since it only relies on AWS-infrastructure-level API calls, the tool works regardless of which programming language your Lambda function uses, and doesn’t require any modifications to your service infrastructure or code.&lt;/p&gt;

&lt;h2 id=&quot;set-up&quot;&gt;Set up&lt;/h2&gt;

&lt;p&gt;The AWS Lambda Power Tuning &lt;a href=&quot;https://github.com/alexcasalboni/aws-lambda-power-tuning&quot;&gt;GitHub repo&lt;/a&gt; documents multiple ways to deploy the tool, using either the AWS Serverless Application Repository (simplest), AWS SAM CLI, AWS CDK, or Terraform.&lt;/p&gt;

&lt;p&gt;I used the AWS Serverless Application Repository since this reduces set-up to a few button clicks, and I planned to tear down the tool after optimizing my Lambda function.&lt;/p&gt;

&lt;p&gt;To use this deployment option, you can simply log into your AWS account, visit &lt;a href=&quot;https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:451282441545:applications~aws-lambda-power-tuning&quot;&gt;https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:451282441545:applications~aws-lambda-power-tuning&lt;/a&gt;, and click Deploy.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuning_AWSServerlessRepoAppPage.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Serverless Repo application page for the AWS Lambda Power Tuning tool&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This will create an AWS Lambda Application that encapsulates all the infrastructure for the tuning tool, including the AWS Step Functions State Machine that you’ll invoke to run the benchmark tests.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuning_AppResources.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Lambda Power Tuning application resources&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Click on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;powerTuningStateMachine&lt;/code&gt; resource to open the state machine, and click &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Start Execution&lt;/code&gt;, then enter the JSON payload to run the benchmark test with, where input parameters are documented on the tool’s &lt;a href=&quot;https://github.com/alexcasalboni/aws-lambda-power-tuning&quot;&gt;GitHub README&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, the following payload runs the tool against the given Lambda function, with 15 executions each for 512, 1024, 1536, 2048, and 3008 MB of memory, with a function payload specific to my API service, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;balanced&lt;/code&gt; optimization strategy.&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;lambdaARN&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;arn:aws:lambda:us-west-2:123456789012:function:TestLambdaFunctionName&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;powerValues&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;512&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1024&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1536&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2048&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3008&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;num&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;payload&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;resource&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/v1/consent-management/services/{serviceId}/users/{userId}/consents&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;path&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/v1/consent-management/services/TestServiceId/users/TestUserId/consents&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;httpMethod&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;GET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;pathParameters&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;serviceId&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;TestServiceId&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;userId&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;TestUserId&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;requestContext&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;resourceId&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1abc2d&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;resourcePath&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/v1/consent-management/services/{serviceId}/users/{userId}/consents&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;operationName&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ListServiceUserConsent&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;httpMethod&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;GET&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;path&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/v1/consent-management/services/{serviceId}/users/{userId}/consents&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;accountId&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;123456789012&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;protocol&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;HTTP/1.1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;stage&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;test&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;parallelInvocation&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;strategy&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;balanced&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parallelInvocation&lt;/code&gt; to false after observing Lambda throttling errors with it set to true, since my test Lambda isn’t currently provisioned for high load, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strategy&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;balanced&lt;/code&gt; to equality weight minimizing latency and minimizing costs, while you can configure the tool to only consider one or use a different weighted average.&lt;/p&gt;

&lt;h2 id=&quot;analyzing-results&quot;&gt;Analyzing results&lt;/h2&gt;

&lt;p&gt;Once the execution completes, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Execution input and output&lt;/code&gt; tab will display the recommended amount of memory as the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;power&lt;/code&gt; value, the resulting average latency in milliseconds and cost per execution, and the URL to a more detailed visualization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuning_ExecutionOutput.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Lambda Power Tuning execution output&quot; /&gt;&lt;/p&gt;

&lt;p&gt;By navigating to that URL, we can view a graph of average latency and execution costs for each amount of memory measured, along with summarized best and worst memories for latency and cost.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuningResults.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Lambda Power Tuning results visualization&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this case, for my Lambda function, which is written in Java and queries a DynamoDB table, 2048 MB of memory resulted in the lowest average latency, while 1024 MB of memory had the lowest runtime costs.&lt;/p&gt;

&lt;p&gt;We can see that 512 MB actually costs more than 1024 MB, and this is due to the duration being several times higher which results in higher GB-second charges.&lt;/p&gt;

&lt;p&gt;This was only run for 15 iterations per memory allocation, so I increased the sample size and reran against 1024, 1536, and 2048 MB by setting powerValues and num to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;powerValues&quot;: [1024, 1536, 2048], &quot;num&quot;: 50&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I executed the Lambda function a couple times first with a test payload to eliminate cold starts as a compounding factor, and then ran the state machine with the new config, which resulted in the following output and visualization:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;power&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1536&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;cost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;3.2760000000000005e-7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;duration&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;12.266666666666667&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;stateMachine&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;executionCost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00023&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;lambdaCost&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.00012891480000000002&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;visualization&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;https://lambda-power-tuning.show/#AAQABgAI;3t2NQUREREFERExB;ilmiNADhrzRWgeo0&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The more detailed visualization indicates that for our particular use case, we’re unlikely to see significant performance improvements from increasing memory above 1536 MB, and the marginal cost increase from 1024 MB to 1536 MB is acceptable for us.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuningResultsRun2.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Lambda Power Tuning results visualization for second run&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can see a more detailed table view of the underlying data by going to the step function execution’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Detail&lt;/code&gt; tab, selecting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Table&lt;/code&gt; view, selecting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Analyzer&lt;/code&gt; task, and selecting the Analyzer panel’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Output&lt;/code&gt; tab.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20240714_AWSLambdaPowerTuningResultsAnalyzerDetails.png&quot; alt=&quot;alt text&quot; title=&quot;AWS Lambda Power Tuning results table view for second run&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;tear-down&quot;&gt;Tear-down&lt;/h2&gt;

&lt;p&gt;When you no longer need the tool, you can open the AWS CloudFormation console and delete the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;serverlessrepo-aws-lambda-power-tuning&lt;/code&gt; CloudFormation stack.&lt;/p&gt;

&lt;h2 id=&quot;outcome&quot;&gt;Outcome&lt;/h2&gt;

&lt;p&gt;The tool took under 10 minutes to deploy, execute, and fine-tune, and resulted in me changing my test Lambda’s memory allocation from 512 MB to 1536 MB.&lt;/p&gt;

&lt;p&gt;This lowered my API’s average latency from 50ms to 12ms, a 4.17x improvement, AKA 76% latency reduction.  Duration costs increased by 8% to $0.3276/million executions, which is minimal for my service’s scale.&lt;/p&gt;

&lt;p&gt;Given the latency improvements of choosing the right amount of memory, and how easy this tool is to use, I’d recommend it to anyone building services on AWS Lambda.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;AWS Lambda docs introducing AWS Lambda Power Tuning: &lt;a href=&quot;https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html&quot;&gt;https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Lambda Power Tuning GitHub repository with usage details: &lt;a href=&quot;https://github.com/alexcasalboni/aws-lambda-power-tuning&quot;&gt;https://github.com/alexcasalboni/aws-lambda-power-tuning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Lambda pricing: &lt;a href=&quot;https://aws.amazon.com/lambda/pricing/&quot;&gt;https://aws.amazon.com/lambda/pricing/&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Sun, 14 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/lambda-power-tuning/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/lambda-power-tuning/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>Serializing and deserializing DynamoDB pagination tokens to support paginated APIs</title>
        <description>&lt;p&gt;When using AWS’s Java 2.x SDK, DynamoDB scan and query responses provide pagination tokens in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Map&amp;lt;String, AttributeValue&amp;gt; lastEvaluatedKey&lt;/code&gt; object, which represents the primary key of the last processed DynamoDB item.  You can then pass this value as the “exclusive start key” for the next query to get the next page of results.&lt;/p&gt;

&lt;p&gt;When your service retrieves all pages of results locally, this isn’t a problem.  However, when you want to provide a paginated API backed by DynamoDB, you’ll need to convert this attribute value map into a format that can be passed over HTTP, AKA “serialize” the object into a string.&lt;/p&gt;

&lt;p&gt;When your client requests the next page of results with that string pagination token, you’ll also need to convert that string back into the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Map&amp;lt;String, AttributeValue&amp;gt;&lt;/code&gt; format that the AWS SDK expects, AKA “deserialize” the string to the original data structure.&lt;/p&gt;

&lt;h2 id=&quot;prior-method-for-serializingdeserializing-pagination-tokens&quot;&gt;Prior method for serializing/deserializing pagination tokens&lt;/h2&gt;

&lt;p&gt;Before May 2023, building paginated APIs backed by DynamoDB was not very convenient, as you’d have to build your own custom serialization and deserialization code.&lt;/p&gt;

&lt;p&gt;Example implementation using Immutables and Jackson, with a sample DynamoDB table primary key that has both a partition key and a sort key:&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.fasterxml.jackson.core.JsonParseException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.fasterxml.jackson.core.JsonProcessingException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.fasterxml.jackson.databind.ObjectMapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.fasterxml.jackson.databind.annotation.JsonDeserialize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;com.fasterxml.jackson.databind.annotation.JsonSerialize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.immutables.value.Value.Immutable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.immutables.value.Value.Parameter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.immutables.value.Value.Style&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;software.amazon.awssdk.services.dynamodb.model.AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.io.IOException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Base64&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.util.Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/**
 * Serializable representation of a Product DynamoDB pagination token.
 * Using Immutables to generate safe, immutable value objects.
 * @see https://immutables.github.io/
 */&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@JsonDeserialize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductNextTokenBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@JsonSerialize&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Style&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;visibility&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Style&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;ImplementationVisibility&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;PRIVATE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Immutable&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;interface&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductNextToken&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;@Parameter&lt;/span&gt;
    &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getPartitionKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;nd&quot;&gt;@Parameter&lt;/span&gt;
    &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getSortKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/**
 * Class encapsulating logic to convert DynamoDB pagination tokens between attribute value
 * maps used by the AWS SDK, and string values that can be passed over HTTP.
 */&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductSerializer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_PARTITION_KEY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;YourDynamoDBTablePartitionKeyName&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_SORT_KEY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;YourDynamoDBTableSortKeyName&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

    &lt;span class=&quot;nc&quot;&gt;ProductSerializer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ObjectMapper&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;objectMapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;objectMapper&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;objectMapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
     * Serialize a lastEvaluatedKey from an attribute value map to a string.
     *
     * @param lastEvaluatedKey attribute map returned by paginated DynamoDB queries.
     * @return serialized String token that can be passed over HTTP.
     * @throws JsonProcessingException exception thrown if unable to parse the key.
     */&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;serializeLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JsonProcessingException&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lastEvaluatedKey&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

        &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductNextToken&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokenObject&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductNextTokenBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;partitionKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_PARTITION_KEY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sortKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_SORT_KEY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Base64&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getUrlEncoder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;encodeToString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;objectMapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;writeValueAsBytes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenObject&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
     * Deserialize a lastEvaluatedKey from a string to an attribute value map.
     *
     * @param lastEvaluatedKey attribute map returned by paginated DynamoDB queries.
     * @return serialized String token that can be passed over HTTP.
     * @throws IOException exception thrown if unable to decode encodedLastEvaluatedKey.
     * @throws JsonParseException exception thrown if unable to deserialize the decoded key into a ProductNextToken.
     */&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;deserializeLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encodedLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;IOException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JsonParseException&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encodedLastEvaluatedKey&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductNextToken&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deserializedToken&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;objectMapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;readValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;Base64&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getUrlDecoder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;encodedLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;ProductNextToken&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;partitionKeyValue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;deserializedToken&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getPartitionKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sortKeyValue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;deserializedToken&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getSortKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;

      &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
          &lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_PARTITION_KEY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;partitionKeyValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
          &lt;span class=&quot;no&quot;&gt;PRODUCT_TABLE_SORT_KEY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sortKeyValue&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a lot of code to maintain and test, with multiple exception cases.  We can remove the dependency on specific key structure by generalizing the code to iterate over the map and JSON key-value pairs, as shown in &lt;a href=&quot;https://github.com/aws/aws-sdk-java-v2/issues/3224&quot;&gt;https://github.com/aws/aws-sdk-java-v2/issues/3224&lt;/a&gt;, but this is still more complex than should be necessary for what we’d prefer to be simple “stringify” and “unstringify” methods.&lt;/p&gt;

&lt;h2 id=&quot;serializationdeserialization-with-the-dynamodb-enhanced-document-library&quot;&gt;Serialization/deserialization with the DynamoDB Enhanced Document library&lt;/h2&gt;

&lt;p&gt;Since May 2023, AWS’s Java 2.x SDK includes an Enhanced Document library that simplifies converting pagination tokens between the AWS SDK’s objects and JSON strings that can be passed over HTTP.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/enhanced/dynamodb/document/EnhancedDocument.html&quot;&gt;software.amazon.awssdk.enhanced.dynamodb.document.EnhancedDocument&lt;/a&gt; class includes utility methods that make serialization and deserialization one-liners.&lt;/p&gt;

&lt;p&gt;AWS blog post demonstrating use cases: &lt;a href=&quot;https://aws.amazon.com/blogs/devops/introducing-the-enhanced-document-api-for-dynamodb-in-the-aws-sdk-for-java-2-x/&quot;&gt;https://aws.amazon.com/blogs/devops/introducing-the-enhanced-document-api-for-dynamodb-in-the-aws-sdk-for-java-2-x/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sample code for converting between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Map&amp;lt;String, AttributeValue&amp;gt;&lt;/code&gt; pagination tokens and JSON strings:&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;software.amazon.awssdk.enhanced.dynamodb.document.EnhancedDocument&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;software.amazon.awssdk.services.dynamodb.model.AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.io.UncheckedIOException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;cm&quot;&gt;/**
 * Class encapsulating logic to convert DynamoDB pagination tokens between attribute value
 * maps used by the AWS SDK, and string values that can be passed over HTTP.
 */&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ProductSerializer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;cm&quot;&gt;/**
      * Convert a DynamoDB attribute value map to a JSON string.
      * @param attributeValueMap DynamoDB item key represented as a map from attribute names to attribute values
      * @return String JSON string representation of the DynamoDB item key
      */&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;serializeLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attributeValueMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;EnhancedDocument&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;fromAttributeValueMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attributeValueMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toJson&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;cm&quot;&gt;/**
      * Convert a JSON string representation of a DynamoDB pagination token to the format required by DynamoDB API calls.
      * @param paginationTokenJson JSON string representing the last paginated API call&apos;s last evaluated record key
      * @return Map&amp;lt;String, AttributeValue&amp;gt; exclusive start key for the next paginated DynamoDB scan/query API call
      * @throws UncheckedIOException exception thrown if fail to parse pagination token
      */&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AttributeValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;deserializeLastEvaluatedKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;paginationTokenJson&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;UncheckedIOException&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;EnhancedDocument&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;fromJson&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;paginationTokenJson&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is much more manageable, with serialization and deserialization functionality now provided out-of-the-box as part of the standard AWS SDK.&lt;/p&gt;

&lt;p&gt;We can pass this serialized JSON string to our API clients as the client-facing pagination token.  Optionally, if it’s important to us to obfuscate our internal DynamoDB key structure from clients, we can add back a Base64 encode/decode layer on top of the JSON strings using the same code snippets from the earlier example.&lt;/p&gt;
</description>
        <pubDate>Sat, 18 May 2024 17:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/serializing-deserializing-dynamodb-pagination-tokens/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/serializing-deserializing-dynamodb-pagination-tokens/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>Concurrency from single host applications up to massively distributed services</title>
        <description>&lt;p&gt;Concurrency is when multiple software threads or programs are run at the same time, and is a key aspect of many modern applications.&lt;/p&gt;

&lt;p&gt;Web browsers run dozens of concurrent processes based on your activity, querying servers, downloading files, and executing scripts all at once.&lt;/p&gt;

&lt;p&gt;Online services with millions of active users run a scaled up number of concurrent processes across thousands of servers, with various distributed system design patterns to support this.&lt;/p&gt;

&lt;p&gt;This post will describe several levels of concurrency, how they’re commonly applied, and pros and cons of each approach.&lt;/p&gt;

&lt;h2 id=&quot;levels-of-concurrency&quot;&gt;Levels of concurrency&lt;/h2&gt;
&lt;h3 id=&quot;multi-threaded-applications&quot;&gt;Multi-threaded applications&lt;/h3&gt;
&lt;p&gt;Within application code, we can run multiple threads concurrently.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20231130_DistributedComputeToHandleMillionTps-MultiThreadedApp.png&quot; alt=&quot;alt text&quot; title=&quot;Diagram of a multi-threaded application&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This approach can be locally applied regardless of whether an application is run on a single host or in a distributed service.  However, multi-threaded code increases code complexity and introduces thread safety issues, and an error in one thread may take down the entire application.&lt;/p&gt;

&lt;p&gt;A common use case for multi-threading is when we need to make multiple requests to other services that may each take multiple seconds to complete.  We can trigger each request in a separate thread to run them concurrently, and collect the results at the end of the longest running call, rather than synchronously making one request at a time after the prior response has returned.&lt;/p&gt;

&lt;h4 id=&quot;latency-trade-offs&quot;&gt;Latency trade-offs&lt;/h4&gt;

&lt;p&gt;Example 1: Suppose we will run 4 requests that each take 2 seconds, 5 seconds, 5 seconds, and 1 second to complete, and each thread adds 20 milliseconds of overhead to start and close.  We’ll exclude the time to combine results as equivalent between multi-threaded and synchronous approaches.  Our runtime with multi-threading will be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max(2, 5, 5, 1) + 0.02*4&lt;/code&gt; = 5.08 seconds, compared to the synchronous approach taking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2 + 5 + 5 + 1&lt;/code&gt; = 13 seconds to make all requests.  In this scenario, multi-threading reduces our latency by 7.92 seconds.&lt;/p&gt;

&lt;p&gt;Example 2: Splitting tasks into threads does not come for free and may not worthwhile for very short-lived requests.  For example, if we have 1000 requests that each take 0.01 seconds to complete, running each request in a separate thread would take &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max(0.01) + 0.02*1000&lt;/code&gt; = 20.01 seconds, compared to the synchronous approach taking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1000*0.01&lt;/code&gt; = 10 seconds.  In this case, the synchronous approach is twice as efficient as multi-threading.&lt;/p&gt;

&lt;p&gt;Since the cost of such a high branching factor is high, in reality, we’ll typically break this workflow up into batches of requests per thread, such as 200 requests per thread.&lt;/p&gt;

&lt;p&gt;Example 2b: Given 1000 requests that each take 0.01 seconds to complete, if we split the work into 5 batches of 200 requests per thread, computing all the results would take &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;max(200*0.01, 200*0.01, 200*0.01, 200*0.01, 200*0.01) + 0.02*5&lt;/code&gt; = 2.1 seconds, compared to the synchronous approach taking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1000*0.01&lt;/code&gt; = 10 seconds.  By batching the work before applying multi-threading, we can reduce latency compared to synchronous calls by 7.9 seconds.&lt;/p&gt;

&lt;p&gt;Multi-threading provides the most latency reduction when we’re able to run multiple long-running tasks in parallel, especially multi-second tasks, whether each task is a single long-running request or a series of requests adding up to seconds.&lt;/p&gt;

&lt;h3 id=&quot;multi-container-hosts&quot;&gt;Multi-container hosts&lt;/h3&gt;
&lt;p&gt;Within a host, we can run multiple containers which each receive allocated memory and run an isolated instance of application code.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20231130_DistributedComputeToHandleMillionTps-MultiContainerHost.png&quot; alt=&quot;alt text&quot; title=&quot;Diagram of a multi-container host&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This allows us to fully utilize a host’s CPU and memory, while we will eventually get to a point where the host no longer has sufficient CPU or memory capacity to add more containers, or where performance begins to drop due to increased context switching and IO bottlenecks.&lt;/p&gt;

&lt;p&gt;Isolating concurrent applications in separate containers also improves system reliability, since regardless of individual application failures, the other containers can continue running.  However, we are still vulnerable to host-level failures.&lt;/p&gt;

&lt;p&gt;A single host can be sufficient for some small-scale services that only have a few hundred concurrent requests and are acceptable to periodically take offline for maintenance.  For services that need to provide 24/7 availability or handle more traffic, we will graduate to distributed services where this host will be a single unit of a larger architecture, leading us to the multi-host cluster.&lt;/p&gt;

&lt;h3 id=&quot;multi-host-clusters-behind-a-load-balancer&quot;&gt;Multi-host clusters behind a load balancer&lt;/h3&gt;
&lt;p&gt;When we require high availability or more concurrency than a single host can support, we can set up a load balancer that distributes traffic across multiple hosts, forming a cluster of hosts.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20231130_DistributedComputeToHandleMillionTps-MultiHostCluster.png&quot; alt=&quot;alt text&quot; title=&quot;Diagram of a multi-host cluster&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This allows us to horizontally scale, that is, add or remove servers to our resource pool as needed.  Horizontal scaling makes our service more robust to individual host failures and enables more flexibility in our infrastructure, allowing us to swap out different types of hosts at will, patch or update individual hosts without affecting service availability, and pay for just as many hosts as are needed to meet current demand.&lt;/p&gt;

&lt;p&gt;This is often the go-to design pattern for services that need to process thousands of concurrent requests, which a single host may no longer be able to handle.&lt;/p&gt;

&lt;h3 id=&quot;multi-cluster-services&quot;&gt;Multi-cluster services&lt;/h3&gt;
&lt;p&gt;When we have more traffic than a single load balancer can handle, we can set up a DNS load balancer to distribute traffic across multiple clusters.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/20231130_DistributedComputeToHandleMillionTps-MultiClusterService.png&quot; alt=&quot;alt text&quot; title=&quot;Diagram of a multi-cluster service&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is rarely the starting point for a new service.  We only want to add this level of complexity when absolutely necessary, such as when scaling up to millions of concurrent requests, or after hitting infrastructure restrictions on load balancer concurrent connections or maximum attached endpoints.&lt;/p&gt;

&lt;p&gt;Many cloud providers provide distributed DNS load balancers that remove the single point of failure of a traditional load balancer, scale to millions of concurrent users, and automatically route traffic to the closest regional cluster.&lt;/p&gt;

&lt;h4 id=&quot;dns-load-balancer-trade-offs&quot;&gt;DNS load balancer trade-offs&lt;/h4&gt;

&lt;p&gt;DNS load balancers are more limited in functionality than many specialized load balancers.  For example, AWS network load balancers can support more granular access controls and security configurations, and integrate with compute services to automatically replace unhealthy hosts that fail to respond to the load balancer.&lt;/p&gt;

&lt;p&gt;DNS also requires its connected endpoints to be accessible to the Internet, which is not always ideal.  Following the security principle of defence-in-depth, when protecting critical data or infrastructure, anything that doesn’t need to be connected to the Internet, shouldn’t be.  Network load balancers can be set up in protected virtual private networks to only allow access from allow-listed hosts or other trusted networks.&lt;/p&gt;

&lt;p&gt;For these reasons, in some scenarios it will make sense to have the added complexity of both a frontend DNS load balancer to distribute traffic to the closest cluster, and backend application load balancers that provide more functionality and integration with your local infrastructure.&lt;/p&gt;

&lt;p&gt;If you don’t need any functionality that isn’t supported by a DNS load balancer, can live with your servers being accessible from the Internet, and already manage your own health monitoring and host replacement strategy, then you can simplify your architecture by having a DNS load balancer directly route traffic to your backend servers.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;We’ve discussed how concurrency can be applied at multiple levels:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Multi-threaded applications that run multiple tasks in parallel, such as querying several websites simultaneously&lt;/li&gt;
  &lt;li&gt;Multi-container or multi-process hosts that run multiple applications in isolation from one another, so that a given application can continue running if others fail&lt;/li&gt;
  &lt;li&gt;Multi-host clusters that enable horizontally scaling a service to process hundreds of thousands of concurrent requests&lt;/li&gt;
  &lt;li&gt;Multi-cluster services that enable routing traffic to local load-balanced clusters that can be independently scaled, to process millions of concurrent requests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many distributed services now start with multi-host clusters for reliability and scalability reasons, so that any given host can be replaced without impacting customer service, and additional hosts can be added as needed.&lt;/p&gt;

&lt;p&gt;A single load balancer and backend compute cluster can often handle hundreds of thousands of concurrent requests or more, while the load balancer may become a single point of failure for your service.  Distributed DNS load balancers can help to mitigate this concern when it’s acceptable for your servers to be accessible from the Internet.&lt;/p&gt;

&lt;p&gt;For applications where you need to handle millions of concurrent requests and have business requirements not met by a single DNS load balancer, such as needing granular access control for your backend servers or integrations with other infrastructure, a DNS load balancer in front of multiple load-balanced clusters can meet these demands with the trade-off of an additional layer of complexity.&lt;/p&gt;

&lt;h2 id=&quot;addendum&quot;&gt;Addendum&lt;/h2&gt;

&lt;p&gt;Before scaling your service to process millions of concurrent requests and paying hundreds of thousands of dollars to do so, make sure this is really necessary.&lt;/p&gt;

&lt;p&gt;Would it be more efficient to extract some of your use cases to a separate microservice?&lt;/p&gt;

&lt;p&gt;Are your hosts really doing unique work on every call?  Could some of that work be deduplicated, or could the right application of a caching layer reduce your traffic and/or average latency by orders of magnitude?&lt;/p&gt;

&lt;p&gt;Also, note that millions of concurrent users do not always translate into millions of transactions per second.  If each user only needs to make a server request every few seconds, with multiple seconds between where they locally interact with rendered results, you may only have tens to hundreds of thousands of transactions per second, which while still high, lowers the required complexity of the system.&lt;/p&gt;

&lt;p&gt;Software architecture design is an iterative process, and the optimal design will change along with the business, so it’s often worth starting with the simplest approach that meets current needs and can be scaled up or down as needed based on customer traffic.  There’s no prize for building the most expensive service that no one uses.&lt;/p&gt;
</description>
        <pubDate>Fri, 01 Dec 2023 03:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/concurrency-from-app-to-massively-distributed-service/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/concurrency-from-app-to-massively-distributed-service/</guid>
        
        
        <category>distributed-systems</category>
        
        <category>system-design</category>
        
      </item>
    
      <item>
        <title>Process for designing distributed systems</title>
        <description>&lt;p&gt;In this post I’ll step through my process for designing distributed systems, with example questions and artifacts associated with each step.&lt;/p&gt;

&lt;h2 id=&quot;step-by-step-process&quot;&gt;Step by step process&lt;/h2&gt;

&lt;h3 id=&quot;1-validate-whether-this-service-needs-to-exist&quot;&gt;1. Validate whether this service needs to exist&lt;/h3&gt;
&lt;p&gt;Before building any complex system, we should ensure there’s a compelling project motivation.  If we can’t identify an underlying customer problem and how this service will address it, we should pause to make sure we’re working on the right thing.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;What specific problem or customer need are we trying to address?&lt;/li&gt;
  &lt;li&gt;How will the customer need be addressed by this service?&lt;/li&gt;
  &lt;li&gt;What will the end state be after this is completed?&lt;/li&gt;
  &lt;li&gt;Why are existing solutions not sufficient?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few hours of research to check what aspects of the problem could be solved by existing services may save both money and months of engineering hours.  If we can leverage existing solutions, we should make sure they are well supported and well documented.&lt;/p&gt;

&lt;p&gt;If we have a compelling justification for the service after answering the above questions, we’ll continue with the design work, otherwise this may be an indication we should put the project aside to focus on more impactful work.&lt;/p&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Project justification including problem statement and brief summary of how the service will solve that problem in a way that isn’t satisfied by existing solutions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-clarify-business-requirements&quot;&gt;2. Clarify business requirements&lt;/h3&gt;
&lt;p&gt;Before making design decisions, it helps to take the time to understand the business use cases and identify what needs to be supported in the first release, and what major features are anticipated in the near future.  This way we can make appropriate choices that keep our system as simple as possible while making it easy to extend to future needs.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;What are the different user personas our system needs to support?  Are they internal employees or external customers, human or programmatic?&lt;/li&gt;
  &lt;li&gt;What latency needs to be supported for each use case?  Some APIs may need to be in the 100ms range, others may not be latency-sensitive.&lt;/li&gt;
  &lt;li&gt;What level of availability is required for this service?  Some services are only used during business hours, while others are critical to keep running 24/7 with severe consequences for even an hour of downtime a year.&lt;/li&gt;
  &lt;li&gt;What are the security requirements for this service?  Eg. Who should be allowed to access different APIs, and are there authorization requirements for who can access what data?  Is it acceptable for anyone on the Internet to be able to query the service, or does it need to be restricted to only allow-listed services/users?  What data needs to be encrypted in transit and at rest?  Do we need to protect against malicious users?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Requirements document including service-level business requirements, specific use cases that must be supported, latency requirements for each use case, and security requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We should identify stakeholders who should have a say in how the service works, and get their feedback so we can drive alignment and make required changes early in the design process while major changes are less costly.&lt;/p&gt;

&lt;h3 id=&quot;3-estimate-scale&quot;&gt;3. Estimate scale&lt;/h3&gt;
&lt;p&gt;The scale of data and traffic make a big difference on the architecture needed to support it.  Services that only receive a few dozen requests at a time can be very simple, but as we scale up to millions of concurrent requests, we have very different needs around load balancing, host scaling, caching, and data management.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;How many users of each type do we expect to have, and how many will be active at a given time?&lt;/li&gt;
  &lt;li&gt;How much data do we expect our system to have, and how quickly will it grow over time?&lt;/li&gt;
  &lt;li&gt;What frequency of read vs write operations do we expect on different types of data?&lt;/li&gt;
  &lt;li&gt;What network bandwidth will we need to support the anticipated traffic?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Summary of expected scale of total/active users, data, and network traffic.&lt;/li&gt;
  &lt;li&gt;Summary of expected transactions per second (TPS) per user operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-define-system-interfaces-and-data-models&quot;&gt;4. Define system interfaces and data models&lt;/h3&gt;
&lt;p&gt;In this step we define how callers will interact with our service, which will typically be through API interfaces.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;How can we translate our business use cases into API interfaces that are simple, decoupled from implementation (so we can iterate on our backend design and data models without impacting customers), and future-proof when considering expected new features?&lt;/li&gt;
  &lt;li&gt;How can we name and structure our API interfaces to be self-explanatory when accompanied by API documentation, to third parties who have no knowledge of how our internal systems work?&lt;/li&gt;
  &lt;li&gt;How can we organize our interfaces around resources and HTTP methods?  Following REST API conventions will make it easier for third parties to integrate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Draft API spec including method names, request structures, and response structures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once we’ve defined our API spec, we can validate it with stakeholders and iterate until we’re confident we have an interface that will meet user requirements and be easily extended to future use cases.&lt;/p&gt;

&lt;h3 id=&quot;5-define-data-flow-and-storage&quot;&gt;5. Define data flow and storage&lt;/h3&gt;
&lt;p&gt;While still treating system internals as a black box, we can define how data will flow in and out of our system and be stored.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Who will we consume data from, and how?&lt;/li&gt;
  &lt;li&gt;Who will consume data from our service, and how?&lt;/li&gt;
  &lt;li&gt;What data does our service need to store, and what data structures will best support our use cases?&lt;/li&gt;
  &lt;li&gt;What is the end-to-end lifecycle for data entering our system, for each type of data we store or process?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Description of how data flows from upstream services/users, to our service, to downstream services/users.&lt;/li&gt;
  &lt;li&gt;Description of data models we will store and process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step may be done in parallel with defining API interfaces, and we’ll similarly want to validate the data workflows with stakeholders, including data producers and consumers, to ensure our contract makes sense before designing a system that doesn’t match reality.&lt;/p&gt;

&lt;h3 id=&quot;6-define-high-level-system-components&quot;&gt;6. Define high-level system components&lt;/h3&gt;
&lt;p&gt;Now that we’ve aligned on our use cases, system interface, data models, and inter-service data flows, we can build a picture of the high-level components of our system and how they’ll interact.&lt;/p&gt;

&lt;p&gt;Example questions:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;What logical components make sense for dividing responsibilities?  How will data flow between them?&lt;/li&gt;
  &lt;li&gt;Should client calls go through a load balancer pointing to multiple backend servers, or will a single server suffice for our scaling and reliability needs?&lt;/li&gt;
  &lt;li&gt;Do we need to implement new data stores, and if so, which components will be retrieving or writing data to them?&lt;/li&gt;
  &lt;li&gt;Do we have static content that should be separated from our other client/service interactions, for example, with clients querying a Content Delivery Network backed by a distributed file service?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;A simple diagram of labelled blocks representing logical components such as load balancers, computational microservices, and data stores, with arrows pointing in the direction of data flow.&lt;/li&gt;
  &lt;li&gt;Workflow diagrams corresponding to business use cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This provides the baseline for designing each logical component and choosing technologies to use in the next step.&lt;/p&gt;

&lt;h3 id=&quot;7-design-individual-components&quot;&gt;7. Design individual components&lt;/h3&gt;
&lt;p&gt;As we dive into the design of each logical component, we can make technology choices based on our business, security, and scaling requirements, comparing options based on how they meet our anticipated current and future needs.&lt;/p&gt;

&lt;p&gt;If we’ve separated out logical components into their own microservices, we should be able to independently update and scale individual components going forward.&lt;/p&gt;

&lt;p&gt;Our detailed system design will narrow down:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Compute types - see &lt;a href=&quot;/blog/aws-compute-options/&quot;&gt;post on AWS compute options&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Data store types - the data type and scale of data and access patterns will guide whether we choose a NoSQL key-value store such as DynamoDB, a document store such as S3, a traditional SQL database such as PostgreSQL on RDS, or another type of data store entirely&lt;/li&gt;
  &lt;li&gt;Caching layers - see &lt;a href=&quot;/blog/aws-caching-options/&quot;&gt;post on AWS caching options&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;API access controls and how they will be enforced&lt;/li&gt;
  &lt;li&gt;Replication strategies for servers and data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Authorization controls may include:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Restricting who can access specific API methods - this can be enforced through role or resource based access policies.&lt;/li&gt;
  &lt;li&gt;Restricting non-admin users to only access their own data - this may leverage some combination of service code, and database row-level security.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Artifacts of this step:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;A detailed system architecture diagram specifying component names, compute/database types (eg. Lambda, DynamoDB), and data flow between components.&lt;/li&gt;
  &lt;li&gt;Workflow diagrams for each logical component.&lt;/li&gt;
  &lt;li&gt;Summary of why each technology choice was made.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we’ve done our job well, other software engineers should be able to skim the design document and have a general understanding of what needs to be built for this system to work as intended, how the system components will interact, and how upstream and downstream services and users will interact with the service.&lt;/p&gt;

&lt;p&gt;If we’ve documented how we made each decision, they should also be able to understand what parts of the design will apply to future projects, and where and how they should deviate based on their use cases and scaling requirements.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;We will take the following steps to design a new service:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Validate whether this service needs to exist&lt;/li&gt;
  &lt;li&gt;Clarify business requirements&lt;/li&gt;
  &lt;li&gt;Estimate scale&lt;/li&gt;
  &lt;li&gt;Define system interfaces and data models&lt;/li&gt;
  &lt;li&gt;Define data flow and storage&lt;/li&gt;
  &lt;li&gt;Define high-level system components&lt;/li&gt;
  &lt;li&gt;Design individual components&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The artifacts of each step can be validated with stakeholders to ensure we’re on the right track before continuing.  They collectively add to a design document that can be referred to both while building the service, and afterwards to understand its inner workings.&lt;/p&gt;
</description>
        <pubDate>Wed, 31 May 2023 03:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/step-by-step-process-designing-distributed-systems/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/step-by-step-process-designing-distributed-systems/</guid>
        
        
        <category>distributed-systems</category>
        
        <category>system-design</category>
        
      </item>
    
      <item>
        <title>Choosing between AWS compute services</title>
        <description>&lt;p&gt;When building a new service in AWS, it can be difficult to decide between all the available compute services.  In this post I’ll give a brief overview of the main options and describe how I compare and choose between them for a given project.&lt;/p&gt;

&lt;!--more--&gt;

&lt;h2 id=&quot;overview-of-aws-compute-services&quot;&gt;Overview of AWS compute services&lt;/h2&gt;

&lt;p&gt;AWS compute services include AWS Lambda, EC2 (Elastic Cloud Compute), and Fargate, where EC2 and Fargate can both be run through container orchestration services ECS (Elastic Container Service) or EKS (Elastic Kubernetes Service).&lt;/p&gt;

&lt;h3 id=&quot;lambda&quot;&gt;Lambda&lt;/h3&gt;
&lt;p&gt;Lambda provides one of the simplest ways to run code on-demand.  You can configure Lambda functions to be automatically triggered via other AWS services or events, or invoke them directly through API calls.&lt;/p&gt;

&lt;p&gt;Lambda functions are intended for short-lived operations and have a maximum runtime of 15 minutes.&lt;/p&gt;

&lt;h3 id=&quot;ec2&quot;&gt;EC2&lt;/h3&gt;
&lt;p&gt;EC2 is the underlying compute service for most other AWS services including Lambda and Fargate, and offers a broad range of instance types that support different memory, storage, and networking capacities.  You can set up long-lived servers directly with EC2, managing provisioning and infrastructure yourself, or use higher-level services like Fargate or ECS that take care of host management for you.&lt;/p&gt;

&lt;h3 id=&quot;fargate&quot;&gt;Fargate&lt;/h3&gt;
&lt;p&gt;Fargate is a serverless computing environment that allows you to specify how much memory and processing power you need, provide a Docker file, and let AWS take care of host management.&lt;/p&gt;

&lt;p&gt;Fargate has less maintenance overhead than EC2 since AWS automatically chooses instance types optimized to your resource requirements and provisions, patches, and replaces hosts as needed.&lt;/p&gt;

&lt;h3 id=&quot;brief-comparison-of-compute-services&quot;&gt;Brief comparison of compute services&lt;/h3&gt;

&lt;table class=&quot;table-small-bordered&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;Lambda&lt;/th&gt;
      &lt;th&gt;EC2&lt;/th&gt;
      &lt;th&gt;Fargate&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Who manages infrastructure&lt;/td&gt;
      &lt;td&gt;AWS&lt;/td&gt;
      &lt;td&gt;You&lt;/td&gt;
      &lt;td&gt;AWS&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Tenancy options&lt;/td&gt;
      &lt;td&gt;Shared&lt;/td&gt;
      &lt;td&gt;Shared/Dedicated&lt;/td&gt;
      &lt;td&gt;Shared&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Maintenance overhead&lt;/td&gt;
      &lt;td&gt;Lowest&lt;/td&gt;
      &lt;td&gt;Highest&lt;/td&gt;
      &lt;td&gt;Low&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Max execution time&lt;/td&gt;
      &lt;td&gt;15 minutes&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Lambda instances are simplest to use for workflows that take under 15 minutes, and both Lambda and Fargate instances are managed by AWS to provide low-maintenance options for customers.&lt;/p&gt;

&lt;p&gt;All three compute services are by default “shared tenancy”, meaning that multiple AWS customers may have their software running on virtual machines that share a physical server.  For most customers, this is a non-issue, but for highly regulated organizations that need their software running on hardware dedicated only to them, EC2 also supports “dedicated tenancy” hosts.&lt;/p&gt;

&lt;p&gt;Quincy Mitchell wrote a good post comparing the pricing of Lambda, EC2, and Fargate across a few instance types at &lt;a href=&quot;https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/&quot;&gt;https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/&lt;/a&gt;.  The general conclusions were that:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Lambda is less expensive than EC2 when run &amp;lt;= 50% of the time, and less expensive than Fargate when run &amp;lt;= 25% of the time.&lt;/li&gt;
  &lt;li&gt;Fargate’s flexibility for resource sizing can save money compared to EC2 if you need less resources than provided by the next larger EC2 instance type.&lt;/li&gt;
  &lt;li&gt;EC2 is least expensive when right-sized to resource requirements and highly utilized.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;container-management-services&quot;&gt;Container management services&lt;/h3&gt;
&lt;p&gt;Both EC2 and Fargate can be run via the following container management services:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;ECS (Elastic Container Service) is the simplest way to run containerized servers in AWS, with most deployment and networking details managed by AWS.&lt;/li&gt;
  &lt;li&gt;EKS (Elastic Kubernetes Service) runs containized servers through Kubernetes, which is more complex and supports more granular configuration than ECS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When building services entirely in AWS, I prefer ECS because of how easy it is to manage and integrate with other AWS services.  EKS may be preferable to teams that already work with Kubernetes and want to leverage specific features that ECS doesn’t support.&lt;/p&gt;

&lt;h2 id=&quot;comparison-matrix&quot;&gt;Comparison matrix&lt;/h2&gt;

&lt;table class=&quot;table-small-bordered&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;Lambda&lt;/th&gt;
      &lt;th&gt;Fargate on ECS&lt;/th&gt;
      &lt;th&gt;EC2 on ECS&lt;/th&gt;
      &lt;th&gt;EC2/Fargate on EKS&lt;/th&gt;
      &lt;th&gt;EC2&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Max execution time&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;15 minutes&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Warm-up time&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Seconds&lt;/strong&gt;*&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Execution latency SLA&lt;/td&gt;
      &lt;td&gt;Seconds&lt;/td&gt;
      &lt;td&gt;100ms range&lt;/td&gt;
      &lt;td&gt;100ms range&lt;/td&gt;
      &lt;td&gt;100ms range&lt;/td&gt;
      &lt;td&gt;100ms range&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Availability SLA&lt;/td&gt;
      &lt;td&gt;99.95%&lt;/td&gt;
      &lt;td&gt;99.99%&lt;/td&gt;
      &lt;td&gt;99.99%&lt;/td&gt;
      &lt;td&gt;99.95%&lt;/td&gt;
      &lt;td&gt;99.99%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Requires OS customization&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Maintenance overhead&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Low&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;Low&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;Medium-High&lt;/td&gt;
      &lt;td&gt;High&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Automatic rollback support&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Handles sharp traffic spikes&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;**&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;For services where executions are expected to always complete in under 15 minutes, AWS Lambda is the simplest and lowest-maintenance compute service to leverage for API services.&lt;/p&gt;

&lt;h3 id=&quot;managing-lambda-warm-up-time&quot;&gt;*Managing Lambda warm-up time&lt;/h3&gt;
&lt;p&gt;AWS Lambda can take up a few seconds to “warm up” and execute a function when no recent requests have been made, or when new workers are being provisioned to meet scaling demands.&lt;/p&gt;

&lt;p&gt;This delay can be partially mitigated by scheduling periodic function calls (“pings”) every few minutes to keep the function “warm”.&lt;/p&gt;

&lt;p&gt;You can also pay for &lt;a href=&quot;https://aws.amazon.com/blogs/compute/new-for-aws-lambda-predictable-start-up-times-with-provisioned-concurrency/&quot;&gt;provisioned concurrency&lt;/a&gt; to always have a minumum number of workers provisioned and ready to accept traffic.&lt;/p&gt;

&lt;p&gt;Execution latency SLAs are influenced by the above warm-up time issues, and with periodic pings and provisioned concurrency you can expect execution latencies to be comparable to EC2/ECS.&lt;/p&gt;

&lt;h3 id=&quot;handling-sharp-traffic-spikes&quot;&gt;**Handling sharp traffic spikes&lt;/h3&gt;
&lt;p&gt;Lambda has the limitation that its auto-scaling takes a few minutes to adjust to sharp spikes in traffic beyond its base capacity of 1000 calls/second.  If your service needs to handle 50%+ traffic spikes above this threshold without throwing “Rate Exceeded” errors for a few minutes, then ECS Fargate or the other ECS/EKS options should be considered instead.&lt;/p&gt;

&lt;p&gt;Many AWS compute services support automatic scaling policies based on factors you specify such as time period and memory usage, which you can use to automatically adjust to major traffic increases with some lag time.  However, to handle sharp traffic spikes without interim throttling errors, you need to estimate your service’s maximum call rate in advance and set the minimum number of running instances in your ECS/EKS cluster or EC2 auto-scaling group to match it.&lt;/p&gt;

&lt;p&gt;It can be expensive to maintain a large number of hosts year-round if you only have traffic surges on specific dates, so many teams track service usage over time and periodically adjust their scaling limits to align with expected seasonal/event-based traffic, with updated projections and load testing done before known peak periods.&lt;/p&gt;

&lt;h2 id=&quot;recommendations&quot;&gt;Recommendations&lt;/h2&gt;

&lt;p&gt;Lambda is the simplest compute service to use for short-lived operations, and is an easy choice for services that process under 1000 requests/second.  It can also handle much higher traffic scenarios, while it can take a few minutes for it to scale up to handle spark traffic spikes of over 50% when above the 1000 requests/second base capacity.  If traffic increases are generally less spiky than this, or temporary throttling is acceptable, then Lambda is still a good choice.&lt;/p&gt;

&lt;p&gt;When Lambda is not an option due to worst-case latency or traffic expectations, Fargate on ECS offers the simplest set-up and management as a fully-managed “serverless” solution, and is my go-to alternative.&lt;/p&gt;

&lt;p&gt;I would only suggest using EC2 (stand-alone or on ECS) when there is a specific need for EC2’s additional OS or runtime environment configurations, since it otherwise adds unnecessary maintenance overhead.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;AWS blog post on ECS vs EKS: &lt;a href=&quot;https://aws.amazon.com/blogs/containers/amazon-ecs-vs-amazon-eks-making-sense-of-aws-container-services&quot;&gt;https://aws.amazon.com/blogs/containers/amazon-ecs-vs-amazon-eks-making-sense-of-aws-container-services&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;AWS blog post on Lambda provisioned concurrency: &lt;a href=&quot;https://aws.amazon.com/blogs/compute/new-for-aws-lambda-predictable-start-up-times-with-provisioned-concurrency&quot;&gt;https://aws.amazon.com/blogs/compute/new-for-aws-lambda-predictable-start-up-times-with-provisioned-concurrency&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;ECS documentation: &lt;a href=&quot;https://docs.aws.amazon.com/ecs&quot;&gt;https://docs.aws.amazon.com/ecs&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;EKS documentation: &lt;a href=&quot;https://docs.aws.amazon.com/eks&quot;&gt;https://docs.aws.amazon.com/eks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Fargate documentation: &lt;a href=&quot;https://docs.aws.amazon.com/AmazonECS/latest/userguide/what-is-fargate.html&quot;&gt;https://docs.aws.amazon.com/AmazonECS/latest/userguide/what-is-fargate.html&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Lambda documentation: &lt;a href=&quot;https://docs.aws.amazon.com/lambda&quot;&gt;https://docs.aws.amazon.com/lambda&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Lambda scaling: &lt;a href=&quot;https://aws.amazon.com/blogs/compute/understanding-aws-lambda-scaling-and-throughput&quot;&gt;https://aws.amazon.com/blogs/compute/understanding-aws-lambda-scaling-and-throughput&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Quincy Mitchell cost comparison of Lambda, EC2, and Fargate: &lt;a href=&quot;https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/&quot;&gt;https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 28 Apr 2023 03:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/aws-compute-options/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/aws-compute-options/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>AWS caching options</title>
        <description>&lt;p&gt;Caching is a technique used to store frequently accessed data for fast retrieval, reducing the load on backend services and improving application performance.  AWS provides several caching options that can be used at different layers of the infrastructure stack.&lt;/p&gt;

&lt;h3 id=&quot;caching-best-practices&quot;&gt;Caching best practices&lt;/h3&gt;
&lt;p&gt;Before integrating a cache, it’s important to evaluate use cases that would benefit from caching.  Here are some caching best practices to keep in mind:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Caching is most helpful for data that is frequently accessed and slow to retrieve&lt;/li&gt;
  &lt;li&gt;A cache should not be used if responses need to be strongly consistent with the backend&lt;/li&gt;
  &lt;li&gt;Use cache expiry times appropriate to the use case&lt;/li&gt;
  &lt;li&gt;Handle cache misses gracefully&lt;/li&gt;
  &lt;li&gt;Avoid cache stampedes by using locks or random delays&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;cloudfront&quot;&gt;CloudFront&lt;/h3&gt;
&lt;p&gt;Content Delivery Networks (CDNs) are perfect for static data that is frequently accessed across users and doesn’t change based on the request, such as multimedia assets, scripts, and other global files. CloudFront is an AWS-managed CDN that has a large global network of endpoints to allow low-latency calls regardless of customer location.&lt;/p&gt;

&lt;p&gt;Key features:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Low-latency data transfers worldwide&lt;/li&gt;
  &lt;li&gt;Integrations with other AWS services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;: Video streaming services (YouTube, Netflix, etc) will almost always want to leverage a CDN with endpoints in multiple regions, since this scenario often involves large volumes of concurrent reads to common data and customers are hyper-sensitive to latency.&lt;/p&gt;

&lt;h3 id=&quot;api-gateway-cache&quot;&gt;API Gateway cache&lt;/h3&gt;
&lt;p&gt;AWS API Gateway’s REST APIs have built-in cache support that can be enabled for specific API methods at the infrastructure level.  This is one of the simplest ways to set up caching if you have API GET methods that are frequently called with the same parameters.&lt;/p&gt;

&lt;p&gt;When an API method has a high rate of cache hits, caching at this level allows you to absorb increased traffic with minimal load to your backend services, reducing data transfer and hardware costs.&lt;/p&gt;

&lt;p&gt;Key features:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Built-in cache support&lt;/li&gt;
  &lt;li&gt;No code changes required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;: You plan on using API Gateway for your service, and one of your API methods has a limited set of expected incoming request values and is frequently called by other services.  Enabling API Gateway’s cache for this method will reduce the worst case load to your backend to N calls per cache timeout period, given N unique request values.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;I’ve used API Gateway’s cache for this exact use case to allow APIs to handle millions of calls per day at minimal cost, without requiring backend changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;custom-cache&quot;&gt;Custom cache&lt;/h3&gt;
&lt;p&gt;Regardless of which AWS services you use, you can always write server-side code that manually accesses a custom cache.&lt;/p&gt;

&lt;p&gt;Using a local cache on the host running the service is simple but has availability issues if you need to reboot or replace the host, and consistency issues if you run concurrent servers with independent caches.  You can avoid these issues by running an independently hosted cache service that all API workers call, which you can either run yourself or through AWS.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;: You have your own cache solution that you prefer over AWS cache services, and your cache use cases are not satisfied by API Gateway or DynamoDB.&lt;/p&gt;

&lt;h3 id=&quot;elasticache&quot;&gt;ElastiCache&lt;/h3&gt;
&lt;p&gt;ElastiCache is useful when you need a hosted general-purpose cache that can be queried by concurrent compute instances, and API Gateway and DynamoDB Accelerator don’t meet your caching needs.  You can choose to leverage either Redis or Memcached as the backend implementation for your ElastiCache instances.&lt;/p&gt;

&lt;p&gt;This requires maintaining additional AWS infrastructure and making code changes to wrap your data access code with logic to read from and write to the cache service.&lt;/p&gt;

&lt;p&gt;Key features:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Flexible use cases independent of backend implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;: A method that is invoked across your service many times per minute involves an expensive SQL query that is always run with the same parameters, where it’s acceptable for the results to be 15-30 minutes out of date.  By leveraging ElastiCache with a 15-minute time-to-live value, you can make this query nearly instant across all servers and clients for all except for one call per 15 minute time period.  Note: If you have a fixed set of global queries that are important to optimize for all users, you can remove the worst-case runtime from end customers by setting up a periodic job that updates the data in the cache more frequently than the time-to-live period.&lt;/p&gt;

&lt;h3 id=&quot;dynamodb-accelerator-dax&quot;&gt;DynamoDB Accelerator (DAX)&lt;/h3&gt;
&lt;p&gt;DynamoDB Accelerator is specific to DynamoDB databases, and can be easily set up in your AWS account and leveraged in code by swapping out the client you use to query DynamoDB.&lt;/p&gt;

&lt;p&gt;Key features:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Automatic caching of DynamoDB read operations that use the DAX code client&lt;/li&gt;
  &lt;li&gt;Minimal code changes required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Example use case&lt;/em&gt;: Your application has a few specific DynamoDB queries that take longer to complete than is acceptable to end users.  By enabling DAX and configuring those use cases to use the DAX code client rather than the DynamoDB code client, you can automatically cache the query results so that the majority of users experience nearly instant responses.&lt;/p&gt;

&lt;h3 id=&quot;comparison-of-aws-caching-services&quot;&gt;Comparison of AWS caching services&lt;/h3&gt;

&lt;table class=&quot;table-small-bordered&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;CloudFront&lt;/th&gt;
      &lt;th&gt;API Gateway cache&lt;/th&gt;
      &lt;th&gt;ElastiCache&lt;/th&gt;
      &lt;th&gt;DAX&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Use case&lt;/td&gt;
      &lt;td&gt;Static resources&lt;/td&gt;
      &lt;td&gt;API responses&lt;/td&gt;
      &lt;td&gt;General-purpose&lt;/td&gt;
      &lt;td&gt;DynamoDB query responses&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Cache layer&lt;/td&gt;
      &lt;td&gt;CDN&lt;/td&gt;
      &lt;td&gt;API Gateway&lt;/td&gt;
      &lt;td&gt;Compute code&lt;/td&gt;
      &lt;td&gt;Database&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Layer accessing cache&lt;/td&gt;
      &lt;td&gt;Server or client code&lt;/td&gt;
      &lt;td&gt;API Gateway&lt;/td&gt;
      &lt;td&gt;Server code&lt;/td&gt;
      &lt;td&gt;Server code&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Infrastructure set-up &amp;amp; maintenance&lt;/td&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;Low&lt;/td&gt;
      &lt;td&gt;High&lt;/td&gt;
      &lt;td&gt;Low&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Code changes&lt;/td&gt;
      &lt;td&gt;Medium&lt;/td&gt;
      &lt;td&gt;N/A&lt;/td&gt;
      &lt;td&gt;High&lt;/td&gt;
      &lt;td&gt;Low&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;API Gateway caching is simplest to set-up and maintain, followed closely by DAX.  I’d recommend considering these over more complex caching solutions when they fit your use case.&lt;/p&gt;

&lt;p&gt;DAX and ElastiCache are comparable in price with the difference depending on the configuration options chosen for ElastiCache, and DAX is much simpler to integrate with, so DAX would be my recommendation if all of your cache use cases are specific to DynamoDB queries.&lt;/p&gt;

&lt;p&gt;ElastiCache has the most overhead and most flexibility as AWS’ general-purpose cache service.&lt;/p&gt;

&lt;p&gt;CloudFront addresses a somewhat different use as a CDN, and is appropriate to use whenever you want to optimize access of static resources across regions.&lt;/p&gt;

&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
&lt;p&gt;AWS provides several ways to cache data depending on your use case and infrastructure requirements.  In many cases, you don’t need to invent the wheel and can use a fully-managed solution that does not require significant code changes.&lt;/p&gt;

&lt;p&gt;By caching at the appropriate layer, you can optimize latencies while minimizing unnecessary load to your backend services, allowing you to scale at a reasonable cost.&lt;/p&gt;

&lt;h3 id=&quot;references&quot;&gt;References&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html&quot;&gt;CloudFront documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html&quot;&gt;API Gateway caching documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/elasticache/&quot;&gt;ElastiCache documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.html&quot;&gt;DynamoDB Accelerator documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Wed, 26 Apr 2023 03:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/aws-caching-options/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/aws-caching-options/</guid>
        
        
        <category>aws</category>
        
      </item>
    
      <item>
        <title>AWS S3 bucket creation dates and S3 master regions</title>
        <description>&lt;p&gt;While working on functionality that depended on AWS S3 bucket ages, I noticed that published bucket CreationDate values didn’t always reflect when the buckets were created.&lt;/p&gt;

&lt;p&gt;For example, when I called the S3 ListBuckets API a few minutes after updating a bucket access policy, the CreationDate value returned for that bucket was the time that I had modified the policy rather than the time that I had created the bucket.  This was also reproduced when using the AWS CLI via the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws s3api list-buckets&lt;/code&gt; command.&lt;/p&gt;

&lt;!--more--&gt;

&lt;p&gt;It turns out that the &lt;a href=&quot;https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/list-buckets.html&quot;&gt;S3 CLI documentation for list-buckets&lt;/a&gt; explicitly states that CreationDate values can change when you make changes to your bucket:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;CreationDate -&amp;gt; (timestamp)&lt;/p&gt;

  &lt;blockquote&gt;
    &lt;p&gt;Date the bucket was created. This date can change when making changes to your bucket, such as editing its bucket policy.&lt;/p&gt;
  &lt;/blockquote&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, in &lt;a href=&quot;https://github.com/aws/aws-cli/issues/3597&quot;&gt;this GitHub issue&lt;/a&gt;, an AWS engineer confirmed there was in fact one AWS region you could query to get the original creation times, the “us-east-1” region, and that this was a feature of how S3 was designed.  Other regions’ CreationDate values would change when key bucket attributes or access policies were modified.&lt;/p&gt;

&lt;p&gt;Why does only one region give the correct creation time? It turns out that all S3 buckets are created in one master region, and then replicated globally.  Each region’s replica is only aware of its own creation date.  When bucket changes are propagated across regions via new replication events, the new replica creation dates are what are reflected in regions other than the master region.&lt;/p&gt;

&lt;p&gt;I followed up with the S3 team for more details since my team interacts with services across multiple AWS partitions and regions, and from tests it looked like the master region differed between partitions.  As of September 4, 2021, &lt;a href=&quot;https://docs.aws.amazon.com/sdk-for-ruby/v2/api/Aws/Partitions.html&quot;&gt;AWS documentation&lt;/a&gt; indicates that there are currently three AWS partitions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“aws”, the classic AWS partition&lt;/li&gt;
  &lt;li&gt;“aws-cn”, AWS China&lt;/li&gt;
  &lt;li&gt;“aws-us-gov”, AWS GovCloud&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The S3 engineer confirmed that each AWS partition has a single S3 master region, and that querying S3 from that master region would be reliable for retrieving original bucket creation dates.  The master regions are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“us-east-1” for the “aws” partition&lt;/li&gt;
  &lt;li&gt;“cn-north-1” for the “aws-cn” partition&lt;/li&gt;
  &lt;li&gt;“us-gov-west-1” for the “aws-us-gov” partition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Demo:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Queries from us-west-2 yield an incorrect creation time which is actually the time when the bucket policy was updated.&lt;/span&gt;
% aws configure &lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;region &lt;span class=&quot;s2&quot;&gt;&quot;us-west-2&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; aws s3api list-buckets | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-A&lt;/span&gt; 1 &lt;span class=&quot;s2&quot;&gt;&quot;masayson-creation-date-test-20210829-1725&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;CreationDate
            &lt;span class=&quot;s2&quot;&gt;&quot;CreationDate&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;2021-08-30T20:12:58+00:00&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Queries from us-east-1 yield the original bucket creation time&lt;/span&gt;
% aws configure &lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;region &lt;span class=&quot;s2&quot;&gt;&quot;us-east-1&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; aws s3api list-buckets | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-A&lt;/span&gt; 1 &lt;span class=&quot;s2&quot;&gt;&quot;masayson-creation-date-test-20210829-1725&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;CreationDate
            &lt;span class=&quot;s2&quot;&gt;&quot;CreationDate&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;2021-08-29T17:25:45+00:00&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;S3 ListBuckets API documentation: &lt;a href=&quot;https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListBuckets.html&quot;&gt;https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListBuckets.html&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;S3 list-buckets CLI documentation: &lt;a href=&quot;https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/list-buckets.html&quot;&gt;https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/list-buckets.html&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;GitHub issue describing changing S3 bucket creation dates: &lt;a href=&quot;https://github.com/aws/aws-cli/issues/3597&quot;&gt;https://github.com/aws/aws-cli/issues/3597&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Sun, 05 Sep 2021 03:00:00 +0000</pubDate>
        <link>https://www.marksayson.com/blog/s3-bucket-creation-dates-s3-master-regions/</link>
        <guid isPermaLink="true">https://www.marksayson.com/blog/s3-bucket-creation-dates-s3-master-regions/</guid>
        
        
        <category>aws</category>
        
      </item>
    
  </channel>
</rss>
