Mark Sayson

Reducing Lambda latency by 76% with AWS Lambda Power Tuning

Sun, 14 Jul 2024 00:00:00 +0000

Introduction

Optimizing AWS Lambda memory capacity can decrease customer-facing latencies by up to 2-5 times without significantly increasing hardware costs. However, this takes trial and error, and many teams just pick an amount of memory and stick with it, leaving their services several times slower than necessary.

Other teams spend hours setting up custom code and metrics to measure latencies for each of their service’s use cases, benchmark each use case against various memory capacities, and use the AWS Cost Estimator or AWS Lambda pricing documentation to estimate costs and choose the amount of memory with the best latency-to-cost tradeoff.

This is no longer necessary with the AWS Lambda Power Tuning tool, which can be run against any Lambda function in your AWS account to automatically determine the optimal memory capacity that minimizes execution latency and/or hardware costs.

There is no cost to deploy and run this besides its underlying hardware costs, which is likely free if you only run it a few times before deleting it from your account.

Since it only relies on AWS-infrastructure-level API calls, the tool works regardless of which programming language your Lambda function uses, and doesn’t require any modifications to your service infrastructure or code.

Set up

The AWS Lambda Power Tuning GitHub repo documents multiple ways to deploy the tool, using either the AWS Serverless Application Repository (simplest), AWS SAM CLI, AWS CDK, or Terraform.

I used the AWS Serverless Application Repository since this reduces set-up to a few button clicks, and I planned to tear down the tool after optimizing my Lambda function.

To use this deployment option, you can simply log into your AWS account, visit https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:451282441545:applications~aws-lambda-power-tuning, and click Deploy.

This will create an AWS Lambda Application that encapsulates all the infrastructure for the tuning tool, including the AWS Step Functions State Machine that you’ll invoke to run the benchmark tests.

Click on the powerTuningStateMachine resource to open the state machine, and click Start Execution, then enter the JSON payload to run the benchmark test with, where input parameters are documented on the tool’s GitHub README.

For example, the following payload runs the tool against the given Lambda function, with 15 executions each for 512, 1024, 1536, 2048, and 3008 MB of memory, with a function payload specific to my API service, and the balanced optimization strategy.

{
  "lambdaARN": "arn:aws:lambda:us-west-2:123456789012:function:TestLambdaFunctionName",
  "powerValues": [
    512,
    1024,
    1536,
    2048,
    3008
  ],
  "num": 15,
  "payload": {
    "resource": "/v1/consent-management/services/{serviceId}/users/{userId}/consents",
    "path": "/v1/consent-management/services/TestServiceId/users/TestUserId/consents",
    "httpMethod": "GET",
    "pathParameters": {
      "serviceId": "TestServiceId",
      "userId": "TestUserId"
    },
    "requestContext": {
      "resourceId": "1abc2d",
      "resourcePath": "/v1/consent-management/services/{serviceId}/users/{userId}/consents",
      "operationName": "ListServiceUserConsent",
      "httpMethod": "GET",
      "path": "/v1/consent-management/services/{serviceId}/users/{userId}/consents",
      "accountId": "123456789012",
      "protocol": "HTTP/1.1",
      "stage": "test"
    }
  },
  "parallelInvocation": false,
  "strategy": "balanced"
}

I set parallelInvocation to false after observing Lambda throttling errors with it set to true, since my test Lambda isn’t currently provisioned for high load, and strategy to balanced to equality weight minimizing latency and minimizing costs, while you can configure the tool to only consider one or use a different weighted average.

Analyzing results

Once the execution completes, the Execution input and output tab will display the recommended amount of memory as the power value, the resulting average latency in milliseconds and cost per execution, and the URL to a more detailed visualization.

By navigating to that URL, we can view a graph of average latency and execution costs for each amount of memory measured, along with summarized best and worst memories for latency and cost.

In this case, for my Lambda function, which is written in Java and queries a DynamoDB table, 2048 MB of memory resulted in the lowest average latency, while 1024 MB of memory had the lowest runtime costs.

We can see that 512 MB actually costs more than 1024 MB, and this is due to the duration being several times higher which results in higher GB-second charges.

This was only run for 15 iterations per memory allocation, so I increased the sample size and reran against 1024, 1536, and 2048 MB by setting powerValues and num to "powerValues": [1024, 1536, 2048], "num": 50.

I executed the Lambda function a couple times first with a test payload to eliminate cold starts as a compounding factor, and then ran the state machine with the new config, which resulted in the following output and visualization:

{
  "power": 1536,
  "cost": 3.2760000000000005e-7,
  "duration": 12.266666666666667,
  "stateMachine": {
    "executionCost": 0.00023,
    "lambdaCost": 0.00012891480000000002,
    "visualization": "https://lambda-power-tuning.show/#AAQABgAI;3t2NQUREREFERExB;ilmiNADhrzRWgeo0"
  }
}

The more detailed visualization indicates that for our particular use case, we’re unlikely to see significant performance improvements from increasing memory above 1536 MB, and the marginal cost increase from 1024 MB to 1536 MB is acceptable for us.

You can see a more detailed table view of the underlying data by going to the step function execution’s Detail tab, selecting the Table view, selecting the Analyzer task, and selecting the Analyzer panel’s Output tab.

Tear-down

When you no longer need the tool, you can open the AWS CloudFormation console and delete the serverlessrepo-aws-lambda-power-tuning CloudFormation stack.

Outcome

The tool took under 10 minutes to deploy, execute, and fine-tune, and resulted in me changing my test Lambda’s memory allocation from 512 MB to 1536 MB.

This lowered my API’s average latency from 50ms to 12ms, a 4.17x improvement, AKA 76% latency reduction. Duration costs increased by 8% to $0.3276/million executions, which is minimal for my service’s scale.

Given the latency improvements of choosing the right amount of memory, and how easy this tool is to use, I’d recommend it to anyone building services on AWS Lambda.

References

AWS Lambda docs introducing AWS Lambda Power Tuning: https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html

AWS Lambda Power Tuning GitHub repository with usage details: https://github.com/alexcasalboni/aws-lambda-power-tuning

AWS Lambda pricing: https://aws.amazon.com/lambda/pricing/

Serializing and deserializing DynamoDB pagination tokens to support paginated APIs

Sat, 18 May 2024 17:00:00 +0000

When using AWS’s Java 2.x SDK, DynamoDB scan and query responses provide pagination tokens in a Map<String, AttributeValue> lastEvaluatedKey object, which represents the primary key of the last processed DynamoDB item. You can then pass this value as the “exclusive start key” for the next query to get the next page of results.

When your service retrieves all pages of results locally, this isn’t a problem. However, when you want to provide a paginated API backed by DynamoDB, you’ll need to convert this attribute value map into a format that can be passed over HTTP, AKA “serialize” the object into a string.

When your client requests the next page of results with that string pagination token, you’ll also need to convert that string back into the Map<String, AttributeValue> format that the AWS SDK expects, AKA “deserialize” the string to the original data structure.

Prior method for serializing/deserializing pagination tokens

Before May 2023, building paginated APIs backed by DynamoDB was not very convenient, as you’d have to build your own custom serialization and deserialization code.

Example implementation using Immutables and Jackson, with a sample DynamoDB table primary key that has both a partition key and a sort key:

import com.fasterxml.jackson.core.JsonParseException;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
import com.fasterxml.jackson.databind.annotation.JsonSerialize;
import org.immutables.value.Value.Immutable;
import org.immutables.value.Value.Parameter;
import org.immutables.value.Value.Style;
import software.amazon.awssdk.services.dynamodb.model.AttributeValue;

import java.io.IOException;
import java.util.Base64;
import java.util.Map;

/**
 * Serializable representation of a Product DynamoDB pagination token.
 * Using Immutables to generate safe, immutable value objects.
 * @see https://immutables.github.io/
 */
@JsonDeserialize(builder = ProductNextTokenBuilder.class)
@JsonSerialize
@Style(visibility = Style.ImplementationVisibility.PRIVATE)
@Immutable
interface ProductNextToken {
    @Parameter
    String getPartitionKey();
    @Parameter
    String getSortKey();
}

/**
 * Class encapsulating logic to convert DynamoDB pagination tokens between attribute value
 * maps used by the AWS SDK, and string values that can be passed over HTTP.
 */
class ProductSerializer {
    private static final PRODUCT_TABLE_PARTITION_KEY = "YourDynamoDBTablePartitionKeyName";
    private static final PRODUCT_TABLE_SORT_KEY = "YourDynamoDBTableSortKeyName";

    ProductSerializer(final ObjectMapper objectMapper) {
      this.objectMapper = objectMapper;
    }

    /**
     * Serialize a lastEvaluatedKey from an attribute value map to a string.
     *
     * @param lastEvaluatedKey attribute map returned by paginated DynamoDB queries.
     * @return serialized String token that can be passed over HTTP.
     * @throws JsonProcessingException exception thrown if unable to parse the key.
     */
    public String serializeLastEvaluatedKey(final Map<String, AttributeValue> lastEvaluatedKey) throws JsonProcessingException {
        if (lastEvaluatedKey == null) {
            return null;
        }

        final ProductNextToken tokenObject = new ProductNextTokenBuilder()
            .partitionKey(lastEvaluatedKey.get(PRODUCT_TABLE_PARTITION_KEY))
            .sortKey(lastEvaluatedKey.get(PRODUCT_TABLE_SORT_KEY))
            .build();

        return Base64.getUrlEncoder().encodeToString(objectMapper.writeValueAsBytes(tokenObject));
    }

    /**
     * Deserialize a lastEvaluatedKey from a string to an attribute value map.
     *
     * @param lastEvaluatedKey attribute map returned by paginated DynamoDB queries.
     * @return serialized String token that can be passed over HTTP.
     * @throws IOException exception thrown if unable to decode encodedLastEvaluatedKey.
     * @throws JsonParseException exception thrown if unable to deserialize the decoded key into a ProductNextToken.
     */
    public Map<String, AttributeValue> deserializeLastEvaluatedKey(final String encodedLastEvaluatedKey) throws IOException, JsonParseException {
      if (encodedLastEvaluatedKey == null) {
          return null;
      }

      final ProductNextToken deserializedToken = objectMapper.readValue(
          Base64.getUrlDecoder().decode(encodedLastEvaluatedKey),
          ProductNextToken.class
      );

      final AttributeValue partitionKeyValue = AttributeValue.builder()
          .s(deserializedToken.getPartitionKey())
          .build();

      final AttributeValue sortKeyValue = AttributeValue.builder()
          .s(deserializedToken.getSortKey())
          .build();

      return Map.of(
          PRODUCT_TABLE_PARTITION_KEY, partitionKeyValue,
          PRODUCT_TABLE_SORT_KEY, sortKeyValue
      );
    }
}

This is a lot of code to maintain and test, with multiple exception cases. We can remove the dependency on specific key structure by generalizing the code to iterate over the map and JSON key-value pairs, as shown in https://github.com/aws/aws-sdk-java-v2/issues/3224, but this is still more complex than should be necessary for what we’d prefer to be simple “stringify” and “unstringify” methods.

Serialization/deserialization with the DynamoDB Enhanced Document library

Since May 2023, AWS’s Java 2.x SDK includes an Enhanced Document library that simplifies converting pagination tokens between the AWS SDK’s objects and JSON strings that can be passed over HTTP.

The software.amazon.awssdk.enhanced.dynamodb.document.EnhancedDocument class includes utility methods that make serialization and deserialization one-liners.

AWS blog post demonstrating use cases: https://aws.amazon.com/blogs/devops/introducing-the-enhanced-document-api-for-dynamodb-in-the-aws-sdk-for-java-2-x/

Sample code for converting between Map<String, AttributeValue> pagination tokens and JSON strings:

import software.amazon.awssdk.enhanced.dynamodb.document.EnhancedDocument;
import software.amazon.awssdk.services.dynamodb.model.AttributeValue;

import java.io.UncheckedIOException;

/**
 * Class encapsulating logic to convert DynamoDB pagination tokens between attribute value
 * maps used by the AWS SDK, and string values that can be passed over HTTP.
 */
class ProductSerializer {
    /**
      * Convert a DynamoDB attribute value map to a JSON string.
      * @param attributeValueMap DynamoDB item key represented as a map from attribute names to attribute values
      * @return String JSON string representation of the DynamoDB item key
      */
    public String serializeLastEvaluatedKey(final Map<String, AttributeValue> attributeValueMap) {
        return EnhancedDocument.fromAttributeValueMap(attributeValueMap).toJson();
    }

    /**
      * Convert a JSON string representation of a DynamoDB pagination token to the format required by DynamoDB API calls.
      * @param paginationTokenJson JSON string representing the last paginated API call's last evaluated record key
      * @return Map<String, AttributeValue> exclusive start key for the next paginated DynamoDB scan/query API call
      * @throws UncheckedIOException exception thrown if fail to parse pagination token
      */
    public Map<String, AttributeValue> deserializeLastEvaluatedKey(final String paginationTokenJson) throws UncheckedIOException {
        return EnhancedDocument.fromJson(paginationTokenJson).toMap();
    }
}

This is much more manageable, with serialization and deserialization functionality now provided out-of-the-box as part of the standard AWS SDK.

We can pass this serialized JSON string to our API clients as the client-facing pagination token. Optionally, if it’s important to us to obfuscate our internal DynamoDB key structure from clients, we can add back a Base64 encode/decode layer on top of the JSON strings using the same code snippets from the earlier example.

Concurrency from single host applications up to massively distributed services

Fri, 01 Dec 2023 03:00:00 +0000

Concurrency is when multiple software threads or programs are run at the same time, and is a key aspect of many modern applications.

Web browsers run dozens of concurrent processes based on your activity, querying servers, downloading files, and executing scripts all at once.

Online services with millions of active users run a scaled up number of concurrent processes across thousands of servers, with various distributed system design patterns to support this.

This post will describe several levels of concurrency, how they’re commonly applied, and pros and cons of each approach.

Levels of concurrency

Multi-threaded applications

Within application code, we can run multiple threads concurrently.

This approach can be locally applied regardless of whether an application is run on a single host or in a distributed service. However, multi-threaded code increases code complexity and introduces thread safety issues, and an error in one thread may take down the entire application.

A common use case for multi-threading is when we need to make multiple requests to other services that may each take multiple seconds to complete. We can trigger each request in a separate thread to run them concurrently, and collect the results at the end of the longest running call, rather than synchronously making one request at a time after the prior response has returned.

Latency trade-offs

Example 1: Suppose we will run 4 requests that each take 2 seconds, 5 seconds, 5 seconds, and 1 second to complete, and each thread adds 20 milliseconds of overhead to start and close. We’ll exclude the time to combine results as equivalent between multi-threaded and synchronous approaches. Our runtime with multi-threading will be max(2, 5, 5, 1) + 0.02*4 = 5.08 seconds, compared to the synchronous approach taking 2 + 5 + 5 + 1 = 13 seconds to make all requests. In this scenario, multi-threading reduces our latency by 7.92 seconds.

Example 2: Splitting tasks into threads does not come for free and may not worthwhile for very short-lived requests. For example, if we have 1000 requests that each take 0.01 seconds to complete, running each request in a separate thread would take max(0.01) + 0.02*1000 = 20.01 seconds, compared to the synchronous approach taking 1000*0.01 = 10 seconds. In this case, the synchronous approach is twice as efficient as multi-threading.

Since the cost of such a high branching factor is high, in reality, we’ll typically break this workflow up into batches of requests per thread, such as 200 requests per thread.

Example 2b: Given 1000 requests that each take 0.01 seconds to complete, if we split the work into 5 batches of 200 requests per thread, computing all the results would take max(200*0.01, 200*0.01, 200*0.01, 200*0.01, 200*0.01) + 0.02*5 = 2.1 seconds, compared to the synchronous approach taking 1000*0.01 = 10 seconds. By batching the work before applying multi-threading, we can reduce latency compared to synchronous calls by 7.9 seconds.

Multi-threading provides the most latency reduction when we’re able to run multiple long-running tasks in parallel, especially multi-second tasks, whether each task is a single long-running request or a series of requests adding up to seconds.

Multi-container hosts

Within a host, we can run multiple containers which each receive allocated memory and run an isolated instance of application code.

This allows us to fully utilize a host’s CPU and memory, while we will eventually get to a point where the host no longer has sufficient CPU or memory capacity to add more containers, or where performance begins to drop due to increased context switching and IO bottlenecks.

Isolating concurrent applications in separate containers also improves system reliability, since regardless of individual application failures, the other containers can continue running. However, we are still vulnerable to host-level failures.

A single host can be sufficient for some small-scale services that only have a few hundred concurrent requests and are acceptable to periodically take offline for maintenance. For services that need to provide 24/7 availability or handle more traffic, we will graduate to distributed services where this host will be a single unit of a larger architecture, leading us to the multi-host cluster.

Multi-host clusters behind a load balancer

When we require high availability or more concurrency than a single host can support, we can set up a load balancer that distributes traffic across multiple hosts, forming a cluster of hosts.

This allows us to horizontally scale, that is, add or remove servers to our resource pool as needed. Horizontal scaling makes our service more robust to individual host failures and enables more flexibility in our infrastructure, allowing us to swap out different types of hosts at will, patch or update individual hosts without affecting service availability, and pay for just as many hosts as are needed to meet current demand.

This is often the go-to design pattern for services that need to process thousands of concurrent requests, which a single host may no longer be able to handle.

Multi-cluster services

When we have more traffic than a single load balancer can handle, we can set up a DNS load balancer to distribute traffic across multiple clusters.

This is rarely the starting point for a new service. We only want to add this level of complexity when absolutely necessary, such as when scaling up to millions of concurrent requests, or after hitting infrastructure restrictions on load balancer concurrent connections or maximum attached endpoints.

Many cloud providers provide distributed DNS load balancers that remove the single point of failure of a traditional load balancer, scale to millions of concurrent users, and automatically route traffic to the closest regional cluster.

DNS load balancer trade-offs

DNS load balancers are more limited in functionality than many specialized load balancers. For example, AWS network load balancers can support more granular access controls and security configurations, and integrate with compute services to automatically replace unhealthy hosts that fail to respond to the load balancer.

DNS also requires its connected endpoints to be accessible to the Internet, which is not always ideal. Following the security principle of defence-in-depth, when protecting critical data or infrastructure, anything that doesn’t need to be connected to the Internet, shouldn’t be. Network load balancers can be set up in protected virtual private networks to only allow access from allow-listed hosts or other trusted networks.

For these reasons, in some scenarios it will make sense to have the added complexity of both a frontend DNS load balancer to distribute traffic to the closest cluster, and backend application load balancers that provide more functionality and integration with your local infrastructure.

If you don’t need any functionality that isn’t supported by a DNS load balancer, can live with your servers being accessible from the Internet, and already manage your own health monitoring and host replacement strategy, then you can simplify your architecture by having a DNS load balancer directly route traffic to your backend servers.

Summary

We’ve discussed how concurrency can be applied at multiple levels:

Multi-threaded applications that run multiple tasks in parallel, such as querying several websites simultaneously
Multi-container or multi-process hosts that run multiple applications in isolation from one another, so that a given application can continue running if others fail
Multi-host clusters that enable horizontally scaling a service to process hundreds of thousands of concurrent requests
Multi-cluster services that enable routing traffic to local load-balanced clusters that can be independently scaled, to process millions of concurrent requests

Many distributed services now start with multi-host clusters for reliability and scalability reasons, so that any given host can be replaced without impacting customer service, and additional hosts can be added as needed.

A single load balancer and backend compute cluster can often handle hundreds of thousands of concurrent requests or more, while the load balancer may become a single point of failure for your service. Distributed DNS load balancers can help to mitigate this concern when it’s acceptable for your servers to be accessible from the Internet.

For applications where you need to handle millions of concurrent requests and have business requirements not met by a single DNS load balancer, such as needing granular access control for your backend servers or integrations with other infrastructure, a DNS load balancer in front of multiple load-balanced clusters can meet these demands with the trade-off of an additional layer of complexity.

Addendum

Before scaling your service to process millions of concurrent requests and paying hundreds of thousands of dollars to do so, make sure this is really necessary.

Would it be more efficient to extract some of your use cases to a separate microservice?

Are your hosts really doing unique work on every call? Could some of that work be deduplicated, or could the right application of a caching layer reduce your traffic and/or average latency by orders of magnitude?

Also, note that millions of concurrent users do not always translate into millions of transactions per second. If each user only needs to make a server request every few seconds, with multiple seconds between where they locally interact with rendered results, you may only have tens to hundreds of thousands of transactions per second, which while still high, lowers the required complexity of the system.

Software architecture design is an iterative process, and the optimal design will change along with the business, so it’s often worth starting with the simplest approach that meets current needs and can be scaled up or down as needed based on customer traffic. There’s no prize for building the most expensive service that no one uses.

Process for designing distributed systems

Wed, 31 May 2023 03:00:00 +0000

In this post I’ll step through my process for designing distributed systems, with example questions and artifacts associated with each step.

Step by step process

1. Validate whether this service needs to exist

Before building any complex system, we should ensure there’s a compelling project motivation. If we can’t identify an underlying customer problem and how this service will address it, we should pause to make sure we’re working on the right thing.

Example questions:

What specific problem or customer need are we trying to address?
How will the customer need be addressed by this service?
What will the end state be after this is completed?
Why are existing solutions not sufficient?

A few hours of research to check what aspects of the problem could be solved by existing services may save both money and months of engineering hours. If we can leverage existing solutions, we should make sure they are well supported and well documented.

If we have a compelling justification for the service after answering the above questions, we’ll continue with the design work, otherwise this may be an indication we should put the project aside to focus on more impactful work.

Artifacts of this step:

Project justification including problem statement and brief summary of how the service will solve that problem in a way that isn’t satisfied by existing solutions.

2. Clarify business requirements

Before making design decisions, it helps to take the time to understand the business use cases and identify what needs to be supported in the first release, and what major features are anticipated in the near future. This way we can make appropriate choices that keep our system as simple as possible while making it easy to extend to future needs.

Example questions:

What are the different user personas our system needs to support? Are they internal employees or external customers, human or programmatic?
What latency needs to be supported for each use case? Some APIs may need to be in the 100ms range, others may not be latency-sensitive.
What level of availability is required for this service? Some services are only used during business hours, while others are critical to keep running 24/7 with severe consequences for even an hour of downtime a year.
What are the security requirements for this service? Eg. Who should be allowed to access different APIs, and are there authorization requirements for who can access what data? Is it acceptable for anyone on the Internet to be able to query the service, or does it need to be restricted to only allow-listed services/users? What data needs to be encrypted in transit and at rest? Do we need to protect against malicious users?

Artifacts of this step:

Requirements document including service-level business requirements, specific use cases that must be supported, latency requirements for each use case, and security requirements.

We should identify stakeholders who should have a say in how the service works, and get their feedback so we can drive alignment and make required changes early in the design process while major changes are less costly.

3. Estimate scale

The scale of data and traffic make a big difference on the architecture needed to support it. Services that only receive a few dozen requests at a time can be very simple, but as we scale up to millions of concurrent requests, we have very different needs around load balancing, host scaling, caching, and data management.

Example questions:

How many users of each type do we expect to have, and how many will be active at a given time?
How much data do we expect our system to have, and how quickly will it grow over time?
What frequency of read vs write operations do we expect on different types of data?
What network bandwidth will we need to support the anticipated traffic?

Artifacts of this step:

Summary of expected scale of total/active users, data, and network traffic.
Summary of expected transactions per second (TPS) per user operation.

4. Define system interfaces and data models

In this step we define how callers will interact with our service, which will typically be through API interfaces.

Example questions:

How can we translate our business use cases into API interfaces that are simple, decoupled from implementation (so we can iterate on our backend design and data models without impacting customers), and future-proof when considering expected new features?
How can we name and structure our API interfaces to be self-explanatory when accompanied by API documentation, to third parties who have no knowledge of how our internal systems work?
How can we organize our interfaces around resources and HTTP methods? Following REST API conventions will make it easier for third parties to integrate.

Artifacts of this step:

Draft API spec including method names, request structures, and response structures.

Once we’ve defined our API spec, we can validate it with stakeholders and iterate until we’re confident we have an interface that will meet user requirements and be easily extended to future use cases.

5. Define data flow and storage

While still treating system internals as a black box, we can define how data will flow in and out of our system and be stored.

Example questions:

Who will we consume data from, and how?
Who will consume data from our service, and how?
What data does our service need to store, and what data structures will best support our use cases?
What is the end-to-end lifecycle for data entering our system, for each type of data we store or process?

Artifacts of this step:

Description of how data flows from upstream services/users, to our service, to downstream services/users.
Description of data models we will store and process.

This step may be done in parallel with defining API interfaces, and we’ll similarly want to validate the data workflows with stakeholders, including data producers and consumers, to ensure our contract makes sense before designing a system that doesn’t match reality.

6. Define high-level system components

Now that we’ve aligned on our use cases, system interface, data models, and inter-service data flows, we can build a picture of the high-level components of our system and how they’ll interact.

Example questions:

What logical components make sense for dividing responsibilities? How will data flow between them?
Should client calls go through a load balancer pointing to multiple backend servers, or will a single server suffice for our scaling and reliability needs?
Do we need to implement new data stores, and if so, which components will be retrieving or writing data to them?
Do we have static content that should be separated from our other client/service interactions, for example, with clients querying a Content Delivery Network backed by a distributed file service?

Artifacts of this step:

A simple diagram of labelled blocks representing logical components such as load balancers, computational microservices, and data stores, with arrows pointing in the direction of data flow.
Workflow diagrams corresponding to business use cases.

This provides the baseline for designing each logical component and choosing technologies to use in the next step.

7. Design individual components

As we dive into the design of each logical component, we can make technology choices based on our business, security, and scaling requirements, comparing options based on how they meet our anticipated current and future needs.

If we’ve separated out logical components into their own microservices, we should be able to independently update and scale individual components going forward.

Our detailed system design will narrow down:

Compute types - see post on AWS compute options
Data store types - the data type and scale of data and access patterns will guide whether we choose a NoSQL key-value store such as DynamoDB, a document store such as S3, a traditional SQL database such as PostgreSQL on RDS, or another type of data store entirely
Caching layers - see post on AWS caching options
API access controls and how they will be enforced
Replication strategies for servers and data

Authorization controls may include:

Restricting who can access specific API methods - this can be enforced through role or resource based access policies.
Restricting non-admin users to only access their own data - this may leverage some combination of service code, and database row-level security.

Artifacts of this step:

A detailed system architecture diagram specifying component names, compute/database types (eg. Lambda, DynamoDB), and data flow between components.
Workflow diagrams for each logical component.
Summary of why each technology choice was made.

If we’ve done our job well, other software engineers should be able to skim the design document and have a general understanding of what needs to be built for this system to work as intended, how the system components will interact, and how upstream and downstream services and users will interact with the service.

If we’ve documented how we made each decision, they should also be able to understand what parts of the design will apply to future projects, and where and how they should deviate based on their use cases and scaling requirements.

Summary

We will take the following steps to design a new service:

Validate whether this service needs to exist
Clarify business requirements
Estimate scale
Define system interfaces and data models
Define data flow and storage
Define high-level system components
Design individual components

The artifacts of each step can be validated with stakeholders to ensure we’re on the right track before continuing. They collectively add to a design document that can be referred to both while building the service, and afterwards to understand its inner workings.

Choosing between AWS compute services

Fri, 28 Apr 2023 03:00:00 +0000

When building a new service in AWS, it can be difficult to decide between all the available compute services. In this post I’ll give a brief overview of the main options and describe how I compare and choose between them for a given project.

Overview of AWS compute services

AWS compute services include AWS Lambda, EC2 (Elastic Cloud Compute), and Fargate, where EC2 and Fargate can both be run through container orchestration services ECS (Elastic Container Service) or EKS (Elastic Kubernetes Service).

Lambda

Lambda provides one of the simplest ways to run code on-demand. You can configure Lambda functions to be automatically triggered via other AWS services or events, or invoke them directly through API calls.

Lambda functions are intended for short-lived operations and have a maximum runtime of 15 minutes.

EC2

EC2 is the underlying compute service for most other AWS services including Lambda and Fargate, and offers a broad range of instance types that support different memory, storage, and networking capacities. You can set up long-lived servers directly with EC2, managing provisioning and infrastructure yourself, or use higher-level services like Fargate or ECS that take care of host management for you.

Fargate

Fargate is a serverless computing environment that allows you to specify how much memory and processing power you need, provide a Docker file, and let AWS take care of host management.

Fargate has less maintenance overhead than EC2 since AWS automatically chooses instance types optimized to your resource requirements and provisions, patches, and replaces hosts as needed.

Brief comparison of compute services

	Lambda	EC2	Fargate
Who manages infrastructure	AWS	You	AWS
Tenancy options	Shared	Shared/Dedicated	Shared
Maintenance overhead	Lowest	Highest	Low
Max execution time	15 minutes	N/A	N/A

Lambda instances are simplest to use for workflows that take under 15 minutes, and both Lambda and Fargate instances are managed by AWS to provide low-maintenance options for customers.

All three compute services are by default “shared tenancy”, meaning that multiple AWS customers may have their software running on virtual machines that share a physical server. For most customers, this is a non-issue, but for highly regulated organizations that need their software running on hardware dedicated only to them, EC2 also supports “dedicated tenancy” hosts.

Quincy Mitchell wrote a good post comparing the pricing of Lambda, EC2, and Fargate across a few instance types at https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/. The general conclusions were that:

Lambda is less expensive than EC2 when run <= 50% of the time, and less expensive than Fargate when run <= 25% of the time.
Fargate’s flexibility for resource sizing can save money compared to EC2 if you need less resources than provided by the next larger EC2 instance type.
EC2 is least expensive when right-sized to resource requirements and highly utilized.

Container management services

Both EC2 and Fargate can be run via the following container management services:

ECS (Elastic Container Service) is the simplest way to run containerized servers in AWS, with most deployment and networking details managed by AWS.
EKS (Elastic Kubernetes Service) runs containized servers through Kubernetes, which is more complex and supports more granular configuration than ECS.

When building services entirely in AWS, I prefer ECS because of how easy it is to manage and integrate with other AWS services. EKS may be preferable to teams that already work with Kubernetes and want to leverage specific features that ECS doesn’t support.

Comparison matrix

	Lambda	Fargate on ECS	EC2 on ECS	EC2/Fargate on EKS	EC2
Max execution time	15 minutes	N/A	N/A	N/A	N/A
Warm-up time	Seconds*	N/A	N/A	N/A	N/A
Execution latency SLA	Seconds	100ms range	100ms range	100ms range	100ms range
Availability SLA	99.95%	99.99%	99.99%	99.95%	99.99%
Requires OS customization	No	No	Yes	Yes	Yes
Maintenance overhead	Low	Low	Medium	Medium-High	High
Automatic rollback support	Yes	Yes	Yes	No	Yes
Handles sharp traffic spikes	No**	Yes	Yes	Yes	Yes

For services where executions are expected to always complete in under 15 minutes, AWS Lambda is the simplest and lowest-maintenance compute service to leverage for API services.

*Managing Lambda warm-up time

AWS Lambda can take up a few seconds to “warm up” and execute a function when no recent requests have been made, or when new workers are being provisioned to meet scaling demands.

This delay can be partially mitigated by scheduling periodic function calls (“pings”) every few minutes to keep the function “warm”.

You can also pay for provisioned concurrency to always have a minumum number of workers provisioned and ready to accept traffic.

Execution latency SLAs are influenced by the above warm-up time issues, and with periodic pings and provisioned concurrency you can expect execution latencies to be comparable to EC2/ECS.

**Handling sharp traffic spikes

Lambda has the limitation that its auto-scaling takes a few minutes to adjust to sharp spikes in traffic beyond its base capacity of 1000 calls/second. If your service needs to handle 50%+ traffic spikes above this threshold without throwing “Rate Exceeded” errors for a few minutes, then ECS Fargate or the other ECS/EKS options should be considered instead.

Many AWS compute services support automatic scaling policies based on factors you specify such as time period and memory usage, which you can use to automatically adjust to major traffic increases with some lag time. However, to handle sharp traffic spikes without interim throttling errors, you need to estimate your service’s maximum call rate in advance and set the minimum number of running instances in your ECS/EKS cluster or EC2 auto-scaling group to match it.

It can be expensive to maintain a large number of hosts year-round if you only have traffic surges on specific dates, so many teams track service usage over time and periodically adjust their scaling limits to align with expected seasonal/event-based traffic, with updated projections and load testing done before known peak periods.

Recommendations

Lambda is the simplest compute service to use for short-lived operations, and is an easy choice for services that process under 1000 requests/second. It can also handle much higher traffic scenarios, while it can take a few minutes for it to scale up to handle spark traffic spikes of over 50% when above the 1000 requests/second base capacity. If traffic increases are generally less spiky than this, or temporary throttling is acceptable, then Lambda is still a good choice.

When Lambda is not an option due to worst-case latency or traffic expectations, Fargate on ECS offers the simplest set-up and management as a fully-managed “serverless” solution, and is my go-to alternative.

I would only suggest using EC2 (stand-alone or on ECS) when there is a specific need for EC2’s additional OS or runtime environment configurations, since it otherwise adds unnecessary maintenance overhead.

References

AWS blog post on ECS vs EKS: https://aws.amazon.com/blogs/containers/amazon-ecs-vs-amazon-eks-making-sense-of-aws-container-services
AWS blog post on Lambda provisioned concurrency: https://aws.amazon.com/blogs/compute/new-for-aws-lambda-predictable-start-up-times-with-provisioned-concurrency
ECS documentation: https://docs.aws.amazon.com/ecs
EKS documentation: https://docs.aws.amazon.com/eks
Fargate documentation: https://docs.aws.amazon.com/AmazonECS/latest/userguide/what-is-fargate.html
Lambda documentation: https://docs.aws.amazon.com/lambda
Lambda scaling: https://aws.amazon.com/blogs/compute/understanding-aws-lambda-scaling-and-throughput
Quincy Mitchell cost comparison of Lambda, EC2, and Fargate: https://blogs.perficient.com/2021/06/17/aws-cost-analysis-comparing-lambda-ec2-fargate/

AWS caching options

Wed, 26 Apr 2023 03:00:00 +0000

Caching is a technique used to store frequently accessed data for fast retrieval, reducing the load on backend services and improving application performance. AWS provides several caching options that can be used at different layers of the infrastructure stack.

Caching best practices

Before integrating a cache, it’s important to evaluate use cases that would benefit from caching. Here are some caching best practices to keep in mind:

Caching is most helpful for data that is frequently accessed and slow to retrieve
A cache should not be used if responses need to be strongly consistent with the backend
Use cache expiry times appropriate to the use case
Handle cache misses gracefully
Avoid cache stampedes by using locks or random delays

CloudFront

Content Delivery Networks (CDNs) are perfect for static data that is frequently accessed across users and doesn’t change based on the request, such as multimedia assets, scripts, and other global files. CloudFront is an AWS-managed CDN that has a large global network of endpoints to allow low-latency calls regardless of customer location.

Key features:

Low-latency data transfers worldwide
Integrations with other AWS services

Example use case: Video streaming services (YouTube, Netflix, etc) will almost always want to leverage a CDN with endpoints in multiple regions, since this scenario often involves large volumes of concurrent reads to common data and customers are hyper-sensitive to latency.

API Gateway cache

AWS API Gateway’s REST APIs have built-in cache support that can be enabled for specific API methods at the infrastructure level. This is one of the simplest ways to set up caching if you have API GET methods that are frequently called with the same parameters.

When an API method has a high rate of cache hits, caching at this level allows you to absorb increased traffic with minimal load to your backend services, reducing data transfer and hardware costs.

Key features:

Built-in cache support
No code changes required

Example use case: You plan on using API Gateway for your service, and one of your API methods has a limited set of expected incoming request values and is frequently called by other services. Enabling API Gateway’s cache for this method will reduce the worst case load to your backend to N calls per cache timeout period, given N unique request values.

I’ve used API Gateway’s cache for this exact use case to allow APIs to handle millions of calls per day at minimal cost, without requiring backend changes.

Custom cache

Regardless of which AWS services you use, you can always write server-side code that manually accesses a custom cache.

Using a local cache on the host running the service is simple but has availability issues if you need to reboot or replace the host, and consistency issues if you run concurrent servers with independent caches. You can avoid these issues by running an independently hosted cache service that all API workers call, which you can either run yourself or through AWS.

Example use case: You have your own cache solution that you prefer over AWS cache services, and your cache use cases are not satisfied by API Gateway or DynamoDB.

ElastiCache

ElastiCache is useful when you need a hosted general-purpose cache that can be queried by concurrent compute instances, and API Gateway and DynamoDB Accelerator don’t meet your caching needs. You can choose to leverage either Redis or Memcached as the backend implementation for your ElastiCache instances.

This requires maintaining additional AWS infrastructure and making code changes to wrap your data access code with logic to read from and write to the cache service.

Key features:

Flexible use cases independent of backend implementation

Example use case: A method that is invoked across your service many times per minute involves an expensive SQL query that is always run with the same parameters, where it’s acceptable for the results to be 15-30 minutes out of date. By leveraging ElastiCache with a 15-minute time-to-live value, you can make this query nearly instant across all servers and clients for all except for one call per 15 minute time period. Note: If you have a fixed set of global queries that are important to optimize for all users, you can remove the worst-case runtime from end customers by setting up a periodic job that updates the data in the cache more frequently than the time-to-live period.

DynamoDB Accelerator (DAX)

DynamoDB Accelerator is specific to DynamoDB databases, and can be easily set up in your AWS account and leveraged in code by swapping out the client you use to query DynamoDB.

Key features:

Automatic caching of DynamoDB read operations that use the DAX code client
Minimal code changes required

Example use case: Your application has a few specific DynamoDB queries that take longer to complete than is acceptable to end users. By enabling DAX and configuring those use cases to use the DAX code client rather than the DynamoDB code client, you can automatically cache the query results so that the majority of users experience nearly instant responses.

Comparison of AWS caching services

	CloudFront	API Gateway cache	ElastiCache	DAX
Use case	Static resources	API responses	General-purpose	DynamoDB query responses
Cache layer	CDN	API Gateway	Compute code	Database
Layer accessing cache	Server or client code	API Gateway	Server code	Server code
Infrastructure set-up & maintenance	Medium	Low	High	Low
Code changes	Medium	N/A	High	Low

API Gateway caching is simplest to set-up and maintain, followed closely by DAX. I’d recommend considering these over more complex caching solutions when they fit your use case.

DAX and ElastiCache are comparable in price with the difference depending on the configuration options chosen for ElastiCache, and DAX is much simpler to integrate with, so DAX would be my recommendation if all of your cache use cases are specific to DynamoDB queries.

ElastiCache has the most overhead and most flexibility as AWS’ general-purpose cache service.

CloudFront addresses a somewhat different use as a CDN, and is appropriate to use whenever you want to optimize access of static resources across regions.

Summary

AWS provides several ways to cache data depending on your use case and infrastructure requirements. In many cases, you don’t need to invent the wheel and can use a fully-managed solution that does not require significant code changes.

By caching at the appropriate layer, you can optimize latencies while minimizing unnecessary load to your backend services, allowing you to scale at a reasonable cost.

References

AWS S3 bucket creation dates and S3 master regions

Sun, 05 Sep 2021 03:00:00 +0000

While working on functionality that depended on AWS S3 bucket ages, I noticed that published bucket CreationDate values didn’t always reflect when the buckets were created.

For example, when I called the S3 ListBuckets API a few minutes after updating a bucket access policy, the CreationDate value returned for that bucket was the time that I had modified the policy rather than the time that I had created the bucket. This was also reproduced when using the AWS CLI via the aws s3api list-buckets command.

It turns out that the S3 CLI documentation for list-buckets explicitly states that CreationDate values can change when you make changes to your bucket:

CreationDate -> (timestamp)

Date the bucket was created. This date can change when making changes to your bucket, such as editing its bucket policy.

However, in this GitHub issue, an AWS engineer confirmed there was in fact one AWS region you could query to get the original creation times, the “us-east-1” region, and that this was a feature of how S3 was designed. Other regions’ CreationDate values would change when key bucket attributes or access policies were modified.

Why does only one region give the correct creation time? It turns out that all S3 buckets are created in one master region, and then replicated globally. Each region’s replica is only aware of its own creation date. When bucket changes are propagated across regions via new replication events, the new replica creation dates are what are reflected in regions other than the master region.

I followed up with the S3 team for more details since my team interacts with services across multiple AWS partitions and regions, and from tests it looked like the master region differed between partitions. As of September 4, 2021, AWS documentation indicates that there are currently three AWS partitions:

“aws”, the classic AWS partition
“aws-cn”, AWS China
“aws-us-gov”, AWS GovCloud

The S3 engineer confirmed that each AWS partition has a single S3 master region, and that querying S3 from that master region would be reliable for retrieving original bucket creation dates. The master regions are:

“us-east-1” for the “aws” partition
“cn-north-1” for the “aws-cn” partition
“us-gov-west-1” for the “aws-us-gov” partition

Demo:

# Queries from us-west-2 yield an incorrect creation time which is actually the time when the bucket policy was updated.
% aws configure set region "us-west-2" && aws s3api list-buckets | grep -A 1 "masayson-creation-date-test-20210829-1725" | grep CreationDate
            "CreationDate": "2021-08-30T20:12:58+00:00"

# Queries from us-east-1 yield the original bucket creation time
% aws configure set region "us-east-1" && aws s3api list-buckets | grep -A 1 "masayson-creation-date-test-20210829-1725" | grep CreationDate
            "CreationDate": "2021-08-29T17:25:45+00:00"

References:

S3 ListBuckets API documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListBuckets.html
S3 list-buckets CLI documentation: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/list-buckets.html
GitHub issue describing changing S3 bucket creation dates: https://github.com/aws/aws-cli/issues/3597

Choosing a logging library for Kotlin or Java AWS Lambda functions

Sun, 09 Aug 2020 03:00:00 +0000

There are a lot of logging libraries to choose from when writing AWS Lambda functions in Kotlin or Java. Since Kotlin is fully interoperable with Java, Kotlin projects have access to both Kotlin-based and Java-based logging libraries.

This post compares some of the major options and evaluates which are most suitable for Lambda functions.

Criteria for evaluating libraries

1. Include Lambda request IDs in log statements

The most useful logs provide contextual information that allows us to quickly look up events associated with an issue, for example, the user ID or request ID, descriptions of actions taken for a given request, and error messages along with their stack traces.

For AWS Lambda functions, every time a Lambda function is called, a unique request ID is generated that is associated with all actions within the given function execution. By including this request ID in all log messages, we can easily search for logs associated with that request if needed.

Logging libraries that make it easy to automatically include Lambda request IDs in log statements will therefore be rated higher than other libraries that force us to put in more effort to construct the same contextual information.

2. Support multiple log levels

It’s also helpful for logs to differentiate between informational, warning, and error statements. Common Java logging APIs such as SLF4J have log levels of trace, debug, info, warn, error, and fatal, where you can usually configure the minimum level that should be logged for a given environment. Trace and debug statements are often only enabled in test environments.

3. Handle multi-line log statements

Finally, when we include multiple lines of text in a log statement, they should be aggregated into a single CloudWatch log event, otherwise our service will be making unnecessary CloudWatch API calls and we’ll have to dig through multiple CloudWatch events that should all really be a single log statement.

Comparing Kotlin and Java logging libraries

stdout and stderr
- System print statements are forwarded to CloudWatch logs
- Lambda request IDs are not logged by default, we have to add request IDs ourselves -> not ideal
- Multi-line log statements become multiple CloudWatch log events -> disqualifies this option
LambdaLogger via the aws-lambda-java-core library
- Lambda request IDs are included in logs
- Simple interface, given the Lambda Context, we can access public void log(String message) and public void log(byte[] message) methods through context.getLogger()
- Supports combining multi-line log statements into a single CloudWatch log event
Log4j2 via the aws-lambda-java-log4j2 library
- Lambda request IDs are included in logs
- Log4j2 provides standard log levels of trace, debug, info, warn, error, and fatal
- Multi-line log statements are combined into a single CloudWatch log event
Logback via the lambda-logging library
- Lambda request IDs are included in logs
- Logback provides standard log levels of trace, debug, info, warn, error, and fatal
- Multi-line log statements are combined into a single CloudWatch log event
kotlin-logging
- A lightweight Kotlin-based wrapper for the SLF4J API, supports plugging in your choice of logging implementation such as Log4j2 or Logback
- Lambda request IDs are not logged by default, we have to add request IDs ourselves -> not ideal
- SLF4J provides standard log levels of trace, debug, info, warn, error, and fatal

Since we want Lambda request IDs to be included in logs by default, we won’t use System print statements or kotlin-logging, and we’ll also drop the basic LambdaLogger since we’d like to differentiate between logging levels.

aws-lambda-java-log4j2 and lambda-logging are good options that both support the SLF4J API, and make this a choice between the underlying logging implementations of Log4j2 in aws-lambda-java-log4j2 vs Logback in kotlin-logging.

Log4j2 has significantly better performance than both its earlier iteration of Log4j and Logback [1, 2], and is also smaller in size than Logback, which leads me to prefer aws-lambda-java-log4j2.

Conclusion

There are many options for logging in Kotlin AWS Lambda functions since we have access to both Kotlin and Java libraries.

For either Kotlin or Java, aws-lambda-java-log4j2 is my preferred choice for now since it includes Lambda request IDs in logs, supports standard log levels, handles multi-line log statements, and uses an underlying logging implementation that is more performant than the one used by the other top contender.

See the AWS documentation on Lambda function logging in Java with Log4j 2 and SLF4J for more details on how to use it.

Resources

AWS documentation for Lambda function logging in Java: https://docs.aws.amazon.com/lambda/latest/dg/java-logging.html
Source code for aws-lambda-java-core: https://github.com/aws/aws-lambda-java-libs/tree/master/aws-lambda-java-core
Source code for aws-lambda-java-log4j2: https://github.com/aws/aws-lambda-java-libs/tree/master/aws-lambda-java-log4j2
Source code for kotlin-logging: https://github.com/MicroUtils/kotlin-logging
Source code for lambda-logging: https://github.com/symphoniacloud/lambda-monitoring/tree/master/lambda-logging

Lessons learned from using AWS Data Pipeline

Sat, 06 Jun 2020 03:00:00 +0000

One of the projects I worked on last year had a requirement to sync daily snapshots of a subset of data from an Amazon RDS database to Amazon S3 in order to support other internal services that ingested from S3 data providers.

I negotiated requirements with the stakeholders who were requesting the data, and decided to use AWS Data Pipeline since it seemed to be a good fit for our use cases and worked well in a proof-of-concept.

This turned out to be a less than ideal solution although it did the job, and I learned several lessons from the experience.

Lesson 1: Security requirements should be explicitly defined at the start of a project and incorporated into proof-of-concepts.

We had user requirements for the data syncs, and knew that we would encrypt data at rest in S3 and encrypt in transit between S3 and downstream services.

However, I found that integrating with internal services that had specific assumptions for how security requirements would be implemented introduced more work than expected, and a number of Data Pipeline limitations came up over the course of the project.

After several days of debugging an integration with an internal security template for S3 buckets, I found that the template assumed that KMS encryption would be used for all files saved to S3. The template blocked all other forms of encryption, while AWS Data Pipeline supported AES but not KMS encryption.

I confirmed with our security team that AES was an acceptable encryption protocol to use, and was able to build a custom template that satisfied security requirements. However, the extra work to debug and resolve the issue added about a week of engineering time to the project.

Next, I found that AWS Data Pipeline does not by default encrypt data in transit between Amazon RDS and itself, while it does encrypt data between itself and S3.

AWS Support suggested a possible workaround to force encryption between RDS and AWS Data Pipeline that would involve:

setting up a script to pre-install RDS certificates on the Data Pipeline compute instances,
maintaining a JRE 7-compatible version of the JDBC driver in S3,
referring Data Pipeline instances to the custom JDBC driver, and
configuring the Data Pipeline JDBC connection to use the RDS SSL certificate.

While it would be ideal to encrypt data in transit between any AWS services, I confirmed with our security team that for the data we were handling, this extra work would not be needed if we ensured that data would only be unencrypted in transit within a secured AWS Virtual Private Cloud (VPC) and would always be encrypted when leaving the VPC. I configured the Data Pipeline compute instances to run in the same VPC as the RDS instance so that this was not an issue.

Lesson 2: AWS Data Pipeline has several functional limitations to be aware of.

Some of these limitations can be worked around, while others make it less suitable for many projects:

Data Pipeline does not encrypt data from a source RDS instance and itself, while there is a potential workaround as described earlier in this post.
Data Pipeline does not support KMS encryption for writing data to S3, while it does support AES encryption.
Data Pipeline does not support including column headers in exports of RDS tables to S3 by default. A workaround for PostgreSQL databases is to update the SQL SELECT statement to force a SQL UNION between column headers and data fields casted to strings. However, this workaround requires all columns to be specified in the query, which adds manual effort to update data schemas.
A number of Data Pipeline attributes cannot be updated after creation. A workaround is to update the CloudFormation template to no longer include the Data Pipeline resource while retaining other template resources. This deletes the Data Pipeline without affecting any data. Then, update the CloudFormation template to include the Data Pipeline with the new attributes. This creates a new Data Pipeline which continues to sync data to the existing S3 bucket.
Data Pipeline supports CSV but not JSON or Parquet for S3 exports.
Data Pipeline compute instances must be explicitly defined by users, meaning that you manage the types and numbers of compute instances to set up for your pipeline.

I also found from my interactions with AWS Support that AWS Data Pipeline is being maintained but not actively developed, as resources have been allocated to the newer and more featureful AWS Glue service. This indicates that many of these limitations are unlikely to change in the near future.

Lesson 3: When producing data for partners, talk directly with the engineers on their team.

Two limitations that made life more difficult for this project were Data Pipeline not having default support for including column headers in S3 data exports, and Data Pipeline not supporting JSON or Parquet file formats for data exports.

Our customer required column headers to be present in order to process data, and the workaround implemented to support this meant that data syncs were no longer agnostic of data schemas and that manual effort would be needed to sync new fields.

The initial requirements included that data be provided to them in CSV format. However, engineers on the team later reported to us that their system didn’t fully support standards-compliant CSV, while they had better support for JSON and Parquet.

In order to support their use cases, we had to add pre-processing of data to strip out or convert a number of characters that their system had issues with, and we had to continue to iterate on this as records came up including additional characters that their CSV parsers were not able to handle. These issues were not present in their JSON or Parquet parsers.

Both of these issues could have been discovered earlier if I had scheduled an in-depth design discussion with the engineers on our partner team at the start of the project rather than taking the requirements at face value.

After our integration with them, we agreed that CSV exports should not be recommended for subsequent data providers onboarding to their service.

Lesson 4: AWS Glue is a more suitable solution for many modern data sync use cases.

Three of the solutions I considered at the start of the project were AWS Database Migration Service, AWS Data Pipeline, and AWS Glue.

AWS Database Migration Service is easy to set up and works well for many use cases that require syncing data from one AWS service to another. However, some of our specific use cases were not supported by it, such as requiring distinct S3 paths for each sync so that customers could quickly query for data from a specific day’s snapshot as well as compare data between snapshots.

AWS Data Pipeline looked like a perfect match at the start of the project, however, as discussed in this post, it has several feature limitations that we discovered after working with it that limit the projects I would recommend it for.

AWS Glue is a newer service that runs on a Spark environment managed by AWS, which removes the overhead of having to manage compute instances on your end, and users can develop ETL (extract, transform, load) jobs in Python or Scala to run on it.

When I first looked at AWS Data Pipeline and AWS Glue, I found that I was able to stand up a Data Pipeline proof-of-concept within a day, while Glue appeared to have a steeper learning curve. Since Data Pipeline was the simplest solution that appeared to satisfy our requirements, I recommended we use it instead of Glue, particularly since no one on our team had used Glue before.

Given what I know today, and now that we have more examples from partners who have successfully used Glue for their workflows, I would tend to recommend going straight to AWS Glue instead of AWS Data Pipeline in use cases that require more complex workflows than are supported by AWS Database Migration Service.

It has better support for dynamic discovery of schemas, has built-in support for data encryption in transit and at rest, and supports a variety of file formats including CSV, JSON, Parquet, and others.

Lesson 5: AWS Support is an excellent resource for assisting with issues that don’t have obvious solutions.

There were a number of issues I came across when working with AWS Data Pipeline where there was minimal information provided in error messages, logs, or public documentation.

These came up when implementing use cases more complicated than examples provided in documentation, such as integrating with VPCs with locked down security groups, or with RDS and S3 resources which have specific security configurations in place.

After attempting to troubleshoot Data Pipeline issues that were lacking error logs for a few days without success, I reached out to AWS Support, where support staff also had difficulty determining the root cause, but were able to reach out to relevant AWS engineering teams who had the background to either diagnose the issue or provide options for moving forward.

One particularly tricky Data Pipeline issue took over two days to work through with AWS Support due to the lack of error logs, while another was resolved within a couple of hours since they quickly recognized it as being a symptom of an encryption protocol mismatch between our security template and what was being used by Data Pipeline’s backend.

I hadn’t often leveraged AWS Support before, while I knew that we had access to enterprise support. This project demonstrated the value of AWS Support when other resources are insufficient for root-causing the issue independently, particularly when working with AWS services in a way that applies restrictions beyond the typical use case.

Summary of lessons learned

Security requirements should be explicitly defined at the start of a project and incorporated into proof-of-concepts.
AWS Data Pipeline has several functional limitations to be aware of that make it unsuitable for some use cases that appear to match on the surface.
When producing data for partners, talk directly with the engineers on their team to validate the format that they will be able to work with, since they know their services best.
AWS Glue is a more suitable solution than AWS Data Pipeline in many cases that are too complex for AWS Database Migration Service to support.
AWS Support is an excellent resource, particularly if you have access to enterprise support.

Lessons learned from playing Go

Fri, 05 Jun 2020 02:30:00 +0000

Managing too many choices

Go forces us to ask ourselves those all-encompassing questions:

What should I do now?

What things are most important to me?

What should I be focusing on?

Or combined into a more immediate, actionable form: What is the most important thing for me to do right now?

You have a dizzying number of options, where sometimes none of them look good and sometimes far too many beg for consideration, and you can only choose one.

Maybe the game will play out in a way that allows you to revisit many of those earlier ideas, and maybe the game will advance in a way that walls them off.

But for now, you have to choose just one move, see how your opponent responds, and then choose the next. The clock is ticking, and while you do have some time, if you spend it all on trivialities, you leave less time for the high stakes decisions that will come.

As a natural consequence of playing Go, you develop ways to think through options, identify opportunities, risks, and critical areas, differentiate between important and unimportant moves, and learn how to negotiate difficult decisions where you won’t necessarily find a perfect play.

Coexisting with uncertainty and continually learning

You don’t know exactly how things will turn out; there is too much variation to accurately predict the future. But you can get better at reading out the likely consequences of local actions, build an intuition of how each choice influences the rest of the board, and feel out the other player’s goals to adjust your strategy to flow with them.

There are many ways to play the game, and it’s fine if you don’t know exactly what you want or how to find the optimal path to success. You can narrow things down to a few ideas that you want to explore, think ahead a few moves in each direction, and go with the path that you like most. Win or lose, you can learn something.

As in real life, the benefit of hindsight is most realized when intentionally exercising it, by reviewing the past with people with alternative viewpoints and potentially with mentors who can point out specific areas where you can improve. You can also learn a lot from watching other skilled players, working on situational problems, or reading books.

One thing that you quickly find is that at every stage of the game and at every skill level, there is more to discover. You are never complete as a player, and people are always finding new plays and strategies for this simple yet deep and beautiful game. Whether you’re a beginner or a world champion, you can always learn something.

In the end, you improve your game by playing, one stone at a time.

Resources

Online materials for learning how to play Go: https://www.usgo.org/learn-play
Sensei’s Library, a popular English-language wiki for Go: https://senseis.xmp.net/?PagesForBeginners
Local clubs listed with the Canadian Go Association: https://canadiango.org/club/list
Local clubs listed with the American Go Assocation: https://www.usgo.org/where-play-go
Online Go servers: https://www.usgo.org/go-internet