Matthew Bonig

Blog Resume Timeline

Developing Step Functions with the AWS CDK

February 19, 2022

cdk, step functions

State machines aren't new, and AWS Step Functions have now been around more than 5 years. But, I've never spoken to anyone that was using them outside a few narrow use-cases.

Then, Step Functions got two major updates last year. The first was the release of the Workflow Studio in June 2021. It brought about a much nicer interface for building state machines. Then, in September they announced the AWS SDK integration, opening up Step Functions to work with over 9,000 API calls.

With that second update I got very curious about Step Functions as a replacement for a lot of simple code I was writing in Lambda functions. They weren't complex orchestrations, but just some sequenced calls to various AWS APIs and occasionally, a third party API. If I could ditch owning any of that code, I'd be happy.

At re:Invent 2021 I saw Sam Dengler talk twice about using Step Functions and was hoping it could keep me from writing more Lambda functions. Thursday was a long day as we launched The CDK Book and I needed to unwind, so I started writing some code in my hotel room that night. That led to this blog and a new construct.

I had a number of automation tasks in my backlog. I could have built it all in a Lambda function, but I've been there, done that, and generally feel Lambda functions are not where you want your orchestration logic to reside. While the Workflow Studio looked good, I'm weary of any UI editor and their knack of falling over when it comes to Infrastructure as Code (IaC).

Visual designers for AWS resources often offer a good development experience, like the NoSQL Workbench, but then completely drop the ball when it comes to management of those resources. Often they require you to manage the resource entirely within the tool with no escape hatch to good IaC. IaC is one of the pillars of the Well Architected Framework and for good reason. Repeatable, predictable, configurable infrastructure has proven time and time again to provide huge value in the long-term development of any project.

The Workflow Studio aims to only be an editor for Amazon State Language (ASL), and doesn't attempt to manage related Lambda functions, SNS topics, or the other AWS resources you will interact with in your state machines. Workflow Studio requires you to bring your own resources, further enforcing the need for good IaC integration.

While I love the AWS CDK for IaC, and it does offer a specialized set of constructs for creating definitions, I feel my code always ended up unreadable and hard to follow or visualize. I want to use the Workflow Studio editor for creating workflows but the CDK to maintain the long-term definition through multiple environment deploys, along with the related resources like Lambda functions.

So, is there a way to get our cake and eat it too?

There are two goals I had here:

  1. Creation, testing, updating, and debugging can all be done in the Workflow Studio without having to loop through an IaC deployment each iteration.
  2. The CDK code ultimately holds the source of truth for the ASL, including setting all related ARNs, names, and references in the ASL to resources it manages or has references to using .from*() lookups.

Testing the Theory

The following is a workflow I built recently to rotate the access keys for a service account. I began by going into Workflow Studio and putting the first draft together:

rotate creds workflow

Now, the astute observer might be noticing an issue with this workflow. It will deactivate and delete ALL keys for the service account before creating new ones. This is not how you'd probably build one of these yourself.

This workflow has a very specific purpose for an unusual service account which rarely accesses the system. It's fine for us that we delete previous keys and generate new ones. Additionally, it allows me to use a Map state which will make for a more interesting example below.

When I first built this workflow I just picked a random UserName on all the IAM calls, because this is a value that the CDK will control and set on the ASL. Also, I picked a random Lambda function I already had in the environment, because I hadn't written the code yet to update the application secret that would reside in a non-AWS system.

To the CDK code!

My CDK code will be responsible for deploying the state machine definition and all the resources that it interacts with. In this case, that's the IAM User for the application to use and a Lambda function to ship new keys for the application to read.

Creating those two resources is already well documented so let's get right to the state machine. I can't use any of the chainable constructs mentioned previously, as none allow you to provide an ASL definition and would require me to build from scratch. But, the L1 CfnStateMachine construct does, it simply takes a definition and some other basic parameters.

I start by copying the ASL from the UI and saving it to a file in my project. Something like:

.
├── src
│   ├── MyStack.ts
│   ├── recycle-access-keys-asl.json

While copying and pasting is fine for now, if I spent a long time on developing the workflow I'd likely create a small Node module to read the ASL using the API and writing the contents to the file.

The ASL doesn't have correct references to my IAM user and Lambda function. I need a way to set those that is repeatable. I'm probably going to be pulling this ASL from Workflow Studio a lot and don't want to have to edit the file each time.

I hit StackOverflow looking for a way to easily edit an existing JS object. Sure, I could use JsonPath, but that felt a little heavy. I came across some code to merge two objects together, no matter how complex they were:

/**
 * Performs a deep merge of objects and returns new object. Does not modify
 * objects (immutable) and merges arrays via concatenation.
 *
 * @param {...object} objects - Objects to merge
 * @returns {object} New object with merged key/values
 */
function mergeDeep(...objects: any[]) {
  const isObject = (obj: any) => obj && typeof obj === 'object';

  return objects.reduce((prev, obj) => {
    Object.keys(obj).forEach(key => {
      const pVal = prev[key];
      const oVal = obj[key];

      if (Array.isArray(pVal) && Array.isArray(oVal)) {
        prev[key] = pVal.concat(...oVal);
      } else if (isObject(pVal) && isObject(oVal)) {
        prev[key] = mergeDeep(pVal, oVal);
      } else {
        prev[key] = oVal;
      }
    });

    return prev;
  }, {});
}

With this function I can take the ASL and set any values by providing a similarly shaped object. I call that smashing objects, so I smash the States field of the ASL:

function smash(definition: any, smash: any) {
  let states = definition.States;
  for (let key in smash) {
    if (states[key]) {
      states[key] = mergeDeep(states[key], smash[key]);
    }
  }
  return { ...definition, States: states };
}

The first parameter is the existing ASL definition object and the second is an object that partially matches the same schema, only providing new values in the States property.


// from the JSON...
const asl = {
  "Comment": "A description of my state machine",
  "StartAt": "List Existing Access Keys",
  "States": {
    "List Existing Access Keys": {
      "Type": "Task",
      "Parameters": {
        "UserName": "whatever"
      },
      "Resource": "arn:aws:states:::aws-sdk:iam:listAccessKeys",
      "ResultPath": "$.existingAccessKeys",
      "Next": "For each key"
    },
    ...
  }
}
// provide overridden values:
const overrides = {
  "List Existing Access Keys": {
    Parameters: {
      UserName: applicationServiceUser.userName
    }
  }
}

With this new function, I can wrap it all up in an L2 construct:

export class StateMachine extends CfnStateMachine implements IGrantable {
  public role: Role;

  constructor(scope: Construct, id: string, props: StateMachineProps) {
    const role = new Role(scope, `${id}-Role`, { assumedBy: new ServicePrincipal('states') });
    role.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('CloudWatchEventsFullAccess'));
    super(scope, id, {
      stateMachineType: props.express ? 'EXPRESS' : 'STANDARD',
      stateMachineName: props.stateMachineName,
      roleArn: role.roleArn,
      definitionString: JSON.stringify(StateMachine.smash(props.definition, props.overrides)),
    });
    this.role = role;
  }

  public static smash(definition: any, smash: any) {
    let states = definition.States;
    for (let key in smash) {
      if (states[key]) {
        states[key] = mergeDeep(states[key], smash[key]);
      }
    }
    return { ...definition, States: states };
  }

  ...
}

/**
 * Performs a deep merge of objects and returns new object. Does not modify
 * objects (immutable) and merges arrays via concatenation.
 *
 * @param {...object} objects - Objects to merge
 * @returns {object} New object with merged key/values
 */
function mergeDeep(...objects: any[]) {
  const isObject = (obj: any) => obj && typeof obj === 'object';

  return objects.reduce((prev, obj) => {
    Object.keys(obj).forEach(key => {
      const pVal = prev[key];
      const oVal = obj[key];

      if (Array.isArray(pVal) && Array.isArray(oVal)) {
        prev[key] = pVal.concat(...oVal);
      } else if (isObject(pVal) && isObject(oVal)) {
        prev[key] = mergeDeep(pVal, oVal);
      } else {
        prev[key] = oVal;
      }
    });

    return prev;
  }, {});
}

Full usage of the construct now looks like this:

const myUser = new User(/*...*/);
const myLambda = new Function(/*...*/);
const myWorkflow = new StateMachine(this, 'ResetAccessKeys', {
  definition: JSON.parse(fs.readFileSync(path.join(__dirname, 'recycle-access-keys-asl.json')).toString()),
  overrides: {
    'List Existing Access Keys': {
      Parameters: {
        UserName: myUser.userName,
      },
    },
    'For each key': {
      Iterator: {
        States: {
          'Deactivate existing key': {
            UserName: myUser.userName,
          },
          'Delete existing key': {
            UserName: myUser.userName,
          }, 
        }
      }
    },
    'Create Access Key': {
      Parameters: {
        UserName: myUser.userName,
      },
    },
    'Update Keys in Application Secret': {
      Parameters: {
        FunctionName: myLambda.functionArn
      }
    }
  },
  stateMachineName: `ResetApplicationAccessKeys`,
});

grantKeyManagement(myUser, myWorkflow); // since there isn't a user.grantManageKeys I have to do it myself.
myLambda.grantInvoke(myWorkflow);

The construct is given a parsed ASL definition as a JS object, the overrides, and a name.

Finally, I grant some standard permissions to the workflow to be able to manage and invoke my User and Lambda function, respectively. Once the CDK code is built and deployed I now have a fully functioning workflow in IaC that I originally designed with Workflow Studio.

But what about updates?

So I've built and deployed it, but what if it doesn't work? I guarantee you I didn't get it right the first time. I have a bug in my ASL, or my logic, and now I need to make changes.

I can make changes directly in Workflow Studio to the state machine. Hopefully, it was just a typo in a mapping, or my output wasn't setup to carry over the input. Chances are it's a minor change, so I make that directly in Workflow Studio and re-test my workflow in the UI. I can edit, execute, debug and iterate entirely within the Workflow Studio at this point, assuming I'm not adding more states which might require more resources.

If I did need new resources, I'd take the latest version of my ASL, copy it to the repository again, add any additional AWS resources with CDK code and set up the overrides accordingly.

Once I'm satisfied that things are working as expected, I re-save my ASL to the local .json file in my project one last time. I wrap a snapshot test around the whole stack and call it a day.

Now I have the best of both worlds. I can rapidly iterate over my workflow definitions in the Workflow Studio and UI to create, test, update, debug over and over, while ensuring that my workflows are ultimately under IaC and are repeatable and consistently deployed to multiple environments.

Conclusion

The last month I've spent time rebuilding my client's infrastructure as code and deployment pipelines. I had about a half-dozen tasks that I could have automated a number of different ways, either using CodeBuild, CodePipeline, Github Actions, or other automation tools (my boss almost had me convinced to try AirFlow). But, I knew after watching Sam Dengler's talk about Step Functions at re:Invent 2021, and how he built some workflows live during his presentations, that I wanted to invest my skills in Step Functions.

After repeating this process a half-dozen times now I can say Step Functions are here to stay in my toolkit. Most were built in less than a day and leveraged Secrets, Lambda functions, SES, and CodeBuild as well as control flow items like Pass and Maps. I was quite happy with not only the speed, but the observability, I was getting out of the executions.

Finally, the StateMachine construct I built is available publicly if you'd like to use it.