If you were ever curious how many words there are in each sentence from Wikipedia’s entry on the legendary band Queen, you are about to find out. Even better, you’re about to find out how to find this out for yourself, using the power of automation and the Amazon Step Functions. Andrei Elefterescu, Levi9 JavaScript Tech Lead, has 10 years experience in AWS and worked with the Amazon Step Functions even before their launch, during the Beta program. Andrei will be your guide to the melodic orchestration of triggers, functions, validations, transition states, conditional routes, and callbacks. It sounds complicated, but trust us, it’s a kind of magic.

What are Amazon Step functions?

To find out how many words are in each sentence of Queen’s Wikipedia biography, you’ll need to do some common-sense actions: read the text, split it into sentences, count the words, then put back together all the information about each sentence in one single text.

Surely you don’t need automation for that. It’s a straight-forward, simple task. But what if you needed to count all the words in the Wikipedia entries of the Top 100 British bands of all time?

This is where Andrei Elefterescu steps in to explain what Amazon Step functions are.

Amazon Step Functions is a serverless orchestration service that helps us integrate multiple AWS services such as Lambda, S3, SNS, and so on to develop various applications. Going back to the Queen example, those steps define a “Step function”.

While it’s easier to visualize as a diagram, Step Functions are written in an AWS proprietary language, which is JSON-based.

Of course, Amazon Step functions are much more useful than counting words on a Wikipedia page. The service can be used to process images in S3, ETL (Extract – Transform – Load) processes, machine learning, microservice orchestration, IT and security automation, as well as Continuous Integration and Continuous Deployment (CICD), a process that automates the integration and deployment of code.

“Step Function has around 250 integrations with other AWS services and around 11,000 API calls that can be called from it. “In 2016, when they launched, they only had an integration with S3 and 3–4 other services.”

Key concepts

Before we delve deeper into how Step functions work, here are some key concepts, as explained by Andrei:

Workflow is the breakdown of what needs to be done and in what order. It can be built visually, like a diagram, in the very intuitive editor of the Amazon Steps Function. In our example, it’s the mental plan of how we are going to grab the information about the Queen Wikipedia blog article. A good plan needs to include some steps where you validate the inputs and a way to handle errors.

Triggers are what start the workflow. You can call the API or you can use other methods, such as starting it directly from the AWS console, triggering an S3 event through the SDK, or using the event bridge to start a Lambda function. For our planned mini-Queen automation, we’d just run this through the interface.

States are the different sequences in the workflow. Step Functions have several types of states, such as task states where a resource can be assigned, choice states where a decision can be made, and check states where a variable can be checked and make the automation behave differently depending on whether it’s true or not. In our mini-automation, the states include grabbing, parsing, and joining the information, but also a validation filter and a choice related to what happens when the automation throws an error.

Tasks are the actual actions and steps that are performed in each state. A task could be, for example, to split a text into sentences, and another task would be to count the words in each sentence.

Transitions are links between two steps, and each step is recorded as a transition. For each execution of a lambda, there are three entries: the transition before, the step itself, and the transition after. For the passing of information from one step to another, Amazon has a limit of 250 KB. For our particular case, the information “There are 10 words in the second sentence” must be carried from one step to the next.

Item execution is the transition from one state to another. The limit per workflow is 25.000 executions. For example, to find out how many words there are in each sentence, there are three executions: one is grabbing the actual sentence, two is getting the numbers, and three is sending those numbers to the following step.

Express vs Standard Step Functions are two types of workflows. The Standard has a year limit on being active, while the Express has 15 minutes and 100,000 parallel executions. Express workflows are used in web development, while standard Standard workflows are used for larger volumes of data.

The history of the execution is in the log, which you can easily check in the Step Function dashboard. There is a limit of 100,000 logs per workflow.

Main advantages: parallel run and serverless

But is this melodic, yet complicated, orchestration needed? Couldn’t a Lambda function do the job instead?

For one thing, Amazon Steps are faster, highlights Andrei. This is due to their ability to support parallel runs, such as a simple map mode with 40 concurrent executions or a dynamic map mode with up to 10,000 parallel executions. On top of that, “Step Functions is a serverless service that allows for stateless states without having to have a separate database or cache, decoupling the logical side from the business side.” In other words, there is no need to worry about sizing the resources to fit the process. This is done automatically. Some other advantages that Andrei appreciates are the fact that it’s easy to see the state of your workflows and that it integrates retry mechanisms and exponential back-off.

This “Plan B” for errors plays a crucial role in workflows. For example, if we sent a blank Wikipedia text instead of the real Queen biography to the step Functions, an error mechanism would be triggered. Without it, the function would have stopped working. And errors can happen during the processing of large amounts of data, and some of them can stop the process completely. This is one of the several drawbacks that Andrei mentions.

Disadvantages: hard limits on executions and history

There is a hard limit of 25.000 executions per workflow. While analyzing a Wikipedia article doesn’t seem like much, when you need to count the words in 10.000 sentences, things change. Remember, each step normally means three executions: the state before the step, the step, and the state after the step. “So, with 10.000 sentences, it’s pretty easy to get there”, emphasizes Andrei. “And if you go over 25.000, the workflow will stop, and there is nothing that you can do about it.”

Also, as step functions decouple the business logic from the logic part of the workflow, the resulting code is more complex and harder to understand. A relatively simple image processing workflow can have clear inputs and outputs, but the connection between the various microservices is challenging to understand. Another disadvantage is limited mobility. The state machine is written in the Amazon State Language, which is a proprietary language of AWS. This means that if you want to move to Azure, for example, you need to start from scratch.

Another disadvantage is the 250 KB limit that can be passed between states in the step function and between transitions. Plus, the execution history is kept for only 90 days, and the maximum execution of a workflow is one year.

Wait, a workflow that takes one year? “It’s possible”, says Andrei. “The Step Functions have callbacks, which allow the workflow to pause and wait until a human provides some input we are waiting for. Then, the callback is invoked, and the workflow resumes.”

Best practises: retry mechanism, keeping an eye on limits

Most of these drawbacks can be overcome by best practices. For example, AWS’s 256 Kb interstate is limiting, but AWS itself came up with a solution. You can write the payload to a PS3 file and process it with a Lambda function.

The native retry mechanism is also very useful. Step Functions provides a retry and exponential back-off mechanism to catch Lambda exceptions and throw specific errors. It also allows for the ability to make a specific decision if a Lambda fails with a throw error, such as not found.

Additionally, there is also a way to avoid the limit of having only 100.000 objects to process. “Yes, we did get there at some point, and it was not pleasant”. The solution might be to keep an eye on the process and trigger an alert when you are close to the limit.

Even so, the costs offset all the small inconveniences of the Step Functions. In Levi9’s experience, even complex processes for large clients do not go over $5 per month.