When an AWS lambda crashes it doesn't get a chance to leave an exception. We will use filter patterns on AWS CloudWatch logs to trigger backup processing when lambdas fail.
Goto CloudWatch Log groups and select "Logs groups" and "Subscription filters." There are two filters we will need:
Filter for when memory exceeds allocation
Pattern: "Error: Runtime exited with error: signal: killed"
Filter for when processing time exceeds allocation
Pattern: "Task timed out"
The Destination ARN will point to arn:aws:lambda:us-west-2:118234403147:function:delete-rudy-testing-cloudwatch, which is can be whatever you want. In this example I set it up to send sns message which emails me.
import logging
import os
import time
import random
logging.basicConfig(level=logging.DEBUG)
logger=logging.getLogger(__name__)
def test_out_of_time():
random_number = random.randint(2, 5)
logger.info(f"Testing out of time, random_number is: {random_number}")
time.sleep(random_number)
def test_out_of_memory():
s = []
logger.info(f"Testing out of memory")
for i in range(1000):
for j in range(1000):
for k in range(1000):
s.append("More")
def lambda_handler(event: dict, context: dict) -> dict:
print(f"Lambda function ARN: {context.invoked_function_arn}")
print(f"CloudWatch log stream name: {context.log_stream_name}")
print(f"CloudWatch log group name: {context.log_group_name}")
print(f"Lambda Request ID: {context.aws_request_id}")
print(f"Lambda function memory limits in MB: {context.memory_limit_in_mb}")
print(f"Lambda time remaining in MS: {context.get_remaining_time_in_millis()}")
#
logger.setLevel(logging.DEBUG)
logger.info(f"_HANDLER: {os.environ['_HANDLER']}")
logger.info(f"AWS_EXECUTION_ENV: {os.environ['AWS_EXECUTION_ENV']}")
AWS_LAMBDA_FUNCTION_MEMORY_SIZE = os.environ['AWS_LAMBDA_FUNCTION_MEMORY_SIZE'];
print(f"AWS_LAMBDA_FUNCTION_MEMORY_SIZE: {AWS_LAMBDA_FUNCTION_MEMORY_SIZE}")
#
input_bucket = event["input_bucket"] # 'noaa-wcsd-pds'
print(f"input bucket: {input_bucket}")
logger.info(f"input bucket: {input_bucket}")
input_key = event["input_key"] # 'data/raw/Henry_B._Bigelow/HB20ORT/EK60/D20200225-T163738.raw'
print(f"input key: {input_key}")
logger.info(f"input key: {input_key}")
output_bucket = event["output_bucket"] # 'noaa-wcsd-zarr-pds'
print(f"output bucket: {output_bucket}")
output_key = event["output_key"]
print(f"output key: {output_key}") # data/raw/Henry_B._Bigelow/HB20ORT/EK60/D20200225-T163738.zarr
#
test_out_of_time();
#test_out_of_memory();
#
logger.info("This is a sample INFO message.. !!")
logger.debug("This is a sample DEBUG message.. !!")
logger.error("This is a sample ERROR message.... !!")
logger.critical("This is a sample 6xx error message.. !!")
The code can be configured to run out of time or memory by what is commented.
The logging will send logs to CloudWatch's "/aws/lambda/delete-rudy-error-generating-lambda" log group.
[2] Testing Cloudwatch
This step will monitor the stream of CloudWatch Logs and determine when a specific string is found, in the case we are looking for "Task timed out" and "Error: Runtime exited with error: signal: killed."
The process goes as follows. [1] The error-generating-lambda gets invoked. The lambda times out and among the logs returns a message:
Response { "errorMessage": "2023-04-13T22:07:39.311Z 2e6d33d8-af65-4822-b974-782f2de5950a Task timed out after 3.01 seconds" }
There isn't enough information with this message alone, we need the key. That message gets captured by CloudWatch logs. From there we can use "Log Insights" to scan using a filter pattern to look for keywords. Now we can search with the log insights for the string "2e6d33d8-af65-4822-b974-782f2de5950a". Specifing /aws/lambda/error-generating-lambda with:
fields @message | filter @requestId like '2e6d33d8-af65-4822-b974-782f2de5950a'
You can run the query. It will return ~10 messages including the message regarding the task timing out. One message will specify the input key as:
We can search for this message and extract the input_key for further processing.
Note: biggest problem is timing. The logs are created immediately but for some reason the log insights don't return reliable results until [sometimes] 5 minutes after the lambda was run.