In this second lesson of Scalable EC2 consuming servers for SQS series, we will create Cloud Watch Alarms.
Cloud Watch is used to monitor other services metrics, like CPU usage in EC2, and messages count or age in SQS.
We can rely on these metrics to trigger alarms. These alarms can be used to trigger actions, like increase the count of EC2 instances that serves that queue.
Creating useful alarms is tricky. You need to make sure that the alarms criteria do not cause strange behavior in your architecture. You do not want your EC2 instances count to increase but fail to decrease for example!
To get an idea of the metrics that can be used for SQS, head to SQS service, and select the desired queue. Try to add a message, then wait for ten minutes and add another one. From the tabs below, head to the monitoring tab, and see the metrics for ApproximateNumberOfMessagesVisible, and ApproximateAgeOfOldestMessage.
The first metric contains the number of visible messages to consuming servers. If this number is zero, then we know that the consuming servers are doing a great job, and there is no need to increase the servers count. In fact, there might be a need to decrease the servers count.
The second metric is the age of the oldest message. This age become very high if the messages are entering the queue faster than the servers consume. It is a good idea to use this metric to trigger creating new EC2 consuming servers.
We will use the visible messages count to trigger EC2 instances decreasing, and the age of oldest message to trigger EC2 instances increasing.
Please consider studying your options, as these metrics may fail in your case, and may not work well in usual cases. I have chosen them because they make a valid point, and does the job, but maybe not in the best way.
Creating Increasing Alarm
Head to Cloud Watch service.
- Click on Alarms, and click on creating a new one.
- You will be asked to select a metric, select the queue metrics from SQS metrics.
- Search for the metrics that are related to our created queue, and select Approximate Age of Oldest Message metric. Click Next.
- Pick a name, I called it “VeryOldMessageAlarm”.
- In the “Whenever” section, you can select the value of the “approximate age of oldest message” that will trigger the alarm. For example, you can set this alarm to be triggered if this value is larger than 60 seconds for example. Likewise, you can create another alarm that will trigger if this value is less than 15, and use that alarm to decrease the count of the EC2 consuming servers. For this tutorial, I will pick larger than or equal to 60 seconds.
- Make sure to delete the default action created.
- Create the alarm 🙂
Once the alarm is created, its status will be “Insufficient data”. It is okay, just wait for several minutes, and it will be updated.
If the queue is empty, this alarm value will be “OK”. Try to add a message in the queue, and wait for several minutes, and notice that the status of this alarm will be “Alarm”.
If the status of this alarm is “Alarm”, we can figure out that the queue is not being served quickly enough, and it is a good idea to increase the number of the consuming servers. However, if the status is “OK”, that does not mean that we can decrease the amount of the servers, but means that we do not need to increase.
Create Decreasing Alarm
We will use the same steps followed to create the increasing alarm, to create the decreasing one. except that we will pick the metric “ApproximateNumberOfMessagesVisible”, and set the “Whenever” to less than or equal to 0. Please note that this choice may not be the best, but it makes a sense. You might pick something better. It is better to monitor how the EC2 instances counts varies, and monitor the status of the SQS queue to understand if the selected metrics can really work well.
If there is no messages in the queue, then this alarm status will be “Alarm”, meaning that the queue is empty, and you can decrease the instance count.
I think the best metric to decrease the instances count can be generated manually, i.e. using the application. If the consuming server tries to pick something from the queue, and finds it empty for a long time, then that means we do not need this instance. I will leave this to you to figure out how to do it :p
Next, we will create the python code to consume the queue.