How to crawl AWS’s whats new blog with a python lambda function and Slack

Alan Newcomer
5 min readJun 23, 2019

--

I want an update to my slack channel of all the new announcements from AWS, once a day and at 8:00am CST. Of course, we can subscribe to the RSS feed but I really want all announcements to come in once a day. Lets make a good old fashioned crawler with python 3.7, lambda, and Slack.

Whats required

  1. Make a lambda layer for Beautiful Soup and Python 3.7
  2. Create a slack channel and add a web crawler app
  3. Create the lambda function
  4. Add a cloudtrail event rule and set to a scheduled cron job

Make a lambda layer for Beautiful Soup and Python 3.7

The following directions were executed on my mac but are fairly straight forward.

First thing to do is create temp directory.

cd ~
mkdir -p temp/python

Next is to download the required packages using pip.

Make sure that pip is configured to work with python 3.7. Executing pip --version will output something similarpip 19.1.1 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)

Run the following command to download Beautiful Soup

pip install bs4 -t temp/python/

change into the temp directory and zip up the files to one directory called bs4.zip.

cd temp
zip -r9 bs4.zip .

Now use the AWS CLI to publish the lambda layer. Make sure the the appropriate credentials are used and the AWS CLI is up to date.

aws lambda publish-layer-version \
--layer-name bs4-python37 \
--description "bs4 access by python 3.7" \
--zip-file fileb://bs4.zip \
--compatible-runtimes python3.7

Create a slack channel and add a web crawler app

If you don’t have a slack account then signup for a free account. After getting an account, create a new channel. I am calling mine AWS-whats-new-bot

Once the channel is created, go to the channel settings on the top right and select View channel details.

Go to the Apps section and click add app

Type in incoming-webhook and add the app and add incoming webhooks integration.

Now copy down the Webhook URL. This will be used in the lambda function.

Create the lambda function

Go to the AWS console and create a new lambda function with the runtime as Python 3.7.

TODO — permisions

Add the newly created layer to the lambda function.

Now add the following code. Ensure that the variable webhook_urlis added to the code.

TODO — clean code and add github

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import bs4
import json
import time
from datetime import date, timedelta
from botocore.vendored import requests
# getting yesterdays date and month
yesterday = (date.today() - timedelta(days=1)).strftime("%b %d, %Y")
yesterday_month = str((date.today() - timedelta(days=1)).strftime("%m"))
# only pulling from the monthly blog page
blog_url = f"https://aws.amazon.com/about-aws/whats-new/2019/{yesterday_month}/"
# webhook url from the slack app integration
webhook_url = 'https://hooks.slack.com/services/' #add webhook url
def blog_scrape():
# setting varialbes
n = 0
title_html = None
date_html = None
description_html = None
link_html = None
icon = None
bot_name = None
page = requests.get(blog_url)
content = bs4.BeautifulSoup(page.content, 'html.parser')
# beautiful soup logic to get the title, date, description and link.
for link in content.find_all('li', class_='directory-item text whats-new'):
link_html = 'https:' + str(link.find_all('div', class_='reg-cta')[0].find_all('a')[0].get('href')).replace('\n','')
for every in link.find_all('a'):
every.replaceWithChildren()
title_html = str(link.find_all('h3')[0].get_text()).replace('\n', '')
date_html = str(link.find_all('div', class_='date')[0].get_text()).replace('\n', '')
# if no new blogs yesterday, then pass
if yesterday not in date_html:
pass
else:
description_html = str(link.find_all('div', class_='description')[0].find_all('p')[0].get_text()).replace('\n','')
# changing the name and icon to keep the slack messages from
# grouping
if (n % 2) == 0:
icon = "Ghost"
bot_name = "AWSUpdate"
else:
icon = "new"
bot_name = "AWSUpdateAgain"
# adding text to json
parsed = json.loads('{"text": "*' + title_html + '*\\n_' + date_html
+ '_\\n `Description:` ' + description_html
+ '\\n`Link:` ' + link_html + '","username": "'
+ bot_name + '","mrkdwn": "true", '
+ '"icon_emoji": ":' + icon + ':"}')
# posting the blog to slack
response = requests.post(webhook_url, json=parsed)
print(response)
n += 1
# sleep to ensure order of process
time.sleep(3)
return '{0} new updates were reported on {1}'.format(n, yesterday)
def lambda_handler(event, context):
http_reply = blog_scrape()
return http_reply

Now test the lambda code. The event type will not matter.

Add a cloudtrail event rule and set to a scheduled cron job

The last step. Go to the console and search for CloudWatch.

Select Events and Rules on the left and click Create rule.

Select Schedule and Cron expression.

Enter 0 13 * * ? * for the cron expression. This will start the lambda function at 8:00am CST time every day.

Click Add Target and enter the lambda function that was create last step.

Click Configure details and Create Rule.

The slack channel is created and will get updated every day with new announcements from AWS. This is a good use case on how to create a serverless webcrawler using python and also a good example of how to implement layers for python simply and quickly. Add your people to the slack channel and let me know if you found this useful. Also, put in the comment if someone creates a slack channel for Google Cloud or Azure.

The views I express are mine alone and that they do not necessarily reflect the views of my employer.

--

--