Hosting simple scripts for cheap on GCP
On giving an old Macbook Air some rest
In the past few weeks I wanted to get a sense of how many remote-friendly jobs are posted on stackoverflow on a given day. To do this, I’ve written a small Python utility that parses their XML feed and uploads the results on a spreadsheet on Google Drive. So far so good.
I initially hosted this script as a
crontab job on my old Macbook Air that sits collecting dust on
a bookshelf. Everything went smooth until I forgot to attach the power cord for a few days
(consequently missing data collections), it wasn’t the end of the world and no human life
depends on it but it was probably the right time to study how to host simple scripts on a
cloud provider for cheap (alas for free).
I chose GCP for no specific reason beyond the fact that I already had an account and some
euros worth of free credits. I am aware that the same end results can be achieved using
Lambda Functions or spinning up a VM on AWS.
The requirements were quite basic:
- Something cheap (the script will run 4 times a day)
- More efficient than a VM (without using k8s auto-scaling I’d end up paying for idle time). With Cloud Functions I’ll be billed only for the time my function has actually run for
- Easy and quick to deploy
I ended up choosing GCP functions as for my usage levels I’d stay within the free plan usage limits (at the time of writing):
Preparing and hosting the repo
In order to successfully host my function to the cloud, my original script needed couple adjustments.
First, as the function will be invoked by a
GET request, I had to add
request as an argument of
my main function in order to make it callable.
def main(request): # Do stuff if __name__ == "__main__": main()
gspread (Python module to interact with
Google Sheets) requires a json file with API auth keys for authentication. I have put that file in a storage bucket and
the file when the function is invoked.
At this point there are mainly 3 ways to host the script code:
- Deploy directly from a local folder.
- Compress the script folder in a .zip archive and pass it in as an argument during deployment.
- Mirror my github repo on a GCP Source Repository. This will play nicely with Cloudbuild if and when I’ll decide to implement some basic CI that would trigger a refresh of the function upon new commits.
I decided to go with 3 as the process is straightforward enough.
Functions can be easily deployed via
gcloud command line utility. Depending on the use-case,
a number of arguments can be set. In this specific case, it is pretty important to set:
--set-env-vars: set environment variables needed by the script to run properly
JOB_SHEET_ID: id of the google sheet that will store the parsed job listings.
SHEET_NAME: worksheet name of the spreadsheet mentioned above
GOOGLE_AUTH_KEY: name of the file storing Google API keys. When run locally, this pointed to a file in the script folder. In cloud function this variable will point to the right file in the storage bucket.
--no-allow-unauthenticated: this argument prevents the function from being invoked from the outside world. It is particularly critical as it will ensure I won’t incur in any billing generated by fraudulent activity. This has been the aspect that I’ve struggled to understand the most as I kept finding different authentication approaches. After asking a question on stackoverflow, I have decided to stick with the approach described in this post as it relies on gcp service accounts and doesn’t require to implement auth flows in the function.
--timeout: it is safe to set it a tad longer than the average function run time as execution will be killed if it exceeds timeout (default is 30s). It is worth pointing out that Cloud Functions have an execution time limit of 9 minutes.
gcloud functions deploy so-parse \ --memory 128MB \ --runtime python37 \ --entry-point main \ --source https://source.developers.google.com/projects/your-project/repos/repo-name/path/to/function \ --timeout 300 \ --set-env-vars JOB_SHEET_ID="your-ghseet-id" \ --set-env-vars SHEET_NAME="worksheet-name" \ --set-env-vars GOOGLE_AUTH_KEY="json-file-with-google-auth-keys.json" \ --trigger-http \ --no-allow-unauthenticated
In order to automatically run the function at set intervals, it is possible to rely on Cloud Scheduler.
The main logic is to set up a cron job that will fire a
GET request to the function’s trigger url
at a given interval.
Before setting the cron job up we need to create a service account and make sure it has the right permissions to invoke Cloud Functions.
gcloud iam service-accounts create cloud-scheduler
gcloud functions add-iam-policy-binding \ --member=serviceAccount:email@example.com \ --role=roles/cloudfunctions.invoker so-parse
gcloud scheduler jobs create http stack-scrape \ --schedule="*/3 * * * *" \ --uri=https://your-func-trigger-url \ --firstname.lastname@example.org
To confirm that everything went according to plan I quickly peeked a the logs after waiting for a few minutes (to be sure the function actually had the time to run at least once).