Data Pipeline for SISTIC Singapore, extracting data from Instagram API, transformed on Google Cloud Run, and finally loaded into Google BigQuery.
This project was conceptualized to allow for easy analysis of SISTIC Singapore's Instagram posts, comments, and post insights.
It incorporates the developing of skills in the areas of API manipulation, Cloud technologies, and Orchestration tools.
- Extract Instagram data from Facebook Graph API
- Containerized Cloud Run job transforms the data
- Processed data is loaded into BigQuery
- Orchestrated with Airflow on Composer, email is sent on successful job run
Above is the output - three tables under the instagram_analytics dataset:
- Posts (PK: post_id)
- Insights (PK: post_id)
- Comments (PK: comment_id)
Due to the distributed nature of compute and scheduling on the cloud, I opted to develop to contanerize solely my Cloud Run job rather than containerizing an entire Airflow instance and volume mounting my scripts. However, the option to do is available with the official Airflow docker-compose.yaml file.
To test my job locally, I utilized my own docker-compose.yaml file, although this was not required on eventual cloud deployment. In it, the environment variables INSTAGRAM_ACCOUNT_ID
, INSTAGRAM_ACCESS_TOKEN
and Google Service Account Key are not shown in this directory due to their sensitive nature.
Locally, with the full directory and account key, only one command is required, after navigation to the terminal:
docker compose up
Detailed setup and workflow instructions are available in the docs folder: