-
The project begins by generating synthetic data using the Faker Python library, providing a controlled environment for testing real-time processing capabilities.
-
The data undergoes real-time ingestion and transformation in Apache NiFi, configured to connect to the Faker data source. The processed data is efficiently written to an S3 bucket in CSV format as a staging area before moving to Snowflake.
-
The entire system is hosted on an AWS EC2 instance, leveraging Docker for containerization. Docker Compose orchestrates the deployment of essential components such as Zookeeper and NiFi, streamlining the setup process.
-
Snowflake, a cloud-based data warehousing platform, is integral to storing and managing processed data. Snowpipe, the data ingestion service, facilitates delta loading into the customer_raw staging table, serving as a gateway to structured data.
-
A Snowflake Stream is established on the production table (customer) to capture change data, crucial for implementing a Slowly Changing Dimension (SCD-2) approach. Scheduled Snowflake tasks run at one-minute intervals, automating data movement from the staging table to both the production and historical tables.
-
Automation is central to the project's efficiency. Scheduled tasks, occurring at one-minute intervals, ensure a seamless transition of data from the raw stage (customer_raw) to the production table (customer). The project incorporates Slowly Changing Dimension (SCD) methodologies, with SCD-1 (Type 1) maintaining only the latest records in the production table and SCD-2 (Type 2) capturing changes over time in the historical table (customer_historical).
-
Notifications
You must be signed in to change notification settings - Fork 0
ishir21/realtime_streaming_aws_snowflake_nifi
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published