Cloud Data Pipeline Architecture

Deployed at: https://dataminingproject-2lpg.onrender.com/

Overview

This pipeline integrates various cloud platforms and tools to enable efficient data processing and analysis. It is designed for seamless data transfer, transformation, and deployment for advanced analytics.

Architecture Diagram

The diagram illustrates the complete architecture, including data flow and integration between various platforms.

Components

1. Data Source

Local System: Raw data originates from a local system, which is then uploaded to GCP.

2. Google Cloud Environment

Buckets in GCP: Data is stored in GCP buckets for initial processing and staging.
SQL Server on GCP: Structured data is stored and managed using SQL Server hosted on GCP.
JDBC Extraction: Data is extracted from the SQL server using JDBC for further processing.

3. Azure Cloud

Azure Databricks: Data processing and transformation are performed on Azure Databricks, hosted within a secure Virtual Network.

4. Databricks Workspace

Notebook (Transformation): Data is transformed using Databricks Notebooks, which apply advanced processing and preparation logic.
Delta Lake (Data Lake House): The processed data is stored in Delta Lake for efficient querying, updates, and management.
Analysis and Deployment: Analytical models are developed, tested, and deployed for end-use.

5. GitHub

The codebase and notebooks used for data processing and analytics are version-controlled and stored in GitHub.

6. Render Cloud

Final analysis and deployments are pushed to the Render Cloud, enabling integration with end-user applications and services.

Workflow

Data Ingestion: Raw data is uploaded from the local system to GCP buckets.
Data Storage: Data is structured and stored in an SQL Server on GCP.
Data Extraction: Using JDBC, the data is extracted from the SQL Server to the Databricks Workspace.
Data Transformation: Transformation and data wrangling are conducted in Databricks Notebooks.
Data Storage in Delta Lake: Transformed data is stored in Delta Lake, enabling a lakehouse architecture.
Analytics and Deployment: The processed data is analyzed, and results are deployed for further use.
Version Control: All code and notebooks are maintained in GitHub.
App Creation : A streamlit app is created.
Final Deployment: App is deployed to the Render Cloud for accessibility and integration.

Technologies Used

Google Cloud Platform: Data storage and SQL server hosting.
Azure Databricks: Data processing and transformation.
Delta Lake: Efficient storage and querying.
GitHub: Version control for development and collaboration.
Render Cloud: Final deployment for application integration.
Backend and Frontend: Streamlit

Getting Started

Prerequisites

Python 3.11 used
Setup of Pipeline as per architecture

Installation

Clone the repository:

git clone https://github.com/codingmukul/data_mining_project
cd data_mining_project

Install dependencies:
```
pip install -r requirements.txt
```
Run the Streamlit app:
```
streamlit run app.py
```
Access the Application: Open your browser and navigate to http://localhost:5000.

Usage

Algorithm Findings: To find the association rules given the minimum values of support, confidence and lift.
Plots and Analysis: Some important plots related to data.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Data		Data
Plots		Plots
Reports and Other Files		Reports and Other Files
ETL for Grocery Store Transactions.py		ETL for Grocery Store Transactions.py
Grocery Data Analysis and Association Rules.py		Grocery Data Analysis and Association Rules.py
README.md		README.md
app.py		app.py
main.png		main.png
report.md		report.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cloud Data Pipeline Architecture

Overview

Architecture Diagram

Components

1. Data Source

2. Google Cloud Environment

3. Azure Cloud

4. Databricks Workspace

5. GitHub

6. Render Cloud

Workflow

Technologies Used

Getting Started

Prerequisites

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

codingmukul/MSD23007_Grocery-Data-Analysis-with-Cloud-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Cloud Data Pipeline Architecture

Overview

Architecture Diagram

Components

1. Data Source

2. Google Cloud Environment

3. Azure Cloud

4. Databricks Workspace

5. GitHub

6. Render Cloud

Workflow

Technologies Used

Getting Started

Prerequisites

Installation

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages