Deployed at: https://dataminingproject-2lpg.onrender.com/
Demo at: Video Demo
This pipeline integrates various cloud platforms and tools to enable efficient data processing and analysis. It is designed for seamless data transfer, transformation, and deployment for advanced analytics.
The diagram illustrates the complete architecture, including data flow and integration between various platforms.
- Local System: Raw data originates from a local system, which is then uploaded to GCP.
- Buckets in GCP: Data is stored in GCP buckets for initial processing and staging.
- SQL Server on GCP: Structured data is stored and managed using SQL Server hosted on GCP.
- JDBC Extraction: Data is extracted from the SQL server using JDBC for further processing.
- Azure Databricks: Data processing and transformation are performed on Azure Databricks, hosted within a secure Virtual Network.
- Notebook (Transformation): Data is transformed using Databricks Notebooks, which apply advanced processing and preparation logic.
- Delta Lake (Data Lake House): The processed data is stored in Delta Lake for efficient querying, updates, and management.
- Analysis and Deployment: Analytical models are developed, tested, and deployed for end-use.
- The codebase and notebooks used for data processing and analytics are version-controlled and stored in GitHub.
- Final analysis and deployments are pushed to the Render Cloud, enabling integration with end-user applications and services.
- Data Ingestion: Raw data is uploaded from the local system to GCP buckets.
- Data Storage: Data is structured and stored in an SQL Server on GCP.
- Data Extraction: Using JDBC, the data is extracted from the SQL Server to the Databricks Workspace.
- Data Transformation: Transformation and data wrangling are conducted in Databricks Notebooks.
- Data Storage in Delta Lake: Transformed data is stored in Delta Lake, enabling a lakehouse architecture.
- Analytics and Deployment: The processed data is analyzed, and results are deployed for further use.
- Version Control: All code and notebooks are maintained in GitHub.
- App Creation : A streamlit app is created.
- Final Deployment: App is deployed to the Render Cloud for accessibility and integration.
- Google Cloud Platform: Data storage and SQL server hosting.
- Azure Databricks: Data processing and transformation.
- Delta Lake: Efficient storage and querying.
- GitHub: Version control for development and collaboration.
- Render Cloud: Final deployment for application integration.
- Backend and Frontend: Streamlit
- Python 3.11 used
- Setup of Pipeline as per architecture
-
Clone the repository:
git clone https://github.com/codingmukul/data_mining_project cd data_mining_project -
Install dependencies:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run app.py
-
Access the Application: Open your browser and navigate to
http://localhost:5000.
- Algorithm Findings: To find the association rules given the minimum values of support, confidence and lift.
- Plots and Analysis: Some important plots related to data.
