This GitHub repository contains the final project submission for the Data-driven Computing Architectures course (2025). The project implements a data pipeline to ingest, process, and visualize a student depression dataset using Snowflake, followed by training a machine learning model to predict depression. The pipeline adheres to the Medallion Architecture (Bronze, Silver, Gold layers) and provides actionable insights into student mental health.
- Name: Md Aslam Hossain
- Contribution: Sole contributor, responsible for designing, implementing, and documenting the pipeline and ML model. All work is tracked via a clear history of commits in this repository.
This project focuses on building a data pipeline to analyze student mental health data (student_depression_dataset.csv) through four stages:
- Ingestion: Loads raw CSV data into Snowflake’s bronze layer (
BRONZE_STUDENT_DATA) and tracks lineage inDATA_LINEAGEusingingest.py. - Processing: Cleans and aggregates data into silver (
SILVER_STUDENT_DATA) and gold (GOLD_STUDENT_INSIGHTS) layers withprocess.py. - Visualization: Generates visual insights (e.g., depression rates by gender, CGPA vs. pressure) saved in
example/usingvisualize.py. - Modeling: Trains a Random Forest Classifier to predict depression, saved as
model/depression_model.joblibwithmodel.py.
The pipeline leverages Snowflake for scalable data storage and Python for processing and analysis, culminating in both visual outputs and a predictive model.
code/: Core pipeline scripts and ML model training. Seecode/README.mdfor details.data/: Sample input data (student_depression_dataset.csv). Seedata/README.md.docs/: Additional scripts or notebooks (placeholder). Seedocs/README.md.example/: Output visualizations and pipeline run examples. Seeexample/README.md.model/: Trained ML model file (depression_model.joblib) generated bymodel.py.
-
Clone the Repository:
git clone https://github.com/aa-it-vasa/ddca2025-project-group-24.git cd ddca2025-project-group-24 -
Install all the required package:
pip install -r requirements.txt
-
Command for ETL:
# 1. Ingest raw data python code/ingest.py # 2. Process to Silver/Gold python code/process.py # 3. Generate visuals python code/visualize.py # 3. Train the Prediction Model python code/model.py