Skip to content

andreysgit/College-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Andrey Sokolov
Completed February 2025


TLDR

Look through course_project.ipynb to get an idea of what this is about


Project Overview

This project explores the relationship between demographic and socioeconomic factors and individual income, leveraging a public dataset to uncover actionable insights for a hypothetical college’s marketing and outreach strategies. The primary objective is to understand how attributes such as education, occupation, relationship status, and work habits correlate with annual earnings, specifically identifying the factors that distinguish higher earners from the rest.

The analysis supports strategic decision-making by revealing patterns that can inform targeted communications, recruitment, and program development for prospective students.


Approach

  • Data Preparation
    Cleaned and preprocessed a complex, real-world dataset with missing values, categorical inconsistencies, and outlier issues. Special attention was given to encoding, binning, and filtering to ensure meaningful analysis and robust visualizations.

  • Exploratory Analysis & Visualization
    Conducted a broad exploratory analysis to identify which variables most strongly relate to income differences. Created several visual representations (including histograms, heatmaps, boxplots, and point plots) to illustrate how demographic and work-related factors interact with income brackets. Each visualization was tailored to emphasize clarity, fairness, and interpretability—enabling quick identification of the most influential variables.

  • User-Centered Insights
    Structured the analysis around key “user stories,” each representing the perspective of a stakeholder (e.g., a marketing strategist, a workforce planner, or an economic analyst). For each scenario, the project surfaces unique findings about how income relates to combinations of demographic factors, employment type, and financial metrics.

  • Methodological Adjustments
    Addressed common data visualization challenges, such as handling skewed data, visual clutter from high-cardinality categories, and the best way to represent missing information. When needed, adjustments to the visualization approach were made to maintain interpretability and analytical rigor.


Key Findings

  • Educational Attainment & Income:
    Higher educational levels correlate with increased probability of higher income, though there is significant overlap—indicating that education alone does not guarantee higher earnings.

  • Work & Relationship Factors:
    Both relationship status and hours worked per week interact in nuanced ways with earning potential, often varying significantly by gender and age group.

  • Country and Occupation:
    Income distributions differ notably by country of origin and work class, revealing geographic and occupational patterns relevant to institutional strategy.

  • Financial Indicators:
    High-income individuals often exhibit distinct patterns in non-wage earnings, such as capital gains, reinforcing the importance of financial literacy and investment opportunities for career growth.


Challenges & Solutions

  • Dealt with real-world messiness: missing data, outliers, and non-uniform categories.
  • Used a combination of statistical binning and focused filtering to clarify patterns in high-variance data.
  • Balanced sample size concerns and missing-value representation to avoid misleading conclusions.

Limitations & Future Work

  • The scope was intentionally focused on exploratory data analysis and insight generation, rather than predictive modeling.
  • Future directions might include integrating additional socioeconomic data, time-series analysis, or developing interactive dashboards for deeper exploration.

Acknowledgments

This project was completed as part of the requirements for graduate school in Spring 2025 All code, figures, and results are original work by Andrey Sokolov. Dataset derived from public sources.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published