Andrey Sokolov
Completed February 2025
Look through course_project.ipynb to get an idea of what this is about
This project explores the relationship between demographic and socioeconomic factors and individual income, leveraging a public dataset to uncover actionable insights for a hypothetical college’s marketing and outreach strategies. The primary objective is to understand how attributes such as education, occupation, relationship status, and work habits correlate with annual earnings, specifically identifying the factors that distinguish higher earners from the rest.
The analysis supports strategic decision-making by revealing patterns that can inform targeted communications, recruitment, and program development for prospective students.
- 
Data Preparation 
 Cleaned and preprocessed a complex, real-world dataset with missing values, categorical inconsistencies, and outlier issues. Special attention was given to encoding, binning, and filtering to ensure meaningful analysis and robust visualizations.
- 
Exploratory Analysis & Visualization 
 Conducted a broad exploratory analysis to identify which variables most strongly relate to income differences. Created several visual representations (including histograms, heatmaps, boxplots, and point plots) to illustrate how demographic and work-related factors interact with income brackets. Each visualization was tailored to emphasize clarity, fairness, and interpretability—enabling quick identification of the most influential variables.
- 
User-Centered Insights 
 Structured the analysis around key “user stories,” each representing the perspective of a stakeholder (e.g., a marketing strategist, a workforce planner, or an economic analyst). For each scenario, the project surfaces unique findings about how income relates to combinations of demographic factors, employment type, and financial metrics.
- 
Methodological Adjustments 
 Addressed common data visualization challenges, such as handling skewed data, visual clutter from high-cardinality categories, and the best way to represent missing information. When needed, adjustments to the visualization approach were made to maintain interpretability and analytical rigor.
- 
Educational Attainment & Income: 
 Higher educational levels correlate with increased probability of higher income, though there is significant overlap—indicating that education alone does not guarantee higher earnings.
- 
Work & Relationship Factors: 
 Both relationship status and hours worked per week interact in nuanced ways with earning potential, often varying significantly by gender and age group.
- 
Country and Occupation: 
 Income distributions differ notably by country of origin and work class, revealing geographic and occupational patterns relevant to institutional strategy.
- 
Financial Indicators: 
 High-income individuals often exhibit distinct patterns in non-wage earnings, such as capital gains, reinforcing the importance of financial literacy and investment opportunities for career growth.
- Dealt with real-world messiness: missing data, outliers, and non-uniform categories.
- Used a combination of statistical binning and focused filtering to clarify patterns in high-variance data.
- Balanced sample size concerns and missing-value representation to avoid misleading conclusions.
- The scope was intentionally focused on exploratory data analysis and insight generation, rather than predictive modeling.
- Future directions might include integrating additional socioeconomic data, time-series analysis, or developing interactive dashboards for deeper exploration.
This project was completed as part of the requirements for graduate school in Spring 2025 All code, figures, and results are original work by Andrey Sokolov. Dataset derived from public sources.