People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.
Table of Contents
- https://mungingdata.com/pyspark/chaining-dataframe-transformations/
 - https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
 - https://pawamoy.github.io/copier-poetry/
 - https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
 
All you need is the following configuration already installed:
- Git
 - The project was tested with Python 3.10.18 managed by pyenv:
- Use 
make pyenvgoal to launch the automated install of pyenv 
 - Use 
 JAVA_HOMEenvironment variable configured with a JavaJDK11SPARK_HOMEenvironment variable configured with Spark versionspark-3.5.6-bin-hadoop3packagePYSPARK_PYTHONenvironment variable configured with"python3.10"PYSPARK_DRIVER_PYTHONenvironment variable configured with"python3.10"- Install Make to run 
Makefilefile - Why 
Python 3.10becausePySpark 3.5.6doesn't work withPython 3.11at the moment it seems (I haven't tried with Python 3.12) 
- pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev pyenvinstalled and available in path pyenv installation with Prerequisites- Install python 3.10 with pyenv on homebrew/linuxbrew
 
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10- 
Auto format via IDE https://github.com/psf/black#pycharmintellij-idea
 - 
[Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration
 - 
Or remember to type
black .to apply the black rules formatting to all sources before commit - 
Add integratin with Jenkins and it will complain and tests will fail if black format is not applied
 - 
Add same mypy option for vscode in
Preferences: Open User Settings - 
Use the option to lint/format with black and flake8 on editor save in vscode
 
Checked optional type with Mypy PEP 484
Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.
- Install mypy plugin for intellij
 - Adjust the plugin with the following options:
"--follow-imports=silent", "--show-column-numbers", "--ignore-missing-imports", "--disallow-untyped-defs", "--check-untyped-defs" - Documentation: Type hints cheat sheet (Python 3)
 - Add same mypy option for vscode in 
Preferences: Open User Settings 
- isort is the default on pycharm
 - isort with vscode
 - Lint/format/sort import on save with vscode in 
Preferences: Open User Settings: 
{
    "editor.formatOnSave": true,
    "python.formatting.provider": "black",
    "[python]": {
        "editor.codeActionsOnSave": {
            "source.organizeImports": true
        }
    }
}
- isort configuration for pycharm. See Set isort and black formatting code in pycharm
 - You can use 
make lintcommand to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration 
isort .
- Show a way to treat json erroneous file like 
data/pubmed.json 
- Create a poetry env with python 3.10
 
poetry env use 3.10- Install pyenv 
make pyenv - Install dependencies in poetry env (virtualenv) 
make deps - Lint & Test 
make build - Lint,Test & Run 
make run - Run dev 
make dev - Build binary/python whell 
make dist 
poetry run drugs_gen --help
Usage: drugs_gen [OPTIONS]
Options:
  -d, --drugs TEXT             Path to drugs.csv
  -p, --pubmed TEXT            Path to pubmed.csv
  -c, --clinicals_trials TEXT  Path to clinical_trials.csv
  -o, --output TEXT            Output path to result.json (e.g
                               /path/to/result.json)
  --help                       Show this message and exit.
- Use 
spark-submitwith the Python Wheel file built bymake distcommand in thedistfolder.