This repository contains the code and data for our paper on using large language models (LLMs) for stylometric analysis. We demonstrate that GPT-2 models trained on individual authors' works can capture unique writing styles, enabling accurate authorship attribution through cross-entropy loss comparison.
llm-stylometry/
├── llm_stylometry/       # Python package (analysis, visualization, data loading)
├── code/                 # Scripts (training, figures, stats) - see code/README.md
├── data/                 # Texts and results - see data/README.md
├── models/               # 320 trained GPT-2 models - see models/README.md
├── paper/                # LaTeX source and figures - see paper/README.md
├── tests/                # Test suite
├── run_llm_stylometry.sh # Main CLI wrapper
├── remote_train.sh       # GPU cluster training
├── check_remote_status.sh # Monitor remote training
└── sync_models.sh        # Download trained models
See folder-specific README files for detailed documentation.
The easiest way to get started is using the comprehensive CLI script:
# Clone the repository
git clone https://github.com/ContextLab/llm-stylometry.git
cd llm-stylometry
# Run the CLI (automatically sets up conda environment if needed)
./run_llm_stylometry.shThe script will:
- Check for conda and install Miniconda if needed (platform-specific)
 - Create and configure the conda environment
 - Install all dependencies including PyTorch with CUDA support
 - Generate all paper figures from pre-computed results
 
If you prefer manual setup:
# Create environment
conda create -n llm-stylometry python=3.10
conda activate llm-stylometry
# Install PyTorch (adjust for your CUDA version)
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1
# Install other dependencies
pip install "numpy<2" scipy transformers matplotlib seaborn pandas tqdm
pip install cleantext plotly scikit-learn
# Install the package
pip install -e .The easiest way to use the toolbox is via the CLI wrapper scripts:
# Generate all figures from pre-computed results
./run_llm_stylometry.sh
# Generate specific figure
./run_llm_stylometry.sh -f 1a    # Figure 1A only
./run_llm_stylometry.sh -l       # List available figures
# Compute statistical analyses
./run_stats.sh
# Get help
./run_llm_stylometry.sh -hFor training models from scratch, see Training Models from Scratch.
Python API: You can also use Python directly for programmatic access:
from llm_stylometry.visualization import generate_all_losses_figure
# Generate a figure
fig = generate_all_losses_figure(
    data_path='data/model_results.pkl',
    output_path='figure.pdf'
)See the Package API section for all available functions.
Note: T-test calculations (Figure 2) take 2-3 minutes due to statistical computations across all epochs and authors.
Downloading pre-trained weights (optional): Model weight files are gitignored due to size. Download pre-trained weights to explore or use trained models:
./download_model_weights.sh --all    # Download all variants (~26.6GB)
./download_model_weights.sh -b       # Baseline only (~6.7GB)See models/README.md for details. Pre-trained weights are not required for generating figures.
Author datasets on HuggingFace: Cleaned text corpora for all 8 authors are publicly available. See data/README.md for dataset links and usage.
The paper analyzes three linguistic variants (Supplemental Figures S1-S8):
- Content-only: Function words masked → tests vocabulary/word choice (Supp. Figs. S1, S4, S7A, S8A)
 - Function-only: Content words masked → tests grammatical structure (Supp. Figs. S2, S5, S7B, S8B)
 - Part-of-speech: Words → POS tags → tests syntactic patterns (Supp. Figs. S3, S6, S7C, S8C)
 
Generate supplemental figures:
./run_llm_stylometry.sh -f s1a    # Supp. Fig. S1A (content-only, Fig 1A format)
./run_llm_stylometry.sh -f s4b    # Supp. Fig. S4B (content-only, Fig 2B format)
./run_llm_stylometry.sh -f s7c    # Supp. Fig. S7C (POS confusion matrix)Training variants: Each trains 80 models (8 authors × 10 seeds)
./run_llm_stylometry.sh --train -co    # Content-only
./remote_train.sh -fo                  # Function-only on GPU clusterStatistical analysis:
./run_stats.sh            # All variants (default)Fairness-based loss thresholding: Automatically ensures fair comparison when variant models converge to different final losses. Disable with --no-fairness if needed.
Training 320 models (baseline + 3 variants) requires a CUDA GPU. See models/README.md for details.
Local training:
./run_llm_stylometry.sh --train           # Baseline (80 models)
./run_llm_stylometry.sh --train -co       # Content-only variant
./run_llm_stylometry.sh -t -r             # Resume from checkpointsRemote training:
Requires GPU cluster with SSH access. Create .ssh/credentials_mycluster.json:
{"server": "hostname", "username": "user", "password": "pass"}Then from local machine:
./remote_train.sh --cluster mycluster           # Train baseline
./remote_train.sh -co --cluster mycluster -r    # Resume content variant
./check_remote_status.sh --cluster mycluster    # Monitor progress
./sync_models.sh --cluster mycluster -a         # Download when completeTrains in detached screen session on GPU server. See script help for full options.
We analyze texts from 8 authors:
- L. Frank Baum
 - Ruth Plumly Thompson
 - Jane Austen
 - Charles Dickens
 - F. Scott Fitzgerald
 - Herman Melville
 - Mark Twain
 - H.G. Wells
 
For Baum and Thompson models, we include additional evaluation sets:
- non_oz_baum: Non-Oz works by Baum
 - non_oz_thompson: Non-Oz works by Thompson
 - contested: The 15th Oz book with disputed authorship
 
Our analysis shows that:
- Models achieve lower cross-entropy loss on texts from the author they were trained on
 - The approach correctly attributes the contested 15th Oz book to Thompson
 - Stylometric distances between authors can be visualized using MDS
 
The repository includes comprehensive tests for all functionality:
# Install test dependencies
pip install pytest pytest-timeout
# Run all tests
pytest tests/
# Run specific test modules
pytest tests/test_visualization.py  # Figure generation
pytest tests/test_cli.py            # CLI functionality
pytest tests/test_model_training.py # Model operationsTests run automatically on GitHub Actions (Linux, macOS, Windows, Python 3.10). See CONTRIBUTING.md for detailed testing guidelines and philosophy.
The llm_stylometry package provides functions for all analyses:
# Visualization functions
from llm_stylometry.visualization import (
    generate_all_losses_figure,      # Figure 1A: Training curves
    generate_stripplot_figure,       # Figure 1B: Loss distributions
    generate_t_test_figure,          # Figure 2A: Individual t-tests
    generate_t_test_avg_figure,      # Figure 2B: Average t-test
    generate_loss_heatmap_figure,    # Figure 3: Confusion matrix
    generate_3d_mds_figure,          # Figure 4: MDS visualization
    generate_oz_losses_figure        # Figure 5: Oz analysis
)
# Fairness-based loss thresholding (for variant comparisons)
from llm_stylometry.analysis.fairness import (
    compute_fairness_threshold,      # Compute fairness threshold
    apply_fairness_threshold         # Truncate data at threshold
)All visualization functions support variant and apply_fairness parameters (except t-test figures).
If you use this code or data in your research, please cite:
@article{StroEtal25,
  title={A Stylometric Application of Large Language Models},
  author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
  journal={arXiv preprint arXiv:2510.21958},
  year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open a GitHub issue or contact Jeremy R. Manning (jeremy.r.manning@dartmouth.edu).