SESA: Sequential Sampling for Exploration in LLMs

The official repository of The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling. This repository implements SESA (Sequential Sampling), a simple, effective framework to boost exploration and prevent policy collapse in RL-trained LLMs. It is built on top of RAGEN and VERL.

What is SESA?

SESA replaces parallel i.i.d. sampling in RL algorithms like GRPO with history-aware sequential sampling to actively diversify rollouts. Baseline parallel rollout (left) samples all solutions i.i.d. from the same distribution, while our sequential rollout (right) first generates diverse methods sequentially, then expands each into full solutions in parallel.

Validated on synthetic path exploration, practical tasks (Sudoku, AIME24), and agent benchmarks (Sokoban, Countdown, FrozenLake).

Quick Start

We recommend using Conda. The script installs dependencies, pulls submodules, and downloads example data. Please follow Ragen Quick Start to install from source.

Recommanded:

conda create -n sesa python=3.12 -y
conda activate sesa

git clone git@github.com:kang-0909/sesa.git
cd sesa

pip install -e .
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
conda install -c "nvidia/label/cuda-12.4.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation

pip install -r requirements.txt

git submodule init
git submodule update
cd verl
pip install -e .
cd ..

Train

Default training config aggregates ppo_trainer.yaml and envs.yaml via config/base.yaml.

To start a training process on Sokoban task, use

export SWANLAB_API_KEY=''
export USE_GRPO="algorithm.adv_estimator=grpo agent_proxy.reward_normalization.method=mean_std actor_rollout_ref.actor.use_kl_loss=True"
export USE_BASE="algorithm.kl_ctrl.kl_coef=0.001 actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.clip_ratio_high=0.2 actor_rollout_ref.rollout.rollout_filter_ratio=1"
export HYDRA_FULL_ERROR=1



export ENABLE_SERIAL_GENERATION=1
MKL_SERVICE_FORCE_INTEL=1 python train.py --config-name _2_sokoban system.CUDA_VISIBLE_DEVICES=\"0,1,2,3\" trainer.n_gpus_per_node=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    trainer.experiment_name=sokoban $USE_GRPO $USE_BASE \
    es_manager.train.env_groups=8 es_manager.train.group_size=4 es_manager.train.env_configs.n_groups=[8] \
    trainer.nnodes=1 \
    trainer.logger=['console','swanlab'] \
    actor_rollout_ref.rollout.tp_size_check=False \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    trainer.val_before_train=True \
    trainer.save_freq=40 \
    trainer.test_freq=5 \
    trainer.resume_mode=auto \
    trainer.project_name=sesa \

Acknowledgements

This work is built upon RAGEN and VERL. We thank the original authors and community contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cases		cases
config		config
figs		figs
public		public
ragen		ragen
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
ray_trainer.py		ray_trainer.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py
train_sesa.sh		train_sesa.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SESA: Sequential Sampling for Exploration in LLMs

What is SESA?

Quick Start

Train

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

MuLabPKU/SESA

Folders and files

Latest commit

History

Repository files navigation

SESA: Sequential Sampling for Exploration in LLMs

What is SESA?

Quick Start

Train

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages