Skip to content
/ SESA Public

The official repository of The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling.

License

Notifications You must be signed in to change notification settings

MuLabPKU/SESA

Repository files navigation

SESA: Sequential Sampling for Exploration in LLMs

The official repository of The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling. This repository implements SESA (Sequential Sampling), a simple, effective framework to boost exploration and prevent policy collapse in RL-trained LLMs. It is built on top of RAGEN and VERL.

What is SESA?

Training overview

SESA replaces parallel i.i.d. sampling in RL algorithms like GRPO with history-aware sequential sampling to actively diversify rollouts. Baseline parallel rollout (left) samples all solutions i.i.d. from the same distribution, while our sequential rollout (right) first generates diverse methods sequentially, then expands each into full solutions in parallel.

Validated on synthetic path exploration, practical tasks (Sudoku, AIME24), and agent benchmarks (Sokoban, Countdown, FrozenLake).

Quick Start

We recommend using Conda. The script installs dependencies, pulls submodules, and downloads example data. Please follow Ragen Quick Start to install from source.

Recommanded:

conda create -n sesa python=3.12 -y
conda activate sesa

git clone git@github.com:kang-0909/sesa.git
cd sesa

pip install -e .
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
conda install -c "nvidia/label/cuda-12.4.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation

pip install -r requirements.txt

git submodule init
git submodule update
cd verl
pip install -e .
cd ..

Train

Default training config aggregates ppo_trainer.yaml and envs.yaml via config/base.yaml.

To start a training process on Sokoban task, use

export SWANLAB_API_KEY=''
export USE_GRPO="algorithm.adv_estimator=grpo agent_proxy.reward_normalization.method=mean_std actor_rollout_ref.actor.use_kl_loss=True"
export USE_BASE="algorithm.kl_ctrl.kl_coef=0.001 actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.clip_ratio_high=0.2 actor_rollout_ref.rollout.rollout_filter_ratio=1"
export HYDRA_FULL_ERROR=1



export ENABLE_SERIAL_GENERATION=1
MKL_SERVICE_FORCE_INTEL=1 python train.py --config-name _2_sokoban system.CUDA_VISIBLE_DEVICES=\"0,1,2,3\" trainer.n_gpus_per_node=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    trainer.experiment_name=sokoban $USE_GRPO $USE_BASE \
    es_manager.train.env_groups=8 es_manager.train.group_size=4 es_manager.train.env_configs.n_groups=[8] \
    trainer.nnodes=1 \
    trainer.logger=['console','swanlab'] \
    actor_rollout_ref.rollout.tp_size_check=False \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
    trainer.val_before_train=True \
    trainer.save_freq=40 \
    trainer.test_freq=5 \
    trainer.resume_mode=auto \
    trainer.project_name=sesa \

Acknowledgements

This work is built upon RAGEN and VERL. We thank the original authors and community contributors.

About

The official repository of The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published