The official repository of The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling. This repository implements SESA (Sequential Sampling), a simple, effective framework to boost exploration and prevent policy collapse in RL-trained LLMs. It is built on top of RAGEN and VERL.
SESA replaces parallel i.i.d. sampling in RL algorithms like GRPO with history-aware sequential sampling to actively diversify rollouts. Baseline parallel rollout (left) samples all solutions i.i.d. from the same distribution, while our sequential rollout (right) first generates diverse methods sequentially, then expands each into full solutions in parallel.
Validated on synthetic path exploration, practical tasks (Sudoku, AIME24), and agent benchmarks (Sokoban, Countdown, FrozenLake).
We recommend using Conda. The script installs dependencies, pulls submodules, and downloads example data. Please follow Ragen Quick Start to install from source.
Recommanded:
conda create -n sesa python=3.12 -y
conda activate sesa
git clone git@github.com:kang-0909/sesa.git
cd sesa
pip install -e .
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
conda install -c "nvidia/label/cuda-12.4.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation
pip install -r requirements.txt
git submodule init
git submodule update
cd verl
pip install -e .
cd ..Default training config aggregates ppo_trainer.yaml and envs.yaml via config/base.yaml.
To start a training process on Sokoban task, use
export SWANLAB_API_KEY=''
export USE_GRPO="algorithm.adv_estimator=grpo agent_proxy.reward_normalization.method=mean_std actor_rollout_ref.actor.use_kl_loss=True"
export USE_BASE="algorithm.kl_ctrl.kl_coef=0.001 actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.clip_ratio_high=0.2 actor_rollout_ref.rollout.rollout_filter_ratio=1"
export HYDRA_FULL_ERROR=1
export ENABLE_SERIAL_GENERATION=1
MKL_SERVICE_FORCE_INTEL=1 python train.py --config-name _2_sokoban system.CUDA_VISIBLE_DEVICES=\"0,1,2,3\" trainer.n_gpus_per_node=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
trainer.experiment_name=sokoban $USE_GRPO $USE_BASE \
es_manager.train.env_groups=8 es_manager.train.group_size=4 es_manager.train.env_configs.n_groups=[8] \
trainer.nnodes=1 \
trainer.logger=['console','swanlab'] \
actor_rollout_ref.rollout.tp_size_check=False \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
trainer.val_before_train=True \
trainer.save_freq=40 \
trainer.test_freq=5 \
trainer.resume_mode=auto \
trainer.project_name=sesa \This work is built upon RAGEN and VERL. We thank the original authors and community contributors.
