Reproducibility Issue: Performance degrades in sequential runs within a single script, but is stable across separate script executions

**Describe the bug**
I am observing a performance degradation issue with the RDT-1B model when conducting multiple experimental runs. The model's accuracy drops if the experiments are run sequentially within the same Python script. In contrast, the results are perfectly stable when each experiment is launched as a new Python process.

**To Reproduce**
Steps to reproduce the behavior:

_Sequential Runs (Performance Drop):_

Create a Python script that contains a loop to run the training/evaluation experiment multiple times.

Set a fixed random seed at the beginning of each loop iteration.

Execute the script.

Observed behavior: The accuracy/performance metric will be lower in the second, third, and subsequent runs compared to the first run.

_Separate Runs (Stable Performance):_

Create a Python script to run the experiment just once.

Set a fixed random seed at the beginning of the script.

Execute this script multiple times from the command line (e.g., python run_experiment.py, then run it again).

Observed behavior: The accuracy is highly stable across all runs, with a variance of 0.

**Question**
The discrepancy between these two methods suggests that some state is not being fully reset between the sequential runs within the same script.

My primary question is: Is this likely an issue with how I am setting the random seed? I am trying to reset the seed before each experiment, but perhaps some components are not being correctly re-initialized. Or, could there be another underlying cause related to the model's state, data loaders, or environment that persists across runs within the same process?

Any guidance on ensuring complete state reset for repeatable experiments in a single script would be greatly appreciated.



```python

#alpha = [1,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0]
alpha = [0.8]
#temp = np.logspace(np.log10(0.01), np.log10(10.0), num=5)
#[0.01, 0.05623413, 0.31622777, 1.77827941, 10.]
temp=[np.logspace(np.log10(0.01), np.log10(10.0), num=5)[1]]
print(temp)
# 2. 执行模型并收集结果
success_data = []
#729  182*1
keep_img_token_nums = 182*3
#pruning_mode = "no_img_token_pruning" #score
#pruning_mode ="score"    
#pruning_mode = PruningMode.SCORE  # 使用枚举成员
pruning_mode=PruningMode.RANDOM
repeat_exp_num=5
for a, b in tqdm_it.product(alpha, temp,
                            desc='grid search'):
    rep_suc=[]
    for re_it in tqdm.trange(repeat_exp_num):
        # --- 标记一个新的 Experiment Run 开始 ---
        global_profiler.start_experiment_run(run_idx=re_it) # <--- 在这里调用新方法
        #success_rate=0
        success_count = 0  
        # --- 标记一个新的 Trial 开始 ---
        # --- 在这里重置种子，在每一次试验开始时 ---
        print(f"Running trial with alpha={a}, temp={b}. Re-seeding with {base_seed}.")
        random.seed(base_seed)
        os.environ['PYTHONHASHSEED'] = str(base_seed)
        np.random.seed(base_seed)
        torch.manual_seed(base_seed)
        torch.cuda.manual_seed(base_seed)
    # 如果您使用多GPU，可能还需要 torch.cuda.manual_seed_all(base_script_seed)
        for episode in tqdm.trange(total_episodes):
            global_profiler.start_trial(trial_idx=episode) # <--- 在这里调用
            obs_window = deque(maxlen=2)
            obs, _ = env.reset(seed = episode + base_seed)
            policy.reset()
            policy.alpha = a
            policy.temp = b
            policy.keep_img_token_nums=keep_img_token_nums
            policy.vis_count=0
            policy.pruning_mode=pruning_mode
            policy.save_vis_doc_name=str(args.env_id)+str(policy.pruning_mode)+str(keep_img_token_nums)+'_729 token'
            
            
            
            img = env.render().squeeze(0).detach().cpu().numpy()
            obs_window.append(None)
            obs_window.append(np.array(img))
            proprio = obs['agent']['qpos'][:, :-1]

            global_steps = 0
            video_frames = []

            success_time = 0
            done = False
            
            while global_steps < MAX_EPISODE_STEPS and not done:
                image_arrs = []
                for window_img in obs_window:
                    image_arrs.append(window_img)
                    image_arrs.append(None)
                    image_arrs.append(None)
                images = [Image.fromarray(arr) if arr is not None else None
                        for arr in image_arrs]
                #action [64,8] 
                actions = policy.step(proprio, images, text_embed).squeeze(0).cpu().numpy()
                
                # Take 8 steps since RDT is trained to predict interpolated 64 steps(actual 14 steps)
                #[16,8]
                actions = actions[::4, :]
                
                for idx in range(actions.shape[0]):
                    action = actions[idx]
                    obs, reward, terminated, truncated, info = env.step(action)
                    img = env.render().squeeze(0).detach().cpu().numpy()
                    obs_window.append(img)
                    proprio = obs['agent']['qpos'][:, :-1]
                    video_frames.append(img)
                    global_steps += 1
                    if terminated or truncated:
                        assert "success" in info, sorted(info.keys())
                        if info['success']:
                            success_count += 1
                            done = True
                            break 
            print(f"Trial {episode+1} finished, success: {info['success']}, steps: {global_steps}")
            print(f"policy.alpha = {policy.alpha}"+f"policy.temp = {policy.temp}")
            
            
            
                
            
            #avg_diffuion_time = sum(policy.runing_time) / len(policy.runing_time)
            #print("avg_diffuion_time: ",avg_diffuion_time)  
            #content  = f"{str(policy.runing_time)}\n"    
            # precision = 3                                 # 想保留的小数位
            # formatted = [f"{t:.{precision}f}" for t in (policy.runing_time)]
            # content   = f"[{', '.join(formatted)}]\n"     
            # content += f"avg_diffuion_time={avg_diffuion_time:.3f}\n"           # => score=0.93   + 换行
            # with lzw_log_PATH.open('a', encoding='utf-8') as f:
            #     f.write(content)
            
            
            #avg_avg_time_list.append(avg_diffuion_time)
        #success_rate=8 +random.uniform(1.5, 2) 
        success_rate = success_count / total_episodes * 100
    #print(f"Success rate: {success_rate}%")
        s = colored(f"Success rate: {success_rate}%", "yellow", attrs=["bold"])
        print(s)
        rep_suc.append(success_rate)
    print(f"rep Success rate: {rep_suc}%")
    rep_suc_array = np.array(rep_suc)
    mean_value = np.mean(rep_suc_array)

    # 3. 计算方差 (Variance)
    #    默认情况下，np.var计算的是总体方差(ddof=0)。
    #    如果您的数据是样本，想要计算样本方差，请设置 ddof=1。
    #    ddof (Delta Degrees of Freedom) 的值将从除数(N)中减去。
    variance_value = np.var(rep_suc_array) # 总体方差 (除以 N)
    sample_variance_value = np.var(rep_suc_array, ddof=1) # 样本方差 (除以 N-1)

    print(f"原始数据列表 (rep_suc): {rep_suc}")
    print(f"均值 (Mean): {mean_value:.2f}")
    print(f"总体方差 (Population Variance): {variance_value:.2f}")
    print(f"样本方差 (Sample Variance): {sample_variance_value:.2f}")
    std_dev = np.std(rep_suc_array) # 默认 ddof=0
    print(f"标准差 (Standard Deviation): {std_dev:.2f}")
    
    
    
    # avg_avg_diffuion_time = sum(avg_avg_time_list) / len(avg_avg_time_list)
    # precision = 3                                 # 想保留的小数位
    # formatted = [f"{t:.{precision}f}" for t in (avg_avg_time_list)]
    # content   = f"[{', '.join(formatted)}]\n" 
    # content += f"avg_avg_diffuion_time={avg_avg_diffuion_time}\n"           # => score=0.93   + 换行
    # with lzw_log_PATH.open('a', encoding='utf-8') as f:
    #     f.write(content)

    
    #print(f"avg_avg_diffuion_time: {avg_avg_diffuion_time}ms")
    success_data.append({'alpha': a, 'temp': b, 'success_rate': success_rate})

```


You can ignore the extra parameters I added; they don't affect the issue I'm reporting. Please just focus on the loop and my core question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reproducibility Issue: Performance degrades in sequential runs within a single script, but is stable across separate script executions #116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Reproducibility Issue: Performance degrades in sequential runs within a single script, but is stable across separate script executions #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions