Autopopulate 2.0 #1243

ttngu207 · 2025-06-12T14:42:44Z

ttngu207
Jun 12, 2025
Maintainer

Problem Statement:

The current dataJoint-python approach for jobs reservation, orchestration, and execution (i.e. the autopopulate) faces scalability limitations. While its original design effectively handled job reservation/distribution for parallelization, it falls short when building a comprehensive data platform.

Limitations of the `jobs` table

The existing jobs table functions more as an error/reserve table than a true jobs queue.

Limited Job Statuses: It primarily records error (failed jobs) and reserved (jobs in progress) states. It lacks crucial statuses such as:
- pending/scheduled (jobs not yet started)
- success (record of successfully completed jobs and their duration).
Inefficient Job Queue: It doesn't operate as a true jobs queue where workers can efficiently pull tasks.
- Each worker must individually call key_source to get a list of jobs, which, while ensuring up-to-date information, strains the database.
Non-Queryable for Status: The table is not easily queryable for overall job status, hindering the development of dashboards, monitoring tools, and reporting.

Limitations of `key_source` Behavior/Usage

The default key_source (an inner-join of parent tables) is intended to represent all possible jobs for a given table.

Frequent Modification Needed: In practice, the actual set of jobs of interest is often a subset of this, requiring frequent modifications to key_source (e.g., restricting by paramset or other tables).
Local Visibility Only: Modified key_source settings are only visible to the local code executing the pipeline, not globally at the database level. This leads to:
- Out-of-sync code and key_source definitions.
- Lack of visibility and accessibility via "virtual modules."
- The need to install the entire pipeline/codebase to run specific parts, increasing complexity for microservices in a platform like Works.
Performance Bottleneck: (Table.key_source - Table).fetch('KEY') is DataJoint's method for retrieving the job queue and can be an expensive operation, especially when called frequently by multiple workers. This significantly strains the database server, as observed by other users.

Proposed Solution: New Jobs Table

Step 1: A New `JOB2` Table (Name TBD)

A new table, tentatively named JOB2, would be introduced with the following schema:

table_name: varchar(255) - The className of the table.
key_hash: char(32) - A hash of the job's key.
status: enum('reserved','error','ignore','scheduled','success') - The current status of the job.
key: json - A JSON structure containing the job's key.
status_message: varchar(2000) - e.g., error message if failed.
error_stack: mediumblob - The error stack if the job failed.
timestamp: timestamp - The scheduled time (UTC) for the job to run.
run_duration: float - The run duration in seconds.
run_version: json - Representation of the code/environment version of the run (e.g., git commit hash).
user: varchar(255) - The database user.
host: varchar(255) - The system hostname.
pid: int unsigned - The system process ID.
connection_id: bigint unsigned - The database connection ID.

Step 2: Mechanism to "Hydrate"/"Refresh" the `JOB2` Table

A new class method, refresh_jobs(), would be introduced for every Autopopulate table. This method would:

Call the key_source of the table.
Add new "scheduled" jobs to JOB2.
Remove invalid job entries (regardless of status) from JOB2 due to upstream record deletions.

The key challenge here is how and when to trigger refresh_jobs(). If triggered by every populate(reserved_jobs=True) call, it could become a bottleneck due to read/write operations to JOB2 and potential race conditions/deadlocks.

Step 3: New/Updated `populate()` Function

The populate() function would be updated to:

Query JOB2 for a list of "scheduled" jobs.
Call populate1(key) as usual for each job.
Upon success, update the job's status in JOB2 to success and add additional information (e.g., run duration, code version).

Considerations

refresh_jobs() Frequency and Staleness: How often should refresh_jobs() be called, and what level of staleness in JOB2 is acceptable? A centralized process could refresh jobs for each research project on a schedule (e.g., every 10, 15, or 30 minutes), similar to current worker-manager cron jobs. This would address the performance issues related to key_source that many users have experienced.
refresh_jobs() without Pipeline Code: Should refresh_jobs() be callable without the pipeline code installed (i.e., from a "virtual module")? Yes, to avoid the complexity and expense of requiring full code installation.

Notes

We have considered adopting and integrating with other industry standards for workflow orchestration such as Airflow, Flyte or Prefect, and have produced and evaluated multiple working prototypes.

However, we think that the additional burden of deployment & maintenance of those tools is too much for a python open-source project such as DataJoint - the enhanced features come with significant DevOps requirements & burden.

dimitri-yatsenko · 2025-07-02T23:34:49Z

dimitri-yatsenko
Jul 2, 2025
Maintainer

After much discussion, we are converging on the following approach:

The jobs tables will now be implemented as separate hidden tables for each computed table. To compare, currently, the jobs table is implemented at the schema level.
The jobs tables will be accessed as schema.Table.jobs (as opposed to the current schema.jobs)
The jobs tables will have the same primary key as their computed table. We will no longer rely on hashes to address the jobs.
The jobs tables will form the same foreign keys as their compute tables form from their primary key, with cascaded delete.
populate() will, by default, invokeself.jobs.refresh(), which (1) deletes all existing entries other than with status in ('reserved', 'error') and (2) fill the key_source into the jobs table with (status=''), except for the keys that are already present in the
populate() will accept argument refresh_jobs=True. To skip the refresh step, set it to refresh_jobs=False.
populate() will then use the jobs table for job reservations, changing the status to reserved and then removing the entry when computed successfully. Or set the status to error if failed. This part is the same as before.
The jobs table will provide a new tinyint priority field, which can default to 3 for example.
The default ordering may be (priority, job_date) but other ordering will now become possible.

1 reply

ttngu207 Jul 3, 2025
Maintainer Author

For (5), also include the status ignore for the jobs to keep, to maintain consistency with the current behavior.
Why status='' for new entries? I'd suggest letting the status be scheduled or pending. status='' is less meaningful.

For (9), we can use the current timestamp attribute instead of job_date, it would serve the same purpose.
Agreed with priority field

The populate logic should only work on jobs on or before datetime.now(), this allows users to purposely schedule jobs into the future. And technically speaking, ones can purposely schedule jobs really far in the past to force higher priority of processing

ttngu207 · 2025-07-03T03:00:27Z

ttngu207
Jul 3, 2025
Maintainer Author

The cascade delete is nice, this design will eliminate the possibility of "orphaned" jobs, removing the need for the cleanup_jobs function.

This means users with delete/drop privilege on the Autopopulate tables must also have delete/drop privilege on the corresponding jobs table. I suppose this is fine.
But the cascade delete preview would now show the deletion preview/count for the jobs tables as well. We can say that this is okay, it's intended, or update the cascade delete logic to skip the display for jobs tables, hide this away.

1 reply

dimitri-yatsenko Jul 3, 2025
Maintainer

We can simply rely on the database to cascade delete, not deleting explicitly.

The cascading delete and drop will omit the jobs tables.

We typically control privileges by schema anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Autopopulate 2.0 #1243

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Autopopulate 2.0 #1243

Uh oh!

ttngu207 Jun 12, 2025 Maintainer

Problem Statement:

Limitations of the jobs table

Limitations of key_source Behavior/Usage

Proposed Solution: New Jobs Table

Step 1: A New JOB2 Table (Name TBD)

Step 2: Mechanism to "Hydrate"/"Refresh" the JOB2 Table

Step 3: New/Updated populate() Function

Considerations

Notes

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

dimitri-yatsenko Jul 2, 2025 Maintainer

Uh oh!

Uh oh!

ttngu207 Jul 3, 2025 Maintainer Author

Uh oh!

Uh oh!

ttngu207 Jul 3, 2025 Maintainer Author

Uh oh!

Uh oh!

dimitri-yatsenko Jul 3, 2025 Maintainer

ttngu207
Jun 12, 2025
Maintainer

Limitations of the `jobs` table

Limitations of `key_source` Behavior/Usage

Step 1: A New `JOB2` Table (Name TBD)

Step 2: Mechanism to "Hydrate"/"Refresh" the `JOB2` Table

Step 3: New/Updated `populate()` Function

Replies: 2 comments 2 replies

dimitri-yatsenko
Jul 2, 2025
Maintainer

ttngu207 Jul 3, 2025
Maintainer Author

ttngu207
Jul 3, 2025
Maintainer Author

dimitri-yatsenko Jul 3, 2025
Maintainer