Skip to content

Conversation

@sofyalaski
Copy link
Member

@sofyalaski sofyalaski commented Oct 3, 2025

Description

This is a big PR that introduces two changes.

  1. This is a big change and it introduces many new components and a dependency on another backend. This, however, can be turned off in the config.json file by setting:
  "ingestorComponent": {
    "ingestorEnabled": false,
    }
  1. At the SciCatCon 2025, we discussed a new project to simplify ingestion. This PR also introduces a change to how the button "Create Dataset" below the filter side bar at the Dataset page behaves according to Change "create dataset" button #1912 . ( Option to control it already present in config - addDatasetEnabled)

Motivation

At PSI with OpenEM we have been working on a new ingestor backend that will allow data ingestion from sites different from the host of SciCat Catalog. This is represented by Point 1. An addition of the Ingestor backend repo into SciCatProject is planned as well.

Changes:

For point 2:

  • when user wants to ingest dataset into SciCat from frontend, the dialog opens, where user enters dataset-specific information.
  • User can provide a url to json schema for scientific metadata. We only check if provided JSON is a valid object.
  • If user provided schema, an additional set of questions is created based on that schema ( with Json forms) and user can specify the scientific metadata details based on it
  • Conformation page where user can review entered metadata
  • Dataset is being added after the submission of the form

For point 1:

  • config.json changes include this new object:
"ingestorComponent": {
    "ingestorEnabled": true,
    "ingestorAutodiscoveryOptions": [
      {
        "mailDomain": "university.org",
        "description": "University/facility of Choice",
        "facilityBackend": "https://facility-ingestor.facility.org"
      }
    ]
  },
  • The main option to turn off the component entirely is controlled by the ingestorEnabled value. This will redirect call to ingestor to 404. When turned on, the ingestor component is available at /ingestor/ with a link in the hamburger menu.

  • ingestorAutodiscoveryOptions is an optional argument and constitutes an array of available facilities running ingestor software.

  • facilityBackend is a reachable backend of the ingestor service.

  • mailDomain is used to match the email of logged-in user against the mailDomain value as a regular expression and in case of success, automatically connect to the respective backend. A regular expression is used to connect to the email of form "staff.university.org" or similar.

  • description is optional, but in case of the match with mailDomain will prefill the creationLocation property in the dataset schema

  • Ingestor component ( when used with the backend ) looks similar to the Point 2 and represents a set of dialogs for SciCat dataset and scientific metadata ingestion, with most of the information prefilled. For this, it interacts with ingestor backend, which does all the hard work such as:

    • loading set of available methods, which correspond to the available metadata extractors.
    • upon method selction, extraction of the metadata into a newly generated json object, that is used in the scientificMetadata
    • creation of a dataset on SciCat
    • creation of a transferring job to the tape.

Tests included

  • Included for each change/fix?
  • Passing? (Merge will not be approved unless this is checked)

Documentation

  • swagger documentation updated [required]
  • official documentation updated [nice-to-have]

official documentation info

If you have updated the official documentation, please provide PR # and URL of the pages where the updates are included

Backend version

  • Does it require a specific version of the backend
  • which version of the backend is required:

Ingestor backend:

https://github.com/SwissOpenEM/Ingestor

sofyalaski and others added 30 commits December 17, 2024 13:18
…ent (SciCatProject#1673)

* fix: optimize condition editing logic in DatasetsFilterSettingsComponent

* if user creates duplicated condition, do nothing

* add snackbar notification for duplicate condition in DatasetsFilterSettingsComponent

* remove unused import

* remove panelClass from snackBar

* added e2e test for the change
* feat: add the new auth service to prepare for the new sdk

* try to fix some ai-bot review suggestions

* add the note for the good review suggestion from ai-bot

* remove old sdk and adjust types against the new one

* fix more types and issues against the new sdk

* finalize type error fixes

* remove prefix

* add the new sdk generation script for local development

* start fixing TODOs after newly generated sdk

* fixed sdk local generation for linux

* update the sdk package version and fix some more types

* detect the OS and use the right current directory path

* improve types and fix more TODOs

* improve types and fix TODOs after backend improvements

* finalize TODOs and FIXMEs fixes and type improvements with the new sdk

* fix some sourcery-ai comments

* fix some of the last TODOs

* adapted sdk generation to unix environment

* ignore the @SciCatProject that is generated with the sdk

* start fixing tests with the new sdk

* add needed stub classes and fix some more tests

* continue fixing unit tests

* try to fix e2e tests and revert some changes that need more attention for now

* changes to just run the tests

* use latest sdk

* update package-lock file

* fixing unit tests

* fix more unit tests

* continue fixing tests

* update the sdk

* fix last e2e test

* fix thumbnail unit tests

* revert some change

* finalize fixing unit tests

* revert the backend image changes after the tests pass

* add some improvements in the mocked objects for unit tests based on ai bot suggestion

* remove encodeURIComponent in the effects as it seems redundant

* fix test files after some changes

* try to use mock objects as much as possible

* update the sdk version

* update package-lock file

* update the sdk to latest

* BREAKING CHANGE: new sdk release

---------

Co-authored-by: martintrajanovski <martin.trajanovski@gmail.com>
Co-authored-by: Jay <b331998513@gmail.com>
This commit rebases and squashes SciCatProject#1585 (7d2a872).
It contains the following commits:
- update job view to match release-jobs
- update job schema and timestamp fields
- update jobs-detail page
- fix testing and linting
@sofyalaski sofyalaski changed the title Ingestor component feat: ingestor component for datasets Oct 6, 2025
@sofyalaski sofyalaski marked this pull request as ready for review October 7, 2025 12:49
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @sofyalaski, your pull request is larger than the review limit of 150000 diff characters

@sbliven
Copy link
Member

sbliven commented Oct 8, 2025

Great work putting this together @sofyalaski!

Copy link
Member

@sbliven sbliven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sofyalaski, can you add some screenshots of the two options to the description? I think that would make it easier for people to review, since the ingestor isn't that easy to run yourself (yet).

We should also start working on user and operator docs for this feature.

I will leave off formally approving this, as I think that should be done by someone outside of OpenEM. It looks good though, and I'm excited to get this merged into main!

@@ -0,0 +1,183 @@
import { Injectable } from "@angular/core";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually it is planned to publish the Ingestor SDK as a separate package. However, for the time being the API definition lives here in the frontend.

@sofyalaski
Copy link
Member Author

Here is an example of dataset ingestion from the main "Datasets" view
https://github.com/user-attachments/assets/44cab5a9-3823-404a-8502-ca5fd9990c3a

@sofyalaski
Copy link
Member Author

And with the ingestor component connected on the backend. It's not configured properly, so the trsnafer hasn't succeeded

Screen.Recording.2025-10-20.at.17.05.45.mov

@Junjiequan
Copy link
Member

Junjiequan commented Nov 5, 2025

I ran it locally and it works great, thanks for the excellent work!

  1. I wasnt sure what does ingestorComponent do initially, and just noticed it's still work in progress. Ideally it'd be nice to have that part in a seperated PR for easier reviews
  2. I struggled to generate scientificMetadata using a schema, I used schema example from jsonforms doc. the validation said Loaded schema is valid! but nothing was renderd, it just showed No applicable renderer found!.
  3. It would be nice to have an generic example for scientific metadata schema e.g., provide a link to the jsonforms schema documentation etc.
  4. It'd also be helpful if we can load a default schema from backend (e.g, jsonform-schema.json) instead of having to provide a url as mandatory
  5. What is the use case of "@jsonforms/react": "^3.5.1" package?

@@ -0,0 +1,39 @@
.stepper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this file to .scss?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

What is the purpose of the item Ingestor in the main menu which take the user to this following page:
image

Unless absolutely needed, I suggest to move the URL(s) required here to the frontend configuration and present here just a drop down with the configured options?

It it applies, I suggest to change the word backend to something different so it is not to be confused with SciCat beckend.

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

Can we move the button Create Dataset within the configurable actions, once they are ready for use?
image

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

Can we style the ingestion popup more inline with the rest of the application?
Also I feel that we are using a lot of space and it requires a lot of scrolling.
image

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

Scientific Metadata Validation

  • Can we add in configuration a default schema?
  • Should we save the schemas in config e provide a drop down list to select from in the ingestion form?
  • What is the purpose of the validate button?
  • If no schema is provided, can we include the table that is used in the default dataset detail view, so user can still add metadata although is more free form?
  • Would it be possible to add an additional step to add a list of files that are already present on the storage?
  • Can we provide some examples for the schema directly on the form?
image

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

Last step of the process "Confirm Inputs"

  • I wonder if showing the json is the best format. Would it be possible to show a rendered format like the Datasets Custom view?
  • Can we change Configure Metadata to Configure Dataset Information? ...or at least make it configurable?
  • Can we close the popup when Create Dataset is clicked and the creation is successful?
image

Copy link
Member

@nitrosx nitrosx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few observations and questions:

  • why we have so many files for the ingestor page?
  • why do we have so many custom renderers?
  • is the ingestor service specific for your use case?
  • I'm a little afraid to include the SDK here and not provide it as an external library?
  • should we define an API for the ingestor service, so everybody can create their own if needed?

Unfortunately, I have hard time to provide a honest review on the code. I might be able to have a better understanding after a demo.

That said, I am not opposed to merge the PR into master, with the conditions that this feature is clearly marked as experimental and with the warning use at your own risk

package.json Outdated
"@jsonforms/angular": "^3.5.1",
"@jsonforms/angular-material": "^3.5.1",
"@jsonforms/core": "^3.5.1",
"@jsonforms/react": "^3.5.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using a react package?

@@ -0,0 +1,39 @@
.stepper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file used for?

Comment on lines 32 to 41
"ingestorComponent": {
"ingestorEnabled": false,
"ingestorAutodiscoveryOptions": [
{
"mailDomain": "university.ch",
"description": "University/facility of Choice",
"facilityBackend": "http://localhost:8888"
}
]
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the option to show the button on the datasets list here and , also, renamed it accordingly to the new options?

@nitrosx
Copy link
Member

nitrosx commented Nov 6, 2025

@sofyalaski and the CH team great work!!!
I know I left a lot of comments, nevertheless I do appreciate the contribution, admire the amount of work and the quality of the PR

@sofyalaski sofyalaski requested a review from a team as a code owner November 7, 2025 14:05
@sbliven
Copy link
Member

sbliven commented Nov 7, 2025

Thanks for all your comments, @Junjiequan and @nitrosx! I will collect your comments here as either todo items (which I agree should be changed) or things I want to respond to.

Minor things that should be added before merging

  • convert css to scss
  • rename 'Backend' to 'Ingestor Service' in the Control Center
  • remove 'validate' button & instead automatically validate the scientificMetadataSchema when pasted
  • change 'Confirm Metadata' to 'Configure Dataset Information'
  • Close the popup directly after confirming creation
  • Mark the ingestor and create dataset features as experimental in the documentation

Bigger things that should have subsequent PRs

Should we save the schemas in config e provide a drop down list to select from in the ingestion form?

Yes, but this can be done after the current POC.

  • Open an issue: provide a configurable drop-down for schemas

Would it be possible to show a rendered format like the Datasets Custom view?

  • Open an issue: Use the datasets custom view for displaying json in create dataset confirmation

  • Open an issue: migrate the ingestor SDK to an external dependency

scientific metadata schema

It'd also be helpful if we can load a default schema from backend (e.g, jsonform-schema.json) instead of having to provide a url as mandatory

Can we provide some examples for the schema directly on the form?

Good idea. I suggest we open a follow-up issue for a configurable default scientificMetadataSchema URLs. The first item would probably be a general schema served from the backend.

  • Open an issue: Add configurable list of scientificMetadataSchema URLs.

If no schema is provided, can we include the table that is used in the default dataset detail view, so user can still add metadata although is more free form?

The first page ('Dataset information') already should correspond to this, right? Or do you mean that you want a key/value schema for scientificMetadata, eg

{
  "$schema": "https://json-schema.org/draft/2020-12/schema"
  "type": "object",
  "additionalProperties": {
    "anyOf": [
      { "type": "string" },
      { "type": "boolean" },
      { "type": "number" }
    ]
  }
}
  • Open an issue: Serve a scientificMetadataSchema from the backend corresponding to key/value pairs

Comments/Responses

I wasnt sure what does ingestorComponent do initially, and just noticed it's still work in progress. Ideally it'd be nice to have that part in a seperated PR for easier reviews

We will continue adding features, but I think it's ready to be merged. Since this PR is already reviewed, let's keep it together but try for more atomic PRs in the future.

Unless absolutely needed, I suggest to move the URL(s) required here to the frontend configuration and present here just a drop down with the configured options?

Note that you do get the list of 'Last used facility backends' below this, and if your email matches one of the mailDomains in the configuration then the list will include that by default.

We designed this way so that it would be possible for some labs or users to install the ingestor service locally and connect to it without registering centrally with scicat. However there are issues with this, eg due to SSL requirements and OIDC callbacks. We should discuss this again, and might want to go back to a drop-down as a clearer (if less flexible) user experience.

Can we move the button Create Dataset within the configurable actions, once they are ready for use?

I though that other configurable action buttons would act on the selection. If so, it might make sense to visually distinguish selection-actions from non-selection-actions. However, configuring all of the frontend buttons consistently seems good.

@sofyalaski Feel free to edit this comment and/or check off todos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants