Skip to content

Conversation

@t3t5u
Copy link
Contributor

@t3t5u t3t5u commented Jul 27, 2025

Purpose

There are two modes:

  • INSERT mode etc.: creates intermediate tables for each task
  • REPLACE mode etc.: creates only one intermediate table

In INSERT mode etc., which creates intermediate tables for each task,
due to Embulk's mechanism, tasks are generated for each input file.
Therefore, when there are a large number of input files, a large number of intermediate tables are created.

We want INSERT mode etc. to create only one intermediate table like REPLACE mode etc. does.

@t3t5u t3t5u requested a review from a team as a code owner July 27, 2025 22:57
@hiroyuki-sato
Copy link
Member

Hello, @t3t5u

Thank you for proposing the new PR.

Is this PR one-shot? Or do your organization want to contribute Embulk project itself?
As you know, @dmikurube stepped down as the active maintainer. There are no active maintainer now.

We still looking for long-term maintainers around the Embulk eco-system.

As far as I know, your organization already fork embulk-output-jdbc and extend some features on your branch.
For example, Added Azure Synapse Analytics to supported products.
Will your organization maintain both plugins?

Can I ask your organization policy?

@t3t5u
Copy link
Contributor Author

t3t5u commented Aug 1, 2025

@hiroyuki-sato

Okay, I'll take a little time to reply. 🙏

@t3t5u
Copy link
Contributor Author

t3t5u commented Aug 8, 2025

@hiroyuki-sato

Sorry for the late response.

We would like to continue contributing in the future with the following policy:


For various Embulk plugins

  • Bug fixes
    • We will submit PRs as appropriate.
  • Feature additions
    • For the following, We will submit PRs as appropriate:
      • Those that seem to have general demand
      • Those with small changes and no side effects
    • For the following, We will basically maintain them with the fork, but will also submit PRs to the origin if requested:
      • Those with large changes, potential side effects, compatibility issues, etc.

For Embulk itself

  • Bug fixes
    • We will submit PRs as appropriate.
  • Feature additions
    • At present, due to lack of resources, We basically don't plan to work on these.

Please confirm.

@dmikurube
Copy link
Member

dmikurube commented Aug 8, 2025

Currently, the Embulk project has no active maintainers. I am no longer an "active" maintainer of Embulk, and I am not willing to continue maintaining it "actively", which includes reviewing and merging PRs even if you submit PRs, unless someone is willing to take over Embulk's leadership and ownership. Therefore, please note and accept that your PRs may be ignored for a long time. Just submitting PRs would not help the maintenance, to be honest.

When I feel like it, I may review and accept some PRs. However, it may take a long interval for this kind of PRs that I have to take a careful look about the compatibility and concistency with existing things, and safety for the future expansions.

@hiroyuki-sato
Copy link
Member

Hello, @t3t5u

At what file count would it make sense to enable the proposed option? (For example, 1,000 files?)

If the number of input files exceeds 1,000, is concurrent insertion into a single intermediate table transaction-safe in all modes?

I assume this would result in multiple threads inserting data in parallel into a single intermediate table—would the transactions handle this safely?

To the best of my recollection, there have been no such requests in the past decade. If we were to enable it, I believe we would need to investigate the potential impacts.

I have primarily been involved in this project by supporting the active maintainers. My contribution has been 100% volunteer work.
I think we need someone to step up and take the lead in driving this change forward.

@NamedPython
Copy link

@hiroyuki-sato @dmikurube
Hi, I'm a technical lead of the team that maintain embulk plugins in trocco-io, primeNumber Inc.
Sorry for the late response.

We recognize that @dmikurube is no longer an active maintainer, but there was a lack of communication within our company that led to submitting this PR without making a decision about our Embulk contribution policy. We apologize for this oversight.

While not a final decision, here are our current policies:

  • We will primarily fork for future development
    • We cannot set an active maintainer in current company situation
  • We will close this PR
    • Sorry for taking your time

We appreciate your understanding and the time you've invested in reviewing this.

@t3t5u
Copy link
Contributor Author

t3t5u commented Aug 18, 2025

@hiroyuki-sato

As stated in @NamedPython's response above, we will close this PR, but we will answer your question.

As stated in the purpose of this PR, currently in REPLACE mode, tasks for each input file are executed parallel writes to a single intermediate table.

In the past, there was an issue where intermediate table names would collide when executing INSERT mode with over 1,000 input files.

As a temporary workaround, we have continued operating by writing the results to a working table in REPLACE mode and then transferring them to the destination table using after_load.
The purpose of this PR is to avoid this temporary workaround operation.

Although we have not verified it with all databases, we have confirmed that this temporary workaround allows Redshift and Snowflake to work without problems with over 1,000 input files.

We think your concern is reasonable, but our basic thinking is that if no problems occur in REPLACE mode, then no problems should occur in INSERT mode either.

@dmikurube
Copy link
Member

@NamedPython You don't need to close/cancel this pull-request while it may take a long time for us to review and merge it. I meant I'm not fully involved in Embulk, but I'm still a part-time voluntary (inactive) maintainer. I didn't mean I'll reject any pull-request to Embulk.

If you or your company have an intention to take it over, I'll give it a higher prioroty. Otherwise, it's just not so high priority for me.

@hiroyuki-sato
Copy link
Member

Hello, @t3t5u , @NamedPython

Thank you for writing the detail of this PR.
In summary, this PR is a workaround, and it would be better if someone implements #299. Is this correct?

It seems reasonable to support more than 1,000 files with keeping the current behavior.
(For example, introduce add a new option use_new_intermediate_table_type: true, then set default false. It this option enable, it accepts over 1,000 files.)

Do you think this option is still necessary after this option is available?

@NamedPython
Copy link

@dmikurube Thank you for clarifying! We appreciate that we can keep this PR open.

Since we'll merge this into our TROCCO fork first, there's no time pressure from our side. We think it's a good feature for embulk-output-jdbc too, so we'll leave this PR open.

If we find any issues in production, we'll share them with the community.

Thanks for your continued work on the project.

@t3t5u
Copy link
Contributor Author

t3t5u commented Aug 20, 2025

@hiroyuki-sato

With #299 resolved, we can now process more than 1,000 input files even in INSERT mode.
However, as stated in the purpose of this PR, we want to avoid creating a large number of intermediate tables, so we continue to need the functionality of this PR.

The reason we don't want to create a large number of intermediate tables is that when the Embulk process itself terminates abnormally for some reason, the process to delete intermediate tables doesn't execute properly, leaving a large number of intermediate tables as garbage.
This problem occurs occasionally and has become an operational burden.

Please confirm. 🙏

@dmikurube
Copy link
Member

@NamedPython Thanks for your confirmation. :)

Please note one point -- we may finally conclude that this pull request needs some changes before merging. For example, changes in the configuration parameter name, in the specification, in the behavior details, or else.

In that case, this embulk-official open-source version may conflict, or may not match, your fork. Once we conclude that the changes are needed, we wouldn't change our mind only to keep the compatibility with TROCCO's. Please be prepared for such a case.

(The light side of being a maintainer is that you'll be one of the decision makers in such a case -- while the dark side is that you'll have to make the decisions for others. ;) )

@hiroyuki-sato
Copy link
Member

@t3t5u Thank you for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants