Added single_intermediate_table option #353

t3t5u · 2025-07-27T22:57:54Z

Purpose

There are two modes:

INSERT mode etc.: creates intermediate tables for each task
REPLACE mode etc.: creates only one intermediate table

In INSERT mode etc., which creates intermediate tables for each task,
due to Embulk's mechanism, tasks are generated for each input file.
Therefore, when there are a large number of input files, a large number of intermediate tables are created.

We want INSERT mode etc. to create only one intermediate table like REPLACE mode etc. does.

hiroyuki-sato · 2025-07-30T10:54:52Z

Hello, @t3t5u

Thank you for proposing the new PR.

Is this PR one-shot? Or do your organization want to contribute Embulk project itself?
As you know, @dmikurube stepped down as the active maintainer. There are no active maintainer now.

We still looking for long-term maintainers around the Embulk eco-system.

As far as I know, your organization already fork embulk-output-jdbc and extend some features on your branch.
For example, Added Azure Synapse Analytics to supported products.
Will your organization maintain both plugins?

Can I ask your organization policy?

t3t5u · 2025-08-01T03:28:40Z

@hiroyuki-sato

Okay, I'll take a little time to reply. 🙏

t3t5u · 2025-08-08T01:25:03Z

@hiroyuki-sato

Sorry for the late response.

We would like to continue contributing in the future with the following policy:

For various Embulk plugins

Bug fixes
- We will submit PRs as appropriate.
Feature additions
- For the following, We will submit PRs as appropriate:
  - Those that seem to have general demand
  - Those with small changes and no side effects
- For the following, We will basically maintain them with the fork, but will also submit PRs to the origin if requested:
  - Those with large changes, potential side effects, compatibility issues, etc.

For Embulk itself

Bug fixes
- We will submit PRs as appropriate.
Feature additions
- At present, due to lack of resources, We basically don't plan to work on these.

Please confirm.

dmikurube · 2025-08-08T07:20:24Z

Currently, the Embulk project has no active maintainers. I am no longer an "active" maintainer of Embulk, and I am not willing to continue maintaining it "actively", which includes reviewing and merging PRs even if you submit PRs, unless someone is willing to take over Embulk's leadership and ownership. Therefore, please note and accept that your PRs may be ignored for a long time. Just submitting PRs would not help the maintenance, to be honest.

When I feel like it, I may review and accept some PRs. However, it may take a long interval for this kind of PRs that I have to take a careful look about the compatibility and concistency with existing things, and safety for the future expansions.

hiroyuki-sato · 2025-08-13T11:56:28Z

Hello, @t3t5u

At what file count would it make sense to enable the proposed option? (For example, 1,000 files?)

If the number of input files exceeds 1,000, is concurrent insertion into a single intermediate table transaction-safe in all modes?

I assume this would result in multiple threads inserting data in parallel into a single intermediate table—would the transactions handle this safely?

To the best of my recollection, there have been no such requests in the past decade. If we were to enable it, I believe we would need to investigate the potential impacts.

I have primarily been involved in this project by supporting the active maintainers. My contribution has been 100% volunteer work.
I think we need someone to step up and take the lead in driving this change forward.

NamedPython · 2025-08-18T10:52:58Z

@hiroyuki-sato @dmikurube
Hi, I'm a technical lead of the team that maintain embulk plugins in trocco-io, primeNumber Inc.
Sorry for the late response.

We recognize that @dmikurube is no longer an active maintainer, but there was a lack of communication within our company that led to submitting this PR without making a decision about our Embulk contribution policy. We apologize for this oversight.

While not a final decision, here are our current policies:

We will primarily fork for future development
- We cannot set an active maintainer in current company situation
We will close this PR
- Sorry for taking your time

We appreciate your understanding and the time you've invested in reviewing this.

t3t5u · 2025-08-18T11:03:18Z

@hiroyuki-sato

As stated in @NamedPython's response above, we will close this PR, but we will answer your question.

As stated in the purpose of this PR, currently in REPLACE mode, tasks for each input file are executed parallel writes to a single intermediate table.

In the past, there was an issue where intermediate table names would collide when executing INSERT mode with over 1,000 input files.

Should be able to create more than 1000 temporary tables #299

As a temporary workaround, we have continued operating by writing the results to a working table in REPLACE mode and then transferring them to the destination table using after_load.
The purpose of this PR is to avoid this temporary workaround operation.

Although we have not verified it with all databases, we have confirmed that this temporary workaround allows Redshift and Snowflake to work without problems with over 1,000 input files.

We think your concern is reasonable, but our basic thinking is that if no problems occur in REPLACE mode, then no problems should occur in INSERT mode either.

dmikurube · 2025-08-19T01:12:49Z

@NamedPython You don't need to close/cancel this pull-request while it may take a long time for us to review and merge it. I meant I'm not fully involved in Embulk, but I'm still a part-time voluntary (inactive) maintainer. I didn't mean I'll reject any pull-request to Embulk.

If you or your company have an intention to take it over, I'll give it a higher prioroty. Otherwise, it's just not so high priority for me.

hiroyuki-sato · 2025-08-19T08:29:19Z

Hello, @t3t5u , @NamedPython

Thank you for writing the detail of this PR.
In summary, this PR is a workaround, and it would be better if someone implements #299. Is this correct?

It seems reasonable to support more than 1,000 files with keeping the current behavior.
(For example, introduce add a new option use_new_intermediate_table_type: true, then set default false. It this option enable, it accepts over 1,000 files.)

Do you think this option is still necessary after this option is available?

NamedPython · 2025-08-20T07:20:13Z

@dmikurube Thank you for clarifying! We appreciate that we can keep this PR open.

Since we'll merge this into our TROCCO fork first, there's no time pressure from our side. We think it's a good feature for embulk-output-jdbc too, so we'll leave this PR open.

If we find any issues in production, we'll share them with the community.

Thanks for your continued work on the project.

t3t5u · 2025-08-20T07:33:27Z

@hiroyuki-sato

With #299 resolved, we can now process more than 1,000 input files even in INSERT mode.
However, as stated in the purpose of this PR, we want to avoid creating a large number of intermediate tables, so we continue to need the functionality of this PR.

The reason we don't want to create a large number of intermediate tables is that when the Embulk process itself terminates abnormally for some reason, the process to delete intermediate tables doesn't execute properly, leaving a large number of intermediate tables as garbage.
This problem occurs occasionally and has become an operational burden.

Please confirm. 🙏

dmikurube · 2025-08-20T09:31:40Z

@NamedPython Thanks for your confirmation. :)

Please note one point -- we may finally conclude that this pull request needs some changes before merging. For example, changes in the configuration parameter name, in the specification, in the behavior details, or else.

In that case, this embulk-official open-source version may conflict, or may not match, your fork. Once we conclude that the changes are needed, we wouldn't change our mind only to keep the compatibility with TROCCO's. Please be prepared for such a case.

(The light side of being a maintainer is that you'll be one of the decision makers in such a case -- while the dark side is that you'll have to make the decisions for others. ;) )

hiroyuki-sato · 2025-08-21T11:43:19Z

@t3t5u Thank you for your reply.

Added single_intermediate_table option

79cb423

t3t5u requested a review from a team as a code owner July 27, 2025 22:57

Added single_intermediate_table option #353

Are you sure you want to change the base?

Added single_intermediate_table option #353

Uh oh!

Conversation

t3t5u commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

hiroyuki-sato commented Jul 30, 2025

Uh oh!

t3t5u commented Aug 1, 2025

Uh oh!

t3t5u commented Aug 8, 2025

For various Embulk plugins

For Embulk itself

Uh oh!

dmikurube commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiroyuki-sato commented Aug 13, 2025

Uh oh!

NamedPython commented Aug 18, 2025

Uh oh!

t3t5u commented Aug 18, 2025

Uh oh!

dmikurube commented Aug 19, 2025

Uh oh!

hiroyuki-sato commented Aug 19, 2025

Uh oh!

NamedPython commented Aug 20, 2025

Uh oh!

t3t5u commented Aug 20, 2025

Uh oh!

dmikurube commented Aug 20, 2025

Uh oh!

hiroyuki-sato commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

t3t5u commented Jul 27, 2025 •

edited

Loading

dmikurube commented Aug 8, 2025 •

edited

Loading