-
Notifications
You must be signed in to change notification settings - Fork 89
Added single_intermediate_table option #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Added single_intermediate_table option #353
Conversation
|
Hello, @t3t5u Thank you for proposing the new PR. Is this PR one-shot? Or do your organization want to contribute Embulk project itself? We still looking for long-term maintainers around the Embulk eco-system. As far as I know, your organization already fork embulk-output-jdbc and extend some features on your branch. Can I ask your organization policy? |
|
Okay, I'll take a little time to reply. 🙏 |
|
Sorry for the late response. We would like to continue contributing in the future with the following policy: For various Embulk plugins
For Embulk itself
Please confirm. |
|
Currently, the Embulk project has no active maintainers. I am no longer an "active" maintainer of Embulk, and I am not willing to continue maintaining it "actively", which includes reviewing and merging PRs even if you submit PRs, unless someone is willing to take over Embulk's leadership and ownership. Therefore, please note and accept that your PRs may be ignored for a long time. Just submitting PRs would not help the maintenance, to be honest. When I feel like it, I may review and accept some PRs. However, it may take a long interval for this kind of PRs that I have to take a careful look about the compatibility and concistency with existing things, and safety for the future expansions. |
|
Hello, @t3t5u At what file count would it make sense to enable the proposed option? (For example, 1,000 files?) If the number of input files exceeds 1,000, is concurrent insertion into a single intermediate table transaction-safe in all modes? I assume this would result in multiple threads inserting data in parallel into a single intermediate table—would the transactions handle this safely? To the best of my recollection, there have been no such requests in the past decade. If we were to enable it, I believe we would need to investigate the potential impacts. I have primarily been involved in this project by supporting the active maintainers. My contribution has been 100% volunteer work. |
|
@hiroyuki-sato @dmikurube We recognize that @dmikurube is no longer an active maintainer, but there was a lack of communication within our company that led to submitting this PR without making a decision about our Embulk contribution policy. We apologize for this oversight. While not a final decision, here are our current policies:
We appreciate your understanding and the time you've invested in reviewing this. |
|
As stated in @NamedPython's response above, we will close this PR, but we will answer your question. As stated in the purpose of this PR, currently in REPLACE mode, tasks for each input file are executed parallel writes to a single intermediate table. In the past, there was an issue where intermediate table names would collide when executing INSERT mode with over 1,000 input files. As a temporary workaround, we have continued operating by writing the results to a working table in REPLACE mode and then transferring them to the destination table using after_load. Although we have not verified it with all databases, we have confirmed that this temporary workaround allows Redshift and Snowflake to work without problems with over 1,000 input files. We think your concern is reasonable, but our basic thinking is that if no problems occur in REPLACE mode, then no problems should occur in INSERT mode either. |
|
@NamedPython You don't need to close/cancel this pull-request while it may take a long time for us to review and merge it. I meant I'm not fully involved in Embulk, but I'm still a part-time voluntary (inactive) maintainer. I didn't mean I'll reject any pull-request to Embulk. If you or your company have an intention to take it over, I'll give it a higher prioroty. Otherwise, it's just not so high priority for me. |
|
Hello, @t3t5u , @NamedPython Thank you for writing the detail of this PR. It seems reasonable to support more than 1,000 files with keeping the current behavior. Do you think this option is still necessary after this option is available? |
|
@dmikurube Thank you for clarifying! We appreciate that we can keep this PR open. Since we'll merge this into our TROCCO fork first, there's no time pressure from our side. We think it's a good feature for embulk-output-jdbc too, so we'll leave this PR open. If we find any issues in production, we'll share them with the community. Thanks for your continued work on the project. |
|
With #299 resolved, we can now process more than 1,000 input files even in INSERT mode. The reason we don't want to create a large number of intermediate tables is that when the Embulk process itself terminates abnormally for some reason, the process to delete intermediate tables doesn't execute properly, leaving a large number of intermediate tables as garbage. Please confirm. 🙏 |
|
@NamedPython Thanks for your confirmation. :) Please note one point -- we may finally conclude that this pull request needs some changes before merging. For example, changes in the configuration parameter name, in the specification, in the behavior details, or else. In that case, this (The light side of being a maintainer is that you'll be one of the decision makers in such a case -- while the dark side is that you'll have to make the decisions for others. ;) ) |
|
@t3t5u Thank you for your reply. |
Purpose
There are two modes:
In INSERT mode etc., which creates intermediate tables for each task,
due to Embulk's mechanism, tasks are generated for each input file.
Therefore, when there are a large number of input files, a large number of intermediate tables are created.
We want INSERT mode etc. to create only one intermediate table like REPLACE mode etc. does.