Skip to content

Conversation

@pedrorfdez
Copy link

Feature or Bugfix

  • Feature

Detail

  • Implemented full support for Iceberg MERGE INTO operations in _write_iceberg.py/to_iceberg
  • Added new parameters for:
    • merge_on_clause: Custom ON statement in the MERGE INTO ... USING ... ON [custom_expression] to allow <, <=, > and >= operators. Until now, only column equality was allowed. Risk of having more than one match in target table is warned in stringdocs.
    • merge_condition: Added new accepted value conditional_merge
    • merge_conditional_clauses: List of dictionaries specifying custom conditional clauses for the MERGE INTO statement.
      Each dictionary should have:
      - 'when': One of ['MATCHED', 'NOT MATCHED', 'NOT MATCHED BY SOURCE']
      - 'action': One of ['UPDATE', 'DELETE', 'INSERT']
      - 'condition': (optional) Additional SQL condition for the clause
      - 'columns': (optional) List of columns to update or insert
      Used only when merge_condition is 'conditional_merge'.
  • Added argument validation for mutually exclusive and required parameters merge_cols, merge_on_clause, merge_match_nulls, merge_condition, merge_conditional_clauses,
  • Added and updated unit tests to cover new validation logic and merge scenarios.
  • Updated docstrings for new parameters and behaviors.
  • Backward compatibility

Relates

This is the first draft of the implementation, feel free to suggest any changes in the approach. I am open to suggestions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@pedrorfdez pedrorfdez changed the title Add full-featured Iceberg MERGE INTO conditional merges support and argument validation feat: Add full-featured Iceberg MERGE INTO conditional merges support and argument validation Sep 10, 2025
@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@pedrorfdez
Copy link
Author

Currently failing 1 Athena test:

FAILED tests/unit/test_athena_iceberg.py::test_to_iceberg_conditional_merge_happy_path - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="id") are different

Attribute "dtype" are different
[left]: int64
[right]: Int64

Plan to fix on next commit.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@pedrorfdez pedrorfdez marked this pull request as ready for review September 11, 2025 17:18
@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido

This comment was marked as outdated.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: e7acaf0
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Contributor

Hi @pedrorfdez it looks like there is an error in test_to_iceberg_conditional_merge_happy_path:

FAILED tests/unit/test_athena_iceberg.py::test_to_iceberg_conditional_merge_happy_path - AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="id") are different
--
 
Attribute "dtype" are different
[left]:  int64
[right]: Int64
= 1 failed, 2312 passed, 72 skipped, 9 xfailed, 4 xpassed, 143 warnings, 1 rerun in 729.43s (0:12:09) =

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: e7acaf0
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

"Cannot specify both merge_cols and merge_on_clause. Use either merge_cols for simple equality matching or merge_on_clause for custom logic."
)

if merge_on_clause and merge_match_nulls:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly better to also check if merge_cols - as it will validate that merge cols are actually set

raise exceptions.InvalidArgumentValue(f"merge_conditional_clauses[{i}] must contain 'action' field.")
if clause["when"] not in ["MATCHED", "NOT MATCHED", "NOT MATCHED BY SOURCE"]:
raise exceptions.InvalidArgumentValue(
f"merge_conditional_clauses[{i}]['when'] must be one of ['MATCHED', 'NOT MATCHED', 'NOT MATCHED BY SOURCE']."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indicate in the error message that the operators are case sensitive

)
if clause["action"] not in ["UPDATE", "DELETE", "INSERT"]:
raise exceptions.InvalidArgumentValue(
f"merge_conditional_clauses[{i}]['action'] must be one of ['UPDATE', 'DELETE', 'INSERT']."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indicate in the error message that the action is case sensitive

- 'update': Update matched rows and insert non-matched rows.
- 'ignore': Only insert non-matched rows.
- 'conditional_merge': Use custom conditional clauses for merge actions.
merge_conditional_clauses : List[dict], optional

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the type hint be list[_MergeClause]?

UPDATE SET {", ".join([f'"{x}" = source."{x}"' for x in df.columns])}"""
else:
match_condition = ""
if merge_cols or merge_on_clause:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if code becomes unwieldy imho, maybe refactor to a helper function to formulate the conditions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants