Skip to content

Should append sometimes or always create UNION ALL BY NAME? #5165

@kgutwin

Description

@kgutwin

What's up?

Currently a PRQL append transform will result in UNION ALL:

from tbl_a
append tbl_b

gives

SELECT * FROM tbl_a UNION ALL SELECT * FROM tbl_b -- Generated by PRQL compiler version:0.13.3-39-ge393ab4d (https://prql-lang.org)

However, this results in issues like #4724, #2680, and #3184, where the underlying cause is that UNION ALL is interpreted by the database as "unify by column position" rather than "unify by column name". See the DuckDB docs:

Traditional set operations unify queries by column position, and require the to-be-combined queries to have the same number of input columns. If the columns are not of the same type, casts may be added. The result will use the column names from the first query.
DuckDB also supports UNION [ALL] BY NAME, which joins columns by name instead of by position. UNION BY NAME does not require the inputs to have the same number of columns. NULL values will be added in case of missing columns.

Questions:

  1. Should append always behave as UNION ALL BY NAME to simplify semantics from the user perspective, and also make the compiler's job easier? This would resolve all of the linked issues above without needing to dive into compiler details, but would be a breaking change for users expecting traditional UNION ALL behavior.
  2. If "no" to the above question, can we add a by:name or by:position argument to append to allow users to use UNION ALL BY NAME when that makes sense for their use case?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions