Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 64 additions & 2 deletions datafusion/expr-common/src/interval_arithmetic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,9 @@ impl Interval {
upper: ScalarValue::Boolean(Some(upper)),
})
}
_ => internal_err!("Incompatible data types for logical conjunction"),

// Return UNCERTAIN when intervals don't have concrete boolean bounds
_ => Ok(Self::UNCERTAIN),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This make sense to me, under the classic 3-valued logic, in such cases it's fine to return unknown.

This method should be used with care as it would probably be unsafe for simplifications, as the TODO suggests we are not handling cases like FALSE AND UNKNOWN which is FALSE, or TRUE OR UNKNOWN which is TRUE.

Depending on where we invoke this method, we need to be aware of the unknown-as-false semantics (filters) and unknown-as-null semantics (order by, projections, etc.), and decide if it is safe to use this method or not.

Since we were previously erroring out, I think it's in any case an improvement but just wanted to point that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find places that would be confused by this change, but there is quite a bit of logic around Interval and tracing the implications are a bit beyond me as a first time datafusion contributor. In a good way it seems Interval::UNCERTAIN is handled uniformly from what I found. There also seems to be test coverage in the optimizer if those TODOs were executed on (test_simplify_with_guarantee, test_inequalities_maybe_null, etc).

@alamb @asolimando I'm happy to explore adding those bounds here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting...I started by adding tests for and/or:

#[test]
fn test_and_null_boolean_intervals() -> Result<()> {
    let null_interval =
        Interval::try_new(ScalarValue::Boolean(None), ScalarValue::Boolean(None))?;

    let and_result = null_interval.and(&Interval::CERTAINLY_FALSE)?;
    assert_eq!(and_result, Interval::CERTAINLY_FALSE);

    let and_result = null_interval.and(&Interval::CERTAINLY_TRUE)?;
    assert_eq!(and_result, Interval::UNCERTAIN);

    let and_result = null_interval.and(&null_interval)?;
    assert_eq!(and_result, Interval::UNCERTAIN);

    Ok(())
}

#[test]
fn test_or_null_boolean_intervals() -> Result<()> {
    let null_interval =
        Interval::try_new(ScalarValue::Boolean(None), ScalarValue::Boolean(None))?;

    let or_result = null_interval.or(&Interval::CERTAINLY_FALSE)?;
    assert_eq!(or_result, Interval::UNCERTAIN);

    let or_result = null_interval.or(&Interval::CERTAINLY_TRUE)?;
    assert_eq!(or_result, Interval::CERTAINLY_TRUE);

    let or_result = null_interval.or(&null_interval)?;
    assert_eq!(or_result, Interval::UNCERTAIN);

    Ok(())
}

And they already worked without any code changes, so I think my TODO statements were premature and unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@watford-ep sounds good! Can you also add the reverse (true or null where you now test null or true) just to make sure that we short-circuit the right way independently from the operands' order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

Expand All @@ -606,7 +608,9 @@ impl Interval {
upper: ScalarValue::Boolean(Some(upper)),
})
}
_ => internal_err!("Incompatible data types for logical disjunction"),

// Return UNCERTAIN when intervals don't have concrete boolean bounds
_ => Ok(Self::UNCERTAIN),
}
}

Expand Down Expand Up @@ -2517,6 +2521,64 @@ mod tests {
Ok(())
}

#[test]
fn test_and_or_with_normalized_boolean_intervals() -> Result<()> {
// Verify that NULL boolean bounds are normalized and don't cause errors
let from_nulls =
Interval::try_new(ScalarValue::Boolean(None), ScalarValue::Boolean(None))?;

assert!(from_nulls.or(&Interval::CERTAINLY_TRUE).is_ok());
assert!(from_nulls.and(&Interval::CERTAINLY_FALSE).is_ok());

Ok(())
}

#[test]
fn test_and_null_boolean_intervals() -> Result<()> {
let null_interval =
Interval::try_new(ScalarValue::Boolean(None), ScalarValue::Boolean(None))?;

let and_result = null_interval.and(&Interval::CERTAINLY_FALSE)?;
assert_eq!(and_result, Interval::CERTAINLY_FALSE);

let and_result = Interval::CERTAINLY_FALSE.and(&null_interval)?;
assert_eq!(and_result, Interval::CERTAINLY_FALSE);

let and_result = null_interval.and(&Interval::CERTAINLY_TRUE)?;
assert_eq!(and_result, Interval::UNCERTAIN);

let and_result = Interval::CERTAINLY_TRUE.and(&null_interval)?;
assert_eq!(and_result, Interval::UNCERTAIN);

let and_result = null_interval.and(&null_interval)?;
assert_eq!(and_result, Interval::UNCERTAIN);

Ok(())
}

#[test]
fn test_or_null_boolean_intervals() -> Result<()> {
let null_interval =
Interval::try_new(ScalarValue::Boolean(None), ScalarValue::Boolean(None))?;

let or_result = null_interval.or(&Interval::CERTAINLY_FALSE)?;
assert_eq!(or_result, Interval::UNCERTAIN);

let or_result = Interval::CERTAINLY_FALSE.or(&null_interval)?;
assert_eq!(or_result, Interval::UNCERTAIN);

let or_result = null_interval.or(&Interval::CERTAINLY_TRUE)?;
assert_eq!(or_result, Interval::CERTAINLY_TRUE);

let or_result = Interval::CERTAINLY_TRUE.or(&null_interval)?;
assert_eq!(or_result, Interval::CERTAINLY_TRUE);

let or_result = null_interval.or(&null_interval)?;
assert_eq!(or_result, Interval::UNCERTAIN);

Ok(())
}

#[test]
fn intersect_test() -> Result<()> {
let possible_cases = vec![
Expand Down
37 changes: 37 additions & 0 deletions datafusion/sqllogictest/test_files/union.slt
Original file line number Diff line number Diff line change
Expand Up @@ -953,3 +953,40 @@ drop table u1;

statement count 0
drop table u2;

# repro for https://github.com/apache/datafusion/issues/18327
# should not error
query TITT
WITH typ(oid, typnamespace, typname, typtype) AS (
SELECT * FROM (VALUES (1, 10, 't1', 'b'))
UNION ALL SELECT * FROM (VALUES (2, NULL, 't2', 'b'))
UNION ALL SELECT * FROM (VALUES (3, 12, 't3', NULL))
)
, ns(oid, nspname) AS (VALUES (1, 'ns1'), (2, 'ns2'))
SELECT ns.nspname, typ.oid, typ.typname, typ.typtype
FROM typ JOIN ns ON (ns.oid = typ.typnamespace)
WHERE typ.typtype IN ('b','r','m','e','d')
ORDER BY CASE WHEN typ.typtype IN ('b','e','p') THEN 0
WHEN typ.typtype = 'r' THEN 1
END
----

# Add another row with a non-NULL value `m` which is retained by the
# filter but not matching any WHEN branch m?
query TITT
WITH typ(oid, typnamespace, typname, typtype) AS (
SELECT * FROM (VALUES (1, 10, 't1', 'b'))
UNION ALL SELECT * FROM (VALUES (2, NULL, 't2', 'b'))
UNION ALL SELECT * FROM (VALUES (3, 12, 't3', NULL))
UNION ALL SELECT * FROM (VALUES (4, 40, 't3', 'm'))
), ns(oid, nspname) AS (
VALUES (1, 'ns1'), (2, 'ns2'), (40, 'ns3')
)
SELECT ns.nspname, typ.oid, typ.typname, typ.typtype
FROM typ JOIN ns ON (ns.oid = typ.typnamespace)
WHERE typ.typtype IN ('b','r','m','e','d')
ORDER BY CASE WHEN typ.typtype IN ('b','e','p') THEN 0
WHEN typ.typtype = 'r' THEN 1
END
----
ns3 4 t3 m