Skip to content

ESQL: Skip LOOKUP JOIN when join key is missing from index #125577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alex-spies opened this issue Mar 25, 2025 · 3 comments · May be fixed by #127583
Open

ESQL: Skip LOOKUP JOIN when join key is missing from index #125577

alex-spies opened this issue Mar 25, 2025 · 3 comments · May be fixed by #127583
Assignees
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@alex-spies
Copy link
Contributor

ReplaceMissingFieldWithNull will check if the indices of a data node even have a given field, and if not, will define a literal NULL in its place to avoid extracting null blocks.

When the join key for a LOOKUP JOIN is missing on the data node, we could optimize away the whole join rather than extracting null blocks and then performing lookups with them.

Or, at least we should check that the current implementation is sufficiently cheap - but I think it's likely not in cases where we fan out to loads of data nodes, many of which just don't have values for the join key; like in FROM * | LOOKUP JOIN lu_idx ON some_rather_rare_field | SORT whatever_field.

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@costin
Copy link
Member

costin commented Apr 23, 2025

The lookup can be completely reworked into a simple projection (due to being a left join) that references the left and right fields. The right side would have had the fields already added as null so

FROM * | LOOKUP JOIN lu_idx ON some_rather_rare_field
// becomes
FROM * | KEEP <left-fields>, <lu_idx fields>
// which would be further optimized (if the stats for some_rather_rare_field are properly read) to
FROM * | EVAL <lu_idx fields> = NULL | KEEP <left_fields> , <lu_idx fields>

One potential complication is in returning data from the optimized data node in a format suitable for the consumer on the coordinator.

P.S.
Ideally we won't broadcast the call at all however that requires some extra infrastructure that we don't have yet at the moment.
Another approach would be to look at the intersection of the keys first before deciding on the execution strategy - no keys is an extreme variant of that.

@costin costin assigned bpintea and unassigned costin Apr 23, 2025
@alex-spies
Copy link
Contributor Author

The lookup can be completely reworked into a simple projection

This is only true if the names on the left/right hand sides are qualified (once we have qualifiers) due to name conflict handling. (There is always at least 1 name conflict, by design, specified via the ON clause.) Right now, the projection has to be baked into the lookup join because duplicate names are forbidden in plans' outputs.
However, once we have qualifiers, you are correct that it may become more convenient to refactor our planning such that LOOKUP JOIN ... ON field is represented as a LOOKUP JOIN ... AS right ON field == right.field | KEEP ...

FROM * | LOOKUP JOIN lu_idx ON some_rather_rare_field becomes FROM * | EVAL <lu_idx fields> = NULL | KEEP <left_fields> , <lu_idx fields>

That's what I had in mind. The last KEEP is likely not even needed due to LOOKUP JOIN ... ON field working just like EVAL in terms of shadowing.

Ideally we won't broadcast the call at all

Not sure that's true in general, because the LOOKUP JOIN is not the only work that's being done on the data nodes. It's true if the query has a WHERE command that can only be true if a lookup field is non-null, though.

Another approach would be to look at the intersection of the keys first before deciding on the execution strategy - no keys is an extreme variant of that.

Yep, there's some optimization potential here when we run LOOKUP JOIN ... ON field and we know that the values for field in the current shard can't have any matches at all with the lookup index. Unfortunately, the lookup index is non-local but maybe it can still be cheap enough to figure this out, or we can make it cheap enough via e.g. caching of metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants