Skip to content

Allow selective schema mapping check to avoid mapping conflicts in multi-mapping indices #2385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zeotuan opened this issue May 9, 2025 · 1 comment · May be fixed by #2386
Open

Allow selective schema mapping check to avoid mapping conflicts in multi-mapping indices #2385

zeotuan opened this issue May 9, 2025 · 1 comment · May be fixed by #2386

Comments

@zeotuan
Copy link

zeotuan commented May 9, 2025

Feature Description:

I am currently encountering the following error when querying a subset of columns from an Elasticsearch index that contains conflicting mappings for the same field:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Incompatible types found in multi-mapping: Field [field1] has conflicting types of [OBJECT] and [TEXT].

I understand that type coercion support has been introduced (see: #1074), and coercion from complex types (e.g., OBJECT) to simple types (e.g., STRING) is not allowed.

However, in my query I am not referencing field1 at all. I’m only selecting a small subset of fields, which is correctly handled during column pruning in org/elasticsearch/spark/sql/DefaultSource.scala.

Despite this, at line 300, we instantiate ScalaEsRowRDD using the full inferred schema:

new ScalaEsRowRDD(sqlContext.sparkContext, paramWithScan, lazySchema)

This triggers a call to the _mapping API, which attempts to retrieve the entire mapping for the index — including problematic fields that aren't actually required by the query.

Proposal:

I suggest one of the following improvements:

  1. Use the user-specified schema from df.read.schema(...).options(...).load directly to construct ScalaEsRowRDD, bypassing the need to fetch or validate unused fields.

  2. Modify lazySchema logic to only resolve and validate the projected subset of columns that are actually used in the query.

This change would prevent failures caused by irrelevant mapping conflicts and aligns with how Spark typically prunes columns during logical planning.

Willing to Contribute:

I'm happy to submit a PR to explore this improvement if the approach is acceptable.

@zeotuan zeotuan linked a pull request May 10, 2025 that will close this issue
1 task
@zeotuan
Copy link
Author

zeotuan commented May 11, 2025

Looks like the above is not enough to get it working as in AbstractEsRDD, when accessing esPartitioons

RestService.findPartitions(esCfg, logger)

is also invoked which again attempt to retrieve full mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant