The identity partition path of timestamp type is inconsistent with java api #1735

sharkdtu · 2025-02-27T12:23:21Z

Feature Request / Improvement

The human string of timestamp value is inconsistent with java api, which cause different partition paths.

For example, let's say i have a table with following representation:

table {
  1: hour: optional timestamptz
  2: id: optional long
  3: name: optional string
},
partition by: [hour],
...

the writting data path of pyiceberg is:

/path/to/namespace/table/data/hour=2025-02-27T00%3A00%3A00%2B00%3A00

however, the writting data path of spark is:

/path/to/namespace/table/data/hour=2025-02-27T00%3A00Z

java api ref: https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/api/src/main/java/org/apache/iceberg/transforms/TransformUtil.java#L57

The text was updated successfully, but these errors were encountered:

sungwy · 2025-02-27T13:26:59Z

Hi @sharkdtu my understanding is that Iceberg does not make any guarantees on the paths of the data files, as it relies on links to connect data files of a snapshot together (as opposed to Hi e partitioning).

Is there a reason why you need consistent file paths for your use case? I think it would be helpful to understand your motivation so we can think of a wholistic way of solving the problem.

Previous discussion: #429

sharkdtu · 2025-02-28T03:12:31Z

Hi @sharkdtu my understanding is that Iceberg does not make any guarantees on the paths of the data files, as it relies on links to connect data files of a snapshot together (as opposed to Hi e partitioning).

Is there a reason why you need consistent file paths for your use case? I think it would be helpful to understand your motivation so we can think of a wholistic way of solving the problem.

Previous discussion: #429

@sungwy Yes, Iceberg does not use file paths to distinguish partitions. While this does not affect correctness, I believe it is best to maintain consistency in the behavior of different APIs; otherwise, using the Python and Java APIs may create a misleading impression.

In actual production systems, DevOps personnel need to monitor the storage usage and number of files for tables. Although this information can be obtained through Iceberg metadata, the actual physical storage may differ from the Iceberg metadata due to orphan files, residual deleted files, and other reasons. Therefore, it is often necessary to check the physical storage information corresponding to the table/partition paths. If the files of a partition are scattered across multiple paths, it can cause significant trouble for operations and maintenance.

sungwy · 2025-03-03T14:16:17Z

Hi @sharkdtu thank you for the explanation! I think that's an interesting use of the location prefix. If I understand correctly, are you running the DeleteOrphanFiles Spark procedure with location configured as partition paths?

sharkdtu · 2025-03-04T05:03:11Z

Hi @sharkdtu thank you for the explanation! I think that's an interesting use of the location prefix. If I understand correctly, are you running the DeleteOrphanFiles Spark procedure with location configured as partition paths?

@sungwy
There may be some misunderstanding. The trouble is not maintenance procedures, our maintenance procedures do not have location configured .

In addition to maintenance procedures, we will also collect physical storage info of partitions, which needs to be obtained through the location.

sharkdtu linked a pull request Feb 27, 2025 that will close this issue

feat: make output data path of table with identity timestamp partition consistent with java api #1736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The identity partition path of timestamp type is inconsistent with java api #1735

The identity partition path of timestamp type is inconsistent with java api #1735

sharkdtu commented Feb 27, 2025 •

edited

Loading

sungwy commented Feb 27, 2025

sharkdtu commented Feb 28, 2025 •

edited

Loading

sungwy commented Mar 3, 2025

sharkdtu commented Mar 4, 2025

The identity partition path of timestamp type is inconsistent with java api #1735

The identity partition path of timestamp type is inconsistent with java api #1735

Comments

sharkdtu commented Feb 27, 2025 • edited Loading

Feature Request / Improvement

sungwy commented Feb 27, 2025

sharkdtu commented Feb 28, 2025 • edited Loading

sungwy commented Mar 3, 2025

sharkdtu commented Mar 4, 2025

sharkdtu commented Feb 27, 2025 •

edited

Loading

sharkdtu commented Feb 28, 2025 •

edited

Loading