Skip to content

The identity partition path of timestamp type is inconsistent with java api #1735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sharkdtu opened this issue Feb 27, 2025 · 4 comments · May be fixed by #1736
Open

The identity partition path of timestamp type is inconsistent with java api #1735

sharkdtu opened this issue Feb 27, 2025 · 4 comments · May be fixed by #1736

Comments

@sharkdtu
Copy link

sharkdtu commented Feb 27, 2025

Feature Request / Improvement

The human string of timestamp value is inconsistent with java api, which cause different partition paths.

For example, let's say i have a table with following representation:

table {
  1: hour: optional timestamptz
  2: id: optional long
  3: name: optional string
},
partition by: [hour],
...

the writting data path of pyiceberg is:

/path/to/namespace/table/data/hour=2025-02-27T00%3A00%3A00%2B00%3A00

however, the writting data path of spark is:

/path/to/namespace/table/data/hour=2025-02-27T00%3A00Z

java api ref: https://github.com/apache/iceberg/blob/a582968975dd30ff4917fbbe999f1be903efac02/api/src/main/java/org/apache/iceberg/transforms/TransformUtil.java#L57

@sungwy
Copy link
Collaborator

sungwy commented Feb 27, 2025

Hi @sharkdtu my understanding is that Iceberg does not make any guarantees on the paths of the data files, as it relies on links to connect data files of a snapshot together (as opposed to Hi e partitioning).

Is there a reason why you need consistent file paths for your use case? I think it would be helpful to understand your motivation so we can think of a wholistic way of solving the problem.

Previous discussion: #429

@sharkdtu
Copy link
Author

sharkdtu commented Feb 28, 2025

Hi @sharkdtu my understanding is that Iceberg does not make any guarantees on the paths of the data files, as it relies on links to connect data files of a snapshot together (as opposed to Hi e partitioning).

Is there a reason why you need consistent file paths for your use case? I think it would be helpful to understand your motivation so we can think of a wholistic way of solving the problem.

Previous discussion: #429

@sungwy Yes, Iceberg does not use file paths to distinguish partitions. While this does not affect correctness, I believe it is best to maintain consistency in the behavior of different APIs; otherwise, using the Python and Java APIs may create a misleading impression.

In actual production systems, DevOps personnel need to monitor the storage usage and number of files for tables. Although this information can be obtained through Iceberg metadata, the actual physical storage may differ from the Iceberg metadata due to orphan files, residual deleted files, and other reasons. Therefore, it is often necessary to check the physical storage information corresponding to the table/partition paths. If the files of a partition are scattered across multiple paths, it can cause significant trouble for operations and maintenance.

@sungwy
Copy link
Collaborator

sungwy commented Mar 3, 2025

Hi @sharkdtu thank you for the explanation! I think that's an interesting use of the location prefix. If I understand correctly, are you running the DeleteOrphanFiles Spark procedure with location configured as partition paths?

@sharkdtu
Copy link
Author

sharkdtu commented Mar 4, 2025

Hi @sharkdtu thank you for the explanation! I think that's an interesting use of the location prefix. If I understand correctly, are you running the DeleteOrphanFiles Spark procedure with location configured as partition paths?

@sungwy
There may be some misunderstanding. The trouble is not maintenance procedures, our maintenance procedures do not have location configured .

In addition to maintenance procedures, we will also collect physical storage info of partitions, which needs to be obtained through the location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants