-
Notifications
You must be signed in to change notification settings - Fork 272
The identity partition path of timestamp type is inconsistent with java api #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sharkdtu my understanding is that Iceberg does not make any guarantees on the paths of the data files, as it relies on links to connect data files of a snapshot together (as opposed to Hi e partitioning). Is there a reason why you need consistent file paths for your use case? I think it would be helpful to understand your motivation so we can think of a wholistic way of solving the problem. Previous discussion: #429 |
@sungwy Yes, Iceberg does not use file paths to distinguish partitions. While this does not affect correctness, I believe it is best to maintain consistency in the behavior of different APIs; otherwise, using the Python and Java APIs may create a misleading impression. In actual production systems, DevOps personnel need to monitor the storage usage and number of files for tables. Although this information can be obtained through Iceberg metadata, the actual physical storage may differ from the Iceberg metadata due to orphan files, residual deleted files, and other reasons. Therefore, it is often necessary to check the physical storage information corresponding to the table/partition paths. If the files of a partition are scattered across multiple paths, it can cause significant trouble for operations and maintenance. |
Hi @sharkdtu thank you for the explanation! I think that's an interesting use of the location prefix. If I understand correctly, are you running the DeleteOrphanFiles Spark procedure with location configured as partition paths? |
@sungwy In addition to maintenance procedures, we will also collect physical storage info of partitions, which needs to be obtained through the location. |
Feature Request / Improvement
The human string of timestamp value is inconsistent with java api, which cause different partition paths.
For example, let's say i have a table with following representation:
the writting data path of pyiceberg is:
however, the writting data path of spark is:
The text was updated successfully, but these errors were encountered: