-
Notifications
You must be signed in to change notification settings - Fork 270
Delete orphan files #1200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @sungwy |
Hey sure thing! I'll assign it to you @omkenge |
What is your opinion on this ? |
That looks generally correct to me. There are a few caveats though. This assumes that the entire iceberg table (metadata and data files) is in a single location and that no other files should exist. I think a good first step is to figure out all the files belonging to an iceberg table. Given a table, return all metadata and data file paths, including historical lineage, branches, and tags. |
@omkenge I believe you will need to wait for the merge of #1285. In the meantime, I will work on the partition statistics over the next few weeks. Before that, I believe we will be tracking all the files in the metadata (this needs to be double-checked). With that, you will be able to verify what could be removed. Another point is the filesystem that will be responsible for scanning the directory. FileIO is not how we solve this, so we will need to use something else. Perhaps OpenDAL would be a good candidate. As a reference, you can see that the Java implementation uses the Hadoop filesystem. |
Hello @omkenge, you can start development, but please note that we need the partition statistics. I'll start working on this feature this week. The merge for the orphan files removal implementation will be blocked until we have these statistics, but you can begin the development work. |
Hello @ndrluis @kevinjqliu |
Hi @omkenge, I don’t have direct experience with OpenDAL, but my suggestion is based on how iceberg-rust is currently using it. For the implementation, I’d recommend aligning with the Java implementation as a reference. Check out these two key files: DeleteOrphanFilesSparkAction.java |
I think we want to avoid depending directly on OpenDal, since that's another dependency. FileIO officially doesn't support listing of directories because listing of a directory doesn't perform well on object stores. This will result in a paged response that potentially has a lot of pages. A catalog might provide a more powerful way of cleaning up orphan files by leveraging S3 Inventory lists, but I don't think that's a good implementation for the client itself. Similar to the Java implementation where we rely on the underlying filesystem, I think we can do something similar in PyIceberg by using the Arrow FileSystem to list the files. |
Hello @Fokko
|
we might want to use all_files and all_metadata_files. |
@kevinjqliu
|
take a look at iceberg-python/pyiceberg/io/__init__.py Line 340 in 5c68ad8
|
Looks like the following will also work directly from a table object: from pyiceberg import catalog
from pyarrow.fs import FileSelector
#
CATALOG = catalog.load_catalog(**{"type": "glue"})
table = CATALOG.load_table("my_table_name")
scheme, netloc, path = table.io.parse_location(table.location())
fs = table.io.fs_by_scheme(scheme, netloc)
selector = FileSelector(path, recursive=True)
files = fs.get_file_info(selector)
print(files) Edit: Actually not that bad, one of my iceberg tables has ~1m files and it took just around 4mins for this method to recursively capture everything in that directory I believe that this is platform agnostic? Basically can just get the difference in that output against (all_manifests + files for every snapshot) Realistically it makes sense to make a new method on the |
Introduce a new API to delete orphan files for a given table
Feature reference: https://iceberg.apache.org/docs/1.5.1/maintenance/#delete-orphan-files
The text was updated successfully, but these errors were encountered: