Skip to content

[Spec] Linking Schema ID to Data & Delete Files #13855

@manirajv06

Description

@manirajv06

Proposed Change

Schema evolve over time and data files could have different columns at different point of time. It is quite natural that data files created at T1 with Schema S1 could have columns C1 to C5, data files created at T2 with Schema S1 could have columns C4 to C10 and so on..

Linking Schema ID with data files would be handy to extract any Schema details easily. For an instance, Files could be filtered based on whether column exists or not using its field id by comparing with file's max field id. Max field id of the file is nothing but the max field id of the linked schema. Schema's Max field id is already available and can be used straight away. C5 is the max field id for all files linked to S1. C10 is the max field id for all files linked to S2. Another instance, to know whether Parquet files has UnknownType type or not, all files needs to be opened as there is no statistics or other way to know it. Linking schema's to these files could pull those info very easily. Similarly, other schema info can be used based on the requirements.

I would like to propose that linking the schema id with files would be useful in carrying out files and schema related operations going forward.

Proposal document

No response

Specifications

  • Table
  • View
  • REST
  • Puffin
  • Encryption
  • Other

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalIceberg Improvement Proposal (spec/major changes/etc)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions