Skip to content

Add ability to specify schema of pandas columns #603

@ektar

Description

@ektar

Inspired by #600, further discussion there

Problem:

  • DataFrames have dimension checking (# of rows/columns) and column name checking, but no dtype checking
  • This would be particularly useful on schema deserialization - datetimes and numbers can be ambiguous in json, currently are loaded in as ints or strings, depending on how they were serialized

Proposal:

  • add "schema" to df init func, accept a dict of dtypes, same formats as pandas' as_type
  • for the columns schema is set for, check in _validate - this would likely involve casting the columns specified in schema and failing on error... potentially could also save that to the dataframe so downstream code wouldn't have to do the casting.
  • allow schema to be used by json deserializer - cast specified columns after pandas.read_json

Example of problem with dates - full recovery only possible when "iso" string output is used instead of epoch, and col is cast from str to datetime by pandas:

import pandas as pd
from IPython.display import display
df = pd.DataFrame({'a': [pd.Timestamp('20200309'), 
                         pd.Timestamp('20200309')],
                  'b': [1, 2]})
# Also works with time-zone aware timestamps
# df = pd.DataFrame({'a': [pd.Timestamp('20200309T120000.000000-0000'), 
#                          pd.Timestamp('20200309T130000.000000-0000')],
#                   'b': [1, 2]})
display(df)

df_json_1 = df.to_json()
display(df_json_1)
df_deser_1 = pd.read_json(df_json_1)
display(df_deser_1)
display(df_deser_1.dtypes)
df_deser_1 = df_deser_1.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_1)
display(df_deser_1.dtypes)

df_json_2 = df.to_json(date_format='iso')
display(df_json_2)
df_deser_2 = pd.read_json(df_json_2)
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))
df_deser_2 = df_deser_2.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))

image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions