-
-
Notifications
You must be signed in to change notification settings - Fork 79
Open
Milestone
Description
Inspired by #600, further discussion there
Problem:
- DataFrames have dimension checking (# of rows/columns) and column name checking, but no dtype checking
- This would be particularly useful on schema deserialization - datetimes and numbers can be ambiguous in json, currently are loaded in as ints or strings, depending on how they were serialized
Proposal:
- add "schema" to df init func, accept a dict of dtypes, same formats as pandas' as_type
- for the columns schema is set for, check in _validate - this would likely involve casting the columns specified in schema and failing on error... potentially could also save that to the dataframe so downstream code wouldn't have to do the casting.
- allow schema to be used by json deserializer - cast specified columns after pandas.read_json
Example of problem with dates - full recovery only possible when "iso" string output is used instead of epoch, and col is cast from str to datetime by pandas:
import pandas as pd
from IPython.display import display
df = pd.DataFrame({'a': [pd.Timestamp('20200309'),
pd.Timestamp('20200309')],
'b': [1, 2]})
# Also works with time-zone aware timestamps
# df = pd.DataFrame({'a': [pd.Timestamp('20200309T120000.000000-0000'),
# pd.Timestamp('20200309T130000.000000-0000')],
# 'b': [1, 2]})
display(df)
df_json_1 = df.to_json()
display(df_json_1)
df_deser_1 = pd.read_json(df_json_1)
display(df_deser_1)
display(df_deser_1.dtypes)
df_deser_1 = df_deser_1.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_1)
display(df_deser_1.dtypes)
df_json_2 = df.to_json(date_format='iso')
display(df_json_2)
df_deser_2 = pd.read_json(df_json_2)
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))
df_deser_2 = df_deser_2.astype({'a': pd.api.types.DatetimeTZDtype(unit='ns', tz='UTC')})
display(df_deser_2)
display(df_deser_2.dtypes)
display(type(df_deser_2.loc[0,'a']))
Metadata
Metadata
Assignees
Labels
No labels
