Skip to content

Conversation

PaleNeutron
Copy link

@PaleNeutron PaleNeutron commented Oct 20, 2023

Description

see #1671

Consider pit data, assume we have T trade days and N report_period record:

date report_period value
0 2011-10-18 00:00:00 201103 0.318919
1 2012-03-23 00:00:00 201104 0.4039
2 2012-04-11 00:00:00 201004 0.403925
3 2012-04-11 00:00:00 200904 0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

Read data file once.

  • Loop through trade day, slice data[:tradeday],
    • groupby report_period, get the last item.
    • return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

image

Types of changes

  • Fix bugs
  • Add new feature
  • Update documentation

@github-actions github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Oct 20, 2023
@PaleNeutron
Copy link
Author

Anyone can fix main branch? CI fails due to main branch problem.

Copy link
Contributor

@Fivele-Li Fivele-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

@PaleNeutron
Copy link
Author

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():
with open(index_file, "rb") as fi:
(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
if interval == self.INTERVAL_quarterly:
n_years //= 4
start_year = first_year + n_years

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

@PaleNeutron
Copy link
Author

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

@CharlieChi
Copy link

Current online update tools seem to be incompatible with these modifications, mind check it out?

@PaleNeutron
Copy link
Author

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

@CharlieChi
Copy link

CharlieChi commented Sep 3, 2024

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

qlib/workflow/online/update.py
        start_time_buffer = get_date_by_shift(
            self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq  # pylint: disable=E1130
        )
        start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
        seg = {"test": (start_time, self.to_date)}
        return self.rmdl.get_dataset(
            start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
        )

Here,when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for triage Cannot auto-triage, wait for triage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants