Improve pit performance #1673

PaleNeutron · 2023-10-20T03:15:46Z

Description

Consider pit data, assume we have T trade days and N report_period record:

	date	report_period	value
0	2011-10-18 00:00:00	201103	0.318919
1	2012-03-23 00:00:00	201104	0.4039
2	2012-04-11 00:00:00	201004	0.403925
3	2012-04-11 00:00:00	200904	0.403925

We access PIT table in 3 Ways:

1. observe latest data each trade day

Just loop through table and keep only latest report_date value. consume O(N)

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`

Read data file once.

Loop through trade day, slice data[:tradeday],
- groupby report_period, get the last item.
- return last X item

Algorithm could be improved by loop back from the end until find X different period. But groupby use C level loop which should be faster.

3. observe specific period from each trade day

Get all data belong to given period

How Has This Been Tested?

Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
If you are adding a new feature, test on your own test scripts.

Screenshots of Test Results (if appropriate):

Types of changes

Fix bugs
Add new feature
Update documentation

PaleNeutron · 2023-11-09T12:34:05Z

Anyone can fix main branch? CI fails due to main branch problem.

Fivele-Li

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

    
           if not overwrite and index_file.exists(): 
        
               with open(index_file, "rb") as fi: 
        
                   (first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE)) 
        
                   n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE 
        
                   if interval == self.INTERVAL_quarterly: 
        
                       n_years //= 4 
        
                   start_year = first_year + n_years

qlib/data/pit.py

qlib/utils/__init__.py

PaleNeutron · 2023-11-28T10:33:39Z

It seems that the index file mentioned here and the _next column in the data file will not be used in this PR. Are you going to delete them together?

qlib/scripts/dump_pit.py

Lines 198 to 204 in 98f569e

if not overwrite and index_file.exists():

with open(index_file, "rb") as fi:

(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))

n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE

if interval == self.INTERVAL_quarterly:

n_years //= 4

start_year = first_year + n_years

The whole dump_pit.py should be rewrited since we implement FilePitStorage. So current dump file should look like

s = FilePitStorage("000001.SZ", "ROE")
s.write(np_data)

PaleNeutron · 2023-12-07T06:32:19Z

@Fivele-Li, I think rewrite dump scripts could be done in another PR, since normal feature dump script should also be rewrited using LocalFeatureStorage and LocalCalendarStorage.

CharlieChi · 2024-09-02T13:58:59Z

Current online update tools seem to be incompatible with these modifications， mind check it out？

…t_fix

PaleNeutron · 2024-09-03T02:13:09Z

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

CharlieChi · 2024-09-03T03:52:06Z

@CharlieChi , which command failed, it's a long time since this pr created and I am not sure about current workflow.

qlib/workflow/online/update.py
        start_time_buffer = get_date_by_shift(
            self.last_end, -hist_ref + 1, clip_shift=False, freq=self.freq  # pylint: disable=E1130
        )
        start_time = get_date_by_shift(self.last_end, 1, freq=self.freq)
        seg = {"test": (start_time, self.to_date)}
        return self.rmdl.get_dataset(
            start_time=start_time_buffer, end_time=self.to_date, segments=seg, unprepared_dataset=unprepared_dataset
        )

Here，when using model with PIT features, and update preds by short time range, like a day, these dataset will return empty dataframe。 while with long time range(one year between start_time and end_time), it works fine

John Lyu added 2 commits October 19, 2023 21:33

improve pit performance

192ddc8

improve pit cache

afff257

github-actions bot added the waiting for triage Cannot auto-triage, wait for triage. label Oct 20, 2023

John Lyu and others added 14 commits October 20, 2023 11:16

lint

6c214aa

deal with empty data

a144bc9

add pit backend: FilePITStorage

3ed3f17

improve docstring

61c31ca

remove index file check

d82ab8d

pit rewrite does not need index

8702049

fix typo

e07487d

make sure dir exist

8d96bd6

fix parents not exist

4213b68

fix pitstorage update

8a354ef

check dtype

dbfe153

fix empty data

20889ca

lint

5c16123

deal with empty data file

31c3747

Fivele-Li reviewed Nov 28, 2023

View reviewed changes

qlib/data/pit.py Show resolved Hide resolved

qlib/utils/__init__.py Outdated Show resolved Hide resolved

remove useless function

ef9242e

John Lyu added 8 commits January 12, 2024 16:07

improve pit performance

87d65e1

improve pit cache

b53bae6

lint

23f16b9

deal with empty data

1a349d0

add pit backend: FilePITStorage

f340776

improve docstring

38a04b6

remove index file check

07cff6b

pit rewrite does not need index

bdf8060

John Lyu and others added 9 commits January 12, 2024 16:07

fix typo

e3fff65

make sure dir exist

c754290

fix parents not exist

74fd9cb

fix pitstorage update

41648b9

check dtype

ca0d4bb

fix empty data

de9e6cf

lint

e093a83

deal with empty data file

52c5cba

remove useless function

8dfc393

SunsetWolf force-pushed the main branch from 702de78 to 194284b Compare May 7, 2024 06:20

John Lyu added 2 commits September 3, 2024 10:06

Merge branch 'pit_fix' of https://github.com/PaleNeutron/qlib into pi…

e42496a

…t_fix

Merge remote-tracking branch 'upstream/main' into pit_fix

958291e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve pit performance #1673

Improve pit performance #1673

Uh oh!

PaleNeutron commented Oct 20, 2023 •

edited

Loading

Uh oh!

PaleNeutron commented Nov 9, 2023

Uh oh!

Fivele-Li left a comment

Uh oh!

Uh oh!

Uh oh!

PaleNeutron commented Nov 28, 2023

Uh oh!

PaleNeutron commented Dec 7, 2023

Uh oh!

CharlieChi commented Sep 2, 2024

Uh oh!

PaleNeutron commented Sep 3, 2024

Uh oh!

CharlieChi commented Sep 3, 2024 •

edited

Loading

Uh oh!

Uh oh!

	if not overwrite and index_file.exists():
	with open(index_file, "rb") as fi:
	(first_year,) = struct.unpack(self.PERIOD_DTYPE, fi.read(self.PERIOD_DTYPE_SIZE))
	n_years = len(fi.read()) // self.INDEX_DTYPE_SIZE
	if interval == self.INTERVAL_quarterly:
	n_years //= 4
	start_year = first_year + n_years

Improve pit performance #1673

Are you sure you want to change the base?

Improve pit performance #1673

Uh oh!

Conversation

PaleNeutron commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. observe latest data each trade day

2. observe latest several report_period data for expression like P(Mean($$roewa_q, 2))

3. observe specific period from each trade day

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

PaleNeutron commented Nov 9, 2023

Uh oh!

Fivele-Li left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PaleNeutron commented Nov 28, 2023

Uh oh!

PaleNeutron commented Dec 7, 2023

Uh oh!

CharlieChi commented Sep 2, 2024

Uh oh!

PaleNeutron commented Sep 3, 2024

Uh oh!

CharlieChi commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

PaleNeutron commented Oct 20, 2023 •

edited

Loading

2. observe latest several `report_period` data for expression like `P(Mean($$roewa_q, 2))`

CharlieChi commented Sep 3, 2024 •

edited

Loading