Skip to content

Commit d1134a6

Browse files
committed
[df] Add more docs to the Snapshot with variations section
1 parent fcf1cdc commit d1134a6

File tree

1 file changed

+22
-4
lines changed

1 file changed

+22
-4
lines changed

tree/dataframe/src/RDataFrame.cxx

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1265,10 +1265,25 @@ In that case, RDataFrame will snapshot the filtered columns in a memory-efficien
12651265
default-constructed object in case of classes. If none of the filters pass like in row 6, the entire event is omitted from the snapshot.
12661266
12671267
To tell apart a genuine `0` (like `x` in row 0) from a variation that didn't pass the selection, RDataFrame writes a bitmask for each event, indicating which variations
1268-
are valid (see last column). A mapping of column names to this bitmask is placed in the same file as the output dataset, and automatically loaded when
1269-
RDataFrame opens a file that was snapshot with variations.
1270-
Attempting to read such missing values with RDataFrame will produce an error, but RDataFrame can either skip these values or fill in defaults as
1271-
described in the \ref missing-values "section on dealing with missing values".
1268+
are valid (see last column). The bitmask is implemented as a 64-bit `std::bitset` in memory, written to the output dataset as a `std::uin64_t`.
1269+
Thus, every 64 columns a new bitmask must be written to accommodate the bits for the next 64 columns.
1270+
1271+
Each column stored in the output is connected to exactly one bit in one bitmask. A mapping of column names to the corresponding bitmask is placed in the
1272+
same file as the output dataset, with a name that follows the pattern `"R_rdf_branchToBitmaskMapping_" + NAME_OF_THE_DATASET`. In this mapping, each
1273+
column name is connected to one bitmask, and one particular bit in that bitmask. For example, in the same file as the dataset "Events" there would be
1274+
an object named `R_rdf_branchToBitmaskMapping_Events`. This object for example would describe a connection such as:
1275+
1276+
~~~
1277+
muon_pt --> (R_rdf_mask_Events_0, 42)
1278+
~~~
1279+
1280+
which means that the validity of the column `muon_pt` is established by the bit `42` in the bitmask found in the column `R_rdf_mask_Events_0`.
1281+
1282+
When RDataFrame opens a file, it checks for the existence of this mapping between columns and bitmasks, and loads it automatically if found. As such,
1283+
RDataFrame makes the treatment of the various bitmap maskings completely transparent to the user.
1284+
1285+
In case certain values are labeled invalid by the corresponding bit, this will result in reading a missing value. The semantics of such a scenario follow the
1286+
rules described in the \ref missing-values "section on dealing with missing values" and can be dealt with accordingly.
12721287
12731288
\note Snapshot with variations is currently restricted to single-threaded TTree snapshots.
12741289
@@ -1780,6 +1795,9 @@ more of its entries. For example:
17801795
- When joining different datasets horizontally according to some index value
17811796
(e.g. the event number), if the index does not find a match in one or more
17821797
other datasets for a certain entry.
1798+
- If, for a certain event, the value of a certain column is invalid because
1799+
it results from a previous processing which involved systematic variations
1800+
and that value was removed by a selection. For more details, see \ref snapshot-with-variations.
17831801
17841802
For example, suppose that column "y" does not have a value for entry 42:
17851803

0 commit comments

Comments
 (0)