Skip to content

Commit 97de4fd

Browse files
mireclandrey-khropov
authored andcommitted
Pull request "Update model export as code limitations, add one more way to calculate hashes for categorical features" by @mirecl from #19
Pull Request resolved: #19 commit_hash:883676763ebf3074624eb7db9c0c2fa452461bce
1 parent 094b281 commit 97de4fd

File tree

2 files changed

+20
-29
lines changed

2 files changed

+20
-29
lines changed

apply_model/model_export_as_cpp_code_tutorial.md

Lines changed: 3 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,22 @@ Catboost model could be saved as standalone C++ code. This can ease an integrati
55

66
The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/google/cityhash/tree/00b9287e8c1255b5922ef90e304d5287361b2c2a) (NOTE: The exact revision under the link is required).
77

8-
9-
### Exporting from Catboost application via command line interface:
8+
### Exporting from Catboost application via command line interface
109

1110
```bash
1211
catboost fit --model-format CPP <other_fit_parameters>
1312
```
1413

1514
By default model is saved into *model.cpp* file. One could alter the output name using *-m* key. If there is more that one model-format specified, then the *.cpp* extention will be added to the name provided after *-m* key.
1615

17-
18-
### Exporting from Catboost python library interface:
16+
### Exporting from Catboost python library interface
1917

2018
```python
2119
model = CatBoost(<train_params>)
2220
model.fit(train_pool)
2321
model.save_model(OUTPUT_CPP_MODEL_PATH, format="CPP")
2422
```
2523

26-
2724
## Models trained with only Float features
2825

2926
If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
@@ -32,14 +29,12 @@ If the model was trained using only numerical features (no cat features), then t
3229
double ApplyCatboostModel(const std::vector<float>& features);
3330
```
3431
35-
3632
### Parameters
3733
3834
| parameter | description |
3935
|-----------|--------------------------------------------------|
4036
| features | features of a single document to make prediction |
4137
42-
4338
### Return value
4439
4540
Prediction of the model for the document with given features.
@@ -58,7 +53,6 @@ double ApplyCatboostModel(const std::vector<float>& features) {
5853

5954
C++11 support of non-static data member initializers and extended initializer lists
6055

61-
6256
## Models trained with Categorical features
6357

6458
If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
@@ -67,7 +61,6 @@ If the model was trained with categorical features present, then the application
6761
double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::vector<std::string>& catFeatures);
6862
```
6963
70-
7164
### Parameters
7265
7366
| parameter | description |
@@ -77,7 +70,6 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve
7770
7871
NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here floatFeatures = {f1, f3}, catFeatures = {f2, f4}.
7972
80-
8173
### Return value
8274
8375
Prediction of the model for the document with given features.
@@ -92,21 +84,17 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve
9284
}
9385
```
9486

95-
9687
### Compiler requiremens
9788

9889
C++14 compiler with aggregate member initialization support. Tested compilers: g++ 5(5.4.1 20160904), clang++ 3.8.
9990

100-
10191
## Current limitations
10292

103-
- MultiClassification models are not supported.
10493
- applyCatboostModel() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
105-
94+
- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.
10695

10796
## Troubleshooting
10897

10998
Q: Generated model results differ from native model when categorical features present
11099
A: Please check that CityHash version 1 is used. Exact required revision of [C++ Google CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56%29). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/blob/master/util/digest/city.h). This is due other versions of CityHash may produce different hash code for the same string.
111100

112-

apply_model/model_export_as_python_code_tutorial.md

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,22 @@ Catboost model could be saved as standalone Python code. This can ease an integr
55

66
The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56).
77

8-
9-
### Exporting from Catboost application via command line interface:
8+
### Exporting from Catboost application via command line interface
109

1110
```bash
1211
catboost fit --model-format Python <other_fit_parameters>
1312
```
1413

1514
By default model is saved into *model.py* file, one could alter the output name using *-m* key. If there is more that one model-format specified, then the *.py* extention will be added to the name provided after *-m* key.
1615

17-
18-
### Exporting from Catboost python library interface:
16+
### Exporting from Catboost python library interface
1917

2018
```python
2119
model = CatBoost(<train_params>)
2220
model.fit(train_pool)
2321
model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python")
2422
```
2523

26-
2724
## Models trained with only Float features
2825

2926
If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
@@ -32,19 +29,16 @@ If the model was trained using only numerical features (no cat features), then t
3229
def apply_catboost_model(float_features):
3330
```
3431

35-
3632
### Parameters
3733

3834
| parameter | type | description |
3935
|----------------|----------------------------|--------------------------------------------------|
4036
| float_features | list of int or float values| features of a single document to make prediction |
4137

42-
4338
### Return value
4439

4540
Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').
4641

47-
4842
## Models trained with Categorical features
4943

5044
If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
@@ -53,7 +47,6 @@ If the model was trained with categorical features present, then the application
5347
def apply_catboost_model(float_features, cat_features):
5448
```
5549

56-
5750
### Parameters
5851

5952
| parameter | type | description |
@@ -63,18 +56,28 @@ def apply_catboost_model(float_features, cat_features):
6356

6457
NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here float_features=[f1,f3], cat_features=[f2,f4].
6558

66-
6759
### Return value
6860

6961
Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').
7062

71-
7263
## Current limitations
73-
- MultiClassification models are not supported.
74-
- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
7564

65+
- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
66+
- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.
7667

7768
## Troubleshooting
7869

7970
Q: Generated model results differ from native model when categorical features present
80-
A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string.
71+
A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string. One option is to use the library [clickhouse-cityhash](https://pypi.org/project/clickhouse-cityhash/):
72+
73+
```python
74+
from clickhouse_cityhash.cityhash import CityHash64
75+
76+
def calc_cat_feature_hash(value: str):
77+
value_hash = CityHash64(value.encode('utf-8')) % (2 ** 32)
78+
79+
if value_hash >= 2 ** 31:
80+
value_hash -= 2 ** 32
81+
82+
return value_hash
83+
```

0 commit comments

Comments
 (0)