Pull request "Update model export as code limitations, add one more way to calculate hashes for categorical features" by @mirecl from #19

mirecl · andrey-khropov · commit 97de4fdf0074 · 2025-09-07T16:39:05.000+03:00
Pull Request resolved: #19 commit_hash:883676763ebf3074624eb7db9c0c2fa452461bce
diff --git a/apply_model/model_export_as_cpp_code_tutorial.md b/apply_model/model_export_as_cpp_code_tutorial.md
@@ -5,25 +5,22 @@ Catboost model could be saved as standalone C++ code. This can ease an integrati
 
 The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/google/cityhash/tree/00b9287e8c1255b5922ef90e304d5287361b2c2a) (NOTE: The exact revision under the link is required).
 
-
-### Exporting from Catboost application via command line interface:
+### Exporting from Catboost application via command line interface
 
 ```bash
 catboost fit --model-format CPP <other_fit_parameters>
 ```
 
 By default model is saved into *model.cpp* file. One could alter the output name using *-m* key. If there is more that one model-format specified, then the *.cpp* extention will be added to the name provided after *-m* key.
 
-
-### Exporting from Catboost python library interface:
+### Exporting from Catboost python library interface
 
 ```python
 model = CatBoost(<train_params>)
 model.fit(train_pool)
 model.save_model(OUTPUT_CPP_MODEL_PATH, format="CPP")
 ```
 
-
 ## Models trained with only Float features
 
 If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
@@ -32,14 +29,12 @@ If the model was trained using only numerical features (no cat features), then t
 double ApplyCatboostModel(const std::vector<float>& features);
 ```
 
-
 ### Parameters
 
 | parameter | description                                      |
 |-----------|--------------------------------------------------|
 | features  | features of a single document to make prediction |
 
-
 ### Return value
 
 Prediction of the model for the document with given features.
@@ -58,7 +53,6 @@ double ApplyCatboostModel(const std::vector<float>& features) {
 
 C++11 support of non-static data member initializers and extended initializer lists
 
-
 ## Models trained with Categorical features
 
 If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
@@ -67,7 +61,6 @@ If the model was trained with categorical features present, then the application
 double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::vector<std::string>& catFeatures);
 ```
 
-
 ### Parameters
 
 | parameter     | description                               |
@@ -77,7 +70,6 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve
 
 NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here floatFeatures = {f1, f3}, catFeatures = {f2, f4}.
 
-
 ### Return value
 
 Prediction of the model for the document with given features.
@@ -92,21 +84,17 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve
 }
 ```
 
-
 ### Compiler requiremens
 
 C++14 compiler with aggregate member initialization support. Tested compilers: g++ 5(5.4.1 20160904), clang++ 3.8.
 
-
 ## Current limitations
 
-- MultiClassification models are not supported.
 - applyCatboostModel() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
-
+- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.
 
 ## Troubleshooting
 
 Q: Generated model results differ from native model when categorical features present
 A: Please check that CityHash version 1 is used. Exact required revision of [C++ Google CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56%29). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/blob/master/util/digest/city.h). This is due other versions of CityHash may produce different hash code for the same string.
 
-
diff --git a/apply_model/model_export_as_python_code_tutorial.md b/apply_model/model_export_as_python_code_tutorial.md
@@ -5,25 +5,22 @@ Catboost model could be saved as standalone Python code. This can ease an integr
 
 The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56).
 
-
-### Exporting from Catboost application via command line interface:
+### Exporting from Catboost application via command line interface
 
 ```bash
 catboost fit --model-format Python <other_fit_parameters>
 ```
 
 By default model is saved into *model.py* file, one could alter the output name using *-m* key. If there is more that one model-format specified, then the *.py* extention will be added to the name provided after *-m* key.
 
-
-### Exporting from Catboost python library interface:
+### Exporting from Catboost python library interface
 
 ```python
 model = CatBoost(<train_params>)
 model.fit(train_pool)
 model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python")
 ```
 
-
 ## Models trained with only Float features
 
 If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
@@ -32,19 +29,16 @@ If the model was trained using only numerical features (no cat features), then t
 def apply_catboost_model(float_features):
 ```
 
-
 ### Parameters
 
 | parameter      | type                       | description                                      |
 |----------------|----------------------------|--------------------------------------------------|
 | float_features | list of int or float values| features of a single document to make prediction |
 
-
 ### Return value
 
 Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').
 
-
 ## Models trained with Categorical features
 
 If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
@@ -53,7 +47,6 @@ If the model was trained with categorical features present, then the application
 def apply_catboost_model(float_features, cat_features):
 ```
 
-
 ### Parameters
 
 | parameter      | type                                 | description                               |
@@ -63,18 +56,28 @@ def apply_catboost_model(float_features, cat_features):
 
 NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here float_features=[f1,f3], cat_features=[f2,f4].
 
-
 ### Return value
 
 Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').
 
-
 ## Current limitations
-- MultiClassification models are not supported.
-- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
 
+- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
+- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.
 
 ## Troubleshooting
 
 Q: Generated model results differ from native model when categorical features present
-A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string.
+A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string. One option is to use the library [clickhouse-cityhash](https://pypi.org/project/clickhouse-cityhash/):
+
+```python
+from clickhouse_cityhash.cityhash import CityHash64
+
+def calc_cat_feature_hash(value: str):
+    value_hash = CityHash64(value.encode('utf-8')) % (2 ** 32)
+
+    if value_hash >= 2 ** 31:
+        value_hash -= 2 ** 32
+
+    return value_hash
+```