Okay, here's a readme.md file for the generatefasttext repository, based on your description and the information from the sources and our conversation history:
This repository contains a Python script to generate a custom FastText embedding model from recipe data. It processes recipe data, extracts relevant text, and uses it to train a FastText model. This model can then be used for various natural language processing tasks, such as semantic search or text classification within the recipe domain.
The script in this repository is designed to take a dataset of recipes and reviews, extract the necessary text data, and then use that data to train a custom FastText embedding model.
Key features:
- Data Extraction: Extracts text from recipe data that is suitable for training a word embedding model.
- Custom Model Training: Trains a FastText model on the extracted recipe text, allowing for a domain-specific embedding model.
- Integration: The resulting model can be integrated with other systems such as SurrealDB for in-database vector search.
- Flexibility: The script is designed to be configurable and allow for experimentation with various training parameters.
This approach is valuable for creating embedding models that are tailored to the specific language and terminology used in recipe data, potentially improving the accuracy of tasks like semantic search and text classification.
- Python 3.6+
- FastText Library: Install the FastText library using pip:
pip install fasttext - Recipe Data: You will need a dataset of recipes and reviews, which can be downloaded from https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions.
-
Clone the Repository:
git clone <repository_url> cd generate_fasttext_model_from_recipe_data
-
Prepare your recipe data: Place the recipe and review data into the
datadirectory in the root of this repository. The script expects the recipe data to be in a file calledRAW_recipes.csvand the review data to be in a file calledRAW_interactions.csv -
Run the Script:
python generate_fasttext_model.py
- The script will process your data, extract text, and then train a custom FastText model. The trained model will be saved to a file specified in the script.
-
Customize Training Parameters:
- You can modify the script to adjust FastText training parameters, such as vector size, window size, minimum word count, and number of epochs [outside source].
The custom FastText embedding model generated by this script can be used within SurrealDB. You can follow the patterns laid out in the surrealDB_embedding_model repository to upload and use your custom FastText model in a similar way as the GloVe model is used.
- Upload the model: Use a modified version of the
upload_model.pyscript to upload your custom FastText model to SurrealDB. - Create a SQL Function: Define a custom SurQL function that uses the uploaded FastText model to generate embeddings, such as
fn::sentence_to_vector. - Implement semantic search: Use your new
fn::sentence_to_vectorfunction with the vector search functionfn::vector_searchfor semantic searches in your database.
- FastText for Out-of-Vocabulary Words: FastText is particularly suitable for recipe data because it handles out-of-vocabulary words well, which is important when dealing with varied food-related terms. As previously discussed, FastText represents words as n-grams, making it robust for datasets with unique or misspelled words.
- Other Models: Although this repository focuses on FastText, it’s worth noting that GloVe and Word2Vec are other word-level embedding models that could be used if desired. GloVe is a strong performer with well established results and is used as an example model in the
surrealDB_embedding_modelrepository. Word2Vec provides a balance of speed and accuracy and offers two model variants for more flexible experimentation. - Training Time: Training time varies based on data size and hardware. A 600MB corpus of recipe data could take from a few hours to a few days on a CPU, and can be significantly reduced with a GPU.
- Experimentation: The best embedding model can depend on the specifics of your data, so consider trying out variations of the training parameters to tune your model.
This repository complements other related projects:
- surrealDB_embedding_model: https://github.com/apireno/surrealDB_embedding_model This repository provides a way to upload an embedding model (like GloVe) to SurrealDB and use a SQL function to calculate embeddings within the database.
- surrealDB_recipe_demo_dataset: https://github.com/apireno/surrealDB_recipe_demo_dataset This repository contains scripts to upload a demo dataset scraped from food.com into a SurrealDB database. It uses the
surrealDB_embedding_modelrepository.
Contributions are welcome! Please feel free to submit pull requests or create issues to report bugs or suggest improvements.