generate_fasttext_model_from_recipe_data

Okay, here's a readme.md file for the generatefasttext repository, based on your description and the information from the sources and our conversation history:

generate_fasttext_model_from_recipe_data

This repository contains a Python script to generate a custom FastText embedding model from recipe data. It processes recipe data, extracts relevant text, and uses it to train a FastText model. This model can then be used for various natural language processing tasks, such as semantic search or text classification within the recipe domain.

Overview

The script in this repository is designed to take a dataset of recipes and reviews, extract the necessary text data, and then use that data to train a custom FastText embedding model.

Key features:

Data Extraction: Extracts text from recipe data that is suitable for training a word embedding model.
Custom Model Training: Trains a FastText model on the extracted recipe text, allowing for a domain-specific embedding model.
Integration: The resulting model can be integrated with other systems such as SurrealDB for in-database vector search.
Flexibility: The script is designed to be configurable and allow for experimentation with various training parameters.

This approach is valuable for creating embedding models that are tailored to the specific language and terminology used in recipe data, potentially improving the accuracy of tasks like semantic search and text classification.

Usage

Prerequisites

Python 3.6+
FastText Library: Install the FastText library using pip: pip install fasttext
Recipe Data: You will need a dataset of recipes and reviews, which can be downloaded from https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions.

Steps

Clone the Repository:

git clone <repository_url>
cd generate_fasttext_model_from_recipe_data

Prepare your recipe data: Place the recipe and review data into the data directory in the root of this repository. The script expects the recipe data to be in a file called RAW_recipes.csv and the review data to be in a file called RAW_interactions.csv
Run the Script:
```
   python generate_fasttext_model.py
```
- The script will process your data, extract text, and then train a custom FastText model. The trained model will be saved to a file specified in the script.
Customize Training Parameters:

You can modify the script to adjust FastText training parameters, such as vector size, window size, minimum word count, and number of epochs [outside source].

Integration with SurrealDB

The custom FastText embedding model generated by this script can be used within SurrealDB. You can follow the patterns laid out in the surrealDB_embedding_model repository to upload and use your custom FastText model in a similar way as the GloVe model is used.

Upload the model: Use a modified version of the upload_model.py script to upload your custom FastText model to SurrealDB.
Create a SQL Function: Define a custom SurQL function that uses the uploaded FastText model to generate embeddings, such as fn::sentence_to_vector.
Implement semantic search: Use your new fn::sentence_to_vector function with the vector search function fn::vector_search for semantic searches in your database.

Model Choice and Considerations

FastText for Out-of-Vocabulary Words: FastText is particularly suitable for recipe data because it handles out-of-vocabulary words well, which is important when dealing with varied food-related terms. As previously discussed, FastText represents words as n-grams, making it robust for datasets with unique or misspelled words.
Other Models: Although this repository focuses on FastText, it’s worth noting that GloVe and Word2Vec are other word-level embedding models that could be used if desired. GloVe is a strong performer with well established results and is used as an example model in the surrealDB_embedding_model repository. Word2Vec provides a balance of speed and accuracy and offers two model variants for more flexible experimentation.
Training Time: Training time varies based on data size and hardware. A 600MB corpus of recipe data could take from a few hours to a few days on a CPU, and can be significantly reduced with a GPU.
Experimentation: The best embedding model can depend on the specifics of your data, so consider trying out variations of the training parameters to tune your model.

Related Repositories

This repository complements other related projects:

surrealDB_embedding_model: https://github.com/apireno/surrealDB_embedding_model This repository provides a way to upload an embedding model (like GloVe) to SurrealDB and use a SQL function to calculate embeddings within the database.
surrealDB_recipe_demo_dataset: https://github.com/apireno/surrealDB_recipe_demo_dataset This repository contains scripts to upload a demo dataset scraped from food.com into a SurrealDB database. It uses the surrealDB_embedding_model repository.

Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues to report bugs or suggest improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
generate_fasttext_model_from_recipe_data		generate_fasttext_model_from_recipe_data
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
SECURITY.md		SECURITY.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

generate_fasttext_model_from_recipe_data

Overview

Usage

Prerequisites

Steps

Integration with SurrealDB

Model Choice and Considerations

Related Repositories

Contributing

About

Uh oh!

Releases

Packages

Languages

apireno/generate_fasttext_model_from_recipe_data

Folders and files

Latest commit

History

Repository files navigation

generate_fasttext_model_from_recipe_data

Overview

Usage

Prerequisites

Steps

Integration with SurrealDB

Model Choice and Considerations

Related Repositories

Contributing

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages