Skip to content

apireno/generate_fasttext_model_from_recipe_data

Okay, here's a readme.md file for the generatefasttext repository, based on your description and the information from the sources and our conversation history:

generate_fasttext_model_from_recipe_data

This repository contains a Python script to generate a custom FastText embedding model from recipe data. It processes recipe data, extracts relevant text, and uses it to train a FastText model. This model can then be used for various natural language processing tasks, such as semantic search or text classification within the recipe domain.

Overview

The script in this repository is designed to take a dataset of recipes and reviews, extract the necessary text data, and then use that data to train a custom FastText embedding model.

Key features:

  • Data Extraction: Extracts text from recipe data that is suitable for training a word embedding model.
  • Custom Model Training: Trains a FastText model on the extracted recipe text, allowing for a domain-specific embedding model.
  • Integration: The resulting model can be integrated with other systems such as SurrealDB for in-database vector search.
  • Flexibility: The script is designed to be configurable and allow for experimentation with various training parameters.

This approach is valuable for creating embedding models that are tailored to the specific language and terminology used in recipe data, potentially improving the accuracy of tasks like semantic search and text classification.

Usage

Prerequisites

Steps

  1. Clone the Repository:

    git clone <repository_url>
    cd generate_fasttext_model_from_recipe_data
  2. Prepare your recipe data: Place the recipe and review data into the data directory in the root of this repository. The script expects the recipe data to be in a file called RAW_recipes.csv and the review data to be in a file called RAW_interactions.csv

  3. Run the Script:

       python generate_fasttext_model.py
    • The script will process your data, extract text, and then train a custom FastText model. The trained model will be saved to a file specified in the script.
  4. Customize Training Parameters:

  • You can modify the script to adjust FastText training parameters, such as vector size, window size, minimum word count, and number of epochs [outside source].

Integration with SurrealDB

The custom FastText embedding model generated by this script can be used within SurrealDB. You can follow the patterns laid out in the surrealDB_embedding_model repository to upload and use your custom FastText model in a similar way as the GloVe model is used.

  • Upload the model: Use a modified version of the upload_model.py script to upload your custom FastText model to SurrealDB.
  • Create a SQL Function: Define a custom SurQL function that uses the uploaded FastText model to generate embeddings, such as fn::sentence_to_vector.
  • Implement semantic search: Use your new fn::sentence_to_vector function with the vector search function fn::vector_search for semantic searches in your database.

Model Choice and Considerations

  • FastText for Out-of-Vocabulary Words: FastText is particularly suitable for recipe data because it handles out-of-vocabulary words well, which is important when dealing with varied food-related terms. As previously discussed, FastText represents words as n-grams, making it robust for datasets with unique or misspelled words.
  • Other Models: Although this repository focuses on FastText, it’s worth noting that GloVe and Word2Vec are other word-level embedding models that could be used if desired. GloVe is a strong performer with well established results and is used as an example model in the surrealDB_embedding_model repository. Word2Vec provides a balance of speed and accuracy and offers two model variants for more flexible experimentation.
  • Training Time: Training time varies based on data size and hardware. A 600MB corpus of recipe data could take from a few hours to a few days on a CPU, and can be significantly reduced with a GPU.
  • Experimentation: The best embedding model can depend on the specifics of your data, so consider trying out variations of the training parameters to tune your model.

Related Repositories

This repository complements other related projects:

Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues to report bugs or suggest improvements.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages