Privacy policies are essential documents that outline how an organization collects, uses, and protects user data. Analyzing these policies manually can be time-consuming and challenging due to their length and complex language. Therefore, this project leverages natural language processing techniques, specifically NER, to automate the analysis process.
This project aims to analyze privacy policies using Named Entity Recognition (NER). The goal is to automatically extract relevant information from privacy policies, such as business-related topics, legal aspects, regulations, usability factors, educational aspects, technology, and multidisciplinary aspects to name a few.
- Install deps (Poetry)
cd /Users/arjuns/Downloads/tnc
poetry install
- Set OpenAI key
export OPENAI_API_KEY=...
- Annotate 5k policies
poetry run python -m tnc annotate \
--sqlite-path <path_to_sqlite> \
--entities-path <path_to_entities> \
--limit 5000 \
--out <path_to_out> \
--model gpt-4o-mini
- Prepare BIO dataset
poetry run python -m tnc prepare \
--annotations <path_to_annottations> \
--entities-path <path_to_entities> \
--train-out <path_to_out>
- Train
poetry run python -m tnc train \
--prepared <path_to_train_json> \
--output <path_to_output> \
--base-model bert-base-cased \
--epochs 3 \
--batch-size 8
- Predict
poetry run python -m tnc predict \
--text "We use cookies and third-party analytics to track usage." \
--model-dir <path_to_model>
Jargon is licensed under the MIT License. See the LICENSE file for more details.
Special thanks to Ryan B. Amos for providing access to their dataset :)