Skip to content

kernelism/jargon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jargon

Privacy policies are essential documents that outline how an organization collects, uses, and protects user data. Analyzing these policies manually can be time-consuming and challenging due to their length and complex language. Therefore, this project leverages natural language processing techniques, specifically NER, to automate the analysis process.

This project aims to analyze privacy policies using Named Entity Recognition (NER). The goal is to automatically extract relevant information from privacy policies, such as business-related topics, legal aspects, regulations, usability factors, educational aspects, technology, and multidisciplinary aspects to name a few.

Quickstart:

  1. Install deps (Poetry)
cd /Users/arjuns/Downloads/tnc
poetry install
  1. Set OpenAI key
export OPENAI_API_KEY=... 
  1. Annotate 5k policies
poetry run python -m tnc annotate \
  --sqlite-path <path_to_sqlite> \
  --entities-path <path_to_entities> \
  --limit 5000 \
  --out <path_to_out> \
  --model gpt-4o-mini
  1. Prepare BIO dataset
poetry run python -m tnc prepare \
  --annotations <path_to_annottations> \
  --entities-path <path_to_entities> \
  --train-out <path_to_out>
  1. Train
poetry run python -m tnc train \
  --prepared <path_to_train_json> \
  --output <path_to_output> \
  --base-model bert-base-cased \
  --epochs 3 \
  --batch-size 8
  1. Predict
poetry run python -m tnc predict \
  --text "We use cookies and third-party analytics to track usage." \
  --model-dir <path_to_model>

License

Jargon is licensed under the MIT License. See the LICENSE file for more details.

References

A Multidisciplinary Definition of Privacy Labels: The Story of Princess Privacy and the Seven Helpers? - Johanna Johansena,∗ , Tore Pedersenb , Simone Fischer-Hübnerc , Christian Johansend , Gerardo Schneidere , Arnold Roosendaalf , Harald Zwingelbergg , Anders Jakob Sivesinda , Josef Nollh

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset Ryan Amos, Gunes Acar, Eli Lucherini, Mihir Kshirsagar, Arvind Narayanan, Jonathan Mayer

Special thanks to Ryan B. Amos for providing access to their dataset :)

About

Jargon is a Named-Entity-Recognition approach to automating the analysis of terms-and-conditions or privacy-policies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published