Jargon

Privacy policies are essential documents that outline how an organization collects, uses, and protects user data. Analyzing these policies manually can be time-consuming and challenging due to their length and complex language. Therefore, this project leverages natural language processing techniques, specifically NER, to automate the analysis process.

This project aims to analyze privacy policies using Named Entity Recognition (NER). The goal is to automatically extract relevant information from privacy policies, such as business-related topics, legal aspects, regulations, usability factors, educational aspects, technology, and multidisciplinary aspects to name a few.

Quickstart:

Install deps (Poetry)

cd /Users/arjuns/Downloads/tnc
poetry install

Set OpenAI key

export OPENAI_API_KEY=...

Annotate 5k policies

poetry run python -m tnc annotate \
  --sqlite-path <path_to_sqlite> \
  --entities-path <path_to_entities> \
  --limit 5000 \
  --out <path_to_out> \
  --model gpt-4o-mini

Prepare BIO dataset

poetry run python -m tnc prepare \
  --annotations <path_to_annottations> \
  --entities-path <path_to_entities> \
  --train-out <path_to_out>

Train

poetry run python -m tnc train \
  --prepared <path_to_train_json> \
  --output <path_to_output> \
  --base-model bert-base-cased \
  --epochs 3 \
  --batch-size 8

Predict

poetry run python -m tnc predict \
  --text "We use cookies and third-party analytics to track usage." \
  --model-dir <path_to_model>

License

Jargon is licensed under the MIT License. See the LICENSE file for more details.

References

A Multidisciplinary Definition of Privacy Labels: The Story of Princess Privacy and the Seven Helpers? - Johanna Johansena,∗ , Tore Pedersenb , Simone Fischer-Hübnerc , Christian Johansend , Gerardo Schneidere , Arnold Roosendaalf , Harald Zwingelbergg , Anders Jakob Sivesinda , Josef Nollh

Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset Ryan Amos, Gunes Acar, Eli Lucherini, Mihir Kshirsagar, Arvind Narayanan, Jonathan Mayer

Special thanks to Ryan B. Amos for providing access to their dataset :)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
frontend		frontend
src		src
.gitignore		.gitignore
README.md		README.md
entities.json		entities.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Jargon

Quickstart:

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kernelism/jargon

Folders and files

Latest commit

History

Repository files navigation

Jargon

Quickstart:

License

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages