In-browser PDF labeling to enable NER (named entity recognition) model training. The page will display a PDF on the left and a two-column table of entities on the right. The list of entity types are defined by the user on a previous page (so for now, assume the entity type list is hardcoded). The values for the entities are dynamically filled based on what the user highlights. The user can choose the color and whether highlight/underline/squiggly for each entity type in the table, as well as delete annotations from the table and use search capability by typing in the table boxes.
- Page 1: the user can decide whether each entity type is required (each PDF must have that entity), unique (each PDF has at most one of that entity), and single-word (whether the entity value can have spaces).
- highlighting to only select full words
- a search button and input field using plugin-search
- Install Node.js v22, Git, and VS Code
- Clone repo and install dependencies:
git clone https://github.com/optimalcharb/pdf-entity-labeling.gitnpm install- Install the recommended VS Code Extensions
- To setup playwright:
npx playwright install- To setup Bun on Windows, open Command Prompt or Powershell with admin privileges and run:
powershell -c "irm bun.sh/install.ps1 | iex"- Frontend framework: Next.js 15 App Router + React
- Language: TypeScript with ts-reset, config by tsconfig.json
- Environment variable management: no environment variables, for now all variables should be hard-coded, loaded from an annotations file, or user provided
- Containerization: none, no Docker or Kubernetes allowed
- Styles: Tailwind CSS v4 with CVA (Class Variance Authority) for CSS integration and PostCSS for JavaScript integration
- Linting: ESlint 9, config by eslint.config.mjs
- Formatting: Prettier, config by .prettierignore, .prettierrc
- Testing: React Testing Library + Bun Test Runner which is based on Jest, name files as ".{spec,test}.{ts,tsx}"
- End-to-End Testing: Playwright, name files as ".e2e.ts"
- Storage: must get PDF from local storage or URL
- Database and API: avoid creating database tables or API routes, except for plugin-annotation. don't rely heavily on some db or api framework, keep it simple. Try to do everything else with React and in-memory or possibly Zustand. Do not add authentication, authorization, Lambda functions, HTTP, caching, observability, security, etc.
- None
| Script | Description |
|---|---|
| dev | run site locally |
| build | build for prod |
| start | start prod server |
| tsc | compile types without generating files |
| lint | check for linting errors |
| lint:fix | fix some linting errors automatically |
| prettier | check format |
| prettier:fix | fix format (.vscode/settings.json does this on every save) |
| prepare | automatically called by install |
| postinstall | automatically called by install |
| depcheck | check for unused dependencies |
| storybook | view storybook workshop |
| test | run tests using Bun Test Runner |
| e2e | run playwright end-to-end tests |
| madge | to be added to package.json to run madge |
| others | other scripts can be added to package.json |
- DevOps CI/CD: GitHub Actions with workflows for check and bundle analyzer - currently disabled
- Changelog generation: Semantic Release config by .releaserc and ran by .github/workflows/semantic-release.yml, Conventional Commits enforced by husky config by .commitlintrc.json, commit messages must start with a prefix in the table below, the workflow edits CHANGELOG.md on any version bump
| commit prefix | version bump | definition |
|---|---|---|
| type!: | major (0.0.0 -> 1.0.0) | breaking changes (feat!:, perf!:, ...) |
| feat: | minor (0.0.0 -> 0.1.0) | new feature |
| perf: | patch (0.0.0 -> 0.0.1) | performance improvement |
| fix: | patch (0.0.0 -> 0.0.1) | bug fix |
| docs: | none | documentation changes |
| test: | none | adding or updating tests |
| ci: | none | CI/CD configuration changes |
| revert: | none | reverting previous commits |
| style: | none | formatting without code changes |
| refactor: | none | reorganizing code without changes |
| chore: | none | maintenance tasks |
| build: | none | build system or dependencies |
- Package manager: npm to ensure compatability with all serverless hosting
- Package management: Corepack
- Package fixes: Patch-package
- Bundle management: Bundle analyzer - currently disabled
- Import management: Absolute imports so imports from same module are alphabetically ordered
- State management: currently React only
- Component workshop: Storybook using .stories.tsx files
- Component dependency grapher: Madge - not yet setup, can fix later, here is my draft cmd: npx madge --extensions=js,jsx,ts,tsx ./ --exclude ".*.config.(ts|js|mjs)|.next/|.storybook/|node_modules/|storybook-static/|reset.d.ts|next-env.d.ts" --image graph.svg (need to install gvpr graphviz)
- Current site uses shadcn/ui stored in components/shadcn-ui and config by components.json
- Avoid using other UI libraries as the PDF container functionality should be locally coded
- EmbedPDF: GitHub, docs for @embedpdf/pdfium the JS library to wrap the C++ engine, docs for @embedpdf/core which I have modified
- Currently PDFs are rendered by URL only, later I want to fix the BufferStrategy in plugin-loader to load PDFs from local storage
- Plugins are built in consitent style defined by core (not using standard Redux style) and must have commented sections following plugin-template/
- Refer to GitHub Issues for ideas
- Defer to .components/shadcn-ui/form.tsx based on react-hook-form
- Try to stick to Lucide Icons, icons are not necessary at first since functionality needs to be built before appearance
- You can pick TailwindCSS colors on tailcolors
- Maybe try x-spreadsheet or other packages on npm
- Check billout for a list of some praised React packages