Skip to content

Conversation

Harsh14901
Copy link

@Harsh14901 Harsh14901 commented Sep 4, 2025

Summary of proposed feature implementations

  • The original database is modified to keep track of the formatting options that were used to upload content. For example if originally the page was written with enable_location=False, and we ran the script a second time with enable_location=True, the original behaviour was to not change the content, new behaviour is to detect this change and rewrite the page. Changes in the following options are detected. The new database can be found here
    • enable_location
    • enable_highlight_date
    • #highlights
    • last highlighted date

Example new page:
image

  • Originally uploading with separate_blocks=True was slower. Optimized it by using the full notion API bandwith of 100 blocks per request. My intention is that separate_blocks=True would be the preferred approach and set to default as notion API has a limit of 100 characters on the paragraph anyways (as indicated by one of the comments of the original author).
  • Add a feature to segregate highlights using their corresponding chapter titles. This organizes the flow of the notion page so that it is more readable.
    Example page to help better understand: https://www.notion.so/Thinking-in-Bets-264cbb15a7c781f8971eed3b07a7a6e1?source=copy_link
    • This requires the kindle to be connected to the computer and the user knowing its mount point (kindle_root)
    • For each book to be processed, the script tries to find the corresponding .mobi file in the kindle directory
    • It then extracts the mobi file to html format. This also gives us the Table of Contents information in an xml file (The library used is mobi)
    • We parse the TOC, and for each chapter title get its location in the HTML page. Similarly for each highlight we try to find its location in the HTML page. This location information is then used to put together a notion page like the following:
image
  • Often Kindle highlights can be overlapping. For example if you select 2 sentences but then you try to extend it to 4 sentences, kindle will make 2 separate entries in My Clippings.txt. First entry for 2 sentences and the second entry for 4 sentences. This PR tries to coalesce them into one so that we don't upload highlights that are subset of another highlight.

Technical improvements

  • Used pydantic models for structuring data everywhere
  • Added a rich logging interface with colors and error handling
  • Integrate uv and direnv
  • Fill in notion auth token and database reference uuid from env vars instead of supplying on the command line

* The database has additional fields to store whether the last upload included BlockQuotes, Location, Highlight Date. We now check that to decide if the page content needs to be refreshed or not
* Optimize exporting as separate blocks by batching them in quotes of 100, which is the limit of notion API.
* Page management is completely automatic at the moment, so manual changes are not expected to persist.
* Add .envrc to work with direnv and uv.lock files for uv
* Add pydantic models for better validation
* Simplify code throughout the codebase
* Add support for pruning overlapping highlights to remove the clutter
* Figure out the correct mobi file from kindle device
* MobiHandler class executes the following flow
  ** converts mobi to html
  ** parses the toc from the extraction
  ** fetches the start character number of each heading in the toc in the html doc
* Before exporting we map every highlight to the heading that contains it based on position information offered by MobiHandler
* Improve error handling to not leave notion in an inconsistent state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant