Skip to content

codelibs/nekohtml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NekoHTML Java CI with Maven

A pure Java HTML parser with zero dependencies

NekoHTML is a lightweight, tolerant HTML parser for Java that generates well-formed XML/DOM output from legacy and malformed HTML. Originally forked from CyberNeko HTML Parser 1.9.22, version 3.0 has been completely rewritten to eliminate all external dependencies and use only standard Java APIs.

✨ Key Features

  • Zero Dependencies - Pure Java 17+ with no transitive dependencies (JAR size ~50KB)
  • Standard APIs - Uses only javax.xml SAX and DOM APIs
  • Backward Compatible - Existing DOMParser and SAXParser code works unchanged
  • Flexible Parsing - DOM tree building and event-based SAX parsing
  • Tolerant - Handles malformed HTML gracefully
  • Modern Java - Requires Java 17+, uses modern language features
  • Well Tested - Comprehensive unit test coverage with JUnit 5

πŸš€ Quick Start

Installation

Add to your pom.xml:

<dependency>
  <groupId>org.codelibs</groupId>
  <artifactId>nekohtml</artifactId>
  <version>3.0.0-SNAPSHOT</version>
</dependency>

No other dependencies needed! βœ…

Basic Usage - DOM Parser

import org.codelibs.nekohtml.parsers.DOMParser;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
import java.io.StringReader;

// Parse HTML to DOM
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader("<html><body><h1>Hello</h1></body></html>")));
Document doc = parser.getDocument();

// Query elements
System.out.println(doc.getElementsByTagName("h1").item(0).getTextContent());

SAX-Based Parsing

import org.codelibs.nekohtml.parsers.SAXParser;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;

SAXParser parser = new SAXParser();
parser.setContentHandler(new DefaultHandler() {
    @Override
    public void startElement(String uri, String localName, String qName,
                           Attributes attributes) {
        System.out.println("Element: " + qName);
    }
});

parser.parse(new InputSource(new StringReader(html)));

πŸ—οΈ Project Structure

src/main/java/org/codelibs/nekohtml/
β”œβ”€β”€ parsers/              # Parser implementations
β”‚   β”œβ”€β”€ DOMParser.java    # DOM-based HTML parser
β”‚   β”œβ”€β”€ SAXParser.java    # SAX-based HTML parser
β”‚   └── SAXToDOMHandler.java
β”œβ”€β”€ sax/                  # Pure SAX implementation (v3.0)
β”‚   β”œβ”€β”€ HTMLSAXParser.java          # New SAX parser
β”‚   β”œβ”€β”€ HTMLSAXConfiguration.java   # Configuration/pipeline
β”‚   β”œβ”€β”€ HTMLSAXScanner.java         # Scanner wrapper
β”‚   β”œβ”€β”€ SimpleHTMLScanner.java      # Regex-based scanner
β”‚   β”œβ”€β”€ HTMLTagBalancerFilter.java  # Tag balancing
β”‚   β”œβ”€β”€ HTMLQName.java              # Qualified names
β”‚   β”œβ”€β”€ HTMLAttributesImpl.java     # Attributes
β”‚   └── ...                         # Support classes
β”œβ”€β”€ HTMLElements.java     # HTML element definitions
β”œβ”€β”€ HTMLEntities.java     # Entity references
└── HTMLErrorReporter.java # Error reporting

src/test/java/            # Comprehensive test suite
└── org/codelibs/nekohtml/
    β”œβ”€β”€ parsers/          # Parser tests
    β”œβ”€β”€ sax/              # SAX implementation tests
    └── ...               # Core functionality tests

πŸ”§ Building & Development

Prerequisites

  • Java 17 or higher
  • Maven 3.6+

Build Commands

# Compile
mvn clean compile

# Run all tests
mvn test

# Run specific test
mvn test -Dtest=DOMParserTest

# Generate coverage report
mvn verify
# Report at: target/site/jacoco/index.html

# Build JAR
mvn package

# Format code
mvn formatter:format

# Apply license headers
mvn license:format

# Generate Javadoc
mvn javadoc:javadoc

Running Tests

The project uses JUnit 5 with Mockito for testing:

# All 21+ tests across the codebase
mvn test

# Test categories:
# - Parser tests (DOMParser, SAXParser)
# - SAX implementation tests
# - HTML elements and entities
# - Error handling
# - Configuration and features

🎯 Use Cases

Extract Links from HTML

DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(html)));
Document doc = parser.getDocument();

NodeList links = doc.getElementsByTagName("a");
for (int i = 0; i < links.getLength(); i++) {
    Element link = (Element) links.item(i);
    System.out.println(link.getAttribute("href"));
}

Parse HTML from URL

import java.net.URL;

DOMParser parser = new DOMParser();
URL url = new URL("https://example.com");
parser.parse(new InputSource(url.openStream()));
Document doc = parser.getDocument();

Count HTML Elements

SAXParser parser = new SAXParser();
Map<String, Integer> counts = new HashMap<>();

parser.setContentHandler(new DefaultHandler() {
    @Override
    public void startElement(String uri, String localName, String qName,
                           Attributes attributes) {
        counts.merge(qName, 1, Integer::sum);
    }
});

parser.parse(new InputSource(new StringReader(html)));
counts.forEach((tag, count) -> System.out.println(tag + ": " + count));

πŸ›οΈ Architecture

Core Components

  • HTMLSAXParser - Pure SAX interface for HTML parsing
  • HTMLSAXConfiguration - Pipeline orchestrator and feature management
  • SimpleHTMLScanner - Regex-based HTML tokenizer
  • HTMLTagBalancerFilter - SAX filter for tag balancing
  • DOMParser/SAXParser - Backward-compatible parser interfaces

Parsing Pipeline

HTML Input β†’ SimpleHTMLScanner β†’ HTMLTagBalancerFilter β†’ SAX Events β†’ DOM/Handler

πŸ“‹ Requirements

  • Runtime: Java 17 or higher
  • Build: Maven 3.6+
  • Dependencies: None (pure Java)

πŸ“¦ Releases

Download from Maven Central

🀝 Contributing

Contributions welcome! The pure Java codebase makes it easy to contribute.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run mvn verify to ensure tests pass
  5. Format code: mvn formatter:format
  6. Submit a pull request

Code Style

  • Follow existing code conventions
  • Use Eclipse formatter: src/config/eclipse/formatter/java.xml
  • Maintain test coverage
  • Add tests for new features

πŸ“„ License

Apache License 2.0 - See LICENSE.txt

About

HTML parser and tag balancer.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5