A pure Java HTML parser with zero dependencies
NekoHTML is a lightweight, tolerant HTML parser for Java that generates well-formed XML/DOM output from legacy and malformed HTML. Originally forked from CyberNeko HTML Parser 1.9.22, version 3.0 has been completely rewritten to eliminate all external dependencies and use only standard Java APIs.
- Zero Dependencies - Pure Java 17+ with no transitive dependencies (JAR size ~50KB)
- Standard APIs - Uses only
javax.xml
SAX and DOM APIs - Backward Compatible - Existing DOMParser and SAXParser code works unchanged
- Flexible Parsing - DOM tree building and event-based SAX parsing
- Tolerant - Handles malformed HTML gracefully
- Modern Java - Requires Java 17+, uses modern language features
- Well Tested - Comprehensive unit test coverage with JUnit 5
Add to your pom.xml
:
<dependency>
<groupId>org.codelibs</groupId>
<artifactId>nekohtml</artifactId>
<version>3.0.0-SNAPSHOT</version>
</dependency>
No other dependencies needed! β
import org.codelibs.nekohtml.parsers.DOMParser;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
import java.io.StringReader;
// Parse HTML to DOM
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader("<html><body><h1>Hello</h1></body></html>")));
Document doc = parser.getDocument();
// Query elements
System.out.println(doc.getElementsByTagName("h1").item(0).getTextContent());
import org.codelibs.nekohtml.parsers.SAXParser;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
SAXParser parser = new SAXParser();
parser.setContentHandler(new DefaultHandler() {
@Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) {
System.out.println("Element: " + qName);
}
});
parser.parse(new InputSource(new StringReader(html)));
src/main/java/org/codelibs/nekohtml/
βββ parsers/ # Parser implementations
β βββ DOMParser.java # DOM-based HTML parser
β βββ SAXParser.java # SAX-based HTML parser
β βββ SAXToDOMHandler.java
βββ sax/ # Pure SAX implementation (v3.0)
β βββ HTMLSAXParser.java # New SAX parser
β βββ HTMLSAXConfiguration.java # Configuration/pipeline
β βββ HTMLSAXScanner.java # Scanner wrapper
β βββ SimpleHTMLScanner.java # Regex-based scanner
β βββ HTMLTagBalancerFilter.java # Tag balancing
β βββ HTMLQName.java # Qualified names
β βββ HTMLAttributesImpl.java # Attributes
β βββ ... # Support classes
βββ HTMLElements.java # HTML element definitions
βββ HTMLEntities.java # Entity references
βββ HTMLErrorReporter.java # Error reporting
src/test/java/ # Comprehensive test suite
βββ org/codelibs/nekohtml/
βββ parsers/ # Parser tests
βββ sax/ # SAX implementation tests
βββ ... # Core functionality tests
- Java 17 or higher
- Maven 3.6+
# Compile
mvn clean compile
# Run all tests
mvn test
# Run specific test
mvn test -Dtest=DOMParserTest
# Generate coverage report
mvn verify
# Report at: target/site/jacoco/index.html
# Build JAR
mvn package
# Format code
mvn formatter:format
# Apply license headers
mvn license:format
# Generate Javadoc
mvn javadoc:javadoc
The project uses JUnit 5 with Mockito for testing:
# All 21+ tests across the codebase
mvn test
# Test categories:
# - Parser tests (DOMParser, SAXParser)
# - SAX implementation tests
# - HTML elements and entities
# - Error handling
# - Configuration and features
DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(html)));
Document doc = parser.getDocument();
NodeList links = doc.getElementsByTagName("a");
for (int i = 0; i < links.getLength(); i++) {
Element link = (Element) links.item(i);
System.out.println(link.getAttribute("href"));
}
import java.net.URL;
DOMParser parser = new DOMParser();
URL url = new URL("https://example.com");
parser.parse(new InputSource(url.openStream()));
Document doc = parser.getDocument();
SAXParser parser = new SAXParser();
Map<String, Integer> counts = new HashMap<>();
parser.setContentHandler(new DefaultHandler() {
@Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) {
counts.merge(qName, 1, Integer::sum);
}
});
parser.parse(new InputSource(new StringReader(html)));
counts.forEach((tag, count) -> System.out.println(tag + ": " + count));
- HTMLSAXParser - Pure SAX interface for HTML parsing
- HTMLSAXConfiguration - Pipeline orchestrator and feature management
- SimpleHTMLScanner - Regex-based HTML tokenizer
- HTMLTagBalancerFilter - SAX filter for tag balancing
- DOMParser/SAXParser - Backward-compatible parser interfaces
HTML Input β SimpleHTMLScanner β HTMLTagBalancerFilter β SAX Events β DOM/Handler
- Runtime: Java 17 or higher
- Build: Maven 3.6+
- Dependencies: None (pure Java)
Download from Maven Central
Contributions welcome! The pure Java codebase makes it easy to contribute.
- Fork the repository
- Create a feature branch
- Make your changes
- Run
mvn verify
to ensure tests pass - Format code:
mvn formatter:format
- Submit a pull request
- Follow existing code conventions
- Use Eclipse formatter:
src/config/eclipse/formatter/java.xml
- Maintain test coverage
- Add tests for new features
Apache License 2.0 - See LICENSE.txt