A smart log processing pipeline where logs β regardless of source, structure, or format β are:
β
Automatically analyzed and understood
π§ Matched against known or discovered structures
π¦ Converted into clean JSON for downstream use (RAG, dashboards, alerts)
π Continuously improved by learning from what it fails to parse
Status: β Implemented
- Uses manually defined regex patterns for known formats (Apache, Syslog, SSH, etc.)
- Converts matching log lines into
JSONL
- Logs that do not match are skipped and stored separately
Goal: Track all unmatched lines for improvement
Features:
- Saves unparsed lines to
SkippedLogs/
- Records file name and line number for traceability
- Enables continuous learning and correction
Goal: Dynamically extract structure from unknown log formats using open-source LLMs like Mistral, Gemma, or LLaMA3.
Steps:
- Pass skipped lines to an LLM with a prompt like:
You are a log analysis assistant. Given the following log line, extract: - timestamp - level - message Return the output as JSON.
- Cache and validate LLM outputs
- Add to training or deployable pattern bank
Benefits:
- Removes the need for new regexes
- Handles unstructured, unknown, or mixed-format logs
Goal: Automatically learn templates and clusters from logs
Features:
- Use Drain3 to:
- Discover static and dynamic fields
- Group logs into clusters
- Mine templates like
User * logged in from *
- Store mined templates for downstream use or learning
- Use clustering insights to guide new pattern or anomaly detection
Goal: Build a self-improving parser system
How:
- Reprocess skipped lines periodically
- Generate new patterns from LLM or Drain3
- Validate outputs with scoring or confidence thresholds
- Add verified patterns to
live_parser_patterns.json
Feature | Description |
---|---|
π§ͺ Accuracy scoring | Manual or LLM-assisted evaluation |
π§ Confidence thresholds | Auto-accept LLM outputs above threshold |
π Parsing dashboard | Visualize logs parsed, templates learned, anomalies |
π Secure fine-tuning | Handle PII-sensitive logs privately |
π¬ RAG-based querying | Ask questions from logs via embedded vector DB |
graph TD
A[Raw Logs] --> B[Regex-based Parser]
B -->|Parsed| C[JSONL Logs]
B -->|Skipped| D[SkippedLogs/]
D --> E[LLM Analysis & Labeling]
D --> F[Drain3 Template Mining]
E --> G[Auto-Generated Patterns]
F --> G
G --> H[Updated Parser Patterns]
H --> B
C --> I[RAG / Vector DB]
log-parser-intelligent/
βββ logs/ # Raw input logs
βββ ParsedLogs/ # Parsed JSONL files
βββ SkippedLogs/ # Unmatched logs with trace info
βββ Anomalies/ # Drain3-flagged anomalies
βββ Patterns/
β βββ live_parser_patterns.json
β βββ learned_templates.json
βββ llm_prompts/
β βββ log_schema_extraction.txt
βββ vectorstore/ # For RAG embeddings
βββ drain3_snapshot.json # Template cluster snapshot
βββ README.md # This file
- Clone this repo
- Install dependencies:
pip install drain3 openai chromadb
- Run the multi-parser:
python parse_logs.py --input ./logs --output ./ParsedLogs
- Run LLM-assist:
python enrich_with_llm.py --input ./SkippedLogs --output ./ParsedLogs
Want to add new patterns, LLM prompt styles, or vector search capabilities?
Feel free to fork and raise a PR.
- Drain3
- ChromaDB
- Open-source LLMs: Mistral / Gemma / LLaMA3 via Ollama
- Inspired by real-world log intelligence & observability challenges
Feel free to connect for ideas, issues or collaborations:
- Maintainer: @mrsahiljaiswal
- Email:
[email protected]
(Replace with your real contact)