Skip to content

TeX Live installation robustness: Network resilience and error handling improvements #1361

@coderabbitai

Description

@coderabbitai

Problem Description

The TeX Live installation in install_pdf_deps.sh is experiencing intermittent network failures that cause Docker build failures, as seen in CI runs.

Recent Failure Example:

TLPDB::_install_data: downloading did not succeed (download_file failed) for https://us.mirrors.cicku.me/ctan/systems/texlive/tlnet/archive/texlive-scripts.tar.xz
Installation failed.
Rerunning the installer will try to restart the installation.
Or you can restart by running the installer with:
  install-tl --profile installation.profile [YOUR-EXTRA-ARGS]
error: build error: building at STEP "RUN ./utils/install_pdf_deps.sh": while running runtime: exit status 1

CI Run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/1357/pull-ci-opendatahub-io-notebooks-main-images/1942865679711473664

Root Cause Analysis

The current TeX Live installation process lacks:

  1. Network Resilience: No retry mechanisms for failed downloads
  2. Mirror Fallback: Single point of failure on specific CTAN mirrors
  3. Timeout Handling: No protection against hanging downloads
  4. Error Recovery: Limited ability to resume failed installations
  5. Dependency Caching: No mechanism to avoid repeated downloads

Current Implementation Issues

Based on the error message, the installation process:

  • Relies on external mirrors that may be temporarily unavailable
  • Lacks robust download retry mechanisms
  • Has no fallback strategies for mirror failures
  • Provides limited error diagnostics for network issues

Solution Options

Option 1: Enhanced Network Resilience (Recommended)

Implement comprehensive retry and fallback mechanisms:

# Enhanced installation with retry logic
install_texlive_with_retries() {
    local max_retries=3
    local retry_delay=30
    local mirrors=(
        "https://mirror.ctan.org/systems/texlive/tlnet"
        "https://ctan.math.utah.edu/ctan/tex-archive/systems/texlive/tlnet"
        "https://mirrors.rit.edu/CTAN/systems/texlive/tlnet"
        "https://us.mirrors.cicku.me/ctan/systems/texlive/tlnet"
    )
    
    for mirror in "${mirrors[@]}"; do
        for attempt in $(seq 1 $max_retries); do
            echo "Attempting TeX Live installation from $mirror (attempt $attempt/$max_retries)"
            
            if install-tl --location $mirror --profile installation.profile; then
                echo "TeX Live installation successful from $mirror"
                return 0
            fi
            
            if [ $attempt -lt $max_retries ]; then
                echo "Installation failed, retrying in ${retry_delay}s..."
                sleep $retry_delay
            fi
        done
        echo "All attempts failed for $mirror, trying next mirror..."
    done
    
    echo "ERROR: TeX Live installation failed from all mirrors"
    return 1
}

Option 2: Container-Based TeX Live Installation

Use pre-built TeX Live containers or packages:

# Alternative: Use system TeX Live packages
RUN dnf install -y texlive-scheme-medium texlive-collection-latexextra \
    && dnf clean all

Option 3: Cached Installation Approach

Implement local caching and validation:

# Cache-aware installation
TEXLIVE_CACHE_DIR="/tmp/texlive-cache"
TEXLIVE_INSTALLER_URL="https://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz"

download_with_cache() {
    local url="$1"
    local cache_file="$2"
    local max_retries=3
    
    if [ -f "$cache_file" ]; then
        echo "Using cached file: $cache_file"
        return 0
    fi
    
    for attempt in $(seq 1 $max_retries); do
        if wget --timeout=300 --tries=1 "$url" -O "$cache_file"; then
            return 0
        fi
        [ $attempt -lt $max_retries ] && sleep 30
    done
    
    return 1
}

Option 4: Hybrid Installation with Validation

Combine multiple approaches with comprehensive validation:

# Hybrid approach with validation
install_texlive_hybrid() {
    # Try system packages first
    if command -v dnf &> /dev/null; then
        if dnf install -y texlive-scheme-basic; then
            echo "System TeX Live packages installed successfully"
            return 0
        fi
    fi
    
    # Fallback to network installation with retries
    install_texlive_with_retries
}

Acceptance Criteria

Core Requirements

  • TeX Live installation succeeds consistently across different network conditions
  • Automatic retry mechanisms for failed downloads
  • Multiple mirror fallback support
  • Comprehensive error logging and diagnostics
  • Installation time optimization through caching

Robustness Features

  • Timeout protection for hanging downloads
  • Partial download recovery capabilities
  • Network connectivity validation before installation
  • Graceful degradation when mirrors are unavailable

Monitoring and Diagnostics

  • Detailed logging of installation attempts and failures
  • Mirror response time tracking
  • Installation success/failure metrics
  • Clear error messages for troubleshooting

Implementation Guidance

Phase 1: Basic Resilience

  1. Add retry logic to existing installation process
  2. Implement timeout protection for downloads
  3. Add basic error handling and logging

Phase 2: Mirror Fallback

  1. Configure multiple CTAN mirrors
  2. Implement automatic failover between mirrors
  3. Add mirror health checking

Phase 3: Caching and Optimization

  1. Implement local caching for downloaded packages
  2. Add checksum validation for cached files
  3. Optimize installation profile for required packages only

Phase 4: Alternative Approaches

  1. Evaluate system package manager alternatives
  2. Consider container-based TeX Live distributions
  3. Implement hybrid installation strategies

Testing Approach

Network Resilience Testing

  • Test installation with simulated network failures
  • Verify retry mechanisms work correctly
  • Test timeout handling for slow mirrors

Mirror Fallback Testing

  • Test with individual mirrors disabled
  • Verify automatic failover functionality
  • Test with all mirrors temporarily unavailable

Performance Testing

  • Measure installation time improvements
  • Test caching effectiveness
  • Verify resource usage optimization

Related Issues

Context

Impact

Network-related TeX Live installation failures cause:

  • CI pipeline interruptions and delays
  • Inconsistent Docker build success rates
  • Developer productivity loss due to flaky builds
  • Potential production deployment risks for PDF functionality

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions