Add scrape_fake_jobs.py script to scrape Real Python fake jobs demo page #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds a new Python script
scripts/scrape_fake_jobs.pythat scrapes job postings from the Real Python "fake jobs" demo page (https://realpython.github.io/fake-jobs/) and saves the results to a CSV file.What's New
Script:
scripts/scrape_fake_jobs.pyA comprehensive web scraping script that demonstrates:
requestsandBeautifulSoupcsvlibraryOutput Format
The script generates
fake_jobs.csvwith the following structure:Technical Details
HTML Structure & Selectors
The script uses class-based BeautifulSoup selectors to extract job data:
<div class="card-content"><h2 class="title"><h3 class="company"><p class="location"><time>element (prefersdatetimeattribute, falls back to text)Fallback Logic
To handle slight HTML variations, the script includes intelligent fallbacks:
h2.title), it falls back to the generic tag (e.g., anyh2)Usage
Additional Changes
.gitignoreto exclude Python artifacts (__pycache__/,*.pyc,*.pyo) and generated CSV filesTesting
The implementation has been thoroughly tested with:
Documentation
The script includes comprehensive documentation:
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
realpython.github.ioimport requests
from bs4 import BeautifulSoup
url = 'REDACTED'
resp = requests.get(url)
resp.raise_for_status()
html = resp.text
Save to a temp file for inspection
with open('/tmp/fake_jobs_page.html', 'w') as f:
f.write(html)
Parse and show structure
soup = BeautifulSoup(html, 'html.parser')
Find job cards
cards = soup.find_all('div', class_='card')
print(f'Found {len(cards)} job cards')
if cards:
first_card = cards[0]
print('\nFirst card structure:')
print(first_card.prettify()[:1000])` (dns block)
Original prompt
Add a new script scripts/scrape_fake_jobs.py that scrapes the Real Python "fake jobs" demo page and writes the results to fake_jobs.csv. The CSV must include a header row with the exact columns: Job Title, Company, Location, Date Posted. Use requests + BeautifulSoup for scraping and Python's built-in csv library for writing the CSV. Include reasonable selectors using class-based find/find_all per https://blog.apify.com/beautifulsoup-find-by-class/ and add fallbacks for slight HTML variations. Provide clear docstring and instructions in the file header.
File path: scripts/scrape_fake_jobs.py
File contents:
#!/usr/bin/env python3
"""
Scrape fake job postings and save them to fake_jobs.csv
This script demonstrates using BeautifulSoup's class-based find/find_all
features (see https://blog.apify.com/beautifulsoup-find-by-class/) to extract
job postings from the Real Python "fake jobs" demo page and write them to a CSV.
Output CSV: fake_jobs.csv
Columns (header): Job Title, Company, Location, Date Posted
"""
from future import annotations
import csv
import sys
from typing import List, Dict
import requests
from bs4 import BeautifulSoup
SOURCE_URL = "https://realpython.github.io/fake-jobs/"
OUTPUT_CSV = "fake_jobs.csv"
CSV_HEADERS = ["Job Title", "Company", "Location", "Date Posted"]
def fetch_page(url: str) -> str:
resp = requests.get(url)
resp.raise_for_status()
return resp.text
def _get_text(elem) -> str:
return elem.get_text(strip=True) if elem else ""
def parse_jobs(html: str) -> List[Dict[str, str]]:
soup = BeautifulSoup(html, "html.parser")
def write_csv(path: str, rows: List[Dict[str, str]]) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=CSV_HEADERS)
writer.writeheader()
for row in rows:
# Ensure all header keys exist
writer.writerow({k: row.get(k, "") for k in CSV_HEADERS})
def main() -> int:
try:
html = fetch_page(SOURCE_URL)
except Exception as e:
print(f"Error fetching {SOURCE_URL}: {e}", file=sys.stderr)
return 1
if name == "main":
raise SystemExit(main())
Additional notes for the PR description:
Please create a pull request with the new file added on a branch named add/scrape-fake-jobs-script and include a descriptive PR title: "Add script to scrape fake job postings and write ...
This pull request was created as a result of the following prompt from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.