The Heise News Crawler is designed to automatically extract and store news articles from Heise's archive. The primary goals are:
- 📡 Data Collection: Gather historical news articles from Heise.de.
- 🏛 Structured Storage: Store articles in a PostgreSQL database for easy querying and analysis.
- 🔍 Metadata Extraction: Retrieve key information such as title, author, category, keywords, and word count.
- 🔄 Incremental crawling: Detect duplicate articles and save only new articles of the current day.
- 🔔 Notifications: Send an email if an error occurs during the crawling process.
- 🎨 Enhanced Terminal Output: Uses PyFiglet for improved readability.
- 📤 Data export: Export of articles as .csv, .json, .xlsx-file or display the data in a stats.html file
- 🖥 API: Provision of statistics and complete data sets.
Also an API endpoint is provided that can display the crawled data and statistics.
🔹 Python 3
🔹 PostgreSQL
🔹 Required Python Libraries (Dependencies in requirements.txt)
Install required Python libraries:
pip3 install -r requirements.txt
Set up your database and email credentials by creating a .env
file:
EMAIL_USER=...
EMAIL_PASSWORD=...
SMTP_SERVER=...
SMTP_PORT=...
ALERT_EMAIL=...
DB_NAME=...
DB_USER=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DISCORD_TOKEN=...
CHANNEL_ID=...
python3 main.py
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Verarbeite 16 Artikel für den Tag xxxx-xx-xx
xxxx-xx-xx xx:xx:xx [INFO] 2025-03-01T20:00:00 - article-name
(⬆️ date)
If fewer than 10 items are found per day, an e-mail will be sent
python3 current_crawler.py
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Aktueller Crawl-Durchlauf abgeschlossen.
xxxx-xx-xx xx:xx:xx [INFO] Warte 300 Sekunden bis zum nächsten Crawl.
(⬆️ date)
The API server starts automatically. You can call up the statistics here:
http://127.0.0.1:6600/stats
You can export the data for each item to a CSV, JSON or XLSX file.
python3 export_articles.py
Exported articles are saved in the current directory.
Column | Type | Description |
---|---|---|
id | SERIAL | Unique ID |
title | TEXT | Article title |
url | TEXT | Article URL (unique) |
date | TEXT | Publication date |
author | TEXT | Author(s) |
category | TEXT | Category |
keywords | TEXT | Keywords |
word_count | INT | Word count |
editor_abbr | TEXT | Editor abbreviation |
site_name | TEXT | Website name |
If any errors occur, an email notification will be sent.
📂 Heise-News-Crawler
├── 📄 .gitignore # Git ignore file
├── 📄 .env # Environment variables (email & database config, you have to create this file manually)
├── 📄 main.py # Main crawler script
├── 📄 api.py # API functionalities
├── 📄 notification.py # Email notification handler
├── 📄 test_notifications.py # Testing email notifications
├── 📄 README.md
├── 📄 current_crawler.py # Crawler for newer articles
├── 📄 export_articles.py # Function to export the data
├── 📄 requirements.txt
└── 📂 templates/ # HTML email templates
├── 📄 stats.html # API functionalities
└── 📂 data/ # Export data (as of 03/03/2025)
├── 📄 .gitattributes
├── 📄 README.md
├── 📄 api.py
├── 📄 articles_export.csv
├── 📄 articles_export.json
├── 📄 articles_export.xlsx
└── 📄 LICENCE
python3 api.py
python3 test_notification.py
Please create a pull request or contact us via server@schächner.de
(with Tableu and DeepNote, status March 2025)
We have also generated some graphs with Deepnote (❗ only with Random 10.000 rows ❗)
Check out also the data/Datamining_Heise web crawler-3.twb-file with an excerpt of analyses.
This program is licensed under GNU GENERAL PUBLIC LICENSE
This project was programmed by both of us within a few days and is constantly being further developed:
Feel free to reach out if you have any questions, feedback, or just want to say hi!
📧 Email: server@schächner.de
🌐 Website:
💖 Special Thanks
The idea for our Heise News Crawler comes from David Kriesel and his presentation “Spiegel Mining” at 33c3.
Happy Crawling! 🎉