The Heise News Crawler is designed to automatically extract and store news articles from Heise's archive. The primary goals are:
- π‘ Data Collection: Gather historical news articles from Heise.de.
- π Structured Storage: Store articles in a PostgreSQL database for easy querying and analysis.
- π Metadata Extraction: Retrieve key information such as title, author, category, keywords, and word count.
- π Incremental crawling: Detect duplicate articles and save only new articles of the current day.
- π Notifications: Send an email if an error occurs during the crawling process.
- π¨ Enhanced Terminal Output: Uses PyFiglet for improved readability.
- π€ Data export: Export of articles as .csv, .json, .xlsx-file or display the data in a stats.html file
- π₯ API: Provision of statistics and complete data sets.
Also an API endpoint is provided that can display the crawled data and statistics.
πΉ Python 3
πΉ PostgreSQL
πΉ Required Python Libraries (Dependencies in requirements.txt)
Install required Python libraries:
pip3 install -r requirements.txt
Set up your database and email credentials by creating a .env
file:
EMAIL_USER=...
EMAIL_PASSWORD=...
SMTP_SERVER=...
SMTP_PORT=...
ALERT_EMAIL=...
DB_NAME=...
DB_USER=...
DB_PASSWORD=...
DB_HOST=...
DB_PORT=...
DISCORD_TOKEN=...
CHANNEL_ID=...
python3 main.py
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Verarbeite 16 Artikel fΓΌr den Tag xxxx-xx-xx
xxxx-xx-xx xx:xx:xx [INFO] 2025-03-01T20:00:00 - article-name
(β¬οΈ date)
If fewer than 10 items are found per day, an e-mail will be sent
python3 current_crawler.py
[INFO] Crawle URL: https://www.heise.de/newsticker/archiv/xxxx/xx
[INFO] Gefundene Artikel (insgesamt): 55
xxxx-xx-xx xx:xx:xx [INFO] Aktueller Crawl-Durchlauf abgeschlossen.
xxxx-xx-xx xx:xx:xx [INFO] Warte 300 Sekunden bis zum nΓ€chsten Crawl.
(β¬οΈ date)
The API server starts automatically. You can call up the statistics here:
http://127.0.0.1:6600/stats
You can export the data for each item to a CSV, JSON or XLSX file.
python3 export_articles.py
Exported articles are saved in the current directory.
Column | Type | Description |
---|---|---|
id | SERIAL | Unique ID |
title | TEXT | Article title |
url | TEXT | Article URL (unique) |
date | TEXT | Publication date |
author | TEXT | Author(s) |
category | TEXT | Category |
keywords | TEXT | Keywords |
word_count | INT | Word count |
editor_abbr | TEXT | Editor abbreviation |
site_name | TEXT | Website name |
If any errors occur, an email notification will be sent.
(old)
π Heise-News-Crawler
βββ π .gitignore # Git ignore file
βββ π .env # Environment variables (email & database config, you have to create this file manually)
βββ π main.py # Main crawler script
βββ π api.py # API functionalities
βββ π notification.py # Email notification handler
βββ π test_notifications.py # Testing email notifications
βββ π README.md
βββ π current_crawler.py # Crawler for newer articles
βββ π export_articles.py # Function to export the data
βββ π requirements.txt
βββ π templates/ # HTML email templates
βββ π stats.html # API functionalities
βββ π data/ # Export data (as of 03/03/2025)
βββ π .gitattributes
βββ π README.md
βββ π api.py
βββ π articles_export.csv
βββ π articles_export.json
βββ π articles_export.xlsx
βββ π LICENCE
python3 api.py
python3 test_notification.py
Please create a pull request or contact us via server@schΓ€chner.de
(with Tableu and DeepNote, status March 2025)
We have also generated some graphs with Deepnote (β only with Random 10.000 rows β)
Check out also the data/Datamining_Heise web crawler-3.twb-file with an excerpt of analyses.
This program is licensed under GNU GENERAL PUBLIC LICENSE
This project was programmed by both of us within a few days and is constantly being further developed:
Feel free to reach out if you have any questions, feedback, or just want to say hi!
π§ Email: server@schΓ€chner.de
π Website:
π Special Thanks
The idea for our Heise News Crawler comes from David Kriesel and his presentation βSpiegel Miningβ at 33c3.
Happy Crawling! π