Scraper for Joel Spolsky's Blog https://www.joelonsoftware.com/
Joel Spolsky is the former CEO of Stack Overflow and has his own blog. He doesn't have a list of his posts with just the title and links to the post; this archive page https://www.joelonsoftware.com/archives/ shows the months (not titles) he's posted in, and clicking on the months shows all the actual article content for that month (instead of only the titles). This scraper extracts those article titles + links and outputs them to an HTML file for easier browsing. It's also written in different languages to act as practice and code samples.
Original version.
cd blog-scrape/golang/
go run main.go scrape
to create local text file of postsgo run main.go template
to generate html- Open
dist/generated.html
to see list of articles
- Open
config.json
MaxGoRoutines
sets the maximum number of goroutines used to scrape the blogBufferSize
sets the capacity of the channel and array used to store articlesScrapeDelay
sets the delay in milliseconds before a goroutine scrapes another linkFastDebug
iftrue
, simplifies/reduces/speeds-up certain portions
Written using Python 3.9. Doesn't have HTML file feature (yet?).
cd blog-scrape/python/
- Run either:
- Python 3.9 -
python scrape.py
- Docker -
docker-compose up
- Python 3.9 -
- See output in
dist/scraped-links.json
- See
config.py
for options and documentation