Skip to content

Scraper with versions written in several languages used to scrape Joel Spolsky's blog

Notifications You must be signed in to change notification settings

jayvanhu/blog-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scraper for Joel Spolsky's Blog https://www.joelonsoftware.com/

Joel Spolsky is the former CEO of Stack Overflow and has his own blog. He doesn't have a list of his posts with just the title and links to the post; this archive page https://www.joelonsoftware.com/archives/ shows the months (not titles) he's posted in, and clicking on the months shows all the actual article content for that month (instead of only the titles). This scraper extracts those article titles + links and outputs them to an HTML file for easier browsing. It's also written in different languages to act as practice and code samples.

Golang

Original version.

Running

  • cd blog-scrape/golang/
  • go run main.go scrape to create local text file of posts
  • go run main.go template to generate html
  • Open dist/generated.html to see list of articles

Config

  • Open config.json
  • MaxGoRoutines sets the maximum number of goroutines used to scrape the blog
  • BufferSize sets the capacity of the channel and array used to store articles
  • ScrapeDelay sets the delay in milliseconds before a goroutine scrapes another link
  • FastDebug if true, simplifies/reduces/speeds-up certain portions

Python

Written using Python 3.9. Doesn't have HTML file feature (yet?).

Running

  • cd blog-scrape/python/
  • Run either:
    • Python 3.9 - python scrape.py
    • Docker - docker-compose up
  • See output in dist/scraped-links.json

Config

  • See config.py for options and documentation

About

Scraper with versions written in several languages used to scrape Joel Spolsky's blog

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages