Crawling titles and paragraphs of Vietnamese online articles using their URLs or categories names
Current supported websites:
Create virtual environment then install required packages:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
Modifying your crawler configuration file (default is crawler_config.yml
) to customize your crawling progress.
# crawler_config.yml
# Name of news website that want to crawls (vnexpress, dantri, vietnamnet)
webname: "vnexpress"
# tasks = ["url", "type"]
task: "url"
#logger config file path
logger_fpath: "logger/logger_config.yml"
urls_fpath: "urls.txt"
output_dpath: "result"
num_workers: 1
# if task == "type":
# article_type == "all" to crawl all of types
article_type: "du-lich"
total_pages: 1
Then simply run:
python VNNewsCrawler.py --config crawler_config.yml
To perform URL-based crawling, you need to configure the file by setting task: "url"
. The program will proceed to crawl each URL specified in the urls_fpath
file. By default, the program is equipped with two VNExpress news URLs included in the urls.txt
file.
To crawl URLs based on their categories, you need to set task: "type"
in the configuration file. The program will retrieve URLs from a specified number of pages (total_pages
) belonging to the provided category. Currently, my program supports only the following categories for these websites:
VNExpress | DanTri | VietNamNet |
---|---|---|
0. thoi-su 1. du-lich 2. the-gioi 3. kinh-doanh 4. khoa-hoc 5. giai-tri 6. the-thao 7. phap-luat 8. giao-duc 9. suc-khoe 10. doi-song |
0. xa-hoi 1. the-gioi 2. kinh-doanh 3. bat-dong-san 4. the-thao 5. lao-dong-viec-lam 6. tam-long-nhan-ai 7. suc-khoe 8. van-hoa 9. giai-tri 10. suc-manh-so 11. giao-duc 12. an-sinh 13. phap-luat |
0. thoi-su 1. kinh-doanh 2. the-thao 3. van-hoa 4. giai-tri 5. the-gioi 6. doi-song 7. giao-duc 8. suc-khoe 9. thong-tin-truyen-thong 10. phap-luat 11. oto-xe-may 12. bat-dong-san 13. du-lich |
For example if you set configuration file like this:
# if task == "type"
article_type: "khoa-hoc"
total_pages: 3
It will crawl articles from
https://vnexpress.net/khoa-hoc-p1
https://vnexpress.net/khoa-hoc-p2
https://vnexpress.net/khoa-hoc-p3
🌟 To crawl article in all of available categories, you just need to set article_type: "all"
.
# if task == "type"
article_type: "all"
total_pages: 3
By increasing the value of num_workers
, you can accelerate the crawling process by utilizing multiple threads simultaneously. num_workers
too high may result in receiving a "Too Many Requests" error from the news website, preventing any further URL crawling.
- Speed up crawling progress with multithreading
- Add logging module
- Use yml config file instead of argparse
- Crawl in other news websites