Skip to content

egliette/VNNewsCrawler

Repository files navigation

📰 Vietnamese news crawler

Python 3.10.7 BeautifulSoup 0.0.1 Requests 2.28.1 tqdm 4.64.1

Crawling titles and paragraphs of Vietnamese online articles using their URLs or categories names

Current supported websites:

🧰 Installation

Create virtual environment then install required packages:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

👨‍💻 Usage

Modifying your crawler configuration file (default is crawler_config.yml) to customize your crawling progress.

# crawler_config.yml

# Name of news website that want to crawls (vnexpress, dantri, vietnamnet)
webname: "vnexpress"

# tasks = ["url", "type"]
task: "url"

#logger config file path
logger_fpath: "logger/logger_config.yml"
urls_fpath: "urls.txt"
output_dpath: "result"
num_workers: 1

# if task == "type": 
# article_type == "all" to crawl all of types
article_type: "du-lich"
total_pages: 1

Then simply run:

python VNNewsCrawler.py --config crawler_config.yml

Crawl by URL

To perform URL-based crawling, you need to configure the file by setting task: "url". The program will proceed to crawl each URL specified in the urls_fpath file. By default, the program is equipped with two VNExpress news URLs included in the urls.txt file.

Crawl by category name

To crawl URLs based on their categories, you need to set task: "type" in the configuration file. The program will retrieve URLs from a specified number of pages (total_pages) belonging to the provided category. Currently, my program supports only the following categories for these websites:

VNExpress DanTri VietNamNet
0. thoi-su
1. du-lich
2. the-gioi
3. kinh-doanh
4. khoa-hoc
5. giai-tri
6. the-thao
7. phap-luat
8. giao-duc
9. suc-khoe
10. doi-song
0. xa-hoi
1. the-gioi
2. kinh-doanh
3. bat-dong-san
4. the-thao
5. lao-dong-viec-lam
6. tam-long-nhan-ai
7. suc-khoe
8. van-hoa
9. giai-tri
10. suc-manh-so
11. giao-duc
12. an-sinh
13. phap-luat
0. thoi-su
1. kinh-doanh
2. the-thao
3. van-hoa
4. giai-tri
5. the-gioi
6. doi-song
7. giao-duc
8. suc-khoe
9. thong-tin-truyen-thong
10. phap-luat
11. oto-xe-may
12. bat-dong-san
13. du-lich

For example if you set configuration file like this:

# if task == "type"
article_type: "khoa-hoc"
total_pages: 3

It will crawl articles from

https://vnexpress.net/khoa-hoc-p1
https://vnexpress.net/khoa-hoc-p2
https://vnexpress.net/khoa-hoc-p3

🌟 To crawl article in all of available categories, you just need to set article_type: "all".

# if task == "type"
article_type: "all"
total_pages: 3

🚀 Crawling faster with MultiThreading

By increasing the value of num_workers, you can accelerate the crawling process by utilizing multiple threads simultaneously. ⚠️ However, it's important to note that setting num_workers too high may result in receiving a "Too Many Requests" error from the news website, preventing any further URL crawling.

✔️ Todo

  • Speed up crawling progress with multithreading
  • Add logging module
  • Use yml config file instead of argparse
  • Crawl in other news websites

About

Vietnamese Online News Crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages