📰 Vietnamese news crawler

Crawling titles and paragraphs of Vietnamese online articles using their URLs or categories names

Current supported websites:

🧰 Installation

Create virtual environment then install required packages:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

👨‍💻 Usage

Modifying your crawler configuration file (default is crawler_config.yml) to customize your crawling progress.

# crawler_config.yml

# Name of news website that want to crawls (vnexpress, dantri, vietnamnet)
webname: "vnexpress"

# tasks = ["url", "type"]
task: "url"

#logger config file path
logger_fpath: "logger/logger_config.yml"
urls_fpath: "urls.txt"
output_dpath: "result"
num_workers: 1

# if task == "type": 
# article_type == "all" to crawl all of types
article_type: "du-lich"
total_pages: 1

Then simply run:

python VNNewsCrawler.py --config crawler_config.yml

Crawl by URL

To perform URL-based crawling, you need to configure the file by setting task: "url". The program will proceed to crawl each URL specified in the urls_fpath file. By default, the program is equipped with two VNExpress news URLs included in the urls.txt file.

Crawl by category name

To crawl URLs based on their categories, you need to set task: "type" in the configuration file. The program will retrieve URLs from a specified number of pages (total_pages) belonging to the provided category. Currently, my program supports only the following categories for these websites:

VNExpress	DanTri	VietNamNet
0. thoi-su 1. du-lich 2. the-gioi 3. kinh-doanh 4. khoa-hoc 5. giai-tri 6. the-thao 7. phap-luat 8. giao-duc 9. suc-khoe 10. doi-song	0. xa-hoi 1. the-gioi 2. kinh-doanh 3. bat-dong-san 4. the-thao 5. lao-dong-viec-lam 6. tam-long-nhan-ai 7. suc-khoe 8. van-hoa 9. giai-tri 10. suc-manh-so 11. giao-duc 12. an-sinh 13. phap-luat	0. thoi-su 1. kinh-doanh 2. the-thao 3. van-hoa 4. giai-tri 5. the-gioi 6. doi-song 7. giao-duc 8. suc-khoe 9. thong-tin-truyen-thong 10. phap-luat 11. oto-xe-may 12. bat-dong-san 13. du-lich

For example if you set configuration file like this:

# if task == "type"
article_type: "khoa-hoc"
total_pages: 3

It will crawl articles from

https://vnexpress.net/khoa-hoc-p1
https://vnexpress.net/khoa-hoc-p2
https://vnexpress.net/khoa-hoc-p3

🌟 To crawl article in all of available categories, you just need to set article_type: "all".

# if task == "type"
article_type: "all"
total_pages: 3

🚀 Crawling faster with MultiThreading

By increasing the value of num_workers, you can accelerate the crawling process by utilizing multiple threads simultaneously. ⚠️ However, it's important to note that setting num_workers too high may result in receiving a "Too Many Requests" error from the news website, preventing any further URL crawling.

✔️ Todo

Speed up crawling progress with multithreading
Add logging module
Use yml config file instead of argparse
Crawl in other news websites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 Vietnamese news crawler

🧰 Installation

👨‍💻 Usage

Crawl by URL

Crawl by category name

🚀 Crawling faster with MultiThreading

✔️ Todo

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
crawler		crawler
logger		logger
utils		utils
.gitignore		.gitignore
README.md		README.md
VNNewsCrawler.py		VNNewsCrawler.py
crawler_config.yml		crawler_config.yml
requirements.txt		requirements.txt
urls.txt		urls.txt

egliette/VNNewsCrawler

Folders and files

Latest commit

History

Repository files navigation

📰 Vietnamese news crawler

🧰 Installation

👨‍💻 Usage

Crawl by URL

Crawl by category name

🚀 Crawling faster with MultiThreading

✔️ Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages