Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipynb file to extract wiki articles generated in google colab #331

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions wiki_dump_extractor.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "AIN1eV12yzVv",
"outputId": "d8543725-1023-4b42-8392-4434fe32adff"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting wikiextractor\n",
" Downloading wikiextractor-3.0.6-py3-none-any.whl (46 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.4/46.4 kB\u001b[0m \u001b[31m478.1 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: wikiextractor\n",
"Successfully installed wikiextractor-3.0.6\n"
]
}
],
"source": [
"pip install wikiextractor"
]
},
{
"cell_type": "code",
"source": [
"!wget https://dumps.wikimedia.org/bnwiki/latest/bnwiki-latest-pages-articles.xml.bz2\n",
"!bzip2 -d bnwiki-latest-pages-articles.xml.bz2\n",
"!python -m wikiextractor.WikiExtractor bnwiki-latest-pages-articles.xml"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "G1HIxA3u17o0",
"outputId": "c2997dd8-f849-4be6-83c1-0862f41b0fee"
},
"execution_count": 31,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2024-05-10 09:31:22-- https://dumps.wikimedia.org/bnwiki/latest/bnwiki-latest-pages-articles.xml.bz2\n",
"Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71, 2620:0:861:3:208:80:154:71\n",
"Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 380404074 (363M) [application/octet-stream]\n",
"Saving to: ‘bnwiki-latest-pages-articles.xml.bz2’\n",
"\n",
"bnwiki-latest-pages 100%[===================>] 362.78M 4.84MB/s in 77s \n",
"\n",
"2024-05-10 09:32:39 (4.71 MB/s) - ‘bnwiki-latest-pages-articles.xml.bz2’ saved [380404074/380404074]\n",
"\n",
"INFO: Preprocessing 'bnwiki-latest-pages-articles.xml' to collect template definitions: this may take some time.\n",
"INFO: Preprocessed 100000 pages\n",
"INFO: Preprocessed 200000 pages\n",
"INFO: Preprocessed 300000 pages\n",
"INFO: Preprocessed 400000 pages\n",
"INFO: Preprocessed 500000 pages\n",
"INFO: Preprocessed 600000 pages\n",
"INFO: Preprocessed 700000 pages\n",
"INFO: Loaded 166186 templates in 54.8s\n",
"INFO: Starting page extraction from bnwiki-latest-pages-articles.xml.\n",
"INFO: Using 1 extract processes.\n",
"INFO: Extracted 100000 articles (1218.9 art/s)\n",
"INFO: Extracted 200000 articles (939.3 art/s)\n",
"INFO: Extracted 300000 articles (741.1 art/s)\n",
"INFO: Finished 1-process extraction of 383474 articles in 459.8s (833.9 art/s)\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!tar -cjf archive.tar.bz2 text\n"
],
"metadata": {
"id": "IG0pMSjBzBQI"
},
"execution_count": 40,
"outputs": []
}
]
}