Skip to content

Commit 775ec9c

Browse files
committed
Clean up notebooks and README
1 parent c416cf6 commit 775ec9c

6 files changed

+41
-53
lines changed

1-Scraping.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -518,7 +518,7 @@
518518
],
519519
"metadata": {
520520
"kernelspec": {
521-
"display_name": "Python 3",
521+
"display_name": "Python 3 (ipykernel)",
522522
"language": "python",
523523
"name": "python3"
524524
},

2-Parsing-Storing.ipynb

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"\n",
99
"This is a demo of using BeautifulSoup to parse the page source code that was saved in step 1 with Selenium, and then save that data into a csv file for later use. For speed and efficiency, this notebook uses a truncated version of the source code - that first batch that is rendered before scrolling down, as demonstrated in the previous notebook.\n",
1010
"\n",
11-
"The script named '' contains everything demonstrated here, and can be run on the complete page source code that was obtained by running `selen.py`"
11+
"The script named `extract.py` contains everything demonstrated here, and can be run on the complete page source code that was obtained by running `selen.py`"
1212
]
1313
},
1414
{
@@ -1000,7 +1000,7 @@
10001000
],
10011001
"metadata": {
10021002
"kernelspec": {
1003-
"display_name": "Python 3",
1003+
"display_name": "Python 3 (ipykernel)",
10041004
"language": "python",
10051005
"name": "python3"
10061006
},

3-EDA.ipynb

+14-24
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# EDA"
7+
"# Exploratory data analysis"
88
]
99
},
1010
{
@@ -2405,7 +2405,9 @@
24052405
{
24062406
"cell_type": "code",
24072407
"execution_count": 57,
2408-
"metadata": {},
2408+
"metadata": {
2409+
"scrolled": true
2410+
},
24092411
"outputs": [
24102412
{
24112413
"data": {
@@ -2434,6 +2436,13 @@
24342436
"df_minus_ongoing['Day'].value_counts().plot.bar(ylim=[500, 600], rot=330)"
24352437
]
24362438
},
2439+
{
2440+
"cell_type": "markdown",
2441+
"metadata": {},
2442+
"source": [
2443+
"Above, we see that virtual meetings seem to be more numerous during the week than during the weekend."
2444+
]
2445+
},
24372446
{
24382447
"cell_type": "code",
24392448
"execution_count": 58,
@@ -2549,37 +2558,18 @@
25492558
}
25502559
],
25512560
"source": [
2552-
"# Pandas' hist does not register this as numeric data, interestingly\n",
2561+
"# Pandas' hist method does not register this as numeric data, interestingly\n",
25532562
"plt.figure(figsize=(30,10))\n",
25542563
"n, bins, edges = plt.hist(df['Time_dt'],bins=24,ec=\"red\",alpha=0.7)\n",
25552564
"plt.xticks(bins, rotation=300, fontsize=18)\n",
25562565
"plt.show()"
25572566
]
25582567
},
25592568
{
2560-
"cell_type": "code",
2561-
"execution_count": 62,
2569+
"cell_type": "markdown",
25622570
"metadata": {},
2563-
"outputs": [
2564-
{
2565-
"data": {
2566-
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAT1klEQVR4nO3df7BcZ33f8fensg0T8ICMbojrH8ikmgymxT9yR5A6BTMJsuxJLdJhWqkUBIHRJGO3IW06dcrUZkxnGkKbdEgcjBI0hgyxaQJO1I4dWw2kbktNdeUK/8T4oti1KhfdINcmNYMr8+0fezSzXO/eeySt7o/H79fMmT3neZ6z97tHq889OntWT6oKSVK7/spyFyBJOrUMeklqnEEvSY0z6CWpcQa9JDXutOUuYJR169bV+vXrl7sMSVo19u3b9xdVNTWqb0UG/fr165mZmVnuMiRp1UjyxLg+L91IUuMMeklqnEEvSY0z6CWpcQa9JDVu0aBPcl6SLyd5JMlDSX5xxJgk+USS2ST3J7l0qG97kse6ZfukX4AkaWF9bq88CvyTqrovyZnAviR7qurhoTFXAhu65c3AJ4E3JzkLuAGYBqrbd3dVPT3RVyFJGmvRM/qqeqqq7uvWvwM8Apwzb9gW4LM1cC/w6iRnA1cAe6rqSBfue4DNE30FkqQFHdc1+iTrgUuAr87rOgd4cmj7YNc2rn3Uc+9IMpNkZm5u7njKkiQtoHfQJ3kl8AXgQ1X17PzuEbvUAu0vbqzaWVXTVTU9NTXyW7ySpBPQK+iTnM4g5D9XVV8cMeQgcN7Q9rnAoQXaJUlLpM9dNwE+DTxSVb8+Zthu4L3d3TdvAZ6pqqeAu4BNSdYmWQts6tokSUukz103lwHvAR5Isr9r++fA+QBVdTNwB3AVMAs8B7y/6zuS5KPA3m6/G6vqyOTKlyQtZtGgr6r/wuhr7cNjCrhmTN8uYNcJVSdJOml+M1aSGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1LhFJx5Jsgv4GeBwVf31Ef3/FHj30PO9AZjqZpd6HPgO8AJwtKqmJ1W4JKmfPmf0twCbx3VW1cer6uKquhj4FeA/zZsu8O1dvyEvSctg0aCvqnuAvvO8bgNuPamKJEkTNbFr9El+iMGZ/xeGmgu4O8m+JDsW2X9HkpkkM3Nzc5MqS5Je8ib5YezfBv7rvMs2l1XVpcCVwDVJ3jpu56raWVXTVTU9NTU1wbIk6aVtkkG/lXmXbarqUPd4GLgd2DjBnydJ6mEiQZ/kVcDbgD8eantFkjOPrQObgAcn8fMkSf31ub3yVuByYF2Sg8ANwOkAVXVzN+xngbur6v8O7fpa4PYkx37O71fVn0yudElSH4sGfVVt6zHmFga3YQ63HQAuOtHCJEmT4TdjJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNWzTok+xKcjjJyGkAk1ye5Jkk+7vl+qG+zUkeTTKb5LpJFi5J6qfPGf0twOZFxvznqrq4W24ESLIGuAm4ErgQ2JbkwpMpVpJ0/BYN+qq6BzhyAs+9EZitqgNV9TxwG7DlBJ5HknQSJnWN/ieSfC3JnUne2LWdAzw5NOZg1zZSkh1JZpLMzM3NnVARl18+WE7WpJ5nJfHYLGylva6VVs9KstKOzWr4uzWJoL8PeF1VXQT8JvBHXXtGjK1xT1JVO6tquqqmp6amJlCWJAkmEPRV9WxV/WW3fgdwepJ1DM7gzxsaei5w6GR/niTp+Jx00Cf5kSTp1jd2z/ltYC+wIckFSc4AtgK7T/bnSZKOz2mLDUhyK3A5sC7JQeAG4HSAqroZeBfwC0mOAt8FtlZVAUeTXAvcBawBdlXVQ6fkVUiSxlo06Ktq2yL9vwX81pi+O4A7Tqw0SdIk+M1YSWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGrdo0CfZleRwkgfH9L87yf3d8pUkFw31PZ7kgST7k8xMsnBJUj99zuhvATYv0P/nwNuq6k3AR4Gd8/rfXlUXV9X0iZUoSToZfWaYuifJ+gX6vzK0eS+DScAlSSvEpK/RfwC4c2i7gLuT7EuyY6Edk+xIMpNkZm5ubsJlSdJL16Jn9H0leTuDoP/JoebLqupQkh8G9iT5elXdM2r/qtpJd9lnenq6JlWXJL3UTeSMPsmbgN8FtlTVt4+1V9Wh7vEwcDuwcRI/T5LU30kHfZLzgS8C76mqbwy1vyLJmcfWgU3AyDt3JEmnzqKXbpLcClwOrEtyELgBOB2gqm4GrgdeA/x2EoCj3R02rwVu79pOA36/qv7kFLwGSdIC+tx1s22R/g8CHxzRfgC46MV7SJKWkt+MlaTGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1rlfQJ9mV5HCSkVMBZuATSWaT3J/k0qG+7Uke65btkypcktRP3zP6W4DNC/RfCWzolh3AJwGSnMVg6sE3M5gY/IYka0+0WEnS8esV9FV1D3BkgSFbgM/WwL3Aq5OcDVwB7KmqI1X1NLCHhX9hSJImbFLX6M8BnhzaPti1jWt/kSQ7kswkmZmbm5tQWZKkSQV9RrTVAu0vbqzaWVXTVTU9NTU1obIkSZMK+oPAeUPb5wKHFmiXJC2RSQX9buC93d03bwGeqaqngLuATUnWdh/CburaJElL5LQ+g5LcClwOrEtykMGdNKcDVNXNwB3AVcAs8Bzw/q7vSJKPAnu7p7qxqhb6UFeSNGG9gr6qti3SX8A1Y/p2AbuOvzRJ0iT4zVhJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuN6BX2SzUkeTTKb5LoR/b+RZH+3fCPJ/xnqe2Gob/cki5ckLW7RGaaSrAFuAt7BYLLvvUl2V9XDx8ZU1S8Njf+HwCVDT/Hdqrp4ciVLko5HnzP6jcBsVR2oqueB24AtC4zfBtw6ieIkSSevT9CfAzw5tH2wa3uRJK8DLgC+NNT88iQzSe5N8s5xPyTJjm7czNzcXI+yJEl99An6jGirMWO3An9YVS8MtZ1fVdPA3wf+bZIfHbVjVe2squmqmp6amupRliSpjz5BfxA4b2j7XODQmLFbmXfZpqoOdY8HgD/jB6/fS5JOsT5BvxfYkOSCJGcwCPMX3T2T5MeAtcB/G2pbm+Rl3fo64DLg4fn7SpJOnUXvuqmqo0muBe4C1gC7quqhJDcCM1V1LPS3AbdV1fBlnTcAn0ryfQa/VH51+G4dSdKpt2jQA1TVHcAd89qun7f9kRH7fQX4GydRnyTpJPnNWElqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhrXK+iTbE7yaJLZJNeN6H9fkrkk+7vlg0N925M81i3bJ1m8JGlxi048kmQNcBPwDgbzx+5NsnvETFGfr6pr5+17FnADMM1gQvF93b5PT6R6SdKi+pzRbwRmq+pAVT0P3AZs6fn8VwB7qupIF+57gM0nVqok6UTkB6d4HTEgeRewuao+2G2/B3jz8Nl7kvcB/wqYA74B/FJVPZnkl4GXV9W/7Mb9C+C7VfWvR/ycHcAOgPPPP//Hn3jiiQm8PEl6aUiyr6qmR/X1OaPPiLb5vx3+PbC+qt4E/EfgM8ex76CxamdVTVfV9NTUVI+yJEl99An6g8B5Q9vnAoeGB1TVt6vqe93m7wA/3ndfSdKp1Sfo9wIbklyQ5AxgK7B7eECSs4c2rwYe6dbvAjYlWZtkLbCpa5MkLZFF77qpqqNJrmUQ0GuAXVX1UJIbgZmq2g38oyRXA0eBI8D7un2PJPkog18WADdW1ZFT8DokSWMs+mHscpienq6ZmZnlLkOSVo2T/TBWkrSKGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1rlfQJ9mc5NEks0muG9H/j5M8nOT+JH+a5HVDfS8k2d8tu+fvK0k6tRadSjDJGuAm4B0MJvvem2R3VT08NOx/ANNV9VySXwB+Dfh7Xd93q+riCdctSeqpzxn9RmC2qg5U1fPAbcCW4QFV9eWqeq7bvBc4d7JlSpJOVJ+gPwd4cmj7YNc2zgeAO4e2X55kJsm9Sd45bqckO7pxM3Nzcz3KkiT1seilGyAj2kbOKJ7kHwDTwNuGms+vqkNJXg98KckDVfXNFz1h1U5gJwwmB+9RlySphz5n9AeB84a2zwUOzR+U5KeBDwNXV9X3jrVX1aHu8QDwZ8AlJ1GvJOk49Qn6vcCGJBckOQPYCvzA3TNJLgE+xSDkDw+1r03ysm59HXAZMPwhriTpFFv00k1VHU1yLXAXsAbYVVUPJbkRmKmq3cDHgVcCf5AE4H9W1dXAG4BPJfk+g18qvzrvbh1J0imWqpV3OXx6erpmZmaWuwxJWjWS7Kuq6VF9fjNWkhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktS4XkGfZHOSR5PMJrluRP/Lkny+6/9qkvVDfb/StT+a5IrJlS5J6mPRoE+yBrgJuBK4ENiW5MJ5wz4APF1Vfw34DeBj3b4XMphj9o3AZuC3u+eTJC2RPmf0G4HZqjpQVc8DtwFb5o3ZAnymW/9D4KcymDx2C3BbVX2vqv4cmO2eT5K0RPoE/TnAk0PbB7u2kWOq6ijwDPCanvsCkGRHkpkkM3Nzc/2qlyQtqk/QZ0Tb/BnFx43ps++gsWpnVU1X1fTU1FSPsiRJffQJ+oPAeUPb5wKHxo1JchrwKuBIz30lSadQn6DfC2xIckGSMxh8uLp73pjdwPZu/V3Al6qquvat3V05FwAbgP8+mdIlSX2cttiAqjqa5FrgLmANsKuqHkpyIzBTVbuBTwO/l2SWwZn81m7fh5L8O+Bh4ChwTVW9cIpeiyRphAxOvFeW6enpmpmZWe4yJGnVSLKvqqZH9fnNWElqnEEvSY0z6CWpcQa9JDVuRX4Ym2QOeGK56+hhHfAXy13EcVht9YI1L5XVVvNqqxdOfc2vq6qR3zZdkUG/WiSZGfcp90q02uoFa14qq63m1VYvLG/NXrqRpMYZ9JLUOIP+5Oxc7gKO02qrF6x5qay2mldbvbCMNXuNXpIa5xm9JDXOoJekxhn0C0hyXpIvJ3kkyUNJfnHEmMuTPJNkf7dcvxy1zqvp8SQPdPW86H+Hy8Anuknb709y6XLUOVTPjw0dv/1Jnk3yoXljlv04J9mV5HCSB4fazkqyJ8lj3ePaMftu78Y8lmT7qDFLWPPHk3y9+7O/Pcmrx+y74PtoCev9SJL/NfRnf9WYfTcnebR7X1+3FPUuUPPnh+p9PMn+MfsuzTGuKpcxC3A2cGm3fibwDeDCeWMuB/7Dctc6r6bHgXUL9F8F3MlgBrC3AF9d7pqHalsD/G8GX/5YUccZeCtwKfDgUNuvAdd169cBHxux31nAge5xbbe+dhlr3gSc1q1/bFTNfd5HS1jvR4Bf7vG++SbweuAM4Gvz/64uZc3z+v8NcP1yHmPP6BdQVU9V1X3d+neARxgz5+0qswX4bA3cC7w6ydnLXVTnp4BvVtWK+2Z0Vd3DYL6FYVuAz3TrnwHeOWLXK4A9VXWkqp4G9gCbT1mhQ0bVXFV312BuZ4B7Gcz8tiKMOcZ9bARmq+pAVT0P3Mbgz+aUW6jmJAH+LnDrUtQyjkHfU5L1wCXAV0d0/0SSryW5M8kbl7Sw0Qq4O8m+JDtG9PeetH0ZbGX8X4qVdpwBXltVT8HgxAD44RFjVvLx/jkG/7obZbH30VK6trvUtGvM5bGVeoz/FvCtqnpsTP+SHGODvockrwS+AHyoqp6d130fg8sMFwG/CfzRUtc3wmVVdSlwJXBNkrfO6+89aftS6qaqvBr4gxHdK/E497VSj/eHGcz89rkxQxZ7Hy2VTwI/ClwMPMXgUsh8K/IYA9tY+Gx+SY6xQb+IJKczCPnPVdUX5/dX1bNV9Zfd+h3A6UnWLXGZ82s61D0eBm5n8M/aYSt10vYrgfuq6lvzO1bice5869hlr+7x8IgxK+54dx8I/wzw7uouFs/X4320JKrqW1X1QlV9H/idMXWsxGN8GvB3gM+PG7NUx9igX0B3fe3TwCNV9etjxvxIN44kGxkc028vXZUvqucVSc48ts7gg7cH5w3bDby3u/vmLcAzxy4/LLOxZz8r7TgP2Q0cu4tmO/DHI8bcBWxKsra77LCpa1sWSTYD/wy4uqqeGzOmz/toScz7/Ohnx9SxF9iQ5ILuX4ZbGfzZLKefBr5eVQdHdS7pMV6KT6VX6wL8JIN//t0P7O+Wq4CfB36+G3Mt8BCDT/nvBf7mMtf8+q6Wr3V1fbhrH645wE0M7lJ4AJheAcf6hxgE96uG2lbUcWbwS+gp4P8xOIP8APAa4E+Bx7rHs7qx08DvDu37c8Bst7x/mWueZXA9+9h7+uZu7F8F7ljofbRM9f5e9z69n0F4nz2/3m77KgZ3xn1zqeodV3PXfsux9+/Q2GU5xv4XCJLUOC/dSFLjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUuP8P0mU2ZOBbGw8AAAAASUVORK5CYII=\n",
2567-
"text/plain": [
2568-
"<Figure size 432x288 with 1 Axes>"
2569-
]
2570-
},
2571-
"metadata": {
2572-
"needs_background": "light"
2573-
},
2574-
"output_type": "display_data"
2575-
}
2576-
],
25772571
"source": [
2578-
"plt.figure()\n",
2579-
"a = [1,2,5,6,9,11,15,17,18]\n",
2580-
"plt.eventplot(a, orientation='horizontal', colors='b')\n",
2581-
"# plt.axis('off')\n",
2582-
"plt.show()"
2572+
"Each bin in the histogram above is about 1 hour long. We see that all the meeting times are (very roughly) normally distributed with a distinct skew. The distribution peaks in the early evening and steadily drops off after. There are 2 minor peaks that break up this pattern however, roughly corresponding to the wake-up and lunch-hours. **Important note: these times are Central USA, which is where this data was sourced (the site displays local time).** Luckily, the vast majority of meetings are held in the US, which is why this distribution makes so much sense in light of the typical 9-5 work day."
25832573
]
25842574
},
25852575
{

4-Analysis.ipynb

+7-7
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Data analysis"
7+
"# Data analysis with inferential statistics"
88
]
99
},
1010
{
@@ -445,11 +445,11 @@
445445
"cell_type": "markdown",
446446
"metadata": {},
447447
"source": [
448-
"**Null hypothesis:** The average time of men's meetings = 14:10:38\n",
448+
"**Null hypothesis $H_{0}$:** The average time of men's meetings = 14:10:38\n",
449449
"\n",
450-
"**Alternative hypothesis:** The average time of men's meetings /= 14:10:38\n",
450+
"**Alternative hypothesis $H_{a}$:** The average time of men's meetings $\\neq$ 14:10:38\n",
451451
"\n",
452-
"**Confidence level:** alpha = 0.05"
452+
"**Confidence level:** $\\alpha$ = 0.05"
453453
]
454454
},
455455
{
@@ -602,11 +602,11 @@
602602
"source": [
603603
"The women's mean is greater than the men's mean, which is a nice preliminary support for doing a **one-tailed test:**\n",
604604
"\n",
605-
"**Null hypothesis:** The average length of description of women's meetings <= the average length of description of men's meetings \n",
605+
"**Null hypothesis $H_{0}$:** The average length of description of women's meetings $\\le$ the average length of description of men's meetings \n",
606606
"\n",
607-
"**Alternative hypothesis:** The average length of description of women's meetings > the average length of description of men's meetings \n",
607+
"**Alternative hypothesis $H_{a}$:** The average length of description of women's meetings > the average length of description of men's meetings \n",
608608
"\n",
609-
"**Confidence level:** alpha = 0.05"
609+
"**Confidence level:** $\\alpha$ = 0.05"
610610
]
611611
},
612612
{

readme.md

+17-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
1-
Meetings Directory Scraper & Analytics
1+
AA Virtual Meetings Directory Scraper & Analytics
22
=====
33

4-
{This script} scrapes the online AA meetings directory, {aa-intergroup.org}, and saves the contents including category tags and user-generated descriptions into a local postgres database. It only pulls English language meetings.
4+
* scrapes the online AA meetings directory, and saves the contents including category tags and user-generated descriptions into a csv file
5+
* demos EDA and inferential statistical analysis on the data
56

7+
Background
8+
---
9+
[aa-intergroup.org](https://aa-intergroup.org/meetings) contains the official online directory of virtual/remote Alcoholics Anonymous meetings. The meetings are created and maintained by real people, featuring titles and descriptions written in unstructured plain language, sometimes long and colorful, sometimes short and reserved. These descriptions are meant to attract desired members and presumably keep away undesired bad actors. These characteristics make this dataset a good potential target for natural language analysis. *(coming soon)*
10+
11+
Another useful feature is the use of category labels that denote whether meetings are closed to non-AA-members, focused on particular study materials, only for men/women/LGBTQ etc. A meeting may have more than one category label, and the directory does seem to be successfully enforcing those categories that are exclusive (eg, records can not and are not both Open and Closed, Men Only and Women Only, etc). **The current version of this project focuses on statistical analysis that makes use of these plentiful categories in combination with complete weekday and time schedule information.**
12+
13+
Contents
14+
---
15+
The project is demonstrated in four jupyter notebooks:
16+
17+
1. `1-Scraping.ipynb` demonstrates obtaining the dataset with the use of Selenium, explains how to use the `selen.py` script
18+
2. `2-Parsing-Storing.ipynb` demonstrates extracting data from the scraped page source code, explains how to use the `extract.py` script
19+
3. `3-EDA.ipynb` demonstrates exploratory data analysis and visualizations
20+
4. `4-Analysis` demonstrates a two-tailed single sample t test, one-tailed two sample t test, and chi square test

util.py

-17
This file was deleted.

0 commit comments

Comments
 (0)