Remove python2 references from Week 07 practice (yandexdataschool#486)

dniku · web-flow · commit b56c7461695f · 2021-11-13T13:53:08.000+03:00
diff --git a/week07_seq2seq/practice_tf.ipynb b/week07_seq2seq/practice_tf.ipynb
@@ -12,7 +12,7 @@
     "    # https://stackoverflow.com/a/62482183\n",
     "    !pip uninstall -y tensorflow\n",
     "    !pip install tensorflow-gpu==1.13.1 keras==2.3.1\n",
-    "    \n",
+    "\n",
     "    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash\n",
     "\n",
     "    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week07_seq2seq/basic_model_tf.py\n",
@@ -56,7 +56,7 @@
     " * [Image captioning](https://cocodataset.org/#captions-2015) and [image2latex](https://openai.com/requests-for-research/#im2latex) (convolutional encoder, recurrent decoder)\n",
     " * Generating [images by captions](https://arxiv.org/abs/1511.02793) (recurrent encoder, convolutional decoder)\n",
     " * Grapheme2phoneme - convert words to transcripts\n",
-    "  \n",
+    "\n",
     "We chose simplified __Hebrew->English__ machine translation for words and short phrases (character-level), as it is relatively quick to train even without a gpu cluster."
    ]
   },
@@ -88,10 +88,7 @@
     "\n",
     "This is mostly due to the fact that many words have several correct translations.\n",
     "\n",
-    "We have implemented this thing for you so that you can focus on more interesting parts.\n",
-    "\n",
-    "\n",
-    "__Attention python2 users!__ You may want to cast everything to unicode later during homework phase, just make sure you do it _everywhere_."
+    "We have implemented this thing for you so that you can focus on more interesting parts."
    ]
   },
   {
@@ -312,7 +309,7 @@
     "\n",
     "def translate(lines):\n",
     "    \"\"\"\n",
-    "    You are given a list of input lines. \n",
+    "    You are given a list of input lines.\n",
     "    Make your neural network translate them.\n",
     "    :return: a list of output lines\n",
     "    \"\"\"\n",
@@ -595,7 +592,7 @@
     "\n",
     "    Params:\n",
     "    - words_ix - a matrix of letter indices, shape=[batch_size,word_length]\n",
-    "    - words_mask - a matrix of zeros/ones, \n",
+    "    - words_mask - a matrix of zeros/ones,\n",
     "       1 means \"word is still not finished\"\n",
     "       0 means \"word has already finished and this is padding\"\n",
     "\n",
@@ -716,7 +713,7 @@
     "\n",
     "In this section you'll implement algorithm called self-critical sequence training (here's an [article](https://arxiv.org/abs/1612.00563)).\n",
     "\n",
-    "The algorithm is a vanilla policy gradient with a special baseline. \n",
+    "The algorithm is a vanilla policy gradient with a special baseline.\n",
     "\n",
     "$$ \\nabla J = E_{x \\sim p(s)} E_{y \\sim \\pi(y|x)} \\nabla log \\pi(y|x) \\cdot (R(x,y) - b(x)) $$\n",
     "\n",
@@ -893,13 +890,13 @@
     "* You will likely need to adjust pre-training time for such a network.\n",
     "* Supervised pre-training may benefit from clipping gradients somehow.\n",
     "* SCST may indulge a higher learning rate in some cases and changing entropy regularizer over time.\n",
-    "* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters. \n",
+    "* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
     "* When leaving training for nighttime, try setting REPORT_FREQ to a larger value (e.g. 500) not to waste time on it.\n",
     "\n",
     "__Formal criteria:__\n",
     "To get 5 points, we want you to build an architecture that:\n",
     "* _doesn't consist of single GRU_\n",
-    "* _works better_ than single GRU baseline. \n",
+    "* _works better_ than single GRU baseline.\n",
     "* We also want you to provide either learning curve or trained model, preferably both\n",
     "* ... and write a brief report or experiment log describing what you did and how it fared.\n",
     "\n",
@@ -908,7 +905,7 @@
     "  * __Vanilla:__ layer_i of encoder last state goes to layer_i of decoder initial state\n",
     "  * __Every tick:__ feed encoder last state _on every iteration_ of decoder.\n",
     "  * __Attention:__ allow decoder to \"peek\" at one (or several) positions of encoded sequence on every tick.\n",
-    "  \n",
+    "\n",
     "The most effective (and cool) of those is, of course, attention.\n",
     "You can read more about attention [in this nice blog post](https://distill.pub/2016/augmented-rnns/). The easiest way to begin is to use \"soft\" attention with \"additive\" or \"dot-product\" intermediate layers.\n",
     "\n",
@@ -975,4 +972,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}
diff --git a/week07_seq2seq/practice_torch.ipynb b/week07_seq2seq/practice_torch.ipynb
@@ -27,7 +27,7 @@
     " * [Image captioning](https://cocodataset.org/#captions-2015) and [image2latex](https://htmlpreview.github.io/?https://github.com/openai/requests-for-research/blob/master/_requests_for_research/im2latex.html) (convolutional encoder, recurrent decoder)\n",
     " * Generating [images by captions](https://arxiv.org/abs/1511.02793) (recurrent encoder, convolutional decoder)\n",
     " * Grapheme2phoneme - convert words to transcripts\n",
-    "  \n",
+    "\n",
     "We chose simplified __Hebrew->English__ machine translation for words and short phrases (character-level), as it is relatively quick to train even without a gpu cluster."
    ]
   },
@@ -74,10 +74,7 @@
     "\n",
     "This is mostly due to the fact that many words have several correct translations.\n",
     "\n",
-    "We have implemented this thing for you so that you can focus on more interesting parts.\n",
-    "\n",
-    "\n",
-    "__Attention python2 users!__ You may want to cast everything to unicode later during homework phase, just make sure you do it _everywhere_."
+    "We have implemented this thing for you so that you can focus on more interesting parts."
    ]
   },
   {
@@ -289,7 +286,7 @@
    "source": [
     "def translate(lines, max_len=MAX_OUTPUT_LENGTH):\n",
     "    \"\"\"\n",
-    "    You are given a list of input lines. \n",
+    "    You are given a list of input lines.\n",
     "    Make your neural network translate them.\n",
     "    :return: a list of output lines\n",
     "    \"\"\"\n",
@@ -545,7 +542,7 @@
     "\n",
     "* __Train loss__ - that's your model's crossentropy over minibatches. It should go down steadily. Most importantly, it shouldn't be NaN :)\n",
     "* __Val score distribution__ - distribution of translation edit distance (score) within batch. It should move to the left over time.\n",
-    "* __Val score / training time__ - it's your current mean edit distance. This plot is much whimsier than loss, but make sure it goes below 8 by 2500 steps. \n",
+    "* __Val score / training time__ - it's your current mean edit distance. This plot is much whimsier than loss, but make sure it goes below 8 by 2500 steps.\n",
     "\n",
     "If it doesn't, first try to re-create both model and opt. You may have changed its weight too much while debugging. If that doesn't help, it's debugging time."
    ]
@@ -584,7 +581,7 @@
     "\n",
     "In this section you'll implement algorithm called self-critical sequence training (here's an [article](https://arxiv.org/abs/1612.00563)).\n",
     "\n",
-    "The algorithm is a vanilla policy gradient with a special baseline. \n",
+    "The algorithm is a vanilla policy gradient with a special baseline.\n",
     "\n",
     "$$ \\nabla J = E_{x \\sim p(s)} E_{y \\sim \\pi(y|x)} \\nabla log \\pi(y|x) \\cdot (R(x,y) - b(x)) $$\n",
     "\n",
@@ -637,7 +634,7 @@
     "\n",
     "    # compute log_pi(a_t|s_t), shape = [batch, seq_length]\n",
     "    logp_sample = <YOUR CODE>\n",
-    "    \n",
+    "\n",
     "    # ^-- hint: look at how crossentropy is implemented in supervised learning loss above\n",
     "    # mind the sign - this one should not be multiplied by -1 :)\n",
     "\n",
@@ -727,11 +724,11 @@
     "<img src=https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/do_something_scst.png width=400>\n",
     "\n",
     " * As usual, don't expect improvements right away, but in general the model should be able to show some positive changes by 5k steps.\n",
-    " * Entropy is a good indicator of many problems. \n",
+    " * Entropy is a good indicator of many problems.\n",
     "   * If it reaches zero, you may need greater entropy regularizer.\n",
     "   * If it has rapid changes time to time, you may need gradient clipping.\n",
     "   * If it oscillates up and down in an erratic manner... it's perfectly okay for entropy to do so. But it should decrease at the end.\n",
-    "   \n",
+    "\n",
     " * We don't show loss_history cuz it's uninformative for pseudo-losses in policy gradient. However, if something goes wrong you can check it to see if everything isn't a constant zero."
    ]
   },
@@ -800,13 +797,13 @@
     "* You will likely need to adjust pre-training time for such a network.\n",
     "* Supervised pre-training may benefit from clipping gradients somehow.\n",
     "* SCST may indulge a higher learning rate in some cases and changing entropy regularizer over time.\n",
-    "* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters. \n",
+    "* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
     "* When leaving training for nighttime, try setting REPORT_FREQ to a larger value (e.g. 500) not to waste time on it.\n",
     "\n",
     "__Formal criteria:__\n",
     "To get 5 points, we want you to build an architecture that:\n",
     "* _doesn't consist of single GRU_\n",
-    "* _works better_ than single GRU baseline. \n",
+    "* _works better_ than single GRU baseline.\n",
     "* We also want you to provide either learning curve or trained model, preferably both\n",
     "* ... and write a brief report or experiment log describing what you did and how it fared.\n",
     "\n",
@@ -815,7 +812,7 @@
     "  * __Vanilla:__ layer_i of encoder last state goes to layer_i of decoder initial state\n",
     "  * __Every tick:__ feed encoder last state _on every iteration_ of decoder.\n",
     "  * __Attention:__ allow decoder to \"peek\" at one (or several) positions of encoded sequence on every tick.\n",
-    "  \n",
+    "\n",
     "The most effective (and cool) of those is, of course, attention.\n",
     "You can read more about attention [in this nice blog post](https://distill.pub/2016/augmented-rnns/). The easiest way to begin is to use \"soft\" attention with \"additive\" or \"dot-product\" intermediate layers.\n",
     "\n",
@@ -875,4 +872,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}