You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
" * [Image captioning](https://cocodataset.org/#captions-2015) and [image2latex](https://openai.com/requests-for-research/#im2latex) (convolutional encoder, recurrent decoder)\n",
57
57
" * Generating [images by captions](https://arxiv.org/abs/1511.02793) (recurrent encoder, convolutional decoder)\n",
58
58
" * Grapheme2phoneme - convert words to transcripts\n",
59
-
"\n",
59
+
"\n",
60
60
"We chose simplified __Hebrew->English__ machine translation for words and short phrases (character-level), as it is relatively quick to train even without a gpu cluster."
61
61
]
62
62
},
@@ -88,10 +88,7 @@
88
88
"\n",
89
89
"This is mostly due to the fact that many words have several correct translations.\n",
90
90
"\n",
91
-
"We have implemented this thing for you so that you can focus on more interesting parts.\n",
92
-
"\n",
93
-
"\n",
94
-
"__Attention python2 users!__ You may want to cast everything to unicode later during homework phase, just make sure you do it _everywhere_."
91
+
"We have implemented this thing for you so that you can focus on more interesting parts."
95
92
]
96
93
},
97
94
{
@@ -312,7 +309,7 @@
312
309
"\n",
313
310
"def translate(lines):\n",
314
311
"\"\"\"\n",
315
-
" You are given a list of input lines.\n",
312
+
" You are given a list of input lines.\n",
316
313
" Make your neural network translate them.\n",
317
314
" :return: a list of output lines\n",
318
315
"\"\"\"\n",
@@ -595,7 +592,7 @@
595
592
"\n",
596
593
" Params:\n",
597
594
" - words_ix - a matrix of letter indices, shape=[batch_size,word_length]\n",
598
-
" - words_mask - a matrix of zeros/ones,\n",
595
+
" - words_mask - a matrix of zeros/ones,\n",
599
596
" 1 means \"word is still not finished\"\n",
600
597
" 0 means \"word has already finished and this is padding\"\n",
601
598
"\n",
@@ -716,7 +713,7 @@
716
713
"\n",
717
714
"In this section you'll implement algorithm called self-critical sequence training (here's an [article](https://arxiv.org/abs/1612.00563)).\n",
718
715
"\n",
719
-
"The algorithm is a vanilla policy gradient with a special baseline.\n",
716
+
"The algorithm is a vanilla policy gradient with a special baseline.\n",
"* You will likely need to adjust pre-training time for such a network.\n",
894
891
"* Supervised pre-training may benefit from clipping gradients somehow.\n",
895
892
"* SCST may indulge a higher learning rate in some cases and changing entropy regularizer over time.\n",
896
-
"* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
893
+
"* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
897
894
"* When leaving training for nighttime, try setting REPORT_FREQ to a larger value (e.g. 500) not to waste time on it.\n",
898
895
"\n",
899
896
"__Formal criteria:__\n",
900
897
"To get 5 points, we want you to build an architecture that:\n",
901
898
"* _doesn't consist of single GRU_\n",
902
-
"* _works better_ than single GRU baseline.\n",
899
+
"* _works better_ than single GRU baseline.\n",
903
900
"* We also want you to provide either learning curve or trained model, preferably both\n",
904
901
"* ... and write a brief report or experiment log describing what you did and how it fared.\n",
905
902
"\n",
@@ -908,7 +905,7 @@
908
905
" * __Vanilla:__ layer_i of encoder last state goes to layer_i of decoder initial state\n",
909
906
" * __Every tick:__ feed encoder last state _on every iteration_ of decoder.\n",
910
907
" * __Attention:__ allow decoder to \"peek\" at one (or several) positions of encoded sequence on every tick.\n",
911
-
"\n",
908
+
"\n",
912
909
"The most effective (and cool) of those is, of course, attention.\n",
913
910
"You can read more about attention [in this nice blog post](https://distill.pub/2016/augmented-rnns/). The easiest way to begin is to use \"soft\" attention with \"additive\" or \"dot-product\" intermediate layers.\n",
Copy file name to clipboardexpand all lines: week07_seq2seq/practice_torch.ipynb
+12-15
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@
27
27
" * [Image captioning](https://cocodataset.org/#captions-2015) and [image2latex](https://htmlpreview.github.io/?https://github.com/openai/requests-for-research/blob/master/_requests_for_research/im2latex.html) (convolutional encoder, recurrent decoder)\n",
28
28
" * Generating [images by captions](https://arxiv.org/abs/1511.02793) (recurrent encoder, convolutional decoder)\n",
29
29
" * Grapheme2phoneme - convert words to transcripts\n",
30
-
"\n",
30
+
"\n",
31
31
"We chose simplified __Hebrew->English__ machine translation for words and short phrases (character-level), as it is relatively quick to train even without a gpu cluster."
32
32
]
33
33
},
@@ -74,10 +74,7 @@
74
74
"\n",
75
75
"This is mostly due to the fact that many words have several correct translations.\n",
76
76
"\n",
77
-
"We have implemented this thing for you so that you can focus on more interesting parts.\n",
78
-
"\n",
79
-
"\n",
80
-
"__Attention python2 users!__ You may want to cast everything to unicode later during homework phase, just make sure you do it _everywhere_."
77
+
"We have implemented this thing for you so that you can focus on more interesting parts."
"* __Train loss__ - that's your model's crossentropy over minibatches. It should go down steadily. Most importantly, it shouldn't be NaN :)\n",
547
544
"* __Val score distribution__ - distribution of translation edit distance (score) within batch. It should move to the left over time.\n",
548
-
"* __Val score / training time__ - it's your current mean edit distance. This plot is much whimsier than loss, but make sure it goes below 8 by 2500 steps.\n",
545
+
"* __Val score / training time__ - it's your current mean edit distance. This plot is much whimsier than loss, but make sure it goes below 8 by 2500 steps.\n",
549
546
"\n",
550
547
"If it doesn't, first try to re-create both model and opt. You may have changed its weight too much while debugging. If that doesn't help, it's debugging time."
551
548
]
@@ -584,7 +581,7 @@
584
581
"\n",
585
582
"In this section you'll implement algorithm called self-critical sequence training (here's an [article](https://arxiv.org/abs/1612.00563)).\n",
586
583
"\n",
587
-
"The algorithm is a vanilla policy gradient with a special baseline.\n",
584
+
"The algorithm is a vanilla policy gradient with a special baseline.\n",
" * As usual, don't expect improvements right away, but in general the model should be able to show some positive changes by 5k steps.\n",
730
-
" * Entropy is a good indicator of many problems.\n",
727
+
" * Entropy is a good indicator of many problems.\n",
731
728
" * If it reaches zero, you may need greater entropy regularizer.\n",
732
729
" * If it has rapid changes time to time, you may need gradient clipping.\n",
733
730
" * If it oscillates up and down in an erratic manner... it's perfectly okay for entropy to do so. But it should decrease at the end.\n",
734
-
"\n",
731
+
"\n",
735
732
" * We don't show loss_history cuz it's uninformative for pseudo-losses in policy gradient. However, if something goes wrong you can check it to see if everything isn't a constant zero."
736
733
]
737
734
},
@@ -800,13 +797,13 @@
800
797
"* You will likely need to adjust pre-training time for such a network.\n",
801
798
"* Supervised pre-training may benefit from clipping gradients somehow.\n",
802
799
"* SCST may indulge a higher learning rate in some cases and changing entropy regularizer over time.\n",
803
-
"* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
800
+
"* It's often useful to save pre-trained model parameters to not re-train it every time you want new policy gradient parameters.\n",
804
801
"* When leaving training for nighttime, try setting REPORT_FREQ to a larger value (e.g. 500) not to waste time on it.\n",
805
802
"\n",
806
803
"__Formal criteria:__\n",
807
804
"To get 5 points, we want you to build an architecture that:\n",
808
805
"* _doesn't consist of single GRU_\n",
809
-
"* _works better_ than single GRU baseline.\n",
806
+
"* _works better_ than single GRU baseline.\n",
810
807
"* We also want you to provide either learning curve or trained model, preferably both\n",
811
808
"* ... and write a brief report or experiment log describing what you did and how it fared.\n",
812
809
"\n",
@@ -815,7 +812,7 @@
815
812
" * __Vanilla:__ layer_i of encoder last state goes to layer_i of decoder initial state\n",
816
813
" * __Every tick:__ feed encoder last state _on every iteration_ of decoder.\n",
817
814
" * __Attention:__ allow decoder to \"peek\" at one (or several) positions of encoded sequence on every tick.\n",
818
-
"\n",
815
+
"\n",
819
816
"The most effective (and cool) of those is, of course, attention.\n",
820
817
"You can read more about attention [in this nice blog post](https://distill.pub/2016/augmented-rnns/). The easiest way to begin is to use \"soft\" attention with \"additive\" or \"dot-product\" intermediate layers.\n",
0 commit comments