fix glm4 typos (#2801)

eaidova · web-flow · commit 1e4e39542683 · 2025-03-07T08:30:41.000+04:00
diff --git a/notebooks/glm4-v/README.md b/notebooks/glm4-v/README.md
@@ -1,4 +1,4 @@
-## Visual-language assistant with GLM-Edge-V and OpenVINO
+## Visual-language assistant with GLM4-V and OpenVINO
 
 [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b) is an open source multimodal version of the GLM-4 series launched by Zhipu AI. GLM-4V-9B has the ability to conduct multi-round conversations in Chinese and English at a high resolution of 1120 * 1120. In multimodal evaluations of comprehensive Chinese and English abilities, perceptual reasoning, text recognition, and chart understanding, GLM-4V-9B has shown superior performance many popular models.
 
diff --git a/notebooks/glm4-v/glm4-v.ipynb b/notebooks/glm4-v/glm4-v.ipynb
@@ -5,13 +5,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Visual-language assistant with GLM-Edge-V and OpenVINO\n",
+    "# Visual-language assistant with GLM4-V and OpenVINO\n",
     "\n",
     "[GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b) is an open source multimodal version of the GLM-4 series launched by Zhipu AI. GLM-4V-9B has the ability to conduct multi-round conversations in Chinese and English at a high resolution of 1120 * 1120. In multimodal evaluations of comprehensive Chinese and English abilities, perceptual reasoning, text recognition, and chart understanding, GLM-4V-9B has shown superior performance many popular models.\n",
     "\n",
     "You can find more information in [model card](https://huggingface.co/THUDM/glm-4v-9b), [technical report](https://arxiv.org/pdf/2406.12793) and original [repository](https://github.com/THUDM/GLM-4).\n",
     "\n",
-    "In this tutorial we consider how to launch multimodal model GLM-Edge-V using OpenVINO for creation multimodal chatbot. Additionally, we optimize model to low precision using [NNCF](https://github.com/openvinotoolkit/nncf).\n",
+    "In this tutorial we consider how to launch multimodal model GLM4-V-9B using OpenVINO for creation multimodal chatbot. Additionally, we optimize model to low precision using [NNCF](https://github.com/openvinotoolkit/nncf).\n",
     "\n",
     ">**Note**: Model conversion process may require ~50GB RAM. We recommend to pay attention on light-weight [GLM-Edge-V models series](../glm-edge-v/glm-edge-v.ipynb) if your system does not satisfy to these requirements.\n",
     "#### Table of contents:\n",
@@ -105,7 +105,7 @@
     "  <summary><b>Click here for more detailed explanation of conversion steps</b></summary>\n",
     "GLM4-V is autoregressive transformer generative model, it means that each next model step depends from model output from previous step. The generation approach is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. In other words, model predicts the next token in the loop guided by previously generated tokens until the stop-condition will be not reached (generated sequence of maximum length or end of string token obtained). The way the next token will be selected over predicted probabilities is driven by the selected decoding methodology. You can find more information about the most popular decoding methods in this <a href=\"https://huggingface.co/blog/how-to-generate\">blog</a>. The entry point for the generation process for models from the Hugging Face Transformers library is the `generate` method. You can find more information about its parameters and configuration in the  <a href=\"https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/text_generation#transformers.GenerationMixin.generate\">documentation</a>. To preserve flexibility in the selection decoding methodology, we will convert only model inference for one step.\n",
     "\n",
-    "GLM-Edge-V model consists of 3 parts:\n",
+    "GLM4-V model consists of 3 parts:\n",
     "\n",
     "* **Vision Model** for encoding input images into embedding space.\n",
     "* **Embedding Model** for conversion input text tokens into embedding space\n",
diff --git a/notebooks/glm4-v/ov_glm4v_helper.py b/notebooks/glm4-v/ov_glm4v_helper.py
@@ -370,7 +370,7 @@ def _glm4_core_attention_forward(self, query_layer, key_layer, value_layer, atte
             "past_key_values": pkv,
         }
 
-        ts_decoder = TorchScriptPythonDecoder(vision_embed_tokens, example_input=example_input, trace_kwargs={"check_trace": False})
+        ts_decoder = TorchScriptPythonDecoder(model.transformer, example_input=example_input, trace_kwargs={"check_trace": False})
 
         ov_model = ov.convert_model(ts_decoder, example_input=example_input)
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-## Visual-language assistant with GLM-Edge-V and OpenVINO`
	`1`	`+## Visual-language assistant with GLM4-V and OpenVINO`
`2`	`2`
`3`	`3`	`[GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b) is an open source multimodal version of the GLM-4 series launched by Zhipu AI. GLM-4V-9B has the ability to conduct multi-round conversations in Chinese and English at a high resolution of 1120 * 1120. In multimodal evaluations of comprehensive Chinese and English abilities, perceptual reasoning, text recognition, and chart understanding, GLM-4V-9B has shown superior performance many popular models.`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -370,7 +370,7 @@ def _glm4_core_attention_forward(self, query_layer, key_layer, value_layer, atte`
`370`	`370`	`"past_key_values": pkv,`
`371`	`371`	`}`
`372`	`372`
`373`		`- ts_decoder = TorchScriptPythonDecoder(vision_embed_tokens, example_input=example_input, trace_kwargs={"check_trace": False})`
	`373`	`+ ts_decoder = TorchScriptPythonDecoder(model.transformer, example_input=example_input, trace_kwargs={"check_trace": False})`
`374`	`374`
`375`	`375`	`ov_model = ov.convert_model(ts_decoder, example_input=example_input)`
`376`	`376`