Skip to content

Commit af2d7ca

Browse files
eustlbxenova
authored andcommitted
Add Moonshine (#34784)
* config draft * full encoder forward * full decoder forward * fix sdpa and FA2 * fix sdpa and FA2 * moonshine model * moonshine model forward * fix attention with past_key_values * add MoonshineForConditionalGeneration * fix cache handling and causality for cross attention * no causal attention mask for the encoder * model addition (imports etc) * small nit * nits * Update src/transformers/models/moonshine/convert_usefulsensors_to_hf.py Co-authored-by: Joshua Lochner <admin@xenova.com> * add rope_theta * nits * model doc * Update src/transformers/models/auto/configuration_auto.py Co-authored-by: Joshua Lochner <admin@xenova.com> * imports * add MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES * updates modular * make * make fix-copies * ruff check examples fix * fix check_modular_conversion * nit * nits * nits * copied from -> imports * imports fix * integrate attention refacto * modular edge case * remove encoder * convolutions params in config * run modular_model_converter * make * Update docs/source/en/model_doc/moonshine.md Co-authored-by: Joshua Lochner <admin@xenova.com> * MoonshineModelTest * correct typo * make style * integration tests * make * modular convert * name conversion update (up_proj -> fc1 etc) * update config * update MLP * update attention * update encoder layer * update decoder layer * update convolutions parameters * update encoder * remove INPUTS_DOCSTRING * update decoder * update conditional generation * update pretrained model * imports * modular converted * update doc * fix * typo * update doc * update license * update init * split config in file * two classes for MLP * attention from GLM * from GlmRotaryEmbedding * split MLP * apply arthur's review suggestions * apply arthur's review suggestions * apply arthur's review suggestions * auto feature extractor * convert modular * fix + make * convert modular * make * unsplit config * use correct checkpoint * wrap generate * update tests * typos * make * typo * update doc --------- Co-authored-by: Joshua Lochner <admin@xenova.com>
1 parent 42b8e79 commit af2d7ca

19 files changed

+3852
-2
lines changed

docs/source/en/_toctree.yml

+3-1
Original file line numberDiff line numberDiff line change
@@ -505,7 +505,9 @@
505505
- local: model_doc/mobilebert
506506
title: MobileBERT
507507
- local: model_doc/modernbert
508-
title: ModernBERT
508+
title: ModernBert
509+
- local: model_doc/moonshine
510+
title: moonshine
509511
- local: model_doc/mpnet
510512
title: MPNet
511513
- local: model_doc/mpt

docs/source/en/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
235235
| [MobileViT](model_doc/mobilevit) ||||
236236
| [MobileViTV2](model_doc/mobilevitv2) ||||
237237
| [ModernBERT](model_doc/modernbert) ||||
238+
| [Moonshine](model_doc/moonshine) ||||
238239
| [Moshi](model_doc/moshi) ||||
239240
| [MPNet](model_doc/mpnet) ||||
240241
| [MPT](model_doc/mpt) ||||

docs/source/en/model_doc/moonshine.md

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Moonshine
18+
19+
## Overview
20+
21+
The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands
22+
](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden.
23+
24+
The abstract from the paper is the following:
25+
26+
*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.*
27+
28+
Tips:
29+
30+
- Moonshine improves upon Whisper's architecture:
31+
1. It uses SwiGLU activation instead of GELU in the decoder layers
32+
2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows.
33+
34+
This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb).
35+
The original code can be found [here](https://github.com/usefulsensors/moonshine).
36+
37+
## Resources
38+
39+
- [Automatic speech recognition task guide](../tasks/asr)
40+
41+
## MoonshineConfig
42+
43+
[[autodoc]] MoonshineConfig
44+
45+
## MoonshineModel
46+
47+
[[autodoc]] MoonshineModel
48+
- forward
49+
- _mask_input_features
50+
51+
## MoonshineForConditionalGeneration
52+
53+
[[autodoc]] MoonshineForConditionalGeneration
54+
- forward
55+
- generate
56+

docs/source/en/perf_infer_gpu_one.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ FlashAttention-2 is currently supported for the following architectures:
6868
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
6969
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
7070
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
71+
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
7172
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
7273
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
7374
* [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava)
@@ -265,6 +266,7 @@ For now, Transformers supports SDPA inference and training for the following arc
265266
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
266267
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
267268
* [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100#transformers.M2M100Model)
269+
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
268270
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
269271
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
270272
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
@@ -283,8 +285,8 @@ For now, Transformers supports SDPA inference and training for the following arc
283285
* [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model)
284286
* [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel)
285287
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
286-
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
287288
* [mBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
289+
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
288290
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
289291
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
290292
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)

src/transformers/__init__.py

+14
Original file line numberDiff line numberDiff line change
@@ -610,6 +610,7 @@
610610
"models.mobilevit": ["MobileViTConfig"],
611611
"models.mobilevitv2": ["MobileViTV2Config"],
612612
"models.modernbert": ["ModernBertConfig"],
613+
"models.moonshine": ["MoonshineConfig"],
613614
"models.moshi": [
614615
"MoshiConfig",
615616
"MoshiDepthConfig",
@@ -2907,6 +2908,13 @@
29072908
"ModernBertPreTrainedModel",
29082909
]
29092910
)
2911+
_import_structure["models.moonshine"].extend(
2912+
[
2913+
"MoonshineForConditionalGeneration",
2914+
"MoonshineModel",
2915+
"MoonshinePreTrainedModel",
2916+
]
2917+
)
29102918
_import_structure["models.moshi"].extend(
29112919
[
29122920
"MoshiForCausalLM",
@@ -5633,6 +5641,7 @@
56335641
MobileViTV2Config,
56345642
)
56355643
from .models.modernbert import ModernBertConfig
5644+
from .models.moonshine import MoonshineConfig
56365645
from .models.moshi import (
56375646
MoshiConfig,
56385647
MoshiDepthConfig,
@@ -7652,6 +7661,11 @@
76527661
ModernBertModel,
76537662
ModernBertPreTrainedModel,
76547663
)
7664+
from .models.moonshine import (
7665+
MoonshineForConditionalGeneration,
7666+
MoonshineModel,
7667+
MoonshinePreTrainedModel,
7668+
)
76557669
from .models.moshi import (
76567670
MoshiForCausalLM,
76577671
MoshiForConditionalGeneration,

src/transformers/models/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@
170170
mobilevit,
171171
mobilevitv2,
172172
modernbert,
173+
moonshine,
173174
moshi,
174175
mpnet,
175176
mpt,

src/transformers/models/auto/configuration_auto.py

+2
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,7 @@
190190
("mobilevit", "MobileViTConfig"),
191191
("mobilevitv2", "MobileViTV2Config"),
192192
("modernbert", "ModernBertConfig"),
193+
("moonshine", "MoonshineConfig"),
193194
("moshi", "MoshiConfig"),
194195
("mpnet", "MPNetConfig"),
195196
("mpt", "MptConfig"),
@@ -519,6 +520,7 @@
519520
("mobilevit", "MobileViT"),
520521
("mobilevitv2", "MobileViTV2"),
521522
("modernbert", "ModernBERT"),
523+
("moonshine", "Moonshine"),
522524
("moshi", "Moshi"),
523525
("mpnet", "MPNet"),
524526
("mpt", "MPT"),

src/transformers/models/auto/feature_extraction_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@
7373
("mobilenet_v1", "MobileNetV1FeatureExtractor"),
7474
("mobilenet_v2", "MobileNetV2FeatureExtractor"),
7575
("mobilevit", "MobileViTFeatureExtractor"),
76+
("moonshine", "Wav2Vec2FeatureExtractor"),
7677
("moshi", "EncodecFeatureExtractor"),
7778
("nat", "ViTFeatureExtractor"),
7879
("owlvit", "OwlViTFeatureExtractor"),

src/transformers/models/auto/modeling_auto.py

+3
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,7 @@
179179
("mobilevit", "MobileViTModel"),
180180
("mobilevitv2", "MobileViTV2Model"),
181181
("modernbert", "ModernBertModel"),
182+
("moonshine", "MoonshineModel"),
182183
("moshi", "MoshiModel"),
183184
("mpnet", "MPNetModel"),
184185
("mpt", "MptModel"),
@@ -436,6 +437,7 @@
436437
("mega", "MegaForMaskedLM"),
437438
("megatron-bert", "MegatronBertForCausalLM"),
438439
("mobilebert", "MobileBertForMaskedLM"),
440+
("moonshine", "MoonshineForConditionalGeneration"),
439441
("mpnet", "MPNetForMaskedLM"),
440442
("mpt", "MptForCausalLM"),
441443
("mra", "MraForMaskedLM"),
@@ -937,6 +939,7 @@
937939

938940
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
939941
[
942+
("moonshine", "MoonshineForConditionalGeneration"),
940943
("pop2piano", "Pop2PianoForConditionalGeneration"),
941944
("seamless_m4t", "SeamlessM4TForSpeechToText"),
942945
("seamless_m4t_v2", "SeamlessM4Tv2ForSpeechToText"),

src/transformers/models/auto/processing_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@
8181
("mctct", "MCTCTProcessor"),
8282
("mgp-str", "MgpstrProcessor"),
8383
("mllama", "MllamaProcessor"),
84+
("moonshine", "Wav2Vec2Processor"),
8485
("oneformer", "OneFormerProcessor"),
8586
("owlv2", "Owlv2Processor"),
8687
("owlvit", "OwlViTProcessor"),

src/transformers/models/auto/tokenization_auto.py

+1
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,7 @@
321321
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
322322
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
323323
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
324+
("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
324325
("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
325326
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
326327
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import TYPE_CHECKING
15+
16+
from ...utils import _LazyModule
17+
from ...utils.import_utils import define_import_structure
18+
19+
20+
if TYPE_CHECKING:
21+
from .configuration_moonshine import *
22+
from .modeling_moonshine import *
23+
else:
24+
import sys
25+
26+
_file = globals()["__file__"]
27+
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

0 commit comments

Comments
 (0)