Meituan Open-Source LongCat-Next: 3B Parameters for Unified Visual Understanding, Generation, and Speech

BlockBeatNews

According to 1M AI News monitoring, Meituan Longmao team has open-sourced LongCat-Next, a native multimodal model based on MoE architecture with 3 billion activated parameters. It unifies five capabilities—text, visual understanding, image generation, speech understanding, and speech synthesis—within a single autoregressive framework. The model and accompanying tokenizer are open-sourced under the MIT license, with weights available on HuggingFace.

LongCat-Next’s core design is the DiNA (Discrete Native Autoregressive) paradigm: by designing paired tokenizers and decoders for each modality, visual and audio signals are converted into discrete tokens, sharing the same embedding space with text, and all tasks are completed through unified next-token prediction. The key component on the visual side, dNaViT (Discrete Native Resolution Vision Transformer), extracts image features into “visual words,” supporting dynamic tokenization and decoding. It maintains strong image generation quality even at 28 times compression ratio, especially excelling in text rendering.

In comparisons with models of similar activated parameter size (A3B), LongCat-Next’s main benchmark performances are:

  1. Visual understanding: MMMU-Pro 60.3 (Qwen3-Omni 57.0, GPT5-minimal 62.7), MathVista 83.1 (Qwen3-Omni 75.9, GPT5-minimal 50.9), MathVision 64.7 (outperforming all comparison models), DocVQA 94.2
  2. Image generation: GenEval 84.44, LongText-EN 93.15 (FLUX.1-dev 60.70, Emu-3.5 97.60)
  3. Programming: SWE-Bench 43.0 (Kimi-Linear-48B 32.8, Qwen3-Next-80B 37.6)
  4. Agent tool invocation: Tau2-Retail 73.68 (Qwen3-Next 57.3), Tau2-Telecom 62.06 (Qwen3-Next 13.2)

In cross-model comparisons of understanding and generation within a unified model, LongCat-Next’s MMMU score of 70.6 surpasses second-place NEO-unify (68.9), significantly exceeding previous unified model solutions like BAGEL (55.3) and Ovis-U1 (51.1). The performance of SWE-Bench 43.0 and Tau2 series tool invocation benchmarks also demonstrate that this multimodal unified architecture does not sacrifice pure text and agent capabilities.

View Original
Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments