LLaVA-OneVision-2: Multimodal Model Analyzes Compressed Video Stream Through a Codec Instead of Frame Sampling

28 May 2026
LLaVA-OneVision-2

LLaVA-OneVision-2: Multimodal Model Analyzes Compressed Video Stream Through a Codec Instead of Frame Sampling

Researchers from Glint Lab, AIM for Health Lab, and MVP Lab published LLaVA-OneVision-2 (LLaVA-OV-2) — a next-generation multimodal model that rethinks how a neural network “watches” video. Instead of slicing…

LongLive-2.0 — 5B Model Generates Long Video at 720p in Real Time

20 May 2026
Training Infrastructure longlive 2.0

LongLive-2.0 — 5B Model Generates Long Video at 720p in Real Time

Researchers from NVIDIA have published LongLive-2.0 — an infrastructure for training and running long video generation models using NVFP4 4-bit precision quantization. Quantization is the compression of model weights by…

SenseNova-U1: NEO-unify multimodal architecture works directly with pixels without VAE

14 May 2026
SenseNova-U1 Unifying Multimodal model

SenseNova-U1: NEO-unify multimodal architecture works directly with pixels without VAE

SenseNova introduced a new multimodal architecture, SenseNova-U1, which combines image understanding, generation, and editing inside a single transformer without a separate visual encoder or variational autoencoder. This approach removes the…

OpenSeeker-v2: Best-in-Class Deep Research Agent Built by an Academic Team on Just 10,600 Samples

7 May 2026
OpenSeeker-v2

OpenSeeker-v2: Best-in-Class Deep Research Agent Built by an Academic Team on Just 10,600 Samples

Researchers from Shanghai Jiao Tong University have proven that building a best-in-class deep research agent doesn’t require hundreds of billions of pre-training tokens or expensive reinforcement learning. Just 10,600 carefully…

OpenGame: AI Agent Generates Full Browser Games from Text Description

22 April 2026
gameengine

OpenGame: AI Agent Generates Full Browser Games from Text Description

A team of researchers from CUHK MMLab published OpenGame — the first agentic framework for creating browser-based 2D games from natural language descriptions. The project is fully open: the framework…

InCoder-32B-Thinking: Open-Source Code Generation Model for Microcontrollers, GPU Kernel Optimization, and RTL Design

7 April 2026
Overview of InCoder-32B-Thinking

InCoder-32B-Thinking: Open-Source Code Generation Model for Microcontrollers, GPU Kernel Optimization, and RTL Design

A research team from Beihang University, Shanghai Jiao Tong University, the University of Manchester, and IQuest Research has published InCoder-32B-Thinking — a language model with an extended chain-of-thought reasoning for…

Trinity-Large-Thinking 400B: an open model matching Claude Opus-4.6 on agentic benchmarks at 28x lower price

3 April 2026
Trinity AI models foundation

Trinity-Large-Thinking 400B: an open model matching Claude Opus-4.6 on agentic benchmarks at 28x lower price

Arcee AI has released Trinity-Large-Thinking — an open-weight reasoning model for complex multi-turn agentic tasks. On PinchBench — a comprehensive benchmark for AI agents — it ranks second among all…

PixelSmile: Open Model for Facial Expression Editing with Smooth Intensity Control

31 March 2026
PixelSmile

PixelSmile: Open Model for Facial Expression Editing with Smooth Intensity Control

Researchers from Fudan University and StepFun have published PixelSmile — a diffusion model for precise facial expression editing in portraits and anime images. Instead of training on discrete labels like…

RealRestorer: Open-Source Image Enhancement Model Outperforms Nano Banana Pro on Real-World Benchmark

30 March 2026
Realresorer image restoration open model 2

RealRestorer: Open-Source Image Enhancement Model Outperforms Nano Banana Pro on Real-World Benchmark

A team of researchers from StepFun, Southern University of Science and Technology, and the Chinese Academy of Sciences has published RealRestorer — an open-source image quality enhancement model that removes…

MinerU-Diffusion: A New Approach to OCR via Diffusion Decoding Speeds Up PDF Parsing 3× Without Accuracy Loss

27 March 2026
Miner-U-Diffusion

MinerU-Diffusion: A New Approach to OCR via Diffusion Decoding Speeds Up PDF Parsing 3× Without Accuracy Loss

A team from Shanghai Artificial Intelligence Laboratory and Peking University published MinerU-Diffusion — a document OCR framework that abandons classical autoregressive generation in favor of diffusion-based decoding. The project is…

daVinci-MagiHuman: Open 15B Model Generates a 5-Second Lip Sync Video in 2 Seconds on a Single H100

24 March 2026
daVinci-MagiHuman model

daVinci-MagiHuman: Open 15B Model Generates a 5-Second Lip Sync Video in 2 Seconds on a Single H100

SII-GAIR and Sand.ai have published daVinci-MagiHuman — an open-source multimodal 15B model based on a single-stream transformer that simultaneously generates video with precise lip sync and synchronized audio, producing a…

Helios: 14B Model Generates Videos Longer Than 60 Seconds at 19.5 FPS on a Single H100

11 March 2026

Helios: 14B Model Generates Videos Longer Than 60 Seconds at 19.5 FPS on a Single H100

A team of researchers from Peking University and ByteDance published Helios — an autoregressive diffusion transformer with 14 billion parameters that generates video at 19.5 frames per second on a…

Baichuan-M3: An Open Medical Model That Conducts Consultations Like a Real Doctor and Outperforms GPT-5.2 on Benchmarks

10 February 2026
Baichuan-M3

Baichuan-M3: An Open Medical Model That Conducts Consultations Like a Real Doctor and Outperforms GPT-5.2 on Benchmarks

A research team from the Chinese company Baichuan has introduced Baichuan-M3 — an open medical language model that, instead of the traditional question-and-answer mode, conducts a full clinical dialogue, actively…

Claude Sonnet 4.5 Leads on Comprehensive Backend Benchmark, Outperforming in Both Code and Environment Configuration

22 January 2026
abc-bench-pipeline-workflow

Claude Sonnet 4.5 Leads on Comprehensive Backend Benchmark, Outperforming in Both Code and Environment Configuration

A team of researchers from Fudan University and Shanghai Qiji Zhifeng Co. introduced ABC-Bench — the first benchmark that tests the ability of AI agents to solve full-fledged backend development…

Multiplex Thinking: Sampling 3 Tokens Instead of 1 Increases Olympiad Problem-Solving Accuracy from 40% to 55%

22 January 2026
multiplex thinking

Multiplex Thinking: Sampling 3 Tokens Instead of 1 Increases Olympiad Problem-Solving Accuracy from 40% to 55%

Researchers from the University of Pennsylvania and Microsoft Research introduced Multiplex Thinking — a new reasoning method for large language models. The idea is to generate not one token at…

Yume1.5: An Open Model for Creating Interactive Virtual Worlds with Keyboard Control

5 January 2026
yume 1.5 model

Yume1.5: An Open Model for Creating Interactive Virtual Worlds with Keyboard Control

Researchers from Shanghai AI Laboratory and Fudan University published Yume1.5 — a model for generating interactive virtual worlds that can be controlled directly from the keyboard. Unlike regular video generation,…

AI Models Are 13% Worse Than Humans at Detecting Generated ASMR Videos

18 December 2025
AI-generated video

AI Models Are 13% Worse Than Humans at Detecting Generated ASMR Videos

Researchers from CUHK, NUS, University of Oxford, and Video Rebirth introduced Video Reality Test — the first benchmark that tests whether modern AI models can create videos indistinguishable from real…

Wan-Move: Open-Source Alternative to Kling 1.5 Pro for Motion-Controllable Video Generation

13 December 2025
WAN_MOVE video editor

Wan-Move: Open-Source Alternative to Kling 1.5 Pro for Motion-Controllable Video Generation

A team of researchers from Tongyi Lab (Alibaba Group), Tsinghua University, and the University of Hong Kong presented Wan-Move — a new approach to precise motion control in generative video…

P1: First Open-Source Model to Win Gold at the International Physics Olympiad

30 November 2025

P1: First Open-Source Model to Win Gold at the International Physics Olympiad

P1-235B-A22B from Shanghai AI Laboratory became the first open-source model to win a gold medal at the latest International Physics Olympiad IPhO 2025, scoring 21.2 out of 30 points and…

MiroThinker v1.0: Open-Source AI Research Agent Learns to Make Up to 600 Tool Calls Per Task

20 November 2025
mirothinker v1.0 benchmarks comparison

MiroThinker v1.0: Open-Source AI Research Agent Learns to Make Up to 600 Tool Calls Per Task

The MiroMind team introduced MiroThinker v1.0 — an AI research agent capable of performing up to 600 tool calls per task with a 256K token context window. On four key…

DeepEyesV2: Multimodal Model Learns to Use Tools to Solve Complex TasksRetry

12 November 2025
deepeyesv2-illustration

DeepEyesV2: Multimodal Model Learns to Use Tools to Solve Complex TasksRetry

Researchers from Xiaohongshu introduced DeepEyesV2 — an agentic multimodal model based on Qwen2.5-VL-7B that can not only understand text and images but also actively use external tools: execute Python code…