State-of-the-Art / Neural Networks and Deep Learning

DTM: New Hardware Architecture Reduces Energy Consumption by 10,000x Compared to GPUs

1 November 2025

DTM: New Hardware Architecture Reduces Energy Consumption by 10,000x Compared to GPUs

Researchers from Extropic Corporation presented an efficient hardware architecture for probabilistic computing based on Denoising Thermodynamic Models (DTM). Analysis shows that devices based on this architecture could achieve performance parity…

Ditto: Open Framework for Text-Instruction-Based Video Style and Object Editing with 99% Frame Consistency

24 October 2025

Ditto: Open Framework for Text-Instruction-Based Video Style and Object Editing with 99% Frame Consistency

Researchers from HKUST, Ant Group, Zhejiang University, and Northeastern University introduced Ditto — a comprehensive open framework addressing the training data scarcity problem in text-instruction-based video editing. The developers created…

QeRL: Training 32B Models on Single H100 vs Three GPUs, Beating LoRA in Accuracy

16 October 2025

QeRL rainforcement learning quantization training speedup

QeRL: Training 32B Models on Single H100 vs Three GPUs, Beating LoRA in Accuracy

QeRL is a framework for training language models using reinforcement learning that simultaneously reduces GPU requirements and surpasses traditional LoRA and QLoRA methods in accuracy. On the Qwen2.5-7B-Instruct model, QeRL…

MinerU2.5: Open-Source 1.2B Model for PDF Parsing Outperforms Gemini 2.5 Pro on Benchmarks

2 October 2025

MinerU2.5: Open-Source 1.2B Model for PDF Parsing Outperforms Gemini 2.5 Pro on Benchmarks

MinerU2.5 is a compact vision-language model with 1.2 billion parameters for PDF parsing, introduced by the Shanghai Artificial Intelligence Laboratory team. The model achieves state-of-the-art results in PDF parsing with…

LongLive — 1.3B Video Generation Model at 20.7 FPS with Real-Time Narrative Control

30 September 2025

LongLive — 1.3B Video Generation Model at 20.7 FPS with Real-Time Narrative Control

A team of researchers from NVIDIA, MIT, and other institutions introduced LongLive — a framework for real-time long video generation that allows users to control the narrative during video creation.…

Hybrid Image Tokenizer: Apple’s New Approach to Multimodal Models

22 September 2025

Hybrid Image Tokenizer: Apple’s New Approach to Multimodal Models

Apple Research Team introduced Manzano — a unified multimodal large language model that combines visual content understanding and generation capabilities through a hybrid image tokenizer and carefully designed training strategy.…

Mini-o3: A Multimodal 7B Model Outperformed GPT-4o in Visual Search With 30-Step Reasoning Chains

10 September 2025

Mini-o3: A Multimodal 7B Model Outperformed GPT-4o in Visual Search With 30-Step Reasoning Chains

Researchers from ByteDance and the University of Hong Kong introduced Mini-o3 — a multimodal model that performs deep multi-step reasoning to solve complex visual search tasks. Mini-o3 achieves SOTA results…

Google Launches Gemini 2.5 Flash Image with Text-Based Editing Capabilities

26 August 2025

Google Launches Gemini 2.5 Flash Image with Text-Based Editing Capabilities

Google introduced Gemini 2.5 Flash Image (with internal codename nano-banana) — a model for image generation and editing. The model supports combining multiple images into one, maintains character consistency between…

NVIDIA Nemotron Nano 2: Reasoning and Code Generation Model Outperforms Qwen3-8B on Benchmarks and Supports 128k Context

20 August 2025

NVIDIA Nemotron Nano 2: Reasoning and Code Generation Model Outperforms Qwen3-8B on Benchmarks and Supports 128k Context

A team of NVIDIA researchers presented Nemotron-Nano-9B-v2 — a hybrid Mamba-Transformer language model that generates responses 6 times faster than Qwen3-8B on reasoning tasks while exceeding it in accuracy. The…

Matrix-3D: Open Framework for Generating Fully Explorable 3D Worlds from a Single Image

14 August 2025

Matrix-3D: Open Framework for Generating Fully Explorable 3D Worlds from a Single Image

Researchers from Skywork AI and the Hong Kong University of Science and Technology have introduced Matrix-3D — a framework for creating fully explorable 3D worlds from a single image or…

3D-R1: Open Source Reasoning Model for 3D Scenes Outperforms State-of-the-Art Methods by 10% on 3D Benchmarks

6 August 2025

3D-R1: Open Source Reasoning Model for 3D Scenes Outperforms State-of-the-Art Methods by 10% on 3D Benchmarks

Researchers from Shanghai University of Engineering Science and Peking University presented 3D-R1 — a new foundation model that significantly improves reasoning capabilities in three-dimensional vision-language models (VLM). The model demonstrates an average performance…

Seed Diffusion: New State-of-the-Art in Speed-Quality Balance for Code Generation Models

6 August 2025

Seed Diffusion: New State-of-the-Art in Speed-Quality Balance for Code Generation Models

The research team from ByteDance Seed in collaboration with the AIR Institute of Tsinghua University introduced Seed Diffusion Preview — a language model based on discrete diffusion that demonstrates record-breaking…

Gemini 2.5 Pro Achieved Gold Medal Performance at IMO 2025, Solving 5 of 6 Problems

25 July 2025

Gemini 2.5 Pro Achieved Gold Medal Performance at IMO 2025, Solving 5 of 6 Problems

Large language models perform well on mathematical benchmarks like AIME, however International Mathematical Olympiad (IMO) problems require deep understanding, creativity, and formal reasoning. Chinese researchers used Google Gemini 2.5 Pro…

Show-o2: Open-source 7B multimodal model outperforms 14B models on benchmarks using significantly less training data

11 July 2025

Show-o2: Open-source 7B multimodal model outperforms 14B models on benchmarks using significantly less training data

Researchers from Show Lab at the National University of Singapore and ByteDance introduced Show-o2 — a second-generation multimodal model that demonstrates superior results in image and video understanding and generation…

MiniCPM4: Open Local Model Achieves Qwen3-8B Performance with 7x Inference Acceleration

15 June 2025

MiniCPM4: Open Local Model Achieves Qwen3-8B Performance with 7x Inference Acceleration

The OpenBMB research team presented MiniCPM4 — a highly efficient language model designed specifically for local devices. MiniCPM4-8B achieves comparable performance to Qwen3-8B (81.13 vs 80.55), while requiring 4.5 times…

Strict On-Policy Training with Optimal Baseline: Microsoft Introduces Simplified Algorithm for RLHF

4 June 2025

Strict On-Policy Training with Optimal Baseline: Microsoft Introduces Simplified Algorithm for RLHF

The Microsoft Research team introduced On-Policy RL with Optimal reward baseline (OPO) — a simplified reinforcement learning algorithm for aligning large language models. The new method addresses key problems of…

Visual-ARFT: Multimodal AI Agents Outperform GPT-4o by 18.6% in Complex Visual Tasks

22 May 2025

Visual-ARFT: Multimodal AI Agents Outperform GPT-4o by 18.6% in Complex Visual Tasks

A research team from Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory has introduced Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) — a new approach to training large multimodal models with…

ZEROSEARCH: A Framework That Cuts LLM Search Training Costs by 88%

9 May 2025

ZEROSEARCH: A Framework That Cuts LLM Search Training Costs by 88%

Alibaba’s NLP research team has officially open-sourced ZEROSEARCH, a complete framework for training LLMs to search without using real search engines. ZEROSEARCH builds on a key insight: LLMs have already…

Phi-4-reasoning: Microsoft’s Breakthrough in AI Thinking

4 May 2025

Phi-4-reasoning: Microsoft’s Breakthrough in AI Thinking

Microsoft recently unveiled Phi-4-reasoning, a 14-billion parameter model that achieves exceptional performance on complex reasoning tasks, outperforming models 5-47 times larger while requiring significantly less computational resources, with developers able…