Chain-of-Experts: Novel Approach Improving MoE Efficiency with up to 42% Memory Reduction

CoE

Chain-of-Experts (CoE) – a novel approach fundamentally changing how sparse language models process information, delivering better performance with significantly less memory. This breakthrough addresses key limitations in current Mixture-of-Experts (MoE) architectures while creating new pathways for efficient AI scaling. Applying CoE to the DeepSeekV2-Lite architecture allowed achieving the same performance with fewer parameters leading to memory savings of up to 42%, or improving performance with the same parameter count by reducing validation loss from 1.20 to 1.12. The code has been open-sourced and is available on Github.

Traditional MoE models have struggled with two critical limitations – experts process information independently with minimal communication, and the sparse activation patterns demand substantial GPU resources. CoE introduces a solution that addresses both issues through sequential expert processing.

Key Achievements

  • Performance gains: 2× iterations reduces Math validation loss from 1.20 to 1.12
  • Memory efficiency: 17.6-42% lower memory requirements with equivalent performance
  • Scaling superiority: CoE outperforms both layer scaling and expert selection count expansion
  • Combinatorial explosion: 823× increase in possible expert combinations

How it works

CoE implements an iterative mechanism where experts process tokens on outputs from other experts:

  • Instead of parallel processing, experts work sequentially, forming dependencies between experts
  • Each iteration’s expert selection is determined by the output of the previous iteration
  • Information accumulates during iterations, achieving explicit expert communication

The formal representation can be described as:

x(0)=xx^{(0)} = x 

x(t)=∑i=1Ngt,i⋅Ei(x(t−1))+Ir⋅x(t−1),t=1,2,...,Cx^{(t)} = sum_{i=1}^{N} g_{t,i} cdot text{E}_i(x^{(t-1)}) + mathbb{I}_r cdot x^{(t-1)}, quad t = 1, 2, ..., C 

y=x(C)y = x^{(C)} 

Experiments reveal that independent gating mechanisms and inner residual connections are critical for CoE’s success. The independent gating allows experts to process different types of information at different stages, while inner residuals effectively increase model depth.

Testing

The Chain-of-Experts method was tested on a DeepSeekV2-Lite architecture with 544MB total parameters (excluding embedding), configured with 4 hidden layers, 1024 hidden size, and 8 attention heads.

All experiments used the MetaMathQA dataset, which is augmented from GSM8K and MATH datasets.

Results

MOE vs CoE

  • CoE with 2 iteration cycles reduced Math validation loss from 1.20 to 1.12 compared to traditional MoE models
  • Memory requirements decreased by 17.6-42% while maintaining equivalent performance
  • CoE with 4 layers matched performance of traditional MoE with 8 layers
  • Tests confirmed that independent gating mechanisms and inner residual connections were essential to CoE’s effectiveness

Experiments were conducted on servers with single H100 GPUs (approximately 30 minutes per run) or 4090 GPUs (approximately 2 hours per run).

Comparison with baselines on LPWP and ComplexOR:

Comparison with baselines on LPWP and ComplexOR

Researchers describe a “free lunch” effect where restructuring information flow delivers better results without additional computational overhead. This effect stems from:

  1. Dramatically increased freedom in expert choices
  2. Unification of sequential processing and expert communication
  3. Enhanced expert specialization through multiple processing opportunities

CoE represents a significant advance in efficient AI scaling, potentially making advanced language models more accessible and sustainable. Future work will explore scaling laws, extend evaluations beyond math datasets, and investigate architectural innovations like experts shared across all layers.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments