The History of Mixture of Experts

April 26th, 2024

thumbnail-img After OpenAI's ChatGPT, the tech world witnessed a modern-day gold rush. The AI giant started a new era and created a competition in the AI game.

However, among this backdrop of AI-tech players, emerged a rising star - Mixtral 8x7B from Mistral AI, challenging the status quo with its innovative Mixture of Experts (#MoEs) architecture.

In this article, we will discuss the evolution of MoEs, including its emergence, struggles, and triumphs.

Rising Star or Returning Hero

The launch of OpenAI’s ChatGPT has triggered a gold rush in Silicon Valley and other global tech hubs around the world, and currently, a handful of major players like OpenAI dominate the industry, and it seems like they will still monopolize the market in the future with its leading advantage and resources. However, in the fast-paced realm of AI, developments can happen at an astonishing rate, it only took open source 1 year to make a model that reaches the level of GPT 3.5: Mixtral 8x7B comes from Mistral AI – a company that was only 8 months old when this model was released.

Right off the bat, you will notice that Mixtral 8x7B has a really unique naming scheme, instead of a whole number that would represent a model’s parameter count which is what people usually do, they were doing maths on its name, plus Mistral was written as Mixtral, and it’s because they are referring to an architecture paradigm which they introduced for this model: Mixture of experts which is completely different from how most large language models (LLMs) operate.

The birth of Mixtral 8x7B not only marked the development of open source LLMs but also sparked a hot topic in the open AI community: Mixture of experts or MoEs for short. The fact that Mistral AI was able to make MoEs work and perform better than Gemini Pro, Claude-2.1, and GPT-3.5 (at the time Mixtral 8x7B launched) makes everyone wonder what exactly is this approach and how it could achieve that good performance. So, let’s travel back in time to the starting point of it all.

Before the Deep Learning Era

Birth of a Concept (1991)

The story begins in 1991 with the seminal paper by Robert Jacobs and Geoffrey Hinton titled "Adaptive Mixtures of Local Experts." They proposed a novel architecture that broke away from the traditional single-network approach. Their idea: a consortium of specialized networks, each an "expert" adept at handling specific subtasks within a broader problem. This specialization can lead to improved overall accuracy and flexibility compared to a single, monolithic model.

But how does the system decide which expert is best suited for a particular input? This crucial role falls to the gating network. A gating network, acting as a conductor, directed the input data to the most suitable expert. The gating network is typically a smaller neural network or a statistical model trained to analyze the input data and determine the weights for each expert. These weights represent the confidence the gating network has in each expert's ability to handle the specific input. During training, both the individual experts and the gating network undergo a supervised learning process.

This division of labor promised faster learning and improved performance compared to a monolithic network.

Beyond Neural Network

While the rise of deep learning has led to a strong association between MoEs and deep neural networks as experts, the early days of MoEs weren't restricted to this specific model type. Researchers actively explored the potential of various expert models, showcasing the versatility of the MoEs framework:

Support Vector Machines (SVMs): These powerful classification algorithms were employed as experts in some MoEs architectures, demonstrating the ability to integrate different learning paradigms within the framework.
Hidden Markov Models (HMMs): These statistical models, particularly adept at handling sequential data, were utilized as experts in tasks involving speech recognition or natural language processing.

This exploration of diverse expert models highlighted the flexibility of MoEs and its potential to leverage the strengths of different machine learning approaches to achieve optimal performance.

Early Struggles

Despite its initial promise, MoEs faced challenges. The computational demands were significant, especially with limited hardware resources at the time. Additionally, training the gating network to effectively route data proved difficult. These factors led MoEs to be overshadowed by simpler algorithms during the early days of the AI resurgence.

However, the seeds of a future revival were sown. Researchers continued to explore MoEs’s potential and apply it to various tasks. Notably, the concept of "mixture models," where data is attributed to multiple sources, found fertile ground in statistics and machine learning, paving the way for a future comeback.

The Deep Learning Era

The rise of deep learning in the 2010s marked a turning point for MoEs. With the advent of powerful GPUs and vast datasets, the computational hurdles that once held MoEs back were significantly reduced. Moreover, researchers began to explore how MoEs could be integrated with deep learning architectures:

Experts as Building Blocks: While traditional MoEs focused on a standalone system with gating and experts, researchers like Eigen, Ranzato, and Ilya explored using MoEs as internal components within deeper networks. This opened doors for creating "MoEs layers" stacked within a larger architecture, enabling models to be both powerful and efficient simultaneously. This approach allows for hierarchical specialization, where experts at lower layers can focus on specific features, while higher-level MoEs layers can combine this information for more complex tasks.
Conditional Computation and NLP: Traditional neural networks process all input data through every layer, regardless of its relevance. However, Yoshua Bengio's work on conditional computation paved the way for dynamically activating or deactivating network components based on the input. This concept perfectly aligned with the MoEs framework, leading to explorations in the Natural Language Processing (NLP) domain.

Building upon these advancements, Shazeer et al. (including Geoffrey Hinton and Jeff Dean) in 2017 pushed the boundaries by scaling MoEs to a massive 137B parameter Long Short-Term Memory (LSTM) network – the dominant NLP architecture at the time. This groundbreaking work introduced sparsity, ensuring that only a small subset of experts was activated for each input, leading to significantly faster inference despite the model's immense size. While primarily focused on machine translation tasks, this work faced challenges like high communication costs and training instabilities.

MoEs layer from the Outrageously Large Neural Network paper

These advancements in the 2010s laid the foundation for the remarkable achievements we see today. MoEs has become a key enabler for training colossal models with trillions of parameters, such as the open-sourced Switch Transformers boasting 1.6 trillion parameters. And while this blog focuses on the NLP domain, MoEs has also been explored in other areas like Computer Vision, demonstrating its versatility across various AI applications.

The marriage of MoEs and LLMs

The 2020s witnessed a fascinating convergence: MoEs joining forces with the ever-evolving world of Large Language Models (LLMs). These powerful language models, trained on massive amounts of text data, offer exceptional capabilities in many NLP tasks.

MoEs Enhances LLM Efficiency - By integrating MoEs with LLMs, researchers aimed to improve efficiency and tackle the ever-growing computational demands of these models. MoEs allows only relevant parts of the LLM to be activated for a specific task, leading to significant resource savings. This is particularly crucial as LLM size continues to grow, requiring ever-increasing computational resources.
Mixtral 8x7B – Our hero, using eight-expert MoEs, where only two experts are used for each token. In this case, in any given forward pass of a single token within the model, the number of parameters used for any given token in a batch is much lower (12 billion parameters used of a total 46 billion parameters). This requires less compute compared to using all eight experts or a similarly-sized fully dense model. Given that tokens are batched together in training, most if not all experts are used. This means that in this regime, a sparse MoEs uses less compute and the same amount of memory capacity compared to a dense model of the same size.

Conclusion

In conclusion, the journey of Mixture of Experts unveils a narrative of resilience, innovation, and adaptation. Despite facing challenges and setbacks along the way, MoEs have proven their efficacy and unique creative potential amidst the competitive landscape of Large Language Models. As LLMs continue to scale to unprecedented sizes, MoEs stand out as a crucial component, offering efficiency and performance gains through selective activation and specialization.

Today's blog has provided a glimpse into the historical milestones of MoEs, offering a foundation for understanding their significance in contemporary AI research. However, this is merely the tip of the iceberg. For those intrigued by the intricacies of MoEs and eager to delve deeper into their mechanics, applications, and future prospects, stay tuned for upcoming blogs that will offer more comprehensive insights.

Ref:

LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

Mixture of experts - Wikipedia

Mixture of Experts Explained (huggingface.co)

2012_TwentyYearsofMixtureofExperts.pdf (hacettepe.edu.tr)

jjnh91.pdf (toronto.edu)

[1312.4314] Learning Factored Representations in a Deep Mixture of Experts (arxiv.org)

[1701.06538] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (arxiv.org)