Sparsely-gated Mixture of Experts: Efficient AI Model Architecture Boosts Capacity, Limits Computational Cost

Sparsely-gated Mixture Of Experts (MoE)

In transformer models, the attention block is typically followed by a feed forward layer (FF), which is a simple fully-connected NN with a hidden layer and nonlinearity. Here's the code for such a block that uses ReLU: def feed_forward_relu(x, W1, W2): """Feed-forward layer with ReLU activation. Args: x: Input tensor (B, N, D). Wh: Weights for the hidden layer (D, DH). Wo: Weights for the output layer (DH, D). Returns: Output tensor (B, N, D). """ x = x @ W1 # hidden layer (B, N, DH) x = np.max...