single-page-annotation
{"meta": {"id": "zM3mlyflTt", "review_idx": 0, "title": "Title: Approximating Two-Layer Feedforward Networks for Efficient Transformers\nAbstract: How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that *unifies* various methods to *approximate two-layer NNs* (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the *compute-equal* condition, our evaluation condition is *parameter-equal*, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the *dense* Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.", "claims": ["Claim1: Our resulting MoE Transformer variant outperforms our improved PKMs, and performs as well as or even outperforms the dense baseline, while using a fraction of its compute for both training and inference.", "Claim2: Importantly, unlike prior work, we compare our MoEs with dense baselines with the same number of total trainable parameters, which is crucial for proper evaluation in language modeling.", "Claim3: We demonstrate that MoEs are not limited to extremely-large LMs, but useful as a generic approach for resource-efficient NNs at any scale, and in line with the recent trend of improving 'smaller' models (Touvron et al., 2023; Taori et al., 2023; Chiang et al., 2023).", "Claim4: Finally, we release a CUDA kernel for our MoE layers which allows for achieving faster wall clock time and large memory reduction compared to the dense model.", "Claim5: We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient.", "Claim6: This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs.", "Claim7: Fedus et al. (2022) integrate the MoE above into the Transformer to obtain their Switch Transformer. In terms of MoE details, one of Fedus et al. (2022)\u2019s key claims is that top-1 routing is enough.", "Claim8: Lewis et al. (2021) show that, while during training, the routing is enforced to be completely uniform, during the test time, the distribution looks exponential (in fact, this is similar to the Switch Transformer but more balanced for BASE).", "Claim9: Clark et al. (2022) have proposed to use the Sinkhorn algorithm (Sinkhorn, 1964; Sinkhorn and Knopp, 1967) instead (resulting in a model called Sinkhorn-BASE or S-BASE), to approximate the solution to this problem (note that similar routing is independently discussed by Kool et al. (2021)). They report that this works well, while being simpler to implement.", "Claim10: We experimentally confirm that this is indeed a good choice.", "Claim11: Our experiments in Sec. 6.3 show that this is sub-optimal.", "Claim12: We experimentally show that our regularization method (Eq. 21) and expert dropout (Eq. 22) are both effective despite their simplicity.", "Claim13: Our modified baseline model on Enwik8 still has 41M parameters and performs similarly to the original Transformer XL (see Tab. 1).", "Claim14: Tab. 1 shows the results.", "Claim15: We observe that not only TopK in the MLP blocks preserves the performance of Transformers, it even improves performance.", "Claim16: Our experiments confirm the benefits of this choice (Tab. 2): the performance of the ReLU variants is much closer to the dense baseline (see also related findings in Shen et al. (2023)).", "Claim17: But even the best PKM models underperform the dense baselines indicating the fundamental limitation of PKMs.", "Claim18: Our \u03c3 -MoE models match the performance of their parameter-equal dense baselines, while achieving significant memory and compute reduction.", "Claim19: The results above demonstrate that our \u03c3 -MoEs can be configured to match the desired performance with fewer resources.", "Claim20: Our \u03c3 -MoE outperforms Switch Transformer and S-BASE.", "Claim21: Our entropy-regularized models with expert dropout, especially \u03c3 -MoE, are capable of matching the expert usage balancing of S-BASE without using the Sinkhorn activation function.", "Claim22: Importantly, our \u03c3 -MoE with moderate sparsity matches the performance of parameter-equal dense baselines while being much more resource-efficient.", "Claim23: Our experiments show that if we naively increase the number of experts, the performance gap between MoE models and their dense counterparts increases.", "Claim24: However, even in its current form, it already yields significant performance boosts and memory reduction.", "Claim25: Our preliminary experiments suggest that such balancing entails a performance hit.", "Claim26: In contrast, MoEs allow increasing the number of parameters without such dramatic drawbacks.", "Claim27: We find that a higher K is beneficial.", "Claim28: The result can be seen in Fig. 6: the network combines experts in a rich way, further supporting the use of K > 1.", "Claim29: For execution time and memory usage, both the dense MLP and the MoE layers are linear in d model (Fig. 9), the MLP is linear in d ff , and MoE is linear in G (Fig. 8) and K.", "Claim30: However, both the memory usage and the execution time of the MoE are almost independent of N E , except for a small linear factor due to the selection network (see Fig. 2).", "Claim31: Note that there is no significant difference in terms of speed and memory usage between different MoE variants given the same d model , G , and K.", "Claim32: Since all methods are configured to have the same number of parameters as the dense baselines, and K experts are used in parallel, the factor of reduction in both FLOPs and memory usage is given by KN E . We show this factor for all models in Tab. 7.", "Claim33: Additional results of different MoE variants with more model details are shown in Tab. 10. We repeat the entries from Tab. 4 for easier comparison.", "Claim34: The models with poor performance can be distinguished easily (Switch Transformer and \u03c3 -MoE with a softmax and renormalization, \u201csoftmax (renom.)\u201d).", "Claim35: Their poor performance may be partially explained by expert collapse."], "review": "Review: Reasons to reject: 1. The proposed $\\sigma$-MoE framework requires further justification. The specific design approach for $W_3$ and the uniform initialization of weight matrices are notable contributions. The paper should delve into the rationale behind these design choices and their benefits, particularly in terms of their impact on the performance of the MoE model.\n\n2. The empirical investigations provided appear somewhat limited in scope. Given that all experiments are exclusively conducted on WikiText-103 and Enwik8, which share similar data distributions, it would be prudent to expand the experimental scope to include other datasets. This would provide additional support for the performance claims of the $\\sigma$-MoE model.\n\n3. While the paper introduces novel elements through the MoE variant design, the novelty level might be constrained. To enhance the clarity of the novelty introduced beyond the MoE variant design, it's advisable to provide further elaboration and illustration.\n\n\nQuestions: 1. It would be valuable to include additional experiments on diverse text datasets such as PTB and C4. Expanding the experimental evaluation beyond Enwik8 and WikiText-103 can provide a more comprehensive understanding of the proposed approach's performance across various text domains and scales.\n2. Could you provide further clarification and elaboration on the design principles underlying the $\\sigma$-MoE model? \n\n"}, "response": {"Weakness associated with claims": [{"Associated claims": [], "Reasoning": "The review does not specifically point out any claims that lack sufficient support or evidence. The main concern is the limited scope of empirical investigations, which does not directly refute the claims but suggests that additional data would strengthen them. This criticism is more about the extensiveness of the evaluation rather than insufficiency of evidence for a particular claim.", "Label": "OTHER", "Weakness span": "3. While the paper introduces novel elements through the MoE variant design, the novelty level might be constrained. To enhance the clarity of the novelty introduced beyond the MoE variant design, it's advisable to provide further elaboration and illustration."}]}, "Clustered claims": {"DESCRIPTIVE": ["Claim4: Finally, we release a CUDA kernel for our MoE layers which allows for achieving faster wall clock time and large memory reduction compared to the dense model.", "Claim10: We experimentally confirm that this is indeed a good choice.", "Claim11: Our experiments in Sec. 6.3 show that this is sub-optimal.", "Claim13: Our modified baseline model on Enwik8 still has 41M parameters and performs similarly to the original Transformer XL (see Tab. 1).", "Claim14: Tab. 1 shows the results.", "Claim16: Our experiments confirm the benefits of this choice (Tab. 2): the performance of the ReLU variants is much closer to the dense baseline (see also related findings in Shen et al. (2023)).", "Claim23: Our experiments show that if we naively increase the number of experts, the performance gap between MoE models and their dense counterparts increases.", "Claim25: Our preliminary experiments suggest that such balancing entails a performance hit.", "Claim27: We find that a higher K is beneficial.", "Claim28: The result can be seen in Fig. 6: the network combines experts in a rich way, further supporting the use of K > 1.", "Claim29: For execution time and memory usage, both the dense MLP and the MoE layers are linear in d model (Fig. 9), the MLP is linear in d ff , and MoE is linear in G (Fig. 8) and K.", "Claim30: However, both the memory usage and the execution time of the MoE are almost independent of N E , except for a small linear factor due to the selection network (see Fig. 2).", "Claim31: Note that there is no significant difference in terms of speed and memory usage between different MoE variants given the same d model , G , and K.", "Claim32: Since all methods are configured to have the same number of parameters as the dense baselines, and K experts are used in parallel, the factor of reduction in both FLOPs and memory usage is given by KN E . We show this factor for all models in Tab. 7.", "Claim33: Additional results of different MoE variants with more model details are shown in Tab. 10. We repeat the entries from Tab. 4 for easier comparison.", "Claim34: The models with poor performance can be distinguished easily (Switch Transformer and \u03c3 -MoE with a softmax and renormalization, \u201csoftmax (renom.)\u201d)."], "INTERPRETIVE": ["Claim1: Our resulting MoE Transformer variant outperforms our improved PKMs, and performs as well as or even outperforms the dense baseline, while using a fraction of its compute for both training and inference.", "Claim2: Importantly, unlike prior work, we compare our MoEs with dense baselines with the same number of total trainable parameters, which is crucial for proper evaluation in language modeling.", "Claim5: We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient.", "Claim12: We experimentally show that our regularization method (Eq. 21) and expert dropout (Eq. 22) are both effective despite their simplicity.", "Claim15: We observe that not only TopK in the MLP blocks preserves the performance of Transformers, it even improves performance.", "Claim17: But even the best PKM models underperform the dense baselines indicating the fundamental limitation of PKMs.", "Claim18: Our \u03c3 -MoE models match the performance of their parameter-equal dense baselines, while achieving significant memory and compute reduction.", "Claim19: The results above demonstrate that our \u03c3 -MoEs can be configured to match the desired performance with fewer resources.", "Claim20: Our \u03c3 -MoE outperforms Switch Transformer and S-BASE.", "Claim21: Our entropy-regularized models with expert dropout, especially \u03c3 -MoE, are capable of matching the expert usage balancing of S-BASE without using the Sinkhorn activation function.", "Claim22: Importantly, our \u03c3 -MoE with moderate sparsity matches the performance of parameter-equal dense baselines while being much more resource-efficient.", "Claim24: However, even in its current form, it already yields significant performance boosts and memory reduction.", "Claim26: In contrast, MoEs allow increasing the number of parameters without such dramatic drawbacks.", "Claim35: Their poor performance may be partially explained by expert collapse."], "OVERARCHING": ["Claim3: We demonstrate that MoEs are not limited to extremely-large LMs, but useful as a generic approach for resource-efficient NNs at any scale, and in line with the recent trend of improving 'smaller' models (Touvron et al., 2023; Taori et al., 2023; Chiang et al., 2023).", "Claim6: This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs."], "RELATED_WORK": ["Claim7: Fedus et al. (2022) integrate the MoE above into the Transformer to obtain their Switch Transformer. In terms of MoE details, one of Fedus et al. (2022)\u2019s key claims is that top-1 routing is enough.", "Claim8: Lewis et al. (2021) show that, while during training, the routing is enforced to be completely uniform, during the test time, the distribution looks exponential (in fact, this is similar to the Switch Transformer but more balanced for BASE).", "Claim9: Clark et al. (2022) have proposed to use the Sinkhorn algorithm (Sinkhorn, 1964; Sinkhorn and Knopp, 1967) instead (resulting in a model called Sinkhorn-BASE or S-BASE), to approximate the solution to this problem (note that similar routing is independently discussed by Kool et al. (2021)). They report that this works well, while being simpler to implement."], "OTHER": []}, "id": "zM3mlyflTt0", "pdf": "openreview.net/pdf?id=zM3mlyflTt"}