## Experiments

The above figure shows the results of ToRA on 10 diverse mathematical reasoning datasets. We can find
that:

- ToRA consistently outperforms state-of-the-art open-source models, achieving significant
improvements of 13%-19% on average across 10 tasks. ToRA-70B significantly outperforms ChatGPT
on GSM8k (84.3% vs. 80.4%) and MATH (49.7% vs. 38.7%), while ToRA-Code-34B greatly surpasses
GPT-4 CoT on competition-level MATH dataset (50.8% vs. 42.5%), and is comparable to GPT-4 PAL
using code to solve problems (50.8% vs. 51.8%).
- Based on CodeLLaMA training, ToRA-Code's accuracy is about 5% higher than that of ToRA based on
LLaMA-2 with the same parameter size, indicating that improving the base model's code capability
can further enhance ToRA's problem-solving ability.
- ToRA exhibits superior generalization ability, while CoT fine-tuning based on language rationales
may have negative effects on OOD generalization. For example, ToRA-70B
generalizes better on TabMWP, a table reasoning task, than WizardMath (74.0% vs. 57.5%).
- ToRA achieves fast zero-shot inference, averaging 1.02 tool interaction rounds per problem,
maintaining high efficiency with one round of interaction for most problems.

##
Figure 4: Comparison of three formats: (1) Rationale-only: step-by-step natural language reasoning like CoT; (2) Program-only: solving problems with programs like PAL; (3) Tool-integrated Reasoning used by TORA: interweaving rationale and program execution to solve problems. We train LLaMA-2 models to reason in the three types of formats. We evaluated GPT-4 with few-shot prompting.

Figure 4 shows that compared to using only language reasoning (Rationale-only) or only program-based tool use (Program-only), Tool-integrated Reasoning has better performance in mathematical reasoning tasks.

##
Figure 5: Ablation on output space shaping strategies using CodeLLaMA.

By conducting ablation on the proposed output space shaping strategies (Figure 5), we demonstrate that output space shaping plays a crucial role in enhancing the ability to solve mathematical problems.