StackMathQA Dataset

Abstract

The development of sophisticated mathematical reasoning in large language models (LLMs) is often hindered by the scarcity of large-scale, high-quality, and domain-specific training data. To address this gap, we introduce StackMathQA, a comprehensive dataset containing nearly 2 million question-and-answer pairs sourced from the Stack Exchange network. This dataset aggregates expert-level and enthusiast discussions from premier platforms including Math Stack Exchange, MathOverflow, Statistics Stack Exchange, and Physics Stack Exchange. We provide the data in multiple formats and curated subsets created through importance resampling to cater to a wide range of research needs, from large-scale pre-training to targeted fine-tuning. This report details the dataset's construction methodology, structure, content, and potential applications, establishing StackMathQA as a valuable resource for advancing machine reasoning in quantitative domains.

Dataset Overview

StackMathQA is a new large-scale dataset designed to facilitate the training and evaluation of LLMs on mathematical tasks. It consists of approximately 2 million question-and-answer (Q&A) pairs meticulously extracted from several high-authority communities within the Stack Exchange network. These platforms are rich with nuanced questions, detailed explanations, and formal LaTeX mathematical notation, making them an ideal source for training sophisticated reasoning models.

Data Sources

The dataset is aggregated from four highly respected Stack Exchange sites:

Mathematics Stack Exchange: A Q&A site for people studying math at any level.
MathOverflow: A Q&A site for professional mathematicians.
Statistics Stack Exchange (Cross Validated): A Q&A site for people interested in statistics, machine learning, and data analysis.
Physics Stack Exchange: A Q&A site for active researchers, academics, and students of physics.

Dataset Structure and Subsets

To serve a variety of research needs, StackMathQA is provided in multiple formats and curated subsets. The data is available as one-question-to-many-answers (`qalist`) or as flattened one-question-to-one-answer (`1q1a`) pairs.

Furthermore, we offer several high-quality subsets generated using importance resampling. This method prioritizes Q&A pairs with higher community engagement (e.g., scores, views), ensuring that even smaller subsets are rich with valuable data. The available curated subsets are:

StackMathQA1600K (1.6 million pairs)
StackMathQA800K (800k pairs)
StackMathQA400K (400k pairs)
StackMathQA200K (200k pairs)
StackMathQA100K (100k pairs)

Potential Applications

StackMathQA is a versatile resource that can support a wide range of research directions in AI and machine learning:

Continual Pre-training: The large scale of the dataset makes it an excellent resource for continual pre-training of foundation models to enhance their understanding of mathematical language, symbols, and reasoning structures.
Supervised Fine-Tuning (SFT): The curated Q&A pairs are ideal for fine-tuning LLMs to improve their ability to follow instructions and generate accurate, step-by-step solutions to mathematical problems.
Domain-Specific Model Development: Researchers can use StackMathQA to train or specialize models for expert domains like theoretical physics, advanced mathematics, or econometrics.
Benchmark for Mathematical Reasoning: The dataset can serve as a challenging benchmark to evaluate the performance of LLMs on a diverse set of real-world mathematical queries.

Citation

Please cite our technical report if you use this dataset in your research. We appreciate your support!

@techreport{zhang2024stackmathqa, title={{StackMathQA: A Curated Collection of 2 Million Mathematical Questions and Answers Sourced from Stack Exchange}}, author={Zhang, Yifan}, year={2024}, institution={Math-AI}, howpublished={\url{https://stackmathqa.github.io/StackMathQA.pdf}}, }