The Cornell e-print arXiv contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML and “HTML5 with MathML” via LaTeXML to have a basis for uncovering it’s structural semantics (see the LLaMaPuN project for details).
Unfortunately, we cannot re-distribute the results of the transformation freely due to arXiv licensing policies. Therefore we have created the Special Interest Group for Math Linguistics (SIGMathLing) that can distribute the data sets under an NDA to SIGMathLing members.