Translating the arXiv to XML/HTML5

From: 2006

Funding: internal

Prof. Dr. Michael Kohlhase
M.Sc. Deyan Ginev

Dr. Bruce Miller (NIST)

various Jacobs University undergrads


The Cornell e-print arXiv contains one of the largest corpora of scientific literature in the world. Unfortunately, its contents are locked up in the TeX/LaTeX format, which makes it nearly useless for knowledge management techniques. We translate it to XML and “HTML5 with MathML” via LaTeXML to have a basis for uncovering it’s structural semantics (see the LLaMaPuN project for details).

The actual corpus processing (and distribution to hundreds of worker machines) is performed by the CorTeX system; see the system state/results: old but complete, new system in Erlangen.

Applications of this include a mathematical search engine MathWebSearch: (live demo on the arXMLiv data set).

Unfortunately, we cannot re-distribute the results of the transformation freely due to arXiv licensing policies. Therefore we have created the Special Interest Group for Math Linguistics (SIGMathLing) that can distribute the data sets under an NDA to SIGMathLing members.