Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex.
You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly
isn't quite enough for this task.
Note: This code is very ad-hoc and requires tinkering with the source
Contents
/src/latex2formulas.py
Script for parsing downloaded latex sources for formulas. Stores formulas in single .txt file (one formula per line)
/src/stackexchange2formulas.py
Similar to latex2formulas.py, but for parsing StackExchange XMLs.
/src/arxiv2formulas.py
Similar to latex2formulas.py, but for parsing arXiv .tar/.tar.gz files (source downloads).
/src/formula2image.py
Creates images and dataset from a file of formulas
/src/im2latex_utils.py
Collection of misc functions for handling these formulas
latex_urls.txt
Text file containing urls to LaTeX dataset from here. Use wget -i latex_urls.txt to download these files.
Dependencies
Python 2.x or 3.x (only ran on 2.x, should work on 3.x too. Haven't tried running on Windows)
For running the script with current settings and generating full-page images:
请发表评论