Skip to content

voxmenthe/papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

papers

A Python toolkit for managing and analyzing arXiv research papers.

Features

  • arxiv-ocr: Download papers from arXiv and extract their text using Mistral OCR
    • Convert papers to markdown or HTML format
    • Extract and save images from papers
    • Process local PDF files or download directly from arXiv

Usage

arxiv-ocr

# Process a paper from arXiv
python src/scripts/arxiv_ocr.py https://arxiv.org/abs/1706.03762

# Process a local PDF file
python src/scripts/arxiv_ocr.py --file-path path/to/paper.pdf

# Generate HTML output with the first 5 pages
python src/scripts/arxiv_ocr.py https://arxiv.org/abs/1706.03762 --pages 5 --html

Installation

# Clone the repository
git clone https://github.com/yourusername/papers.git
cd papers

# Install dependencies using Poetry
./project_setup.sh

Requirements

  • Python 3.12+
  • Poetry
  • Mistral API key

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published