Converts PDFs to clean markdown. Actually preserves code blocks and tables.
→ They add fake strikethrough (~~) to tables
→ Code blocks lose their syntax highlighting
→ Images get extracted 5 times from the same PDF
→ You get raw text dumps instead of structured data
I was preparing PDFs for RAG pipelines and kept hitting these issues. So I fixed them.
Uses Pygments to figure out if code blocks are Python, JavaScript, SQL, etc. Most converters don't do this.
Extracts each unique image once, not 5 times. Saves them with proper metadata.
Automatically removes the fake strikethrough that pymupdf4llm adds to tables.
Get markdown plus metadata (page count, word count, code blocks found, etc) in one call.
curl -X POST https://pdf-to-md-30rq.onrender.com/api/v1/convert \
-F "file=@document.pdf" \
-F "include_content=true"
No API key needed (for now). Free tier: 100 conversions/month.
RAG pipelines: Convert technical docs to markdown for LLM context
Research tools: Extract text from academic papers with code/equations preserved
Document processing: Batch convert PDFs to searchable text
Built by Eswar Sethu in Melbourne
Questions? Email api@eswarsethu.dev
This is free because I needed it for my own projects. If it helps you, great.