document-converter-suite

Convert between 8 formats (PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, HTML). Best-effort text extraction, batch processing, and document format transformation.

8estrelas

0forks

Atualizado 1/29/2026

Obter Código Fonte

SKILL.md

readonlyread-only

name

document-converter-suite

description

Convert between 8 formats (PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, HTML). Best-effort text extraction, batch processing, and document format transformation.

Document Converter Suite

Overview

Provide a best-effort conversion workflow between 8 document formats:

Office Formats: PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX)
Text Formats: Plain Text (TXT), CSV, Markdown (MD), HTML

Uses pypdf, python-docx, python-pptx, openpyxl, reportlab, mistune, beautifulsoup4, and Pillow.

Prefer reliable extraction + rebuild (text, headings, bullets, basic tables) over pixel-perfect layout.

When to use

Use when the request involves:

Converting a file between .pdf / .docx / .pptx / .xlsx / .txt / .csv / .md / .html
Making a document more editable by moving its content into Office or text formats
Exporting slide text or spreadsheet cell grids to a different format
Converting Markdown/HTML documentation to Office formats or vice versa
Extracting tables from Office documents to CSV/XLSX
Batch-converting a folder of mixed documents

Supported conversion paths: 64 total (8×8 matrix) - see references/conversion_matrix.md

Avoid promising visual fidelity. Emphasize that output is clean and structured, not identical.

Workflow decision tree

Identify input and desired output (extensions matter).
Classify the user's goal:
- Editable content → proceed with this suite.
- Visually identical rendering → explain limitations; suggest external rendering tools.
Pick conversion mode:
- Single file → run scripts/convert.py.
- Folder/batch → run scripts/batch_convert.py.
Tune safety caps if needed:
- PDF: --max-pages, --max-chars
- XLSX: --max-rows, --max-cols
Run conversion, then sanity-check output size and structure.
Iterate (e.g., increase max rows/cols, split large docs, or choose a different target format).

Quick start

Single-file conversion

Run:

python scripts/convert.py <input-file> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

Examples:

# Office format conversions
python scripts/convert.py report.pdf --to docx
python scripts/convert.py deck.pptx --to pdf --out deck_export.pdf
python scripts/convert.py data.xlsx --to pptx --max-rows 40 --max-cols 12

# Text format conversions
python scripts/convert.py documentation.md --to docx
python scripts/convert.py data.csv --to xlsx
python scripts/convert.py report.docx --to html
python scripts/convert.py notes.txt --to md

Batch conversion

Run:

python scripts/batch_convert.py <input-dir> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

Examples:

python scripts/batch_convert.py ./inbox --to docx --recursive
python scripts/batch_convert.py ./inbox --to pdf --outdir ./out --recursive --overwrite
python scripts/batch_convert.py ./markdown-docs --to html --pattern "*.md"
python scripts/batch_convert.py ./data --to xlsx --pattern "*.csv"

Conversion behavior

Follow these defaults (and say them out loud if the user might be expecting magic):

Office Format Conversions

PDF → (DOCX/PPTX/XLSX/TXT/MD/HTML): extract text with pypdf; no OCR; each page becomes a section/slide block.
DOCX → (PDF/PPTX/XLSX/TXT/CSV/MD/HTML): export paragraphs, headings (with improved detection), and tables.
- Improved heading detection: now uses font size + bold + ALL CAPS heuristics, not just style names.
PPTX → (DOCX/PDF/XLSX/TXT/CSV/MD/HTML): export slide titles + text frames; export tables.
- Multi-table support: PPTX now creates one slide per table when multiple tables exist.
XLSX → (DOCX/PPTX/PDF/TXT/CSV/MD/HTML): export bounded value grid per sheet (defaults: 200×50).
- Truncation warnings: printed to stderr when data exceeds limits (e.g., "Sheet 'Data': Truncated 500 rows → 200 rows").

Text Format Conversions

TXT → (DOCX/PPTX/XLSX/PDF/CSV/MD/HTML): lines become paragraphs/bullets; simple structure preservation.
CSV → (XLSX/DOCX/PPTX/HTML): headers + rows mapped to tables/sheets; auto-delimiter detection.
MD → (DOCX/PPTX/XLSX/PDF/TXT/CSV/HTML): parsed with mistune; headings, lists, tables, code blocks preserved.
- High fidelity: Markdown ↔ HTML and Markdown ↔ DOCX maintain structure well.
HTML → (DOCX/PPTX/XLSX/PDF/TXT/CSV/MD): parsed with beautifulsoup4; semantic structure extracted.
- High fidelity: HTML ↔ Markdown and HTML ↔ DOCX maintain structure well.

Quality Improvements

Multi-table PPTX: Creates one slide per table (instead of dropping extra tables)
Smart heading detection: DOCX headings detected by style, font size+bold, or ALL CAPS+bold
Data truncation warnings: XLSX conversions warn when data is truncated
Image extraction foundation: image_handler.py provides hash-based deduplication for future image support

Load extra detail from:

references/conversion_matrix.md - Full 8×8 conversion matrix
references/limitations.md - Format-specific limitations and edge cases

Guardrails and honesty rules

State "best-effort" explicitly for any conversion request.
Do not claim formatting fidelity (fonts, spacing, images, charts, animations).
Call out scanned PDFs as a likely failure mode (no OCR).
For giant spreadsheets, prefer increasing caps gradually and/or limiting to specific sheets (if user provides intent).

Bundled scripts

scripts/convert.py: single-file CLI converter
scripts/batch_convert.py: batch converter for directories
scripts/lib/*: internal readers/writers and conversion orchestration

Related Skills

verify

243K

Use when you want to validate changes before committing, or when you need to check all React contribution requirements.

facebook

Obter

test

243K

Use when you need to run tests for React core. Supports source, www, stable, and experimental channels.

facebook

Obter

feature-flags

243K

Use when feature flag tests fail, flags need updating, understanding @gate pragmas, debugging channel-specific test failures, or adding new flags to React.

facebook

Obter

extract-errors

243K

Use when adding new error messages to React, or seeing "unknown error code" warnings.

facebook

Obter

flow

243K

Use when you need to run Flow type checking, or when seeing Flow type errors in React code.

facebook

Obter

flags

243K

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

facebook

Obter

document-converter-suite

Document Converter Suite

Overview

When to use

Workflow decision tree

Quick start

Single-file conversion

Batch conversion

Conversion behavior

Office Format Conversions

Text Format Conversions

Quality Improvements

Guardrails and honesty rules

Bundled scripts

You Might Also Like

Related Skills

verify

test

feature-flags

extract-errors

flow

flags