Skip to main content

browser_utils.mdconvert

_CustomMarkdownify

class _CustomMarkdownify(markdownify.MarkdownConverter)

A custom version of markdownify's MarkdownConverter. Changes include:

  • Altering the default heading style to use '#', '##', etc.
  • Removing javascript hyperlinks.
  • Truncating images with large data:uri sources.
  • Ensuring URIs are properly escaped, and do not conflict with Markdown syntax

convert_hn

def convert_hn(n, el, text, convert_as_inline)

Same as usual, but be sure to start with a new line

convert_a

def convert_a(el, text, convert_as_inline)

Same as usual converter, but removes Javascript links and escapes URIs.

convert_img

def convert_img(el, text, convert_as_inline)

Same as usual converter, but removes data URIs

DocumentConverterResult

class DocumentConverterResult()

The result of converting a document to text.

DocumentConverter

class DocumentConverter()

Abstract superclass of all DocumentConverters.

PlainTextConverter

class PlainTextConverter(DocumentConverter)

Anything with content type text/plain

HtmlConverter

class HtmlConverter(DocumentConverter)

Anything with content type text/html

WikipediaConverter

class WikipediaConverter(DocumentConverter)

Handle Wikipedia pages separately, focusing only on the main document content.

YouTubeConverter

class YouTubeConverter(DocumentConverter)

Handle YouTube specially, focusing on the video title, description, and transcript.

BingSerpConverter

class BingSerpConverter(DocumentConverter)

Handle Bing results pages (only the organic search results). NOTE: It is better to use the Bing API

PdfConverter

class PdfConverter(DocumentConverter)

Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.

DocxConverter

class DocxConverter(HtmlConverter)

Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.

XlsxConverter

class XlsxConverter(HtmlConverter)

Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.

PptxConverter

class PptxConverter(HtmlConverter)

Converts PPTX files to Markdown. Supports heading, tables and images with alt text.

MediaConverter

class MediaConverter(DocumentConverter)

Abstract class for multi-modal media (e.g., images and audio)

WavConverter

class WavConverter(MediaConverter)

Converts WAV files to markdown via extraction of metadata (if exiftool is installed), and speech transcription (if speech_recognition is installed).

Mp3Converter

class Mp3Converter(WavConverter)

Converts MP3 files to markdown via extraction of metadata (if exiftool is installed), and speech transcription (if speech_recognition AND pydub are installed).

ImageConverter

class ImageConverter(MediaConverter)

Converts images to markdown via extraction of metadata (if exiftool is installed), OCR (if easyocr is installed), and description via a multimodal LLM (if an mlm_client is configured).

MarkdownConverter

class MarkdownConverter()

(In preview) An extremely simple text-based document reader, suitable for LLM use. This reader will convert common file-types or webpages to Markdown.

convert

def convert(source, **kwargs)

Arguments:

  • source: can be a string representing a path or url, or a requests.response object
  • extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)

register_page_converter

def register_page_converter(converter: DocumentConverter) -> None

Register a page text converter.