browser_utils.mdconvert
_CustomMarkdownify
class _CustomMarkdownify(markdownify.MarkdownConverter)
A custom version of markdownify's MarkdownConverter. Changes include:
- Altering the default heading style to use '#', '##', etc.
- Removing javascript hyperlinks.
- Truncating images with large data:uri sources.
- Ensuring URIs are properly escaped, and do not conflict with Markdown syntax
convert_hn
def convert_hn(n, el, text, convert_as_inline)
Same as usual, but be sure to start with a new line
convert_a
def convert_a(el, text, convert_as_inline)
Same as usual converter, but removes Javascript links and escapes URIs.
convert_img
def convert_img(el, text, convert_as_inline)
Same as usual converter, but removes data URIs
DocumentConverterResult
class DocumentConverterResult()
The result of converting a document to text.
DocumentConverter
class DocumentConverter()
Abstract superclass of all DocumentConverters.
PlainTextConverter
class PlainTextConverter(DocumentConverter)
Anything with content type text/plain
HtmlConverter
class HtmlConverter(DocumentConverter)
Anything with content type text/html
WikipediaConverter
class WikipediaConverter(DocumentConverter)
Handle Wikipedia pages separately, focusing only on the main document content.
YouTubeConverter
class YouTubeConverter(DocumentConverter)
Handle YouTube specially, focusing on the video title, description, and transcript.
BingSerpConverter
class BingSerpConverter(DocumentConverter)
Handle Bing results pages (only the organic search results). NOTE: It is better to use the Bing API
PdfConverter
class PdfConverter(DocumentConverter)
Converts PDFs to Markdown. Most style information is ignored, so the results are essentially plain-text.
DocxConverter
class DocxConverter(HtmlConverter)
Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
XlsxConverter
class XlsxConverter(HtmlConverter)
Converts XLSX files to Markdown, with each sheet presented as a separate Markdown table.
PptxConverter
class PptxConverter(HtmlConverter)
Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
MediaConverter
class MediaConverter(DocumentConverter)
Abstract class for multi-modal media (e.g., images and audio)
WavConverter
class WavConverter(MediaConverter)
Converts WAV files to markdown via extraction of metadata (if exiftool
is installed), and speech transcription (if speech_recognition
is installed).
Mp3Converter
class Mp3Converter(WavConverter)
Converts MP3 files to markdown via extraction of metadata (if exiftool
is installed), and speech transcription (if speech_recognition
AND pydub
are installed).
ImageConverter
class ImageConverter(MediaConverter)
Converts images to markdown via extraction of metadata (if exiftool
is installed), OCR (if easyocr
is installed), and description via a multimodal LLM (if an mlm_client is configured).
MarkdownConverter
class MarkdownConverter()
(In preview) An extremely simple text-based document reader, suitable for LLM use. This reader will convert common file-types or webpages to Markdown.
convert
def convert(source, **kwargs)
Arguments:
- source: can be a string representing a path or url, or a requests.response object
- extension: specifies the file extension to use when interpreting the file. If None, infer from source (path, uri, content-type, etc.)
register_page_converter
def register_page_converter(converter: DocumentConverter) -> None
Register a page text converter.