Visual Control Detection (OmniParser)

Visual control detection uses OmniParser-v2, a vision-based grounding model that detects UI elements through computer vision. This method is particularly effective for custom controls, icons, images, and visual elements that may not be accessible through standard UIA.

Use Cases

Custom Controls: Detects proprietary or non-standard UI elements
Visual Elements: Icons, images, and graphics-based controls
Web Content: Elements within browser windows or web views
Canvas-based UIs: Applications that render custom graphics

Deployment

1. Clone the OmniParser Repository

On your remote GPU server:

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser/omnitool/omniparserserver

2. Start the OmniParser Service

python gradio_demo.py

This will generate output similar to:

* Running on local URL:  http://0.0.0.0:7861
* Running on public URL: https://xxxxxxxxxxxxxxxxxx.gradio.live

For detailed deployment instructions, refer to the OmniParser README.

Configuration

OmniParser Settings

Configure the OmniParser endpoint and parameters in config/ufo/system.yaml:

OMNIPARSER:
  ENDPOINT: "<YOUR_END_POINT>"  # The endpoint URL from deployment
  BOX_THRESHOLD: 0.05            # Bounding box confidence threshold
  IOU_THRESHOLD: 0.1             # IoU threshold for non-max suppression
  USE_PADDLEOCR: True            # Enable OCR for text detection
  IMGSZ: 640                     # Input image size for the model

Enable Visual Detection

Set CONTROL_BACKEND to use OmniParser:

# Use OmniParser only
CONTROL_BACKEND: ["omniparser"]

# Or use hybrid mode (recommended for maximum coverage)
CONTROL_BACKEND: ["uia", "omniparser"]

See Hybrid Detection for combining UIA and OmniParser, or System Configuration for detailed options.

Reference

Bases: BasicGrounding

The OmniparserGrounding class is a subclass of BasicGrounding, which is used to represent the Omniparser grounding model.

`parse_results(results, application_window=None)`

Parse the grounding results string into a list of control elements infomation dictionaries.

Parameters:	`results` (`List[Dict[str, Any]]`) – The list of grounding results dictionaries from the grounding model. `application_window` (`UIAWrapper`, default: `None` ) – The application window to get the absolute coordinates.

Returns:

List[Dict[str, Any]] –

The list of control elements information dictionaries, the dictionary should contain the following keys: { "control_type": The control type of the element, "name": The name of the element, "x0": The absolute left coordinate of the bounding box in integer, "y0": The absolute top coordinate of the bounding box in integer, "x1": The absolute right coordinate of the bounding box in integer, "y1": The absolute bottom coordinate of the bounding box in integer, }

Source code in automator/ui_control/grounding/omniparser.py

def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """
    control_elements_info = []

    # Get application rectangle coordinates from UIAWrapper
    app_left, app_top, app_width, app_height = self._get_application_rect_from_uia(
        application_window
    )

    for control_info in results:
        control_element = self._calculate_absolute_coordinates(
            control_info, app_left, app_top, app_width, app_height
        )
        if control_element is not None:
            control_elements_info.append(control_element)

    return control_elements_info

    return control_elements_info

`predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')`

Predict the grounding for the given image.

Parameters:

image_path (str) –

The path to the image.
box_threshold (float, default: 0.05 ) –

The threshold for the bounding box.
iou_threshold (float, default: 0.1 ) –

The threshold for the intersection over union.
use_paddleocr (bool, default: True ) –

Whether to use paddleocr.
imgsz (int, default: 640 ) –

The image size.
api_name (str, default: '/process' ) –

The name of the API.

Returns:	`List[Dict[str, Any]]` – The predicted grounding results string.

Source code in automator/ui_control/grounding/omniparser.py

def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        logger.warning(f"The image path {image_path} does not exist.")
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        logger.warning(
            f"Failed to get grounding results for Omniparser. Error: {e}"
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(
                    item[item.index("{") : item.rindex("}") + 1]
                )
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results

`screen_parsing(screenshot_path, application_window_info=None, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640)`

Parse the grounding results using TargetInfo for application window information.

Parameters:

application_window_info (TargetInfo, default: None ) –

The application window TargetInfo.
box_threshold (float, default: 0.05 ) –

The threshold for the bounding box.
iou_threshold (float, default: 0.1 ) –

The threshold for the intersection over union.
use_paddleocr (bool, default: True ) –

Whether to use PaddleOCR.
imgsz (int, default: 640 ) –

The image size.

Returns:	`List[TargetInfo]` – The list of control elements information dictionaries.

Source code in automator/ui_control/grounding/omniparser.py

def screen_parsing(
    self,
    screenshot_path: str,
    application_window_info: TargetInfo = None,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
) -> List[TargetInfo]:
    """
    Parse the grounding results using TargetInfo for application window information.
    :param application_window_info: The application window TargetInfo.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use PaddleOCR.
    :param imgsz: The image size.
    :return: The list of control elements information dictionaries.
    """
    results = self.predict(
        screenshot_path,
        box_threshold=box_threshold,
        iou_threshold=iou_threshold,
        use_paddleocr=use_paddleocr,
        imgsz=imgsz
    )

    control_elements_info = []

    # Get application rectangle coordinates from TargetInfo
    app_left, app_top, app_width, app_height = (
        self._get_application_rect_from_target_info(application_window_info)
    )

    for control_info in results:
        control_element = self._calculate_absolute_coordinates(
            control_info, app_left, app_top, app_width, app_height
        )
        if control_element is not None:
            control_elements_info.append(
                TargetInfo(
                    kind=TargetKind.CONTROL,
                    type=control_element.get("control_type", "Button"),
                    name=control_element.get("name", ""),
                    rect=(
                        control_element.get("x0", 0),
                        control_element.get("y0", 0),
                        control_element.get("x1", 0),
                        control_element.get("y1", 0),
                    ),
                )
            )

    return control_elements_info