Hybrid Control Detection

Hybrid control detection combines both UIA and OmniParser to provide comprehensive UI coverage. It merges standard Windows controls detected via UIA with visual elements detected through OmniParser, removing duplicates based on Intersection over Union (IoU) overlap.

Hybrid Control Detection

How It Works

The hybrid detection process follows these steps:

graph LR A[Screenshot] --> B[UIA Detection] A --> C[OmniParser Detection] B --> D[UIA Controls<br/>Standard UI Elements] C --> E[Visual Controls<br/>Icons, Images, Custom UI] D --> F[Merge & Deduplicate<br/>IoU Threshold: 0.1] E --> F F --> G[Final Control List<br/>Annotated [1] to [N]] style D fill:#e3f2fd style E fill:#fff3e0 style F fill:#e8f5e9 style G fill:#f3e5f5

Deduplication Algorithm:

  1. Keep all UIA-detected controls (main list)
  2. For each OmniParser-detected control (additional list):
  3. Calculate IoU with all UIA controls
  4. If IoU > threshold (default 0.1), discard as duplicate
  5. Otherwise, add to merged list
  6. Result: Maximum coverage with minimal duplicates

Benefits

  • Maximum Coverage: Detects both standard and custom UI elements
  • No Gaps: Visual detection fills in UIA blind spots
  • Efficiency: Deduplication prevents redundant annotations
  • Flexibility: Works across diverse application types

Configuration

Prerequisites

Before enabling hybrid detection, you must deploy and configure OmniParser. See Visual Detection - Deployment for instructions.

Enable Hybrid Mode

Configure both backends in config/ufo/system.yaml:

# Enable hybrid detection
CONTROL_BACKEND: ["uia", "omniparser"]

# IoU threshold for merging (controls with IoU > threshold are considered duplicates)
IOU_THRESHOLD_FOR_MERGE: 0.1  # Default: 0.1

# OmniParser configuration
OMNIPARSER:
  ENDPOINT: "<YOUR_END_POINT>"
  BOX_THRESHOLD: 0.05
  IOU_THRESHOLD: 0.1
  USE_PADDLEOCR: True
  IMGSZ: 640

Configuration Options

Parameter Type Default Description
CONTROL_BACKEND List[str] ["uia"] List of detection backends to use
IOU_THRESHOLD_FOR_MERGE float 0.1 IoU threshold for duplicate detection (0.0-1.0)

Tuning Guidelines:

  • Lower threshold (< 0.1): More aggressive deduplication, may miss some controls
  • Higher threshold (> 0.1): Keep more overlapping controls, may have duplicates
  • Recommended: Keep default 0.1 for optimal balance

See System Configuration for complete configuration details.

Implementation

The hybrid detection is implemented through:

  • AppControlInfoStrategy: Orchestrates control collection from multiple backends
  • PhotographerFacade.merge_target_info_list(): Performs IoU-based deduplication
  • OmniparserGrounding: Handles visual detection and parsing

Reference

Bases: BasicGrounding

The OmniparserGrounding class is a subclass of BasicGrounding, which is used to represent the Omniparser grounding model.

parse_results(results, application_window=None)

Parse the grounding results string into a list of control elements infomation dictionaries.

Parameters:
  • results (List[Dict[str, Any]]) –

    The list of grounding results dictionaries from the grounding model.

  • application_window (UIAWrapper, default: None ) –

    The application window to get the absolute coordinates.

Returns:
  • List[Dict[str, Any]]

    The list of control elements information dictionaries, the dictionary should contain the following keys: { "control_type": The control type of the element, "name": The name of the element, "x0": The absolute left coordinate of the bounding box in integer, "y0": The absolute top coordinate of the bounding box in integer, "x1": The absolute right coordinate of the bounding box in integer, "y1": The absolute bottom coordinate of the bounding box in integer, }

Source code in automator/ui_control/grounding/omniparser.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """
    control_elements_info = []

    # Get application rectangle coordinates from UIAWrapper
    app_left, app_top, app_width, app_height = self._get_application_rect_from_uia(
        application_window
    )

    for control_info in results:
        control_element = self._calculate_absolute_coordinates(
            control_info, app_left, app_top, app_width, app_height
        )
        if control_element is not None:
            control_elements_info.append(control_element)

    return control_elements_info

    return control_elements_info

predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')

Predict the grounding for the given image.

Parameters:
  • image_path (str) –

    The path to the image.

  • box_threshold (float, default: 0.05 ) –

    The threshold for the bounding box.

  • iou_threshold (float, default: 0.1 ) –

    The threshold for the intersection over union.

  • use_paddleocr (bool, default: True ) –

    Whether to use paddleocr.

  • imgsz (int, default: 640 ) –

    The image size.

  • api_name (str, default: '/process' ) –

    The name of the API.

Returns:
  • List[Dict[str, Any]]

    The predicted grounding results string.

Source code in automator/ui_control/grounding/omniparser.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        logger.warning(f"The image path {image_path} does not exist.")
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        logger.warning(
            f"Failed to get grounding results for Omniparser. Error: {e}"
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(
                    item[item.index("{") : item.rindex("}") + 1]
                )
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results

screen_parsing(screenshot_path, application_window_info=None)

Parse the grounding results using TargetInfo for application window information.

Parameters:
  • application_window_info (TargetInfo, default: None ) –

    The application window TargetInfo.

Returns:
  • List[TargetInfo]

    The list of control elements information dictionaries.

Source code in automator/ui_control/grounding/omniparser.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def screen_parsing(
    self,
    screenshot_path: str,
    application_window_info: TargetInfo = None,
) -> List[TargetInfo]:
    """
    Parse the grounding results using TargetInfo for application window information.
    :param application_window_info: The application window TargetInfo.
    :return: The list of control elements information dictionaries.
    """
    results = self.predict(screenshot_path)

    control_elements_info = []

    # Get application rectangle coordinates from TargetInfo
    app_left, app_top, app_width, app_height = (
        self._get_application_rect_from_target_info(application_window_info)
    )

    for control_info in results:
        control_element = self._calculate_absolute_coordinates(
            control_info, app_left, app_top, app_width, app_height
        )
        if control_element is not None:
            control_elements_info.append(
                TargetInfo(
                    kind=TargetKind.CONTROL,
                    type=control_element.get("control_type", "Button"),
                    name=control_element.get("name", ""),
                    rect=(
                        control_element.get("x0", 0),
                        control_element.get("y0", 0),
                        control_element.get("x1", 0),
                        control_element.get("y1", 0),
                    ),
                )
            )

    return control_elements_info