Visual Control Detection (OmniParser)
Visual control detection uses OmniParser-v2, a vision-based grounding model that detects UI elements through computer vision. This method is particularly effective for custom controls, icons, images, and visual elements that may not be accessible through standard UIA.
Use Cases
- Custom Controls: Detects proprietary or non-standard UI elements
- Visual Elements: Icons, images, and graphics-based controls
- Web Content: Elements within browser windows or web views
- Canvas-based UIs: Applications that render custom graphics
Deployment
1. Clone the OmniParser Repository
On your remote GPU server:
git clone https://github.com/microsoft/OmniParser.git
cd OmniParser/omnitool/omniparserserver
2. Start the OmniParser Service
python gradio_demo.py
This will generate output similar to:
* Running on local URL: http://0.0.0.0:7861
* Running on public URL: https://xxxxxxxxxxxxxxxxxx.gradio.live
For detailed deployment instructions, refer to the OmniParser README.
Configuration
OmniParser Settings
Configure the OmniParser endpoint and parameters in config/ufo/system.yaml:
OMNIPARSER:
ENDPOINT: "<YOUR_END_POINT>" # The endpoint URL from deployment
BOX_THRESHOLD: 0.05 # Bounding box confidence threshold
IOU_THRESHOLD: 0.1 # IoU threshold for non-max suppression
USE_PADDLEOCR: True # Enable OCR for text detection
IMGSZ: 640 # Input image size for the model
Enable Visual Detection
Set CONTROL_BACKEND to use OmniParser:
# Use OmniParser only
CONTROL_BACKEND: ["omniparser"]
# Or use hybrid mode (recommended for maximum coverage)
CONTROL_BACKEND: ["uia", "omniparser"]
See Hybrid Detection for combining UIA and OmniParser, or System Configuration for detailed options.
Reference
Bases: BasicGrounding
The OmniparserGrounding class is a subclass of BasicGrounding, which is used to represent the Omniparser grounding model.
parse_results(results, application_window=None)
Parse the grounding results string into a list of control elements infomation dictionaries.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in automator/ui_control/grounding/omniparser.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')
Predict the grounding for the given image.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in automator/ui_control/grounding/omniparser.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | |
screen_parsing(screenshot_path, application_window_info=None, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640)
Parse the grounding results using TargetInfo for application window information.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Source code in automator/ui_control/grounding/omniparser.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |