Visual Control Detection (OmniParser)
We also support visual control detection using OmniParser-v2. This method is useful for detecting custom controls in the application that may not be recognized by standard UIA methods. The visual control detection uses computer vision techniques to identify and interact with the UI elements based on their visual appearance.
Deployment
On your remote GPU server, clone the OmniParser repository
git clone https://github.com/microsoft/OmniParser.git
Start omniparserserver
service
cd OmniParser/omnitool/omniparserserver
python gradio_demo.py
This will give you a short URL
* Running on local URL: http:
* Running on public URL: https:
Note: If you have any questions regarding the deployment of OmniParser, please take a look at the README from OmniParser repo.
Configuration
After deploying the OmniParser model, you need to configure the OmniParser settings in the config.yaml
file:
OMNIPARSER: {
ENDPOINT: "<YOUR_END_POINT>",
BOX_THRESHOLD: 0.05,
IOU_THRESHOLD: 0.1,
USE_PADDLEOCR: True,
IMGSZ: 640
}
To activate the icon control filtering, you need to set CONTROL_BACKEND
to ["omniparser"]
in the config_dev.yaml
file.
CONTROL_BACKEND: ["omniparser"]
Reference
The following classes are used for visual control detection in OmniParser:
Bases: BasicGrounding
The OmniparserGrounding class is a subclass of BasicGrounding, which is used to represent the Omniparser grounding model.
parse_results(results, application_window=None)
Parse the grounding results string into a list of control elements infomation dictionaries.
Parameters: |
-
results
(List[Dict[str, Any]] )
–
The list of grounding results dictionaries from the grounding model.
-
application_window
(UIAWrapper , default:
None
)
–
The application window to get the absolute coordinates.
|
Returns: |
-
List[Dict[str, Any]]
–
The list of control elements information dictionaries, the dictionary should contain the following keys: { "control_type": The control type of the element, "name": The name of the element, "x0": The absolute left coordinate of the bounding box in integer, "y0": The absolute top coordinate of the bounding box in integer, "x1": The absolute right coordinate of the bounding box in integer, "y1": The absolute bottom coordinate of the bounding box in integer, }
|
Source code in automator/ui_control/grounding/omniparser.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145 | def parse_results(
self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
"""
Parse the grounding results string into a list of control elements infomation dictionaries.
:param results: The list of grounding results dictionaries from the grounding model.
:param application_window: The application window to get the absolute coordinates.
:return: The list of control elements information dictionaries, the dictionary should contain the following keys:
{
"control_type": The control type of the element,
"name": The name of the element,
"x0": The absolute left coordinate of the bounding box in integer,
"y0": The absolute top coordinate of the bounding box in integer,
"x1": The absolute right coordinate of the bounding box in integer,
"y1": The absolute bottom coordinate of the bounding box in integer,
}
"""
control_elements_info = []
if application_window is None:
application_rect = RECT(0, 0, 0, 0)
else:
try:
application_rect = application_window.rectangle()
except Exception:
application_rect = RECT(0, 0, 0, 0)
for control_info in results:
if not self._filter_interactivity and control_info.get(
"interactivity", True
):
continue
application_left, application_top = (
application_rect.left,
application_rect.top,
)
control_box = control_info.get("bbox", [0, 0, 0, 0])
control_left = int(
application_left + control_box[0] * application_rect.width()
)
control_top = int(
application_top + control_box[1] * application_rect.height()
)
control_right = int(
application_left + control_box[2] * application_rect.width()
)
control_bottom = int(
application_top + control_box[3] * application_rect.height()
)
control_elements_info.append(
{
"control_type": control_info.get("type", "Button"),
"name": control_info.get("content", ""),
"x0": control_left,
"y0": control_top,
"x1": control_right,
"y1": control_bottom,
}
)
return control_elements_info
|
predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')
Predict the grounding for the given image.
Parameters: |
-
image_path
(str )
–
-
box_threshold
(float , default:
0.05
)
–
The threshold for the bounding box.
-
iou_threshold
(float , default:
0.1
)
–
The threshold for the intersection over union.
-
use_paddleocr
(bool , default:
True
)
–
Whether to use paddleocr.
-
imgsz
(int , default:
640
)
–
-
api_name
(str , default:
'/process'
)
–
|
Returns: |
-
List[Dict[str, Any]]
–
The predicted grounding results string.
|
Source code in automator/ui_control/grounding/omniparser.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77 | def predict(
self,
image_path: str,
box_threshold: float = 0.05,
iou_threshold: float = 0.1,
use_paddleocr: bool = True,
imgsz: int = 640,
api_name: str = "/process",
) -> List[Dict[str, Any]]:
"""
Predict the grounding for the given image.
:param image_path: The path to the image.
:param box_threshold: The threshold for the bounding box.
:param iou_threshold: The threshold for the intersection over union.
:param use_paddleocr: Whether to use paddleocr.
:param imgsz: The image size.
:param api_name: The name of the API.
:return: The predicted grounding results string.
"""
list_of_grounding_results = []
if not os.path.exists(image_path):
print_with_color(
f"Warning: The image path {image_path} does not exist.", "yellow"
)
return list_of_grounding_results
try:
results = self.service.chat_completion(
image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
)
grounding_results = results[1].splitlines()
except Exception as e:
print_with_color(
f"Warning: Failed to get grounding results for Omniparser. Error: {e}",
"yellow",
)
return list_of_grounding_results
for item in grounding_results:
try:
item = json.loads(item)
list_of_grounding_results.append(item)
except json.JSONDecodeError:
try:
item = ast.literal_eval(item[item.index("{"):item.rindex("}") + 1])
list_of_grounding_results.append(item)
except Exception:
pass
return list_of_grounding_results
|