Have you ever written a script to perform a string transformation and have it either crash or produce wrong results silently due to input data being in unexpected formats? Or do you want to figure out how many different cases you need to handle in your standardization procedure. Matching.Text to the rescue!
Matching.Text automatically identifies different formats and patterns in string data. Given a set of input strings, Matching.Text produces a small number of disjoint regular expressions such that they together match all the input strings, except possibly a small fraction of outliers. Additional documentation and usage can be found here.
Consider a list of names below which from which you want to extract last names.
A simple looking task, if there was one – the python function below is a good attempt.
def extract_last_name(name): return name[name.find(' ')+1:]
However, while the first 10 names look standard, running Matching.Text provides more insight into the different formats, further identifies outliers that do not fall into any of the other formats.
|Pattern Name||Regex Pattern||Frequency||Examples|
||0.84||“Laia Sanchis”, “Gwilym Jones”|
||0.06||“Tulga Bat-Erdene”, “Dabir Al-Zuhairi”|
||0.06||“Yue Ying Jen”, “Rolf Van Eeuwijk”|
Given this new insight, it can be seen that
extract_last_name may not always
return the right answer, and you may want to handle the last name extraction task
Further, to make the writing the procedure easier, Matching.Text can also generate
a switch-case like template to match against the different patterns.
regex_word_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+') regex_word_word_hyphen_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+-[A-Z][a-z]+') regex_word_word_word = re.compile(r'[A-Z][a-z]+ [A-Z][a-z]+ [A-Z][a-z]+') regex_word = re.compile(r'[A-Z][a-z]+') def extract_last_name(name): if regex_word_word.match(name): return "TitleWord & TitleWord" # Modify elif regex_word_word_hyphen_word.match(name): return "TitleWord & TitleWord & Const[-] & TitleWord" # Modify elif regex_word_word_word.match(name): return "TitleWord & TitleWord & TitleWord" # Modify elif regex_word.match(name): return "TitleWord" # Modify else: return "Others" # Modify