Text Extraction

Extraction.Text extracts data from semi-structured text files using examples. The Usage page and the Sample project illustrate the API usage.

Extraction.Text supports two kinds of extraction:

  1. Extract.Text.Region extracts a substring from an input string.
  2. Extract.Text.Sequence extracts a sequence of substrings from an input string.

Read more: “FlashExtract: A Framework for Data Extraction by Examples”

Substring Extraction

From input/output example(s) in the form of <input string, substring of input string>, Extraction.Text.Region learns a program to extract a substring from an input string. The program can be run on new input strings to obtain new output substrings.

For instance, given this example:

Input Example output
Carrie Dodson 100 Dodson

Extraction.Text.Region generates a program to extract the last name in similar strings such as the one below:

Input Program output
Leonard Robledo 75 Robledo

Sequence Extraction

From input/output example(s) in the form of <input string, subsequence of the intended sequence>, Extraction.Text.Sequence learns a program to extract a complete sequence of substrings from an input string. The program can be run on the same training input string to obtain the complete output sequence, or on new input strings to obtain new output sequences.

For instance, given this example that contains a subsequence of the intended sequence of first names:

Input Example output
United States
 Carrie Dodson 100
 Leonard Robledo 75
 Margaret Cook 320
Canada
 Concetta Beck 350
 Nicholas Sayers 90
 Francis Terrill 2430
Great Britain
 Nettie Pope 50
 Mack Beeson 1070
Carrie
Leonard

Extraction.Text.Sequence generates a program to extract the sequence of all first names:

Input Program output
United States
 Carrie Dodson 100
 Leonard Robledo 75
 Margaret Cook 320
Canada
 Concetta Beck 350
 Nicholas Sayers 90
 Francis Terrill 2430
Great Britain
 Nettie Pope 50
 Mack Beeson 1070
Carrie
Leonard
Margaret
Concetta
Nicholas
Francis
Nettie
Mack

Applications

Clients may use these two APIs directly, or combine them to extract nested/hierarchical data (i.e, tree) from documents. Below are several real usages of Extraction.Text.

Operations Management Suite (OMS)

The Custom Fields feature in OMS allows users to create a new custom field based on an existing field in the log. Custom Fields uses the Extraction.Text.Region API because each cell in the new field is a substring of a cell in the existing field.

Users highlight a text in one cell of the source field that they want to create a new field with.

The UI calls the Extraction.Text.Region API to learn a substring extraction program, and populates other cells of the new field. Users may give more examples if necessary.

Read more: Create your own fields in OMS with Custom Fields!.

The Custom Log feature in OMS allows users to split the input text stream into records, which can be split further into fields using Custom Fields.

PowerShell ConvertFrom-String

ConvertFrom-String allows users to extract hierarchical data from a document from an example template, which is a sample of the complete document.

Users mark extracted fields in the template using pairs of curly brackets { }. The following template extracts three fields, each of which was given 2 examples.

{[string]Name*:Phoebe Cat}, {[string]phone:425-123-6789}, {[int]age:6}
{[string]Name*:Lucky Shot}, {[string]phone:(206) 987-4321}, {[int]age:12}

ConvertFrom-String learns the fields one by one using one of the two Extraction.Text APIs. The fields are learned based on their document order. That is, a field appear first in the document will be learned first.

While learning a field, ConvertFrom-String uses one of the already learned fields as a reference. Depending on the nature of the field, it learns a substring program or a sequence program.

In the above template, ConvertFrom-String learns a sequence of Name (also indicated by the * next to Name), a substring of phone based on Name, and a substring of age also based on Name. Note that age can also be learned w.r.t phone.

Now we can pass the template to ConvertFrom-String to extract nested data from the complete input document.

$template = @'
{[string]Name*:Phoebe Cat}, {[string]phone:425-123-6789}, {[int]age:6}
{[string]Name*:Lucky Shot}, {[string]phone:(206) 987-4321}, {[int]age:12}
'@

$testText = @'
Phoebe Cat, 425-123-6789, 6
Lucky Shot, (206) 987-4321, 12
Elephant Wise, 425-888-7766, 87
Wild Shrimp, (111)  222-3333, 1
'@

$testText  |
    ConvertFrom-String -TemplateContent $template -OutVariable PersonalData | Out-Null

Write-Output ("Pet items found: " + ($PersonalData.Count))
$PersonalData

Output:

Pet items found: 4

Name          phone           age
----          -----           ---
Phoebe Cat    425-123-6789      6
Lucky Shot    (206) 987-4321   12
Elephant Wise 425-888-7766     87
Wild Shrimp   (111)  222-3333   1

Read more: ConvertFrom-String: Example-based text parsing.

Prose Playground

In Prose Playground, users extract hierarchical data by highlighting various fields using colors.

Although the learning in Playground is similar to that of ConvertFrom-String (learning fields in document order and fields reference each other), it is more complicated due to its interactive nature.

Because at each step users can only give one example, most of the existing fields are not affected. This allows Playground to cache most of the learning result from the previous step. However, since users can give example for any field, all fields depending on it will be affected. Playground has to visit the field dependency graph to relearn the affected fields, if necessary.

Read more: User Interaction Models for Disambiguation in Programming by Example