Extract Text Patterns
In this tutorial, you are instructed on how to use RATH to discover and extract text patterns from your data source.
The traditional way of text patterns operations are:
- Manually identify and extract these features based on experience and insight.
- Design a suitable algorithm or regular expression for the operation, which could be time-consuming.
RATH offers a smart text pattern discovery and extraction feature that can accurately identify matching text patterns based on your intent, and automatically extract them.
Prerequisites
Text patterns discovery and extraction features are available on the Data Source tab. Simply import from your preferable data source and utilize this feature.
Discover and extract patterns from text
Case 1: Basic text extraction
In this case, we try to extract a subset (for example, 2011
) from the date
field by highlighting the text. RATH will highlight all the 2011 and suggest related Regular expressions on the right side of the screen.
Case 2: Extract texts based on intent
In this case, we attempt to extract all words University
from the field Name
.
-
Select the word "University"
-
RATH will infer that the last word of the text may be the desired result, and aggregate the extracted texts into a new column with distribution and statistics.
-
To change this, select another
University
. RATH will understand your intention to match the word "University" exactly.
Case 3: Generalize intent
RATH not only can understand your intention for text extraction but can also generalize your intent.
-
In the "Titanic" dataset, which is composed of the names and other information of Titanic passengers, select the title and surname (Mr. Owen Harris) of a passager.
-
Due to some surnames being followed by additional information in brackets, such as "Mrs. John Bradley (Florence Briggs Thayer)", RATH cannot extract all the surnames. You only need to select one of these unselected surnames, RATH will generalize your intent and extract all the matching surnames.
-
You can also extract the person's title (e.g. Mr., Miss., Mrs.) and RATH will accurately understand the intention, extract the information, and generate a new field displayed next to the original field.
Best practices
- Text pattern discovery and extraction can be a great alternative to SQL, which can identify and extract simple patterns with expressions but cannot figure out hidden text patterns.