Skip to content
RATH
Prepare Data
Extract Text Patterns

Extract Text Patterns

In this tutorial, you are instructed on how to use RATH to discover and extract text patterns from your data source.

The traditional way of text patterns operations are:

  • Manually identify and extract these features based on experience and insight.
  • Design a suitable algorithm or regular expression for the operation, which could be time-consuming.

RATH offers a smart text pattern discovery and extraction feature that can accurately identify matching text patterns based on your intent, and automatically extract them.

Prerequisites

Text patterns discovery and extraction features are available on the Data Source tab. Simply import from your preferable data source and utilize this feature.

Discover and extract patterns from text

Case 1: Basic text extraction

In this case, we try to extract a subset (for example, 2011) from the date field by highlighting the text. RATH will highlight all the 2011 and suggest related Regular expressions on the right side of the screen. Simple text extraction

Case 2: Extract texts based on intent

In this case, we attempt to extract all words University from the field Name.

  1. Select the word "University"

  2. RATH will infer that the last word of the text may be the desired result, and aggregate the extracted texts into a new column with distribution and statistics.

  3. To change this, select another University. RATH will understand your intention to match the word "University" exactly. Text pattern extraction - RATH can understand the intent

Case 3: Generalize intent

RATH not only can understand your intention for text extraction but can also generalize your intent.

  1. In the "Titanic" dataset, which is composed of the names and other information of Titanic passengers, select the title and surname (Mr. Owen Harris) of a passager.

  2. Due to some surnames being followed by additional information in brackets, such as "Mrs. John Bradley (Florence Briggs Thayer)", RATH cannot extract all the surnames. You only need to select one of these unselected surnames, RATH will generalize your intent and extract all the matching surnames.

  3. You can also extract the person's title (e.g. Mr., Miss., Mrs.) and RATH will accurately understand the intention, extract the information, and generate a new field displayed next to the original field. Text pattern extraction - Generalization of the intent

Best practices

  • Text pattern discovery and extraction can be a great alternative to SQL, which can identify and extract simple patterns with expressions but cannot figure out hidden text patterns.