Transform Your Data
Data Transformation is a process that helps you prepare your raw data for analysis and modeling. It consists of four main steps to make sure your data is accurate and reliable.
Data Cleaning: This step involves fixing errors, inconsistencies, and missing values in your data.
Data Filtering: This step lets you select only the data that is relevant to your analysis.
Data Transformation: This step changes the format of your data so it's easier to work with.
Data Sampling: This step involves selecting a smaller portion of your data to save time and resources.
By following these steps, you'll be able to work with high-quality data that will give you accurate results from your analysis and modeling.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data from a dataset. Proper data cleaning can improve the quality of analysis.
Before using RATH for data cleaning, make sure your datasets have standardized data formats. Which include:
- DateTime Data: must be standardized as
- Numerical Data: should be correct. For example, consider a dataset for supermarket sales records. The sales data should be standardized as
To use RATH for data cleaning, simply import your data from a data source. RATH can automatically clean your data.
You can also choose an option from the Clean Method drop menu on the Data Source tab.
Select one of the options that match your requirement to proceed.
You can also filter through your data with RATH. Move to the Meta view, and click on the "Filter" button of a certain field.
Enable the filter and select a certain range or value set. In the above example, we are selecting the data whose temperature is between 20 to 30 degrees.
If you just want to remove the anomalies, select the Fast Selection button, and use the fast filtering feature to get the main parts of the data. You can configure more details in the following screen:
On the Table or Meta view, select the Transforms option on a given field. RATH can automatically generate suggestions for data transformation.
For example, if you select a DateTime object, RATH will suggest you group DateTime by units of time:
For categorical variables, RATH will suggest using the One-hot Encoding algorithm.
If RATH detects potential anomalies in a certain field, RATH will suggest using the Isolation Forest algorithm.
Data sampling is the process of selecting a representative portion of data from a larger dataset to draw inferences about the overall population. It enables efficient and effective exploration and analysis, reducing the amount of data to be processed while providing accurate insights.
For more details about data sampling, refer to the related sections in the Connect your data chapter.