The Catalog
Data in RapidMiner
Data is critical for any data science project. The starting point is data, and the results may include enriched data.
RapidMiner provides thecatalogas an easily-accessible shared repository for both:
- uploaded data and
- generated data.
Depending on the need, access to a data file can be restricted to a limited number of users or it can be shared by multiple项目.
Global view / project view
The catalog provides global access to your data, but in fact there are two access points:
- Catalog
- Projects
Within the catalog, you can seeanydata that you have uploaded or to whichyou have access.
Note however that all work is done in the scope of aproject, and the project has more limited data access. Each project provides aDatatab, where you can see thedata accessible to that particular project.
Upload your data
Both in the catalog as well as in theDatatab of any project, you can upload new data files using theAdd Databutton. In either case, the data lands in the catalog.
Notice in the screenshot that the data setChurn-Gois owned by an individual user, whereaschurn-sampleis owned by a project. The reason is thatchurn-sample是来自Churn-Gowithin a project calledChurn9, and thereforeChurn9is the owner.
支持的格式
You can upload files of any data format to the catalog. Nevertheless, we distinguish between two different cases:
- HDF5: thenative data file format of RapidMiner. You can find your HDF5 data files in RapidMiner Studio in the folderDocuments/RapidMiner, with the extensionrmhdf5table.
- Other: to use any other data format, such as CSV or Excel, you need to connect the input to the relevant operator in the workflow designer (e.g.,Read CSV).
In practice, the difference is undramatic. It usually implies an extra step when developing yourworkflow.
Link to project
In order to do anything with the data, you must first link it to a project.
Does the project exist? If not,create the project.
Once you have uploaded the data, click onLink to Projectand select a project.
If the data is generated inside a project, the data is automatically linked.
Data linked to a project is available to all project contributors and visible to all project viewers. You may link the same data file to multiple projects.
Organize your data
There are four elements that help you to organize and find your data.
Name: A search field allows you to filter by file name, by typing any substring. Search is the easiest way to locate a data file, if you know its name.
Tags: Each data file can have multiple tags, and you can use them to filter the set of files you want to see or exclude from the view.
Projects: Knowing which projects the data file is linked to helps you understand where it's used and what its potential dependencies are.
Filter type: TheFilter typelocates specific file types, such as Excel, CSV, etc.
Details
Clicking on the name of a data file, you arrive at aDetailspage.
Here you can:
- See the data table by selecting theDatatab.
- Plot the data by selecting theCharttab.
- Add aDescriptionto help other users to understand the content of the data file.
- Add or removeTags, for better organization.
- Link to projectsthat will have access to the data file.
- SetPermissionsfor users who will have access to the file, independent of projects.
Permissions
Note that if the data file islinked to a projectand a user hasaccess to the project, that user does not need additional permissions.
Comparing the screenshots above and below, you can see that any user who has access to theChurn9project will have access to both data setschurn-sampleandChurn-Go, but
- whereasChurn-Gois explictly linked to theChurn9project,
- the link tochurn-sampleis implicit, becauseChurn9is the owner.
When explicit permission is required, because a user has no access to the project, data files can be shared as:
- Read(read only) or
- Write(read-write).