You are viewing the RapidMiner Developers documentation for version 9.8 -Check here for latest version
API changes in RapidMiner 9.8
From ExampleSet to Belt Table
Forget about theExampleSet
class and start usingcom.rapidminer.belt.table.Table
, RapidMiner's new representation of example sets. The corresponding framework is calledBelt。It comes with several advantages compared toExampleSet
:
- Column-oriented design:a column-oriented data layout allows for using compact representations for the different column types.
- Immutability:all columns and tables are immutable. This not only guarantees data integrity but also allows for safely reusing components, e.g., multiple tables can safely reference the same column.
- Thread-safety:all public data structures are thread-safe and designed to perform well when used concurrently.
- Implicit parallelism:Many of Belt's built-in functionality, such as the transformations shown in the examples below, automatically scale out to multiple cores.
To learn everything about the Belt framework please refer to the official documentation of theBelt project。
This page will focus on the differences between the old example set and the new Belt framework and present some examples on how to implement operators using the Belt framework and theTable
类。
If you are new to extension development for RapidMiner Studio, thenCreate your own extension你是一个伟大的起点。
Sum operator example
Let's start with an example. We will create an operator that takes a table with only numeric columns, calculates the sum for each row and adds these row sums as a new column to the resulting table.
First of all thedoWork()
我thod. You receive the input table by calling:
IOTable ioTable = tableInput.getData(IOTable.class); Table table = ioTable.getTable();
You need not worry if the actual data at the port is an IOTable or an ExampleSet since RapidMiner will automatically convert it to the requested format. This makes the collaboration between new operators working onTable
s and old operators working onExampleSet
s easy.
Then to make the code a little bit cleaner we will outsource the actual work to thecalculateSum
我thod.
// read table, calculate sum and return new table Table result = calculateSum(table);
Now deliver the resulting table to the output port.
IOTable newIOTable = new IOTable(result); newIOTable.getAnnotations().addAll(ioTable.getAnnotations()); tableOutput.deliver(newIOTable);
Since theTable
class itself is not anIOObject
we need to wrap it with theIOTable
类。Also it is important to copy the annotations of the inputIOTable
to the newIOTable
because otherwise they will be lost.
Finally, it is good practice to also deliver the input table to an output port:
originalOutput.deliver(ioTable);
That's thedoWork()
我thod. Let's move on to implement thecalculateSum(Table table)
我thod. First of all check that the given Table contains only numeric columns. TheBeltErrorTools
class holds some convenience methods for this kind of checks.
BeltErrorTools.onlyNumeric(table, getName(), this);
接下来,我们将决定结果of type real or integer. If any column is of type real, the result will also be of type real. The table provides aColumnSelector
that can be accessed via theselect()
我thod. A column selector can be used to filter the columns of a table via predicates. The default predicates filter regarding type, category, capability and meta data (e.g. roles). You can even define your own predicates for custom filter operations. TheofTypeId
我thod does the trick:
boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();
Since the Column class is immutable, we need a column buffer to fill and instantiate a new column:
NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height()) : Buffers.integer53BitBuffer(table.height());
Tables can be read column-wise or row-wise. In this case we want to read it row-wise so that we can calculate the sum for each row:
NumericRowReader reader = Readers.numericRowReader(table); for (int i = 0; i < buffer.size(); i++) { // move must be called to advance the reader to the next row reader.move(); double sum = 0; for (int j = 0; j < reader.width(); j++) { // reader.get(j) returns the value of the j-th column of the row sum += reader.get(j); } buffer.set(i, sum); }
The move method advances the reader to the next row. Please note that it must be called before the first row is read.
We have calculated the row sums and filled them into the buffer. Next, copy the original table and add a new column to it. Since theTable
class is immutable we will use a table builder:
TableBuilder builder = Builders.newTableBuilder(table); builder.add("Sum", buffer.toColumn());
Please note that the data stored in the buffer cannot be modified anymore after calling thetoColumn
我thod. Attempting to do so will lead to an Exception.
Nearly done! All that's left to do is to build and to return the table. And this is where Belt's implicit parallelism comes into play. Thebuild
我thod takes the operator's context that can be accessed via theBeltTools
class and runs the build process in parallel.
Table result = builder.build(BeltTools.getContext(this)); return result;
This concludes the example. Since for nowExampleSetMetaData
will be used as meta data class for Belt tables we will not go through the meta data transformation in detail.
进口com.rapidminer.adaption.belt.IOTable;作的t com.rapidminer.belt.buffer.Buffers; import com.rapidminer.belt.buffer.NumericBuffer; import com.rapidminer.belt.column.Column; import com.rapidminer.belt.reader.NumericRowReader; import com.rapidminer.belt.reader.Readers; import com.rapidminer.belt.table.Builders; import com.rapidminer.belt.table.Table; import com.rapidminer.belt.table.TableBuilder; import com.rapidminer.operator.Operator; import com.rapidminer.operator.OperatorDescription; import com.rapidminer.operator.OperatorException; import com.rapidminer.operator.UserError; import com.rapidminer.operator.ports.InputPort; import com.rapidminer.operator.ports.OutputPort; import com.rapidminer.operator.ports.metadata.AttributeMetaData; import com.rapidminer.operator.ports.metadata.ExampleSetMetaData; import com.rapidminer.operator.ports.metadata.MetaData; import com.rapidminer.operator.ports.metadata.MetaDataInfo; import com.rapidminer.operator.ports.metadata.PassThroughRule; import com.rapidminer.operator.ports.metadata.SimplePrecondition; import com.rapidminer.tools.Ontology; import com.rapidminer.tools.belt.BeltErrorTools; import com.rapidminer.tools.belt.BeltTools; /** * This operator takes a {@link Table} with only numeric columns, calculates the sum for each row * and adds it as a new column. * * @author Kevin Majchrzak * @since 9.8 */ public class SumOperator extends Operator { private final InputPort tableInput = getInputPorts().createPort("example set input"); private final OutputPort tableOutput = getOutputPorts().createPort("example set output"); private final OutputPort originalOutput = getOutputPorts().createPort("original"); public SumOperator(OperatorDescription description) { super(description); // we want example set meta data as input tableInput.addPrecondition(new SimplePrecondition(tableInput, new ExampleSetMetaData())); // pass through the original data getTransformer().addPassThroughRule(tableInput, originalOutput); // generate meta data for new table getTransformer().addRule(new PassThroughRule(tableInput, tableOutput, true) { @Override public MetaData modifyMetaData(MetaData metaData) { if (metaData instanceof ExampleSetMetaData) { ExampleSetMetaData emd = (ExampleSetMetaData) metaData; boolean resultIsReal = emd.containsAttributesWithValueType(Ontology.REAL, true) != MetaDataInfo.NO; AttributeMetaData sumAttribute = resultIsReal ? new AttributeMetaData("Sum", Ontology.REAL) : new AttributeMetaData("Sum", Ontology.INTEGER); emd.addAttribute(sumAttribute); } return metaData; } }); } @Override public void doWork() throws OperatorException { // fetch table from input port IOTable ioTable = tableInput.getData(IOTable.class); Table table = ioTable.getTable(); // read table, calculate sum and return new table Table result = calculateSum(table); // wrap the result into an IOTable IOTable newIOTable = new IOTable(result); // copy the annotations from the original IOTable newIOTable.getAnnotations().addAll(ioTable.getAnnotations()); // deliver the new IOTable to the port tableOutput.deliver(newIOTable); // deliver original table to corresponding port originalOutput.deliver(ioTable); } /** * Takes a {@link Table} with only numeric columns, calculates the sum for each row and adds it as a new column. * * @param table * the original table * @return a new table with the original columns and a sum column * @throws UserError * if the table contains non-numeric columns */ private Table calculateSum(Table table) throws UserError { // check that all columns are numeric BeltErrorTools.onlyNumeric(table, getName(), this); // If any column is of type real the result will be real. Otherwise, it will be integer. boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty(); // initialize numeric buffer needed to create sum column NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height()) : Buffers.integer53BitBuffer(table.height()); // read the table row-wise and store the sum of each row in the buffer NumericRowReader reader = Readers.numericRowReader(table); for (int i = 0; i < buffer.size(); i++) { // move must be called to advance the reader to the next row reader.move(); double sum = 0; for (int j = 0; j < reader.width(); j++) { // reader.get(j) returns the value of the j-th column of the row sum += reader.get(j); } buffer.set(i, sum); } // copy original table using table builder TableBuilder builder = Builders.newTableBuilder(table); // add the new column to the builder builder.add("Sum", buffer.toColumn()); // build the new table in parallel using the operator's context Table result = builder.build(BeltTools.getContext(this)); return result; } }
In this example you have seen how to fetch and deliver a table from and to ports. How to read a table and processed its data, create a new column using a buffer and return a modified table using the TableBuilder class.
There are alternative ways to implement the operator, of course. Look, for example, at the following code:
private Table calculateSum(Table table) throws UserError { // check that all columns are numeric BeltErrorTools.onlyNumeric(table, getName(), this); // If any column is of type real the result will be real. Otherwise, it will be integer. boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty(); // this function will be applied in parallel to the table rows ToDoubleFunction sumUpRow = row -> { double sum = 0; for (int j = 0; j < row.width(); j++) { sum += row.get(j); } return sum; }; // the results will be collected in a numeric buffer NumericBuffer buffer; if(resultIsReal){ buffer = table.transform().applyNumericToReal(sumUpRow, BeltTools.getContext(this)); } else { buffer = table.transform().applyNumericToInteger53Bit(sumUpRow, BeltTools.getContext(this)); } // copy original table using table builder TableBuilder builder = Builders.newTableBuilder(table); // add the new column to the builder builder.add("Sum", buffer.toColumn()); // build the new table in parallel using the operator's context Table result = builder.build(BeltTools.getContext(this)); return result; }
This code uses theTable
's transform method and a row transformer to achieve the same results as thecalculateSum
我thod presented earlier. Details on thetransform
可以找到方法here。Using the transform method comes with the additional advantage that the summations potentially takes place in parallel. Belt once again makes use of the operator's context to automatically decide if and how to parallelize the computation.
The next example shows how to use generators to fill columns and how to add meta data like, for example, roles to a table.
IDgenerator example
Next, let's implement an operator that takes a table and adds an ID column to it. Here is the code of itsdoWork()
我thod:
@Override public void doWork() throws OperatorException { // fetch table from input port and initialize builder IOTable ioTable = tableInput.getData(IOTable.class); Table table = ioTable.getTable(); TableBuilder builder = Builders.newTableBuilder(table); // add id column via generator builder.addInt53Bit("ID", i -> i); // set column role builder.addMetaData("ID", ColumnRole.ID); // add annotations and deliver results Table result = builder.build(BeltTools.getContext(this)); IOTable newIOTable = new IOTable(result); newIOTable.getAnnotations().addAll(ioTable.getAnnotations()); tableOutput.deliver(newIOTable); // deliver original table to corresponding port originalOutput.deliver(ioTable); }
We fetch the input table and initlialize the builder with it just as we did before. Then add the id column via:
builder.addInt53Bit("ID", i -> i);
This line of code makes use of one of the table builder's convenience methods that takes a label and a generator and automatically fills the column. Furthermore, it does not fill the column straight away but does so later when thebuild
我thod is called. Thereby, the builder can fill all columns in parallel.
Let's take a closer look at the generator. For numeric column types it is represented via anIntToDoubleFunction
。The generator consumes a row index and returns the value for that row. Our implementation returns the row index itself as the result and, thereby, generates ids from 0 to the number of rows - 1. Similar generator methods for other column types are also available in the table builder.
The next step is to set the column's role toColumnRole。ID
。建筑工人的addMetaData
我thod takes a column label and meta data to attach to the corresponding column. SinceColumnRole
实现了ColumnMetaData
it can be attached via this method.
Finally, the resulting table is wrapped into an IOTable, the annotations are copied and the table is delivered to the output port.
ColumnMetaData
ColumnMetaData
represents additional information that can be attached to columns. Classes implementing theColumnMetaData
by default are:
- ColumnRole:Representing the roles used in Studio to mark special columns like, for example, labels.
- ColumnAnnotation:的文本描述column.
- ColumnReference:A reference to another column that is somehow related to the column. An example would be a prediction column referencing the label column that it refers to.
Custom meta data can be added to the columns by implementing theColumnMetaData
interface.
Please note that column annotations and references are not visualized in RapidMiner Studio yet, but we plan on doing so in the near future.
Two important changes have been made to column roles. Firstly,roles need not be unique anymore。A table can have multiple label, prediction and even id columns. This comes in handy, e.g., when working with learners that expect multiple labels. Secondly, in Beltthe set of column roles is fixedto BATCH, CLUSTER, ID, LABEL, OUTLIER, PREDICTION, SCORE, WEIGHT, INTERPRETATION, ENCODING, SOURCE and METADATA. While the first eleven of them are the default roles, METADATA stands for anything other than the known roles. Columns marked as METADATA will usually be ignored by operators (e.g. when creating models). Legacy roles that do not exist in Belt will be mapped to METADATA.
Automatic conversion between Table and ExampleSet
Table
will be converted toExampleSet
and vice versa depending on the format the operator requests a port to deliver it in. This conversion is done very efficient so that in most cases this will not impact the overall performance of a process.
Please note:
- Since
ExampleSet
expects roles to be unique, non unique roles will have an index appended to their name when converting fromTable
toExampleSet
。When such a role is converted back at a later point in the process, the unnecessary index will automatically be removed. - Attribute / column types will be mapped to the next best representation in the converted format. Some of the Belt column types do not have a representation in the old API. Therefore, attempting to deliver an
IOTable
holding column types not included inBeltConverter.STANDARD_TYPES
will lead to an exception. This restriction may be removed in one of the future releases.
MetaData class for IOTables
To this pointExampleSetMetaData
is theMetaData
class used to describeIOTable
s at the operator ports. This works to a certain degree well becauseExampleSet
andTable
both represent data tables and they are conceptually similar. Nevertheless, in the near future we will release anIOTable
specific meta data class that can better represent the new Belt tables.