New Extension for Applied Onomastics (name recognition) on GitHub + help needed
Hi,
Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using 'Enrich Data by Webservice'.
We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension
1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3)我们必须创建一个“DummyOperator”与“generate_extract' otherwise RM complains that the documentation is missing
Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender
Any help welcome!
Thanks,
Elian
Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
阿卜杜勒•;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru
What we get in the output (genderScale is a random number) :
"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"
Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using 'Enrich Data by Webservice'.
We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension
1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3)我们必须创建一个“DummyOperator”与“generate_extract' otherwise RM complains that the documentation is missing
Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender
Any help welcome!
Thanks,
Elian
Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
阿卜杜勒•;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru
What we get in the output (genderScale is a random number) :
"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"
Tagged:
0
Answers
cool stuff 8)
1) I don't quite get the problem. What CSV output?
2) RapidMiner is by default rounding to 3 fraction digits when displaying data. You can change the default setting in the preferences under "General" -> "rapidminer.general.fractiondigits.numbers". When calculating, the actual numbers are used.
3) Not quite sure what that is about, are you getting this warning in the console also when removing your extension? I don't think it has to do anything with it.
Regards,
Marco
I've created a simple process loading data from an Excel file with
>firstName;lastName;countryIso2
>Blas;PEREZ+HENRIQUEZ;
>A.+Craig;COPETAS;
>Abdel;AISSOU;
Then I've connected this Import Excel operator with my custom Extension operator Name2Gender, and connected the output to a CSV file. Unfortunately, the output of my Extension operator seems completely mixed up, with the same firstName being repeated several times, incorrect numeric values, etc.
I think the problem comes from the way I pass parameters in and out in the doWork method
@Override
public void doWork() throws OperatorException {
ExampleSet exampleSet = inputSet.getData();
Attributes attributes = exampleSet.getAttributes();
Attribute fnAttribute = attributes.get(ATTRIBUTE_FN);
Attribute lnAttribute = attributes.get(ATTRIBUTE_LN);
Attribute iso2Attribute = attributes.get(ATTRIBUTE_ISO2);
String mashapeAPIKey = getParameterAsString(MASHAPE_API_KEY);
String defaultISO2 = getParameterAsString(DEFAULT_COUNTRY_ISO2);
double threshold = getParameterAsDouble(ATTRIBUTE_THRESHOLD);
Attribute genderScaleAttribute = AttributeFactory.createAttribute(
ATTRIBUTE_GENDERSCALE, Ontology.REAL);
genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
attributes.addRegular(genderScaleAttribute);
Attribute genderAttribute = AttributeFactory.createAttribute(
ATTRIBUTE_GENDER, Ontology.STRING);
genderAttribute.setTableIndex(fnAttribute.getTableIndex());
attributes.addRegular(genderAttribute);
for (Example example : exampleSet) {
String firstName = example.getValueAsString(fnAttribute);
String lastName = example.getValueAsString(lnAttribute);
String iso2 = example.getValueAsString(iso2Attribute);
if (iso2 != null && iso2.trim().length() == 2) {
// real value
} else if (defaultISO2 != null && defaultISO2.trim().length() == 2) {
iso2 = defaultISO2.trim();
} else {
// invalid value, set to null
iso2 = null;
}
double genderScale = 0d;
if (MOCKUP) {
genderScale = RND.nextDouble() * 2 - 1;
} else {
// API stuff goes here
}
String gender = "Unknown";
if (genderScale > threshold) {
gender = "Female";
} else if (genderScale < -threshold) {
gender = "Male";
}
example.setValue(genderScaleAttribute, genderScale);
example.setValue(genderAttribute, gender);
}
outputSet.deliver(exampleSet);
}
Any idea?
Thx,
Elian
the call seems dangerous. Generally speaking, you can only append new attribute columns on the right. Does removing said line fix your problem?
Regards,
Marco
Without this call, I get a ArrayIndexOutOfBoundsException. I took this method from "How-to-Extend-RapidMiner-5" documentation. Is there an updated document?
Thx in advance for your help,
Elian
SEVERE: java.lang.ArrayIndexOutOfBoundsException: -1
java.lang.ArrayIndexOutOfBoundsException: -1
at com.rapidminer.example.table.DoubleArrayDataRow.set(DoubleArrayDataRo
w.java:61)
at com.rapidminer.example.table.AbstractAttribute.setValue(AbstractAttri
bute.java:184)
at com.rapidminer.example.table.DataRow.set(DataRow.java:85)
at com.rapidminer.example.Example.setValue(Example.java:140)
at com.namsor.api.rapidminer.Name2GenderOperator.doWork(Name2GenderOpera
tor.java:160)
at com.rapidminer.operator.Operator.execute(Operator.java:866)
at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUn
itExecutor.java:51)
at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
at com.rapidminer.operator.Operator.execute(Operator.java:866)
the document will be updated, however I cannot name any date as of yet.
Please use these calls to add new attributes to an existing ExampleSet. Regards,
Marco