"text mining Excel file"

rdmckinneyrdmckinney MemberPosts:15Maven
edited May 2019 inHelp
I didn't find my topic with a search, so please redirect me if you have discussed this elsewhere. I have an Excel file with comments from members. I want to mine the comments as if each member/record is a document. I can get the Excel file into Rapidminer easily with ExcelExampleSource, but when I connect that to TextInput I get an error message: "Error in: TextInput (TextInput) The attribute 'text_source' does not exist. The example set does not contain an attribute with the given name." What should be my next step after the ExcelExampleSource?
Thanks!
Roger D. McKinney

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, MemberPosts:294RM Product Management
    Hello Roger,

    use the [tt]StringTextInput[/tt] operator instead of the [tt]TextInput[/tt] operator.

    Kind regards,
    Tobias
  • Legacy UserLegacy User MemberPosts:0Newbie
    I tried the StringTextInput operator but the ExcelExampleSource operator doesn't allow me to designate a field as string and the StringTextInput operator looks for a field designated as string. I finally just saved the Excel file as a tab delimited file and imported it with the ExampleSource operator followed by the StringTextInput operator and that works fine.
  • rdmckinneyrdmckinney MemberPosts:15Maven
    PS, For every document/example in my file, I get this error message: "[Warning] StringTextInput: Warning: Encoding unknown. Using default." Should I worry about this?
    Thanks!
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, MemberPosts:294RM Product Management
    Hi,
    rdmckinney wrote:

    PS, For every document/example in my file, I get this error message: "[Warning] StringTextInput: Warning: Encoding unknown. Using default." Should I worry about this?
    Thanks!
    如果数据显示正确,你做的事情not need to worry!;)

    Btw.: the [tt]Nominal2String[/tt] operator converts nominal to string columns. That way, you could load the texts directly from the excel file.

    Kind regards,
    Tobias
  • rdmckinneyrdmckinney MemberPosts:15Maven
    Thanks for the tip!

    我麦ing progress:oI am running the following code and so far it has taken 43 minutes. Is that normal?

























  • rdmckinneyrdmckinney MemberPosts:15Maven
    Sorry, I forgot to tell you that the input to the EM clustering operator has 600 examples and about 1,600 attributes. It's now up to 1 hr 9 minutes.
  • rdmckinneyrdmckinney MemberPosts:15Maven
    I stopped the EM clustering after about 2 hrs and substituted K-means. It ran in about 20 seconds, which is when I realized that it clusters examples and not attributes. I need to cluster attributes because each attribute is a word from a text mining problem. Is there an operator that will transpose a data set?
  • rdmckinneyrdmckinney MemberPosts:15Maven
    Never mind! I realized my mistake. I need to apply dimension reduction, such as principal components to the attributes, then cluster. Sorry!
  • rdmckinneyrdmckinney MemberPosts:15Maven
    Is there any way to re-run just one operator in a chain? The reason I ask is I have a model that imports data from Excel, uses the stemmers, tokenizers and stopword filters to create a data set using the stringtextinput, then I apply the GHA operator. Do I always need to have the program run through all operators each time when all I really want to do is re-run the GHA operator with different settings? Thanks!
  • rdmckinneyrdmckinney MemberPosts:15Maven
    I have a more serious issue now. I'm getting these messages: May 1, 2009 11:50:55 AM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of GHA (GHA)
    G May 1, 2009 11:50:55 AM: [Fatal] Process failed: operator cannot be executed (6). Check the log messages...

    Here's my code:

























  • rdmckinneyrdmckinney MemberPosts:15Maven
    I need to add that in the problem above I am trying to reduce 1,600 variables to as few components as possible. If I choose -1 as my number of components, then the program works fine and creates 1,600 components. But If I try to limit the number of components to even 200, I get the error message.
Sign InorRegisterto comment.