虽然训练示例集和真实数据集是相同的,但在应用模型时缺少属性

田纳西州田纳西州 成员职位:5新手
2019年11月编辑 帮助
你好,

我是新的rapidMiner,我试图分类飞YouTube评论创新产品到客户需求或不是。乐鱼官网手机版下载

这两个示例集应该与我使用来自培训数据的单词列表并将其应用到我想用来自数据操作符的流程文档进行分类的数据相同。在下面的图片中,你可以看到两个数据集的比较。



我使用RapidMiner自动模型创建一个支持向量机分类过程,然后我用这个过程存储模型。然后我使用以下过程:

<?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程version = " 9.3.001”>
> <上下文
<输入/ >
<输出/ >
<宏>
宏观> <
<键>文本关键> < /
让我们试试非常简单的方法。我喜欢智能手表
宏观> < /
< /宏>
> < /上下文
<参数键= " logverbosity " value = " init " / >
<参数键= " random_seed " value = " 2001 " / >
<参数键= " send_mail " value = "永远" / >
<参数键= " notification_email“价值= " / >
<参数键= " process_duration_for_mail " value = " 30 " / >
<参数键=“编码”值= "系统" / >
<过程扩展= " true " >
<参数键= " repository_entry " value = " . . /数据/ PrepedYouTubeCommentData " / >
< /操作符>
<参数键= " repository_entry " value = " . . /结果/ SvmClassificationYComments " / >
< /操作符>
<列出关键= " application_parameters " / >
<参数键= " create_view " value = " false " / >
< /操作符>
<参数键= " attribute_filter_type " value = " numeric_value_filter " / >
<参数键= "属性" value = " / >
<参数键= "属性" value = " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = " attribute_value " / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键= " except_value_type " value = "时间" / >
<参数键= " block_type " value = " attribute_block " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " value_matrix_row_start " / >
<参数键= " numeric_condition " value = "祝辞0 " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
< /操作符>
/> .
> < /过程
< /操作符>
> < /过程


然而,我总是得到这个错误:



我使用youTube上的教程和视频类VbNhvYQZ2v0以及rapidMiner学院的TextMining和机器学习课程来构建我的过程。

这个过程显示了我对训练数据的预处理:

<?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程version = " 9.3.001”>
> <上下文
<输入/ >
<输出/ >
<宏/ >
> < /上下文
<参数键= " logverbosity " value = " init " / >
<参数键= " random_seed " value = " 2001 " / >
<参数键= " send_mail " value = "永远" / >
<参数键= " notification_email“价值= " / >
<参数键= " process_duration_for_mail " value = " 30 " / >
<参数键=“编码”值= "系统" / >
<过程扩展= " true " >
<参数键= " repository_entry " value = " . . /数据/ SampleDataYouTubeComments " / >
< /操作符>
<参数键= =“子集”/“attribute_filter_type”值>
<参数键= "属性" value = "评论" / >
<参数键= "属性" value = " |评论| product_name | product_type " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = "名义" / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键= " except_value_type " value = " file_path " / >
<参数键= " block_type " value = " single_value " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " single_value " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
< /操作符>
<参数键= " create_word_vector " value = " true " / >
<参数键= " vector_creation " value = " TF-IDF " / >
<参数键= " add_meta_information " value = " true " / >
<参数键= " keep_text " value = " true " / >
<参数键= " prune_method " value = "绝对" / >
<参数键= " prune_below_percent " value = " 3.0 " / >
<参数键= " prune_above_percent " value = " 30.0 " / >
<参数键= " prune_below_absolute " value = " 2 " / >
<参数键= " prune_above_absolute " value = " 1000 " / >
<参数键= " prune_below_rank " value = " 0.05 " / >
<参数键= " prune_above_rank " value = " 0.95 " / >
<参数键= " datamanagement " value = " double_sparse_array " / >
<参数键= " data_management " value = "自动" / >
<参数键= " select_attributes_and_weights " value = " false " / >
<列出关键= " specify_weights " / >
<过程扩展= " true " >
.
<参数键= "字符" value = ": " / >
<参数键= "语言" value = "英语" / >
<参数键= " max_token_length " value = " 3 " / >
< /操作符>
< /操作符>
<参数键= " max_length " value = " 2 " / >
< /操作符>
<参数键= " min_chars " value = " 2 " / >
<参数键= " max_chars " value = " 25 " / >
< /操作符>
< portSpacing端口= " source_document”间隔= " 0 " / >
.
> < /过程
< /操作符>
<参数键= " repository_entry " value = " . . /结果/ WordlistForCR " / >
< /操作符>
<参数键= " attribute_filter_type " value = "单一" / >
.
<参数键= "属性" value = " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = "数字" / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键=“except_value_type”值= "真正的" / >
<参数键= " block_type " value = " value_series " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " value_series_end " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
< /操作符>
<参数键= " attribute_filter_type " value = "单一" / >
.
<参数键= "属性" value = " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = " attribute_value " / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键= " except_value_type " value = "时间" / >
<参数键= " block_type " value = " attribute_block " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " value_matrix_row_start " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
<列出关键= " value_mappings " >
<参数键= " 1 " value = " true " / >
<参数键= " 0 " value = " false " / >
< / >列表
<参数键= " consider_regular_expressions " value = " false " / >
<参数键= " add_default_mapping " value = " false " / >
< /操作符>
. class="set_role" compatibility="9.3.001
.
<参数键= " target_role " value = "标签" / >
<列出关键= " set_additional_roles " >
<参数键= " comment_id " value = " id " / >
< / >列表
< /操作符>
<参数键= " repository_entry " value = " . . /数据/ PrepedTrainingDataYouTubeComments " / >
< /操作符>
> < /过程
< /操作符>
> < /过程


这个过程显示了我想分类的数据的预处理:

<?xml version = " 1.0 " encoding = " utf - 8 " ?> <过程version = " 9.3.001”>
> <上下文
<输入/ >
<输出/ >
<宏/ >
> < /上下文
<参数键= " logverbosity " value = " init " / >
<参数键= " random_seed " value = " 2001 " / >
<参数键= " send_mail " value = "永远" / >
<参数键= " notification_email“价值= " / >
<参数键= " process_duration_for_mail " value = " 30 " / >
<参数键=“编码”值= "系统" / >
<过程扩展= " true " >
<参数键= " repository_entry " value = " . . /结果/ WordlistForCR " / >
< /操作符>
<参数键= " repository_entry " value = " . . /数据/ DataYouTubeComments " / >
< /操作符>
.
<参数键= "属性" value = " |评论| product_name | product_type " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = "名义" / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键= " except_value_type " value = " file_path " / >
<参数键= " block_type " value = " single_value " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " single_value " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
< /操作符>
<参数键= " create_word_vector " value = " true " / >
<参数键= " vector_creation " value = " TF-IDF " / >
<参数键= " add_meta_information " value = " true " / >
<参数键= " keep_text " value = " true " / >
<参数键= " prune_method " value = "没有" / >
<参数键= " prune_below_percent " value = " 3.0 " / >
<参数键= " prune_above_percent " value = " 30.0 " / >
<参数键= " prune_below_rank " value = " 0.05 " / >
<参数键= " prune_above_rank " value = " 0.95 " / >
<参数键= " datamanagement " value = " double_sparse_array " / >
<参数键= " data_management " value = "自动" / >
<参数键= " select_attributes_and_weights " value = " false " / >
<列出关键= " specify_weights " / >
<过程扩展= " true " >
.
<参数键= "字符" value = ": " / >
<参数键= "语言" value = "英语" / >
<参数键= " max_token_length " value = " 3 " / >
< /操作符>
< /操作符>
<参数键= " max_length " value = " 2 " / >
< /操作符>
< portSpacing端口= " source_document”间隔= " 0 " / >
.
> < /过程
< /操作符>
. class="generate_empty_attribute" compatibility="9.3.001
.
<参数键= " value_type " value = "多项式" / >
< /操作符>
<参数键= " attribute_filter_type " value = "单一" / >
.
<参数键= "属性" value = " / >
<参数键= " use_except_expression " value = " false " / >
<参数键= " value_type " value = " attribute_value " / >
<参数键= " use_value_type_exception " value = " false " / >
<参数键= " except_value_type " value = "时间" / >
<参数键= " block_type " value = " attribute_block " / >
<参数键= " use_block_type_exception " value = " false " / >
<参数键= " except_block_type " value = " value_matrix_row_start " / >
<参数键= " invert_selection " value = " false " / >
<参数键= " include_special_attributes " value = " false " / >
<列出关键= " value_mappings " >
<参数键= " 1 " value = " true " / >
<参数键= " 0 " value = " false " / >
< / >列表
<参数键= " consider_regular_expressions " value = " false " / >
<参数键= " add_default_mapping " value = " false " / >
< /操作符>
.
<参数键= " target_role " value = "标签" / >
<列出关键= " set_additional_roles " >
<参数键= " comment_id " value = " id " / >
< / >列表
< /操作符>
<参数键= " repository_entry " value = " . . /数据/ PrepedYouTubeCommentData " / >
< /操作符>
/> .
> < /过程
< /操作符>
> < /过程

我还附上了100行样本数据。如果这个问题已经解决了(找不到对我的情况有用的东西),或者如果我犯了一些简单的错误,我道歉。

如果有人知道如何纠正英语拼写,(我已经尝试使用快速矿工社区发布的textblob python脚本。它会改变已经正确的单词,例如“Big”变成“Fig”)我也会非常感激。

提前谢谢,

田纳西州





标记:

最佳答案

  • 田纳西州田纳西州 成员职位:5新手
    解决方案接受
    我把支持向量机模型操作符从自动模型过程中复制到交叉验证中并创建了另一个模型。在这个模型中,我可以输入需要分类的数据,而不会得到错误消息。现在它工作得很顺利。但还是浪费了3天。

    因此,我建议如果您对automodler创建的模型有问题,将模型复制到交叉验证中。

答案

  • sgenzersgenzer 12管理员,版主,员工,RapidMiner认证分析师,社区管理员,会员,大学教授,PM版主职位:2959年社区经理
    @Tennessee好吧,我可以看看这个。你们的过程文件预处理与你们的培训和测试是不一样的,当然这是需要的。





    斯科特
  • Telcontar120Telcontar120 版主,RapidMiner认证分析师,RapidMiner认证专家,会员职位:1635年独角兽
    同时也要小心使用单词表。您确实需要存储来自最初模型构建过程的单词列表,然后确保在将来应用模型时使用相同的单词列表,否则您正在处理的文本中的差异可能导致不兼容的结果。

    布莱恩·T。
    Lindon合资企业
    乐鱼平台进入来自认证RapidMiner专家的数据科学咨询
    sgenzer Tghadially
  • 田纳西州田纳西州 成员职位:5新手
    预处理步骤不一样的原因是我在创建训练数据以创建模型时保存的单词列表。我使用这个词表作为操作员“从数据中处理文档”的输入,这允许我省去某些预处理步骤,根据YouTube上的快速矿工文本挖掘教程。另外,如果我没有对训练和实际数据使用单词列表和相同的预处理步骤,我将在准备好的表中得到不同的属性。你确定是这个问题吗?
  • 田纳西州田纳西州 成员职位:5新手
    如果您比较我的后两个进程,您可以看到我在第一个进程中存储了一个wordlist,并在第二个进程中使用它。另外,第一张图显示,我在两个例子中拥有相同数量的属性。这将是不可能的,如果我使用生成n克运算符在两个预处理步骤没有一个词表,不是吗?

    提前谢谢,

    田纳西州



  • 田纳西州田纳西州 成员职位:5新手
    我也尝试给模型提供训练数据,但我仍然得到相同的错误消息。

    我使用rapidminer automodell创建这个模型,它甚至不能除了自己的训练数据。好像出了什么问题。我将尝试手动创建模型。
    Tghadially
登录注册置评。