“价值系列”

labrat · 2009年8月

你好,

我在一个使用RM的项目上工作，我有一个关于多值系列的问题，让我解释一下。

目前，我正在尝试预测一串氨基酸是否是表位(E)或不是(N)。这些字符串的长度都是固定的20，并且有几种预测模型，准确性各不相同。这些模型在蛋白质的滑动窗口上给出一串分数。

如
序列六边形ABCDEF
Model1 1、2、3、4、5、6、7
Model2 3、5、2、4、5、6、9
Model3 6、2、2、3、1、7、6
标签E

我想知道如何将这些评分模型组合到支持向量机中。

例如:

ID、[MODEL1] [MODEL2], [MODEL3]标签。
ID,[1, 2, 3, 4, 5, 6, 7],[3、5、2、4、5、6、9],[6、2、2、3、1、7、6],E

我希望你们能理解。

斯图尔特

PS:

我将使用的SVM将是向导中的标准Xval-SVM

土地 · 2009年8月

嗨,斯图尔特,
如果我没理解错的话，你想学习三种不同的学习算法的预测吗?如果您想这样做，您可以使用MetaLearning操作符Stacking，其中您将SVM作为第一个操作符，并将包含生成当前模型的三个学习方案的OperatorChain。这应该已经达到目的了。

问候,
塞巴斯蒂安。

labrat · 2009年8月

嗨,塞巴斯蒂安,

谢谢你的回复，

基本上可以归结为....

有一些物理化学性质与一串氨基酸表位(E)相关。为了计算这一点，使用大小为7的滑动窗口扫描蛋白质，并生成分数。如果分数高于任意阈值，则确定表位。

因为我使用了几个评分指标(我称之为模型)，我需要能够将这些数据加载到SVM中，并告诉它分数来自哪里，无论是模型1还是模型2。

我现在拥有的是…
<？XML版本="1.0"编码="windows-1252"?>
< attributeset default_source = " all7.dat " >
< id
Name = "id"
Sourcecol = "1"
Valuetype = "integer"/>

<属性
Name = "ant1"
Sourcecol = "2"
Valuetype = "real"
Blocktype = "value_series_start"/> . Blocktype = "value_series_start

<属性
Name = "ant2"
Sourcecol = "3"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant3"
Sourcecol = "4"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant4"
Sourcecol = "5"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant5"
Sourcecol = "6"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant6"
Sourcecol = "7"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant7"
Sourcecol = "8"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant8"
Sourcecol = "9"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant9"
Sourcecol = "10"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant10"
Sourcecol = "11"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant11"
Sourcecol = "12"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant12"
Sourcecol = "13"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant13"
Sourcecol = "14"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "ant14"
Sourcecol = "15"
Valuetype = "real"
Blocktype = "value_series_end"/> . Blocktype = "value_series_end

<属性
Name = "asa1"
Sourcecol = "16"
Valuetype = "real"
Blocktype = "value_series_start"/> . Blocktype = "value_series_start

<属性
Name = "asa2"
Sourcecol = "17"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa3"
Sourcecol = "18"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa4"
Sourcecol = "19"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa5"
Sourcecol = "20"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa6"
Sourcecol = "21"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa7"
Sourcecol = "22"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa8"
Sourcecol = "23"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa9"
Sourcecol = "24"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa10"
Sourcecol = "25"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa11"
Sourcecol = "26"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa12"
Sourcecol = "27"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa13"
Sourcecol = "28"
Valuetype = "real"
Blocktype = "value_series"/>

<属性
Name = "asa14"
Sourcecol = "29"
Valuetype = "real"
Blocktype = "value_series_end"/> . Blocktype = "value_series_end

然而<标记>…

<属性
Name = "paris"
Sourcecol = "72"
Valuetype = "integer"
Blocktype = "value_series"/>

<标签
Name = "class"
Sourcecol = "73"
Valuetype = "nominal">
E <价值> < /值>
<价值> N > < /价值
< / >标签

< / attributeset >

土地 · 2009年8月

你好,
对不起，我看没有什么问题。为什么不能只加载数据文件(显然在快速矿工中已经有了)，然后应用SVM?我错过什么了吗?

问候,
塞巴斯蒂安。

labrat · 2009年8月

我认为这会达到目的，但我注意到，通过制作“数据属性屏幕中的系列”，我仍然从我的数据中得到相同的结果，通过告诉SVM，平等地看待所有数据点。

这很奇怪。

土地 · 2009年8月

嗨,斯图尔特,
对不起，我听不懂。我只是不知道发生了什么，哪里出了问题。我现在看到的一切，似乎都符合你所需要的设置。也许你可以粘贴你的过程?(请使用一个代码环境，可通过图标用尖!)当然，你可以把过程和数据都发给我们，我们会注意得到一个正确的结果，但我认为我们必须把这个当作咨询。

问候,
塞巴斯蒂安。

labrat · 2009年8月

嗨,塞巴斯蒂安,

好了，现在我有一点时间来整理我的想法(在那里有点死线)，

这就是我想做的:

我试图使用SVM来帮助分类一个短肽(在这种情况下是一个20个字母的字符串)可能是一个表位(E)或不是(N)。

以前有几种评分方法的预测准确率约为54/55%，我试图使用支持向量机来改善这些方法。

这些评分方法所做的是在7个字母的窗口上分配一个分数，这个7个字母的窗口沿着短肽(20个字母)滑动，产生14个单分数。

例如:
SEQ - APTQPPPAGTGDRLLNLVQG
标签- E
评分窗口索引，评分窗口序列，评分
1 - aptqppp - 0.132
2 - ptqpppa - 0.132
3 - tqpppag - -0.165
．...
13 - RLLNLVQ - 1
14 - LLNLVQG - 1

将单个评分方法放入SVM中是没有问题的，因为我只是像往常一样转储它，然而，例如，我开始组合这些分数，给SMV更多的向量，试图用IE更多的评分方法进行分类。

所以我有一个Excel设置如下:

ID、s1-1 s1-2、s1-3……s1-13, s1-14, s2-1, s2 2, s2-3,…,13个,s2-14, sn-1 sn-2,……sn-13,标签

地点:

S表示分数
S1为评分方法1
S1-12表示窗口12的评分方法1

目前N大约是7

所以我最初的问题是，我如何告诉rapid miner和SVM, s1-1和s2-1是相同的，但略有不同，或者更清楚地说，我如何在Rapidminer中设置数据来告诉RM这些是相同数据上的独立评分方法?

现在我可以在不使用SVM的情况下使用共识评分获得约60%的准确率，我从SVM中得到的最好结果约为56%，这就是为什么我认为我在这里错了。也许现在回想起来，我可能应该看看神经网络，但这就是硕士项目的全部，关于过程而不是结果。

再次感谢。

斯图尔特

你好,陌生人!

快速链接

类别

RapidMiner社区

得到帮助。学习最佳实践。与你的同事建立联系。

“价值系列”

答案