tranining data ??

dianzi83

我有个疑问关于data mining中什么是tranining data?

谢谢指导

TING

MSN: tingtinghuang868@hotmail.com

rtist

用来估计参数的数据

dianzi83

hi谢谢你的帮助,这里有个问题想请教,是个case题目,背景是一个超市要开辟一个新的产品线路,是有机食品,然后超市希望能知道哪些顾客倾向于买有机食品,对于第一次购买有机食品的顾客,超市有奖励机制,就是送样品,然后开始用SAS做一系列的题目,以下是英文原版的background:

Predictive Data Mining

A supermarket is beginning to offer a line of organic products. The supermarket's managment would lik to determine which customers are likely to purchase these products. The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of their loyalty program participants and has now collected data that includes whether or not these customers have purchased any of the organic products. The data set ORGANICS contains over 22,000 observations and 18 variables. The variables in the data set are shown below.

CUSTID--Customer Loyalty Identification Number

GENDER--M=male, F=female,U=unknown

DOB--Date of birth

EDATE--Date extracted from the daily sales data base

AGE--Age, in years

AGEGRP1--Age group 1

AGEGRP2--Age group 2

TV_REG--Television Region

NGROUP--Neighborhood group

NEIGHBORHOOD--Type of residential neighborhood

LCDATE--Loyalty card application due

LTIME--Time as loyalty card member

ORGANICS--Number of organic products purchased

BILL--Total amount spent

REGION--Geographic region

CLASS--Customer loyalty status:tin, silver, gold or platinum

ORGYN--Organics purchased? 1=Yes, 0=No

AFFL--Affluence grade on a scale from 1 to 30

我想请教大家的问题是: 他原始设有两个target variables--ORGANICS and ORGYN in this data set. 这个assignment should concentrate on the binary classification problem.

作业题目之一是要: set the model role for the target variable and examine the distribution of the variable.想请假大家这里两个target variable..可以去掉一个吗?大家有什么想法吗,恳求帮助,万分感谢

ｐｓ：这是澳洲科技大学data mining学科我们做的第一个大作业,正在奋斗中,感兴趣的朋友,可以问我拿原始资料和数据库

TING

tianwild

ORGANICS也是target variable，是要predict it's value?

那我认为可以先做binary classification，再做预测数量。

rtist

除非你老师上来声明别人可以帮助你做这个作业，否则不回答作业问题，也为这样涉嫌学术舞弊。诚信永远是第一位的。

这种原则问题，估计澳洲科技大学应该也有明确的规定吧？还是先把规则搞清楚了吧。

南田

我本来准备回应一下，看了你的帖子有点犹豫。不知道那里的cheating policy是如何规定的？我的理解是凡group assignment，应该有些latitude，只要不是照抄。

既然这样，我们只能“擦边”地讨论了。我看了看帖子全文，还不是很明白。既然“concentrate on the binary classification problem”，怎么又有一个“set the model role for the target variable and examine the distribution of the variable”？如果题目是前一个，那么当然需要去掉一个变量；但如果是后一个，就要小心了。所以能够明确回答你的问题还需要清楚你的assignment到底包括什么题目。

rtist

一般来说，instructor的syllabus里面都会有对此有明确的声明，写明允许什么程度的上的讨论，在什么程度上可以获得额外帮助。

dianzi83

感谢大家的回复,帮助和探讨,大概我给你们的感觉是想让你们帮我做作业,对不起,我阐述的方式让你们误会了,如果说,我来这里的目的只是为了你们讨论的什么作弊之类的,那么我想我不会有更多的勇气来论坛或者在这个陌生的国度学习,我的目的是想弄懂一些知识,同时也想给大家分享这里的教材之类的信息.如果我给大家留下了不好的印象,我很伤心,尤其是那句:除非你老师上来之类的...",但是大家有这样的反映,我也要检讨,总之,我得向你们学习,只要我不懂的你们懂的,都是我的老师!

祝大家幸福!

Ting

ps:今天晚上做一个大作业,所以不能探讨DM的知识,不知道这样的帖子,在这个论坛里是否能生存.

yihui

题目没细看，感觉楼主可能确实叙述方式有问题，让大家误会了。

那么还是回主题吧，DM一般会把数据拆分为training data和testing data，前一部分用来“训练”模型，说白了也就是1楼给你回答的“估计模型参数”，后一部分用来验证你的模型是否具有好的推广性。关于这个问题，我曾经在http://cos.name/bbs/read.php?tid=7030写过一些我的想法，供参考。

rtist

that's good, if we misunderstand your purpose.