Data Mining Standards
Arati Kadav Jaya Kawale Pabitra Mitra aratik@c.iitk.ac.in jayak@c.iitk.ac.in pmitra@c.iitk.ac.in
Abstract
In this survey paper we have consolidated all the current data mining standards. We have categorized them in to process standards, XML standards, standard APIs, web standards and grid standards and discusd them in considerable detail. We have also designed an application using the standards. We later also analyze the standards their influence on data mining application development and later point out areas in the data mining application development that need to be standardized. We also talk about the trend in the focus areas addresd by the standards.
Data Mining Standards (1)
1 Introduction (3)
2. Data Mining Standards (5)
2.1 Process Standards (5)
2.1.1 CRISP-DM (5)
2.2 XML Standards/ OR Model defining standards<TODO> (7)
2.2.1 PMML (7)
2.2.2 CWM-DM (9)
2.3 Web Standards (10)
2.3.1 XMLA (10)
2.3.2 Semantic Web (12)
2.3.3 Data Space (12)
2.4 Application Programming Interfaces (APIs) (14)
2.4.1 SQL/ MM DM (15)
2.4.2 Java API’s (16)
2.4.3 Microsoft OLEDB-DM (18)
2.5 Grid Services (20)
2.5.1 OGSA and data mining (20)
3. Developing Data Mining Application Using Data Mining Standards (22)
3.1 Application Requirement Specification (22)
3.2 Design and Deployment (22)
4. Analysis (24)
5. Conclusion (25)
Appendix: (28)
A1] PMML example (28)
A2] XMLA example (29)坐下英语怎么说
A3] OLEDB (29)
A4] OLEDB-DM example (30)
A5] SQL / MM Example (31)
[A6] Java Data Mining Model Example (32)
1 Introduction
Rearchers in data mining and knowledge discovery are creating new, more automated methods for discovering knowledge to meet the needs of the 21st century. This need for analysis will keep growing, driven by the business trends of one-to-one marketing, customer-relationship management, enterpri resource planning, risk management, intrusion detection and Web personalization —all of which require customer-information analysis and customer-preferences prediction. [GrePia]
Deploying a data mining solution requires collecting data to be mined, cleaning and transforming its attributes to provide the inputs for data mining models. Also the models need to be built, ud and integrated with different applications. Moreover it is required that currently deployed data management software be able to interact with the data mining models using standards APIs. The sc
alability aspect calls for collecting data to be mined from distributed and remote locations. Employing common data mining standards greatly simplifies the integration, updating, and maintenance of the applications and systems containing the models. [stdHB]
Over the past veral years, various data mining standards have matured and today are ud by many of the data mining vendors, as well as by others building data mining applications. With the maturity of data mining standards, a variety of standards-bad data mining rvices and platforms can now be much more easily developed and deployed. Related fields such as data grids, web rvices, and the mantic web have also developed standards bad infrastructures and rvices relevant to KDD. The new standards and standards bad rvices and platforms have the potential for changing the way the data mining is ud. [kdd03]
The data mining standards are concerned with one or more of the following issues [stdHB]:
1.The overall process by which data mining models are produced, ud, and deployed:
This includes, for example, a description of the business interpretation of the output
of a classification tree.
2. A standard reprentation for data mining and statistical models: This includes, for
example, the parameters defining a classification tree.
3. A standard reprentation for cleaning, transforming, and aggregating attributes to
provide the inputs for data mining models: This includes, for example, the parameters defining how zip codes are mapped to three digit codes prior to their u as a
categorical variable in a classification tree.
4. A standard reprentation for specifying the ttings required to build models and to
u the outputs of models in other systems: This includes, for example, specifying the name of the training t ud to build a classification tree.
5.Interfaces and Application Programming Interfaces (APIs) to other languages and
systems: There are standard data mining APIs for Java and SQL. This includes, for
example, a description of the API so that a classification tree can be built on data in a SQL databa.
6.Standards for viewing, analyzing, and mining remote and distributed data: This
includes, for example, standards for the format of the data and metadata so that a classification tree can be built on distributed web-bad data.
The current established standards address the different aspects or dimensions of data mining application development. They are summarized in Table 1.1.
Areas Data Mining Standard Description
Process Standards Cross Industry Standard
Process for Data Mining
(CRISP-DM) Captures Data Mining Process: Begins with business problem and ends with the deployment of knowledge gained in the process.
Predictive Model Markup Language (PMML) Model for reprenting Data Mining and statistical data.
XML Standards
丁慧
Common Warehou Model for Data Mining (CWM-DM) Model for meta data that specifies metadata for building ttings, model reprentations, and results from model operations Models are defined through the Unified Modeling Language.
Standard APIs SQL/MM , Java API
(JSR-73), Microsoft
OLE-DB
API for Data Mining applications
Protocol for transport of remote and distributed data. Data Space Transport
Protocol (DSTP)
DSTP is ud for distribution, enquiry and retrieval
of data in a data space.
Model Scoring Standard Predictive scoring and
update protocol (PSUP) PSUP can be ud for both on line real time scoring and updates as well as scoring in an off line batch environment (Scoring is the process of using statistical models to make decisions.)
XML for analysis (XMLA) Standard web rvice interface designed specifically for online analytical processing and data-mining functions (us Simple Object Access Protocol (SOAP))
Semantic Web Semantic Web provides a framework to reprent
information in machine processable form and can be狗狗摇尾巴是什么意思
ud to extract knowledge from Data Mining
Systems.
Web Standards
Data Space Provides an infrastructure for creating a web of
data. Is built around standards like XML, DSTP,
PSUP. Helps handle large data ts which are
prent on remote and distributed locations.
Grid Standards Open Grid Service
荧火虫
Architecture Developed by Globus, this standard talks about Service bad open architecture for distributed virtual organizations. It will provide data mining engine with cure, reliable and scaleable high bandwidth access to the various distributed data sources and formats across various administrative domains.
涨墨
Table 1: Summary of Data Mining Standards
Section 2 describes the above standards in details. In ction 3 we design and develop a data mining application using the above standards. Section 4 analyzes the standards and their relationship with each other and propos the areas where standards are needed.
2. Data Mining Standards
2.1 Process Standards
2.1.1 CRISP-DM
CRISP-DM stands for CRoss Industry Standard Process for Data Mining.
It is industry, tool and application neutral standard for defining and validating data mining process.
It was conceived in late 1996 by DailerChrysler, SPSS and NCR. The latest version is CRISP-DM 1.0.
Motivation:
As the market interest in data mining was resulting into its widespread uptake every new adopter of data mining was required to come up with his own approach of incorporating data mining in his current t up. There was also a requirement of demonstrating that data mining was sufficiently mature to be adopted as a key part of any customer’s business process. CRISP-DM provided the standard process model for conceiving, developing and deploying a data mining project which is non-propriety and freely distributed.
Standard Description:
英语作文大学生活
The CRISP-DM organizes the process model into hierarchical process model.
At the top level the task is divided into phas. Each pha consists of veral cond level generic tasks. The tasks are complete (covering the pha and all possible data mining applications) and stable (valid for yet unforeen developments).
The generic tasks are mapped to specialized tasks. Finally the specialized tasks contain veral process instances which are record of the actions, decisions and results of an actual data mining engagement process.
This is depicted in Figure 1.
Mapping of the generic tasks (e.g. task for cleaning data) to specialized task (e.g. cleaning numerical or categorical value) depends on the data mining context. CRISP-DM distinguishes between four different dimensions of data mining contexts. The are: Application domain (areas of the Respon Modeling)
Data mining problem type (e.g. clustering or gmentation problem)
商业运作Technical aspect (issues like outliers or missing values)
Tool and technique (e.g. Clementine or decision trees).
未来的生活作文
The more value for the different context domains are fixed, the more concrete is the data mining context. The mappings can be done for the current single data mining project in hand or for the future.
The process reference model consists of phas shown in figure 1 and summarized in table 2. The quence of the phas is not rigid. Depending on the outcome of each pha, which pha or which particular task of a pha to be performed next is determined.[CRSP]
Figure 1: CRISP-DM process Model
Interoperability with other standards:
CRISP-DM provides a reference model which is completely neutral to other tools, vendors, applications or existing standards.
Phas Description
Business understanding Focus on asssing and understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.
Data -Starts with an initial data collection.