A data mining process may uncover thousands of rules from a given data set, most of which end up being unrelated or uninteresting to users. Often, users have a good sense of which “direction” of mining may lead to interesting patterns and the “form” of the patterns or rules they want to find. They may also have a sense of “conditions” for the rules, which would eliminate the discovery of certain rules that they know would not be of interest. Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint-based mining. The constraints can include the following:
■ Knowledge type constraints: These specify the type of knowledge to be mined, such as association, correlation, classification, or clustering.
■ Data constraints: These specify the set of task-relevant data.
■ Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, the abstraction levels, or the level of the concept hierarchies to be used in mining.
■ Interestingness constraints: These specify thresholds on statistical measures of rule interestingness such as support, confidence, and correlation.
■ Rule constraints: These specify the form of, or conditions on, the rules to be mined. Such constraints may be expressed as metarules (rule templates), as the maximum or minimum number of predicates that can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates. These constraints can be specified using a high-level declarative data mining query language and user interface.
Classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
- Learning Step: where the model is used to predict class labels for given data. Training data are analyzed by a classificatior algorithm. Test data are used to estimate the accuracy of classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
Types of Data Sets
- Dimentionality is the number of attributes that the objects in the data set posseses.
- Sparsity is when most attributes of an object have value 0. Advantage is that only non-zero value need to be stored and manipulated.
- Resolution it is frequency possible to obtain data at different levels of resolutions.
Han, Jiawei; Kamber, Micheline; Pei, Jian (2011-06-22). Data Mining: Concepts and Techniques: Concepts and Techniques
Machine Learning Repository