Detecting Outliers in High Dimensional Categorical Data through Feature Selection
Keywords:
Data Mining, Outlier detection, Categorical data, Entropy, Mutual informationAbstract
Extensive use of qualitative features for describing categorical data leads to high dimensional scenario in which outlier detection turns out to be a challenging task due to data sparseness. The curse of dimensionality has been well addressed in the case of numerical data by developing various feature selection methods, whereas the categorical data scenario is actively being explored. As the outlier detection problem is generally known to be unsupervised in nature due to lack of knowledge about various types of outliers, a novel unsupervised feature selection method is proposed in this paper for effective detection of outliers in categorical data. The proposed algorithm establishes the relevance and the redundancy of a feature through the entropy and the mutual information computation.
By measuring the inherent redundancy of the features describing a data set, a threshold is applied on the allowed maximum
redundancy of a candidate feature with already selected subset of features. This way of selecting features among the relevant ones results in a feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.