ERIC Number: ED658555
Record Type: Non-Journal
Publication Date: 2024
Pages: 95
Abstractor: As Provided
ISBN: 979-8-3832-2700-8
ISSN: N/A
EISSN: N/A
Available Date: N/A
Oversampling Based Method for Statistical Learning of Imbalanced Mixed Data
Weihao Wang
ProQuest LLC, Ph.D. Dissertation, State University of New York at Stony Brook
In this work, we introduce a novel oversampling technique, the theory of inheritance and Gower distance-based oversampling (TIGO) method, designed to address class imbalance issues in mixed categorical and continuous variables data set. Drawing inspiration from genetic inheritance principles, TIGO synthesizes new minority class data, conceptualizing the minority samples as genetic chromosomes. The methodology adopts the Gower distance metric for its ability to navigate the diversity of the mixed data types, a notable departure from the more commonly used Euclidean or Mahalanobis distances based oversampling methods. We have shown that this approach is particularly advantageous in handling datasets dominated by binary variables including the case of multi-level categorical variables, which are represented by binary dummy variables. Simulations studies have been conducted to examine and compare the proposed method to existing methods. To generate synthetic data, we introduce three novel data pairing strategies: parallel midpoint pairing (PMP), adjacent pairing (AP), and reverse pairing (RP), each rooted in a distinct theoretical framework, to identify parent data pairs for new sample generation. Through comparative analysis with various techniques, including under-sampling methods such as random under-sampling (RUS) and Tomek Links, oversampling methods such as random over-sampling (ROS), SMOTE, borderline-SMOTE, ADASYN, and mixed data-specific SMOTE variants SMOTE-NC and SMOTE-ENC, as well as cost-sensitive approaches such as cost-sensitive random forest (RF), our finding indicate that the theory of inheritance and Gower distance-based oversampling (TIGO) method outperforms most other methods for handling imbalanced mixed data with imbalance ratio below 9. Notably, TIGO achieves substantially higher sensitivity (prediction accuracy for the minority class), specificity, and overall accuracy. In particular, TIGO-RP and TIGO-PMP achieve the highest sensitivity and overall accuracy, whereas TIGO-AP secures the highest specificity. For datasets with an imbalance ratio above 9, TIGO, together with SMOTE-NC, and SMOTE-ENC, show markedly superior efficacy, far surpassing alternative methods. Compared with TIGO, RUS demonstrates noticeably lower specificity, while the other approaches experience a significantly lower level of sensitivity. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
Descriptors: Sampling, Statistics Education, Data Analysis, Prediction, Accuracy, Alternative Assessment
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A
Author Affiliations: N/A