Effect Of Datasets Size On The Machine Learning Performance Of The Bagworm, Metisa Plana (Walker) Infestation Using UAV Remote Sensing | INSTITUTE OF PLANTATION STUDIES (IKP)
» ARTICLE » Effect of datasets size on the machine learning performance of the bagworm, Metisa plana (Walker) infestation using UAV remote sensing

Effect of datasets size on the machine learning performance of the bagworm, Metisa plana (Walker) infestation using UAV remote sensing

This study investigates the effect of dataset size and balance on machine learning (ML) performance in detecting bagworm (Metisa plana) infestations in Malaysian oil palm plantations using unmanned aerial vehicle (UAV) multispectral imagery. Bagworms are a major pest, capable of causing 10–13% leaf defoliation and up to 40% yield losses. Traditional manual census methods are labor-intensive and slow, prompting the adoption of UAV-based remote sensing for rapid, non-destructive monitoring.

The research focused on three vegetation indices—NDVI, NDRE, and GNDVI—derived from UAV imagery captured with a DJI Inspire 2 equipped with a Micasense Altum-PT multispectral camera. Data were collected from plantations in Selangor, Johor, and Perak, covering healthy, low, mild, and severe infestation levels. Because the datasets were inherently imbalanced (e.g., 1000 severe vs. 300 mild cases), resampling techniques were applied: random oversampling (ROS), synthetic minority oversampling technique (SMOTE), random undersampling (RUS), and interval-based undersampling (3-interval and 5-interval).

Classification models were developed using MATLAB’s classification learner toolbox, employing decision tree, discriminant analysis, naïve Bayes, support vector machine, and k-nearest neighbor (KNN) classifiers. Performance was evaluated using macro-F1 scores with fivefold cross-validation. Results demonstrated that undersampling, particularly 3-interval undersampling, yielded the best outcomes, achieving 86.84% accuracy and a perfect F1-score despite reducing dataset size by 66.67%. Fine KNN consistently outperformed other classifiers across vegetation index combinations, especially NDVI-NDRE, proving robust in classifying all infestation levels.

The findings highlight that reducing dataset size can improve classification success, even when data remain imbalanced. This counters the assumption that larger datasets always enhance ML performance, emphasizing instead the importance of dataset quality and balance. The study contributes to precision agriculture by identifying effective resampling strategies and reliable vegetation indices for pest detection. It also underscores UAV remote sensing as a scalable, efficient alternative to manual census methods, enabling timely interventions against bagworm outbreaks.

   

Figure 1: Illustration of oversampling method

   

Figure 2: Random oversampling (ROS)

   

Reference:

Johari, S. N. A. M., Khairunniza-Bejo, S., Mohamed Shariff, A. R., Husin, N. A., Masri, M. M. M., & Kamarudin, N. (2025). Effect of datasets size on the machine learning performance of the bagworm, Metisa plana (Walker) infestation using UAV remote sensing. Journal of Plant Diseases and Protection, 132, 52–68.

   
Link: https://doi.org/  

Date of Input: 29/01/2026 | Updated: 29/01/2026 | ainzubaidah

MEDIA SHARING

INSTITUTE OF PLANTATION STUDIES (IKP)
Universiti Putra Malaysia
43400 UPM Serdang
Selangor Darul Ehsan
+603-9769 1044
+603-9769 XXXX
W, (07:35:12am-07:40:12am, 09 Feb 2026)   [*LIVETIMESTAMP*]