Proposing an ensemble-based model using data clustering and machine learning algorithms for effective predictions

Azimlu Shanajani, Fateme

dc.contributor.advisor	Rahnamayan, Shahryar
dc.contributor.advisor	Makrehchi, Masoud
dc.contributor.author	Azimlu Shanajani, Fateme
dc.date.accessioned	2019-10-17T16:07:29Z
dc.date.accessioned	2022-03-29T16:49:15Z
dc.date.available	2019-10-17T16:07:29Z
dc.date.available	2022-03-29T16:49:15Z
dc.date.issued	2019-08-01
dc.identifier.uri	https://hdl.handle.net/10155/1078
dc.description.abstract	One of the most important tasks in machine learning is prediction. Data scientists use various regression methods to find the most appropriate and accurate model applicable for each type of datasets. This study proposes a meta-model to improve prediction accuracy. In common methods different models are applied to the whole dataset to find the best model with the highest accuracy. This means, a global model is developed for the entire dataset. In the proposed approach, first, we cluster data using different methods and we have used algorithm-based and expert-based clustering. Algorithm-based clustering such as K-means, DBSCAN, agglomerative hierarchical clustering algorithms. For expert-based clustering, we use expert knowledge to group datasets based on the important features which are selected by experts. Then, for each clustering method and for each generated cluster, we apply different machine learning models including linear and polynomial regressions, SVR, neural network, genetic programming and other techniques and select the most accurate prediction model per cluster. In every cluster, the number of samples in each cluster is reduced compared to the number of samples in the original dataset and consequently, by decreasing the number of samples in each cluster, the model is prone to lose its accuracy. On the other hand, customizing a model for each sub-dataset increases the capability of offering more effective prediction, compared to a situation where one model is fitted to the whole dataset. That is why the proposed model can be categorized as in an ensemble-based group due to the fact that the prediction is performed based on the collaboration of various models over clusters of sub-datasets. Moreover, granularity of the proposed method is better for parallelization purposes. This means, it can be parallelized in a more efficient way. As our main case study, we used real-estate data with more than 21,000 instances and 20 features to improve house price prediction. However, this approach is applicable to other large datasets. In order to examine its capability, we applied the proposed method on two other datasets; agricultural dataset with 10 features and more than 7,000 instances and also Facebook comments volume dataset, which contains roughly 41,000 samples with 54 features. For the first dataset, the new approach reduces error value from 0.14 to 0.087 for K-means clustering and 0.086 for grouping based on human knowledge. With respect to our second case study, the water evaporation data did not obtain considerable improvement in accuracy; however, in some sub-datasets there was an improvement in accuracy.	en
dc.description.sponsorship	University of Ontario Institute of Technology	en
dc.language.iso	en	en
dc.subject	Data mining	en
dc.subject	Machine learning	en
dc.subject	Clustering	en
dc.subject	Regression	en
dc.subject	Prediction	en
dc.title	Proposing an ensemble-based model using data clustering and machine learning algorithms for effective predictions	en
dc.type	Thesis	en
dc.degree.level	Master of Applied Science (MASc)	en
dc.degree.discipline	Electrical and Computer Engineering	en

Files in this item

Name:: Azimlu_Shanajani_Fateme.pdf
Size:: 13.20Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations [1478]
Electronic Theses and Dissertations
Master Theses & Projects [463]
Master Theses & Projects (FEAS)

Show simple item record