Wikimore

Summary

DescriptionReadability-model-results-barplot.png	English: Summary of results for the supervised model trained on 70% SEW data and tested on 30% SEW data (held-out), VW, and Klexikon. We have three types of model for each dataset, one trained on all 7 language-agnostic features, the second on readability metrics (only FRE to remain constant across all languages) not customized for the language, and finally one on customized FRE. For each feature set, we train three types of supervised ML models — Logistic Regression (LR), Support Vector Machines (SVM, with a linear kernel), and Random Forests (RF). In the VW datasets, for German, Russian, and Spanish, the language agnostic features perform best, especially the RF models. For English, in SEW the readability features are the best. Notably, for languages other than French, the language-agnostic features always outperform the non-customized readability features.
Date	19 July 2022
Source	Own work
Author	MGerlach (WMF)

I, the copyright holder of this work, hereby publish it under the following license:

You are free:

Under the following conditions:

attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.