New iRead4Skills Deliverable: Annotated Corpora by Level of Complexity for FR, PT, and SP
We are pleased to announce the release of Dataset 2: Annotated Corpora by Level of Complexity for French (FR), Portuguese (PT), and Spanish (SP). This dataset is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). The corpora were compiled and annotated under the scope of the iRead4Skills.
Dataset 2 is derived from the previously released Dataset 1: Corpora by Level of Complexity for FR, PT, and SP (DOI: 10.5281/zenodo.10055909), which consists of written texts of various genres and complexity levels. A sample of texts from Dataset 1 was selected for classification and annotation, providing additional data and test sets for the complexity analysis systems in the three project languages.
Data Collection and Annotation Process
The classification and annotation tasks were carried out through a structured methodology:
- Texts were distributed to Adult Learning (AL) and Vocational Education Training (VET) Centres, where trainers and students participated in classification tasks.
- The classification was conducted via the Qualtrics platform, ensuring a standardized approach.
- Participants assigned texts to one of four complexity levels:
- Very Easy (140 texts) – Easily understood by all.
- Easy (140 texts) – Understandable for those with less than 9 years of schooling.
- Plain (140 texts) – Readable at a 9th-grade level.
- More Complex (42 texts) – Challenging for individuals with a 9th-grade education.
For full details on the annotation process, data descriptions, and inter-annotator agreement, refer to the documentation available at Zenodo.