Main Article Content
The findings of previous research into the compatibility of stakeholders’ perceptions with statistical estimations of item difficulty are not seemingly consistent. Furthermore, most research shows that teachers’ estimation of item difficulty is not reliable since they tend to overestimate the difficulty of easy items and underestimate the difficulty of difficult items. Therefore, the present study aims to analyze a high stakes test in terms of heuristic (test takers’ standpoint) and statistical difficulty (CTT and IRT) and investigate the extent to which the findings from the two perspectives converge. Results indicate that, 1) the whole test along with its sub-tests is difficult which might lead to test invalidity; 2) the respondents’ ratings of the total test in terms of difficulty level are almost convergent with the difficulty values indicated by IRT and CTT, except for the two subtests where students underestimated the difficulty values, and 3) CTT difficulty estimates are convergent with IRT difficulty estimates. Therefore, it can be concluded that students’ perceptions of item difficulty might be a better estimate of test difficulty and a combination of test takers’ perceptions and statistical difficulty might provide a better picture of item difficulty in assessment contexts.
International Journal of Assessment Tools in Education
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Apostolou, E. (2010). Comparing perceived and actual task and text difficulty in the assessment of listening comprehension. In Lancaster University Postgraduate Conference in Linguistics & Language Teaching (pp. 26-47).
Bachman, L. (2002). Some reflections on task-based language performance assessment. Language Testing, 19, 453–476.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford university press.
Baker, F. (2001). The basics of item response theory., College Park: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland.
Bejar, I. (1983). Subject matter experts’ assessment of item statistics. Applied Psychological Measurement, 7, 303–310
Bereby-Meijer, Y., Meijer, J., & Flascher, O. M. (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15, 313–327.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord. & M. R. Novick (Eds.), statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19, 369-394.
Brown, S., & Glasner, A. (1999). Assessment matters in higher education. Buckingham: SRHE and Open University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument‐based approach to validity make a difference?. Educational Measurement: Issues and Practice, 29, 3-13.
Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17, 31.
Conejo, R., Guzmán, E., Perez-De-La-Cruz, J. L., & Barros, B. (2014). An empirical study on the quantitative notion of task difficulty. Expert Systems with Applications, 41, 594-606.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer?. Language Testing, 19, 347-368.
Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah: Erlbaum.
Farhady, H. (1998). A critical review of the English section of the BA and MA University Entrance Examination. In the Proceedings of the conference on MA tests in Iran, Ministry of Culture and Higher Education, Center for Educational Evaluation. Tehran, Iran.
Freedle, R., & Kostin, I. (1999). Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL's minitalks. Language Testing, 16, 2-32.
Goodwin, L. D. (1996). Focus on quantitative methods: Determining cut-off scores. Research in Nursing & Health, 19, 249–256.
Hajforoush, H. (2002). Negative consequences of entrance exams on instructional objectives and a proposal for removing them. Proceedings of the Isfahan University Conference on Evaluating the Issues of the Entrance Exams.
Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12, 38-47.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hambleton, R., & Jirka, S. (2006). Anchor-based methods for judgmentally estimating item statistics. In S. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 399–420). Mahwah, NJ: Erlbaum.
Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3, 49–68.
Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103, 219.
Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35, 69–81.
Johnson, R.C., & Riazi, M. (2013). Assessing the assessments: Using an argument-based validity framework to assess the validity and use of an English placement system in a foreign language context. Papers in Language Testing and Assessment, 2, 31-58
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments (PELAs). Papers in Language Testing and Assessment, 2, 48-66.
Kostin, I. (2004). Exploring item characteristics that are related to difficulty of TOEFL dialogue items (TOEFL Research Rep. No. 79). Princeton, NJ: ETS.
Lee, F. L. (1996). Electronic homework: an intelligent tutoring system in mathematics. (Doctoral Dissertation). The Chinese University of Hong Kong. Hong Kong, China.
Lee, F. L., & Heyworth, R. M. (2000). Problem complexity: a measure of problem difficulty in algebra by using computer. Education Journal, 28, 85–107.
Magno, C. (2009). Demonstrating the difference between Classical Test Theory and Item Response Theory using derived test data. The International Journal of Educational and Psychological Assessment, 1, 1-11.
Nickerson, R. S. (1999). How we know-and sometimes misjudge-what others know: Imputing one’s own knowledge to others. Psychological Bulletin, 125, 737–759.
Pardos, Z. A., & Heffernan, N. T. (2011). KT-IDEM: Introducing item difficulty to the knowledge tracing model. In J. Konstan, R. Conejo, J. L. Marzo, & N. Oliver (Eds.), Proceedings of the 19th international conference on user modeling, adaptation and personalization (Vol. 6787, pp. 243–254). Lecture Notes in Computer Science.
Razavipur, K. (2014). On the substantive and predictive validity facets of the university entrance exam for English majors. Research in Applied Linguistics, 5, 77-90.
Razmjoo, S. A. (2006). A content analysis of university entrance examination for English majors in 1382. Journal of Social Sciences and Humanities, Shiraz University, 46, 67-75.
Rezvani, R., & Sayyadi, A. (2016). Ph. D. instructors’ and students’ insights into the validity of the new Iranian TEFL Ph. D. program Entrance Exam. Theory and Practice in Language Studies, 6, 1111-1120.
Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1, 185-216.
Shojaee, M. & Gholipoor, R. (2005). Recommended draft of applying university student system survey and designing acceptance model of university student. Research Center of the Parliamnet, No. 7624.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Education
van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1, 133-147.
van der Linden, W., & Hambleton, R.K. (1996). Item response theory: Brief history, common models, and extensions. In W. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item-response theory (pp. 1–28). Berlin: Springer-Verlag.
Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compared to recently graduated students. Medical Education, 36, 860–867.
Wauters, K., Desmet, P., & van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58, 1183–1193.
Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education, 2nd edn, vol. 7: Language testing and assessment (pp. 177–196). New York: Springer.