Unit-Testlet Dilemma: Pisa Sample



PISA, testlet items, bifactor model, local dependency, marginal item parameters


Testlets have advantages such as making it possible to measure higher-order thinking skills and saving time, which are accepted in the literature. For this reason, they have often been preferred in many implementations from in-class assessments to large-scale assessments. Because of increased usage of testlets, the following questions are controversial topics to be studied: “Is it enough for the items to share a common stem to be assumed as a testlet?” “Which estimation method should be preferred in implementation containing this type of items?” “Is there an alternative estimation method for PISA implementation which consists of this type of items?” In addition to these, which statistical model to use for the estimations of the items, since they violate the local independence assumption has become a popular topic of discussion. In light of these discussions this study aimed to clarify the unit-testlet ambiguity with various item response theory models when testlets consist of a mixed item type (dichotomous and polytomous) for the science and math tests of the PISA 2018. When the findings were examined, it was seen that while the bifactor model fits the data best, the uni-dimensional model fits quite closely with the bifactor model for both data sets (science and math). On the other hand, the multi-dimensional IRT model has the weakest model fit for both test types. In line with all these findings, the methods used when determining the testlet items were discussed and estimation suggestions were made for implementations using testlets, especially PISA.


Ackerman, T. A. (1987, April). The robustness of LOGIST and BILOG IRT estimation pro-grams to violations of local independence. Paper presented at the annual meeting of the American Educational Research Association. Washington, DC.

Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item re-sponse theory to evaluate educational and psychological tests. Educational Measure-ment: Issues and Practice, 22(3), 37-51. https://doi.org/10.1111/j.1745-3992.2003.tb00136.x

Akoğlu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93. https://doi.org/10.1016/j.tjem.2018.08.001

Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202-218. https://doi.org/10.1080/08957347.2015.1042154

Bao, H. (2007). Investigating differential item function amplification and cancellation in application of item response testlet models [Doctoral dissertation, University of Maryland]. ProQuest Dissertations and Theses Global.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.

Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. https://doi.org/10.3102/1076998609353115

Cai, L., du Toit, S. H. C., & Thissen, D. (2015). IRTPRO: Flexible professional item re-sponse theory modeling for patient reported outcomes (version 3.1) [computer soft-ware]. SSIInternational.

Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245-276. https://doi.org/10.1111/j.2044-8317.2012.02050.x

Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data. (CRESST Report 839). National Center for Research on Evaluation, Standards, and Student Testing (CRESST).

Canivez, G. L. (2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In K. Schweizer & C. DiStefano (Eds.). Principles and methods of test construction: Standards and recent advancements (pp. 247-271). Hogrefe Publishers.

Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item re-sponse theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265

Chon, K. H., Lee, W., & Ansley, T. N. (2007). Assessing IRT model-data fit for mixed format tests. (CASMA Research Report 26). Center for Advanced Studies in Measurement and Assessment.

DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x

DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36, 104–121. https://doi.org/10.1177/0146621612437403

Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah: Law-rence Erlbaum Associates Inc.

Fukuhara, H., & Kamata, A. (2011). Functioning analysis on testlet-based items a bifactor multidimensional item response theory model for differential items. Applied Psycho-logical Measurement, 35(8), 604–622. https://doi.org/10.1177/0146621611428447

Gibbons, R. D., & Hedeker, D. R. (1992). Full-information bi-factor analysis. Psy-chometrika, 57, 423–436.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.

Holzinger, K. J., Swineford, F. (1937). The Bi-factor method. Psychometrika, 2, 41–54. https://doi.org/10.1007/BF02287965

Houts, C. R., & Cai, L. (2013). Flexible multilevel multidimensional item analysis and test scoring [FlexMIRT R user’s manual version 3.52]. Vector Psychometric Group.

Ip, E. H. (2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467 482. https://doi.org/10.1177/0146621610364975

Lee, G., Dunbar, S. B., & Frisbie, D. A. (2001). The relative appropriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61, 958 975. https://doi.org/10.1177/00131640121971590

Li, Y., Bolt. D. M., & Fu, J. (2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340 356. https://doi.org/10.1177/0146621605276678

Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local de-pendence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 105–124.

Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and testing in 2" contingency tables: A unified framework. Journal of the American Statistical As-sociation. https://doi.org/10.1198/016214504000002069

McDonald, R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24(2), 99 114. https://doi.org/10.1177/01466210022031552

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i 30. https://doi.org/10.1002/j.2333 8504.1992.tb01436.x

OECD (2019a). “PISA 2018 Mathematics Framework”. in PISA 2018 Assessment and Ana-lytical Framework. OECD Publishing. https://doi.org/10.1787/13c8a22c-en

OECD (2019b). “PISA 2018 Science Framework”. in PISA 2018 Assessment and Analytical Framework. OECD Publishing. https://doi.org/10.1787/f30da688-en

OECD (2019c). “Scaling PISA data”. in PISA 2018 Technical Report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/Ch.09-Scaling-PISA-Data.pdf

Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Ex-ploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92(6), 544-559. https://doi.org/10.1080/00223891.2010.496477

Revelle, W., & Revelle, M. W. (2015). Package ‘psych’. The comprehensive R archive net-work, 337, 338.

Sireci, S. G., Thissen. D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237 247. https://doi.org/10.1111/j.1745 3984.1991.tb00356.x

Stucky, B. D., & Edelen, M. O. (2014). Using hierarchical IRT models to create unidimen-sional measures from multidimensional data. In S. P. Reise & D. A. Revicki (Eds.) Handbook of item response theory modelling. (pp. 201-224). Routledge.

Stucky, B. D., Thissen, D., & Orlando Edelen, M. (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41-57. https://doi.org/10.1177/0146621612462759

Toland, M. D., Sulis, I., Giambona, F., Porcu, M., & Campbell, J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. https://doi.org/10.1016/j.jsp.2016.11.001

Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://doi.org/10.1037/1082-989X.6.2.181

Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W Glas (Eds.). Computerized adaptive testing: Theory and practice (pp. 245–269). Springer, Dordrecht.

Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1–14. https://doi.org/10.1111/j.1745-3984.1990.tb00730.x

Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x

Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053

Yen, W. M. (1993). Scaling performance assessments Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187 213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x

Yılmaz Kogar, E. (2016). Madde takımları içeren testlerde farklı modellerden elde edilen madde ve yetenek parametrelerinin karşılaştırılması [Comparison of item and ability parameters obtained from different models on tests composed of testlets] [Doctoral dissertation, Hacettepe University]. Hacettepe University Libraries, https://avesis.hacettepe.edu.tr/yonetilen-tez/c2ade6a0-6a2d-4147-beb0-8a3feb0642c5/madde-takimlari-iceren-testlerde-farkli-modellerden-elde-edilen-madde-ve-yetenek-parametrelerinin-karsilastirilmasi



How to Cite

ayan, cansu, & Barış Pekmezci , F. . (2021). Unit-Testlet Dilemma: Pisa Sample. International Journal of Assessment Tools in Education, 8(3), 613-632. Retrieved from https://ijate.net/index.php/ijate/article/view/29