| Research article - (2026)25, 235 - 261 DOI: https://doi.org/10.52082/jssm.2026.235 |
| ChatGPT Outperforms Personal Trainers in Answering Common Exercise Training Questions |
Brecht D’hoe1, , Daniel Kirk2, Jan Boone3, Alessandro Colosio4 |
| Key words: Artificial Intelligence, exercise, natural language processing, machine learning, training guidance |
| Key Points |
|
|
|
| Experimental Protocol |
The study took place from March to July 2024. The chronological flow of the experimental protocol can be found in Exclusion criteria for the questions included failure to refer to exercise training or the implementation of lifestyle and health-related factors; answers that were not articulated in a manner suitable for direct communication with a potential client (e.g., “it is a boring question”, “it depends”); and those that fell below the minimum word count of 80 or exceeded the maximum of 300 words. None of the PTs were made aware of the goal of the study or the involvement of ChatGPT prior to their contribution to prevent influencing the questions or answers they provided. Following submission of their questions and answers, they were informed about the goal of the study and were given the opportunity to retract their input if they wished to; however, none of the participants chose to do so. After applying the exclusion criteria, the remaining questions formed the final set of nine questions and answers from nine different personal trainers used in the study ( The answers to each of the questions were graded by other PTs and scientific experts in the field of the questions. Each set of answers was graded by a total of 27 graders. The order of answers to the questions was randomized in the grading document to reduce the chances of pattern recognition and unblinding of the graders. One of the three scoring components was scientific correctness, and therefore, nine of the 27 graders (1/3) were scientific domain experts in the field of the question. The other 18 graders were personal trainers who graded all nine questions, irrespective of the topic. As with the PTs providing the question and answers, graders were initially blinded to the actual goal of the study and thus the involvement of ChatGPT. After receiving the evaluation of the answers, graders were informed about the goal of the study and were provided the opportunity to subtract their contribution if they disagreed with the procedure; however, none of the graders chose to do so. The grading was based on three components deemed relevant for determining the quality of an answer to a hypothetical knowledge-seeker asking a question: “Scientific correctness”, reflecting how accurately an answer reflects the current state of knowledge in the scientific domain to which the question belongs; “Comprehensibility”, capturing how well the answer could be expected to be understood by the layman receiving the answers, and; “Actionability”, the extent to which the answer to the question contains information that is useful and can be acted upon by the hypothetical layman asking the question. The rubric describing how each component should be scored that was sent to the graders is presented in |
| Participants – contributors |
Personal trainers (PTs) providing questions and answers were recruited through relations within Ghent University and the European Register of Exercise Professionals (EREPS). Contributing PTs were required to hold at least a European Qualification Framework (EQF) Level 4 certification or an equivalent or higher qualification. This criterion ensured inclusion of professionally qualified PTs while excluding self-proclaimed ones, and participants were further required to be actively engaged in personal training. All provided online written informed consent to participate in the study. |
| Participants – graders |
To grade the answers provided by the contributing PTs, nine scientific experts per question topic were recruited from universities or research institutions and were required to comply with the inclusion criteria: minimum academic level of having attained a PhD, being currently affiliated with a research institute (either academic or private) and having expertise in the topic of the question based on topic-related peer-reviewed scientific article output. Hence, the scientific experts (1/3 of all graders) only rated the answers to the question relating to their domain of expertise across all three components (comprehensibility, actionability, and scientific correctness). In alignment with their professional responsibilities relating to relaying information to trainees and thus representing comprehensibility and actionability, the remaining 18 graders (2/3 of all graders) were personal trainers. Accordingly, the grading PTs rated all answers to all questions across the three components. Grading PTs were recruited through contacts of Ghent University and the Belgian commercial gym franchise ‘Jims’. The same inclusion criteria applied to the grading PTs as for those contributing to questions and answers, implying having obtained a minimum of EQF level 4 or equivalent or higher, and being actively engaged in personal training. |
| Ethical approval |
The study was conducted according to the Declaration of Helsinki, and all procedures were approved by the ethics committee of the faculty of psychology and educational sciences, Ghent University (reference code: 2024-005). After receiving a description of the study process, all participants provided online written informed consent to participate in the current study and could withdraw at any time (including after the actual goal of the study and the involvement of ChatGPT was made clear). All participants were informed that their participation would be made pseudonymous. |
| Statistical analysis |
Differences between the grades of the answers overall and for each grading component were determined using permutation tests with the function perm.test from the package jmuOutlier in R software (R: The R Project for Statistical Computing). Permutation tests were used to test for group differences with the test statistic set to the mean. We further tested the robustness of our results with a sensitivity analysis using the median as the test statistic to account for the potential effects of any outliers. For both the overall score and the scores of the components for each question, p-values were approximated from 100,000 simulations to gauge the strength of evidence of a difference between the groups. Additional analysis on the comprehensibility of the answers was done using the Flesch Reading Ease (FRE) score and the Flesch–Kincaid Grade Level from the textstat library (version 0.7.10) in Python (version 3.11.7) using the flesch_reading_ease and flesch_reading_ease functions, respectively. These scores were calculated for each answer in both groups and compared using two-sided permutation tests using the permutation_test (N permutation = 9999, permutation type = ‘independent’) from the scipy.stats library (version 1.11.4). To evaluate the consistency among graders, inter-rater reliability (IRR) was assessed using the intraclass correlation coefficient (ICC) function from the irr package in R software. ICC values could be calculated for all grading PTs and for the scientific expert graders within the topic “training for fat loss”, as these were the only instances with multiple raters per item - a prerequisite for ICC computation. Cut-off values for the interpretation of the ICC values were based on Koo and Li (Koo and Li, |
|
|
| Study sample |
After applying the exclusion criteria, nine questions and answers from nine PTs were eligible for inclusion out of the initial 47 inputs received. The age of the nine PTs providing the questions and answers ranged from 19 to 48 (mean age = 38 years). Their experience in personal training varied from 0.5 to 15 years, with a mean of 5.3 years. The nationalities of the PTs were Spanish, Swedish, Malaysian, Belgian, French, Danish, Pakistani, and two were Dutch. All of them had obtained a European Qualification Framework 4 (EQF4) personal training certificate and were all currently active in personal training. Additional personal training-related qualifications that the PTs held are displayed in Table S1. Of the nine included questions, five were related to training for fat/weight loss, one to training frequency, one to training motivation, one to training around pain, and one to timing of the day for training. The answers from the personal trainers across the nine questions ranged from 80 to 280 words, with an average of 140 words, whereas those from ChatGPT ranged from 181 to 274 and had an average of 241 words. The scientific experts who graded the topic-specific answers represented a diverse range of countries of employment. Among the nine reviewers assigned to each topic, at least five different countries were represented, with one topic including experts from up to eight countries. Furthermore, for three of the topics, seven out of nine reviewers were active professors in the respective field, while the remaining two topics included no fewer than five and six professors among the reviewers, respectively (Table S2). As for the grading PTs, five different nationalities were represented, with the majority being Belgian. All grading PTs held an EREPS level 4 personal training qualification or an equivalent—or higher—certification (Table S3). |
| Grades of answers from PTs and ChatGPT |
There was strong evidence for higher mean scores of comprehensibility for ChatGPT in six instances, specifically, Question 2 (8.63 for ChatGPT vs 6.15 for PTs; Finally, there was strong evidence for higher mean actionability for ChatGPT on five occasions, specifically, Question 2 (7.22 for ChatGPT vs 5.11 for PTs; |
| Sensitivity analysis |
Sensitivity analysis using the median instead of the mean as the test statistic for group differences was also conducted. These results can be seen in Tables S10. In general, the |
|
|
| ChatGPT outperforms human PTs in answering common exercise training questions |
The results of the current study provide compelling evidence supporting our hypothesis that responses from ChatGPT outperform human PTs when answering common exercise training questions, as indicated by higher scores in six of the nine questions for the mean overall score. In contrast, overall scores for human PTs failed to exceed those of ChatGPT in any of the questions. A similar pattern was also seen for each of the individual grading criteria from which the overall scores were composed (i.e., scientific correctness, comprehensibility, and actionability), suggesting that ChatGPT provided better responses across multiple distinct components relevant for providing high-quality answers. While recent studies have compared the generative performance of different chatbots (Havers et al., |
| LLMs for fitness and health knowledge acquisition |
Our findings support the use of ChatGPT for knowledge acquisition in the fitness and health domain. We propose that ChatGPT and other LLMs could support health professionals in promoting commitment to physical activity and exercise guidelines by providing accurate and comprehensible responses to inquiries in real time. Parallels can be seen in other branches of health, such as medicine and nutrition. In medicine, LLMs have been proposed to be used for patient queries, educating medical students or patients, and managing chronic diseases (Thirunavukarasu et al., In recent years, the internet has become a key source of information in the fitness and health space (Tan and Goonawardene, |
| AI-driven chatbots may support fitness professionals and trainees |
Studies conducted so far in the field of exercise and health indicate that chatbots can improve the efficiency of resource allocation by offloading human operators from duties that can be automated (Fadhil and Gabrielli, |
| Strengths |
We highlight some strengths of our work. First, all participants—including PTs submitting questions and answers, grading PTs, and scientific experts—were blinded to group allocation, reducing bias. Second, the questions used were directly representative of those commonly asked by trainees seeking support from a PT, enhancing real-world relevance. Third, evaluating answers across three components (scientific correctness, comprehensibility, and actionability) provided a multidimensional assessment of answer quality. The grading structure – one-third scientific experts and two-thirds PTs – reflected these components appropriately, and the resulting grades demonstrated good inter-rater reliability. |
| Limitations |
We note the following limitations. Firstly, although the sample size was smaller than initially planned (9 vs 17), the experimental design generated a total of 486 individual ratings, allowing stable comparative estimates. We therefore interpret the findings as exploratory and hypothesis-generating rather than definitive. Due to the limited sample size, we were unable to examine whether answer quality differed among PTs with varying levels of academic education. This is an important avenue for future research, particularly given the relatively low entry requirements for becoming a licensed PT. The inter-rater reliability (IRR) analysis among the scientific experts could only be conducted for those grading the five questions related to the “training for fat loss” topic, as for all other topics, each expert rated only one subset of answers. Given the rapid evolution of LLMs, it is reasonable to expect that newer versions of ChatGPT (e.g., GPT-5 and beyond) will demonstrate greater consistency and accuracy, particularly regarding factual and practical aspects of exercise prescription. Future research should therefore consider model versioning as a key methodological factor, systematically recording the model type, release date, and prompting protocol, enabling temporal benchmarking of AI performance. |
|
|
Our study found that ChatGPT (version 3.5) outperformed human PTs in answering commonly asked exercise training questions. The overall quality of responses from ChatGPT was higher in 6/9 of the questions investigated, while scores from the answers of human PTs were higher in none of the questions. These findings also extended to each of the individual metrics used in the assessment of the quality of the answers, showing a general superiority of ChatGPT in providing actionable, comprehensible, and scientifically correct answers to common exercise training questions. Our results provide evidence that AI-driven LLMs such as ChatGPT may be used by knowledge-seeking trainees to answer common training questions, which could both enhance knowledge acquisition and thus encourage commitment to exercise guidelines as well as reduce the workload for PTs. Future work should look to validate our findings using a larger sample of questions and answers to better identify the strengths and weaknesses of LLMs such as ChatGPT across a broad range of exercise training topics. |
| ACKNOWLEDGEMENTS |
The author(s) reported there is no funding associated with the work featured in this article. No potential conflict of interest was reported by the authors. Experiments comply with the current laws of the country in which they were performed. The data that support the findings of this study are available on request from the corresponding author. |
| AUTHOR BIOGRAPHY |
|
| REFERENCES |
|