LLMs4 min read

ChatGPT-4o and DeepSeek align with AAOS clavicle fracture guidelines about 90 percent of the time, but neither scored on actionability

Source: BMC Medical Informatics and Decision Making·Published: 2025

Authors: Keçeci T, Karagöz B·DOI: 10.1186/s12911-025-03202-5Open Access

Key figure: Figure 4 — Side-by-side bar chart of binary accuracy, weighted accuracy, and concordance across the 14 AAOS clavicle CPG questions for ChatGPT-4o versus DeepSeek. View in source

Bottom line: Binary accuracy against the 2022 AAOS clavicle CPG was 0.93 for ChatGPT-4o and 0.89 for DeepSeek (p > 0.05, not significant). Both models produced coherent, high-accuracy answers. Both also scored a median of 0 on PEMAT actionability, meaning patients could not readily act on what was written. Occasional hallucinations were observed.

What the study did

The authors rephrased each recommendation from the 2022 AAOS Clinical Practice Guideline on clavicle fractures into a standardized open-ended prompt, yielding 14 clinical questions. Each question was submitted once to ChatGPT-4o and once to DeepSeek. Two orthopedic surgeons independently rated responses using DISCERN, PEMAT-P, and a CLEAR score, along with Flesch-Kincaid Grade Level, Flesch Reading Ease, and Gunning-Fog Index. Binary accuracy (complete concordance with the CPG) and weighted accuracy (partial concordance) were computed against the source guideline. Mann-Whitney U tests compared the two models.

What they found

Binary accuracy was 0.93 for ChatGPT-4o and 0.89 for DeepSeek (p > 0.05). Weighted accuracy was 0.83 and 0.79 respectively (p > 0.05). DeepSeek responses were longer (median 572 words vs 438.5, p = 0.016) and scored higher on CLEAR (18 vs 16, p < 0.001), but PEMAT understandability, PEMAT actionability, and total PEMAT scores were statistically indistinguishable between the two models. Both systems posted a median PEMAT actionability score of 0, meaning responses contained no patient-actionable elements. The reviewers flagged occasional inaccuracies and hallucinations across both models.

Why it matters for orthopedic practice

The study reinforces a pattern now consistent across LLM-versus-guideline evaluations in orthopedics: high surface accuracy, poor actionability, and episodic hallucination. For trainees, this means general-purpose LLMs are a reasonable starting point for studying a CPG, but the output cannot be handed to a patient as discharge instructions without rewriting. For educators, actionability is the design target LLMs have not yet hit on orthopedic content. The study also shows that, within a single specialty question set, choice of model (ChatGPT vs DeepSeek) did not drive accuracy differences.

Limitations

Only 14 questions from a single clinical practice guideline were evaluated, which limits statistical power and generalizability. Each prompt was submitted once per model, so intra-model variance was not captured and prompt engineering was not explored. Two raters rated all responses, and inter-rater reliability was averaged rather than reported per question. Hallucination rates were not formally quantified. The study is a snapshot in time, and LLM behavior changes with each model release, so the specific numbers should not be generalized to future versions.

Citation

Keçeci T, Karagöz B. Can large language models follow guidelines? A comparative study of ChatGPT-4o and DeepSeek AI in clavicle fracture management based on AAOS recommendations. BMC Med Inform Decis Mak. 2025;25:350. doi:10.1186/s12911-025-03202-5

More in LLMs →AI in Orthopedics hub →

Publishing AI research in orthopedics?

OSCRSJ accepts case reports and series on novel AI-assisted diagnoses and surgical planning. Free to publish in 2026.

Submit a manuscript

OSCRSJ News items are editorial summaries for educational purposes. They are not clinical recommendations, endorsements, or substitutes for the primary literature. Always consult the source paper and applicable specialty-society guidelines before changing practice.