Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center tested OpenAI's o1 and GPT-4o on 76 real emergency room cases, presenting the models with the same text-based EMR data available at triage. The o1 model achieved accurate or near-accurate diagnoses in 67% of cases, compared to 55% and 50% for the two attending physicians tested — a statistically meaningful gap at the exact moment (initial triage) when information is most limited and urgency is highest.
The study is cautious: it used internal medicine physicians rather than ER specialists, and researchers emphasised that AI is not yet ready for unsupervised clinical decisions. But this is the first peer-reviewed benchmark demonstrating LLM diagnostic advantage over real clinicians in a clinical-record setting. The implications reach beyond healthcare: as AI accuracy data accumulates across regulated professions (medicine, law, finance), AI citations in those verticals will carry authority signals — a new GEO surface category. KwikGEO should begin tracking regulated-industry AI citation patterns as an emerging optimization target distinct from consumer ecommerce.
TechCrunch · Harvard Medical School · May 3