Today’s research highlights a significant trend toward optimizing human-agent and agent-agent collaboration frameworks across diverse domains. In human-computer interaction, the Computer-Use Agent (CUA) as Judge paradigm introduces a Coder-CUA collaboration framework that shifts interface design from human-centric aesthetics toward agent-native efficiency, achieving substantial improvements in task solvability and navigation success. Meanwhile, in language agent efficiency, DEPO (Dual-Efficiency Preference Optimization) addresses both step-level and trajectory-level efficiency, cutting token usage by up to 60.9% while maintaining performance. In clinical NLP, the OEMA (Ontology-Enhanced Multi-Agent Collaboration Framework) leverages SNOMED CT (Systematized Nomenclature of Medicine—Clinical Terms) to achieve near-supervised performance in zero-shot Named Entity Recognition (NER), demonstrating how multi-agent systems with structured knowledge can bridge the gap between zero-shot and supervised learning.

TL;DR

Total papers: 29 , Selected papers: 3

Here’s a TL;DR summary of the three papers on multi-agent collaboration and optimization:

Main Themes:

Multi-agent collaboration emerges as a powerful paradigm for complex tasks
Agent-environment co-design shifts from adapting agents to human interfaces toward optimizing environments for agents
Efficiency optimization becomes crucial as agent systems scale, with multiple dimensions of efficiency to consider

Key Insights:

Agent-Native UI Design (https://arxiv.org/abs/2511.15567v1): Proposes a Coder-CUA collaboration where coding agents design interfaces and computer-use agents evaluate them. Shows that designing UIs specifically for agents (rather than humans) improves task solvability from 67.9% to 81.5% and navigation success from 7.3% to 19.0%.
Dual-Efficiency Agent Optimization (https://arxiv.org/abs/2511.15392v1): Introduces DEPO for optimizing both step-level (tokens per step) and trajectory-level (number of steps) efficiency in LLM agents. Achieves 60.9% token reduction and 26.9% step reduction while improving performance by 29.3%.
Ontology-Guided Clinical NER (https://arxiv.org/abs/2511.15211v1): OEMA framework uses multi-agent collaboration with SNOMED CT ontology for zero-shot clinical entity recognition. Matches supervised model performance without labeled data, demonstrating how structured knowledge enhances agent reasoning.

Overall: These papers highlight a trend toward specialized agent collaboration frameworks that leverage domain knowledge and efficiency optimization, moving beyond single-agent approaches to tackle complex real-world problems.

Computer-Use Agents as Judges for Generative User Interface

Authors: Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou

Keywords: Computer-Use Agents, Automatic GUI Design, UI Benchmark, Agent-Environment Collaboration, Task Solvability, CUA Navigation, Human-Free Evaluation

Comments: Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI

Paper link: http://arxiv.org/abs/2511.15567v1

Abstract

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

Summary

This paper introduces AUI-Gym, a benchmark for automatic GUI development, and proposes a Coder–CUA collaboration framework where coding-oriented language models (Coders) act as designers and Computer-Use Agents (CUAs) serve as judges to iteratively improve user interfaces (UIs). The key motivation is to shift UI design from human-centric aesthetics toward agent-native efficiency, enabling environments optimized for automated task execution rather than human usability.

The main contributions are threefold:

AUI-Gym Benchmark: A scalable testbed with 52 web applications across six domains (e.g., apps, games, tools), featuring 1,560 tasks proposed by GPT-5 and validated by humans. Each task includes a rule-based verifier that programmatically checks task feasibility, ensuring reliable, human-free evaluation.
Coder–CUA Collaboration Framework: Coders (e.g., GPT-5, Qwen3-Coder) generate and refine UIs, while CUAs (e.g., UI-TARS, Operator) navigate the interfaces and provide feedback via two signals: task solvability (identifying missing functionality) and CUA navigation success.
CUA Dashboard: A visual summary that compresses multi-step CUA trajectories into a single image, highlighting key interactive regions and reducing visual tokens by 76.2% on average. This dashboard offers interpretable feedback to guide UI redesigns, such as simplifying layouts or increasing contrast.

Experiments show that the framework significantly improves both function completeness (from 67.9% to 81.5% for GPT-5) and CUA success rates (e.g., from 7.3% to 19.0% for Qwen3-Coder). Task solvability feedback alone boosts functionality, while CUA navigation feedback drives agent-centric optimizations like de-stylization and clearer controls. The work demonstrates that designing UIs for agents, rather than adapting agents to human-centric interfaces, enhances both robustness and execution efficiency.

Critique

Of course. Here is a critique of the paper “Computer-Use Agents as Judges for Generative User Interface,” focusing on its strengths, weaknesses, novelty, significance, and clarity.

Overall Summary

This paper presents a compelling vision for “agent-native” UI design by introducing a novel collaborative framework where a Coder agent designs interfaces and a Computer-Use Agent (CUA) evaluates them. The work is ambitious, well-executed, and addresses a timely and important problem. The creation of the AUI-Gym benchmark is a significant contribution in itself.

Strengths

High Novelty and Conceptual Shift: The core idea is highly novel. Instead of the prevailing paradigm of adapting agents to human-centric UIs, the paper flips the problem: it adapts the UI to the agent. The “Coder-as-Designer” and “CUA-as-Judge” collaboration framework is a creative and powerful conceptual shift.
Substantial and Well-Designed Benchmark (AUI-Gym): The paper’s empirical foundation is strong. AUI-Gym is a comprehensive benchmark featuring:
- Scale and Diversity: 52 applications across 6 distinct domains (App, Game, Tool, etc.) and 1,560 tasks.
- Realism: Tasks are synthesized by a powerful LLM (GPT-5) and then human-validated, balancing scalability with quality.
- Reliable Evaluation: The use of a GPT-5-powered “Verifier” to generate programmatic, rule-based checkers for each task is a robust and clever solution to the notoriously difficult problem of evaluating interactive UI performance, avoiding the biases of VLM-based judges.
Innovative Technical Contribution (CUA Dashboard): The CUA Dashboard is an elegant solution to a critical problem: how to condense long, multi-step CUA interaction trajectories into a concise, interpretable format for the Coder. Reducing visual tokens by 76.2% while preserving essential cues is a practical and effective innovation that makes the iterative loop feasible.
Strong and Insightful Empirical Results: The experiments are thorough and yield clear, significant findings:
- Dual-Bottleneck Identification: The clear distinction between “Task Solvability” (functional completeness) and “CUA Navigation” (execution success) is a key insight.
- Framework Effectiveness: The integrated feedback loop consistently improves both metrics across multiple Coder models (GPT-5, Qwen, GPT-4o), demonstrating the generalizability of the approach.
- Actionable Qualitative Analysis: The qualitative examples (Figure 6) effectively illustrate the different types of revisions driven by the two feedback types (e.g., adding functionality vs. de-stylizing for clarity).

Weaknesses and Potential Improvements

Limited Exploration of the “Optimal Agent-UI”: The paper demonstrates that the framework improves UIs for a specific CUA, but it doesn’t deeply explore what an optimal agent-native UI actually looks like as a general principle. While the qualitative analysis shows trends (de-stylization, higher contrast), a more systematic set of design principles derived from the data would strengthen the contribution. Is there a risk of over-fitting the UI to the specific quirks of the CUA used for evaluation?
Human-Agent Trade-off is Acknowledged but Under-Explored: The paper briefly mentions that the resulting UIs are “optimized for agent-native efficiency and reliability,” not aesthetics. This is a critical limitation. A discussion or small experiment on how these agent-optimized UIs perform with human users would be valuable. Do they become unusable for people, or is there a beneficial middle ground? This is crucial for real-world applicability.
Cost and Scalability of the Framework: The framework relies heavily on iterative calls to powerful, expensive models (GPT-5 as Coder, Verifier, Commenter; Operator/UI-TARS as CUA). While the results are impressive, the computational cost and latency of this multi-step, iterative process are significant practical hurdles. A discussion of these limitations and potential avenues for efficiency (e.g., smaller, specialized models) would provide a more complete picture.
Ablation on the Verifier’s Role: The Verifier is a cornerstone of the benchmark’s reliability. However, its performance and potential failure modes are not deeply analyzed. How often does the GPT-5 Verifier incorrectly deem a task solvable or unsolvable? A small validation study on the Verifier’s accuracy would increase confidence in the primary results.

Significance of Results

The results are highly significant for the fields of HCI, AI, and software automation. They convincingly show that:

Automatic, iterative UI improvement is feasible. The framework leads to measurable and substantial gains in both functionality and usability from an agent’s perspective.
The primary bottleneck is not just functionality but navigability. This shifts the focus of automated UI design from merely generating working code to generating discoverable and actionable code for AI agents.
Agent-centric design requires different priorities. The qualitative results suggest that visual simplicity, clear affordances, and immediate state feedback are more critical for agents than aesthetic appeal.

This work paves the way for a future where software can self-improve its interface for automation, potentially revolutionizing areas like robotic process automation (RPA), software testing, and accessibility.

Clarity of Presentation

The paper is generally very well-written and clearly structured.

Strengths:
- The problem motivation is excellent, using a clear “Humans vs. Coder-CUA” illustration (Figure 1).
- The pipeline figures (Figures 3, 4) effectively guide the reader through the complex AUI-Gym creation and Coder-CUA collaboration processes.
- The results in Table 3 are comprehensive and clearly show the impact of different feedback types.
Minor Weaknesses:
- The paper is dense, and the number of components (Coder, CUA, Verifier, Proposer, Commenter, Dashboard) can be overwhelming on first read. A glossary or a more simplified high-level diagram before diving into details could help.
- The prompt appendix is useful but extensive. A summary of the key prompting strategies in the main text, rather than just relegating all details to the appendix, might help readers grasp the methodology more quickly.

Conclusion

This is a high-quality, impactful paper that introduces a novel paradigm for UI design. Its strengths—a creative concept, a robust new benchmark, and a clever technical solution (the Dashboard)—far outweigh its weaknesses. The work makes a significant contribution by demonstrating a viable path toward environments co-designed by and for AI agents, with results that are both statistically significant and intuitively compelling. It is likely to inspire considerable follow-up research in the community.

DEPO: Dual-Efficiency Preference Optimization for LLM Agents

Authors: Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

Keywords: LLM agents, preference optimization, efficiency optimization, reinforcement learning, dual-efficiency, step-level efficiency, trajectory-level efficiency

Comments: Accepted to AAAI 2026

Paper link: http://arxiv.org/abs/2511.15392v1

Abstract

Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.

Summary

Here is a summary of the paper “DEPO: Dual-Efficiency Preference Optimization for LLM Agents”:

Key Contributions: The paper introduces the concept of “dual-efficiency” for LLM agents, comprising (1) step-level efficiency (minimizing tokens per interaction step) and (2) trajectory-level efficiency (minimizing number of steps to complete a task). This addresses limitations in current efficiency discussions that primarily focus on token reduction while ignoring the interaction dynamics of agents that require API calls or web services at each step. The authors propose DEPO, a Dual-Efficiency Preference Optimization method that explicitly targets both efficiency dimensions.

Method: DEPO extends the Kahneman-Tversky Optimization (KTO) framework by incorporating an efficiency-aware bonus into the reward function. The method uses a two-stage training process: first behavioral cloning on high-quality desirable trajectories generated via Monte Carlo Tree Search, followed by DEPO optimization that contrasts desirable and undesirable samples. The efficiency bonus rewards trajectories with fewer tokens per step and fewer total steps, applied only to desirable samples to avoid injecting unnecessary signals into low-quality data. The approach relies solely on offline preference data without requiring paired annotations, reward model training, or on-policy sampling.

Results: Experiments on WebShop and BabyAI benchmarks show that DEPO achieves significant efficiency improvements while maintaining or even enhancing performance. Compared to behavioral cloning, DEPO reduces token usage by up to 60.9% and step count by up to 26.9% over vanilla KTO, while achieving up to 29.3% improvement in success rates. The method demonstrates strong generalizability across three out-of-domain math benchmarks (GSM8K, MATH, SimulEq) and maintains efficiency gains even when trained with only 25% of the data, showing excellent sample efficiency. Ablation studies confirm that jointly optimizing both efficiency dimensions provides the best performance-efficiency trade-off.

Critique

Of course. Here is a critique of the paper “DEPO: Dual‑Efficiency Preference Optimization for LLM Agents,” focusing on its strengths and weaknesses.

Overall Assessment

This is a strong, well-executed paper that addresses a timely and practical problem: the inefficiency of Large Language Model (LLM) agents. The proposed method, DEPO, is elegant, effective, and demonstrates impressive results across multiple benchmarks. The paper is clearly written and structured.

Strengths

Novel and Well-Motivated Problem Formulation: The paper’s core strength lies in its clear definition of “dual-efficiency.” By distinguishing between step-level efficiency (tokens per step) and trajectory-level efficiency (number of steps), the authors move beyond the common, narrower focus on just token count. This dual perspective is crucial for real-world agent applications where API calls and environment interactions incur significant latency and cost per step. The problem framing is both novel and highly relevant.
Elegant and Practical Method: DEPO is a simple yet powerful extension of the existing KTO (Kahneman-Tversky Optimization) framework. Its key innovations are:
- Efficiency Bonus: The addition of a bonus term b(τ) that directly rewards desirable trajectories for being concise (low T_token) and direct (low T_step).
- Simplicity and Stability: By building on the stable, offline KTO method, DEPO avoids the complexities of training a separate reward model or performing on-policy sampling (like PPO). This makes it computationally efficient and accessible.
- Selective Application: The design choice to apply the bonus only to desirable trajectories (and not a penalty to undesirable ones) is empirically justified and simplifies the algorithm.
Comprehensive and Convincing Evaluation: The experimental section is a major strength.
- Diverse Benchmarks: The use of interactive environments (WebShop, BabyAI) and reasoning tasks (GSM8K, MATH, SimulEq) provides a robust test of the method’s capabilities.
- Significant Results: The reported improvements are substantial: up to 60.9% reduction in tokens and 26.9% reduction in steps, often coupled with an improvement in task success rate (up to 29.3%). This demonstrates that DEPO improves efficiency without sacrificing, and sometimes even enhancing, performance—a critical result.
- Strong Generalizability and Sample Efficiency: The positive results on out-of-domain math tasks and with only 25% of the training data show that DEPO is not just effective but also robust and data-efficient.
Thorough Ablation Studies: The authors go beyond just showing that DEPO works; they deconstruct why it works.
- The ablation on α1 and α2 convincingly shows that optimizing for both step and trajectory efficiency jointly yields the best overall trade-off.
- The experiment on adding an “undesirable penalty” provides a clear justification for their design choice, showing that it degrades performance.

Weaknesses

Limited Comparison to State-of-the-Art Agent-Tuning Methods: While the paper compares DEPO against a strong set of base models and basic training paradigms (BC, KTO), it lacks a direct comparison to recent, more advanced agent-specific alignment methods like DMPO, StarPO, or GiGPO (which are mentioned in the related work). A comparison showing that DEPO achieves superior or comparable efficiency gains would have strengthened the claim of its contribution relative to the current state-of-the-art.
Ablation on the Data Generation Pipeline: The data generation process is complex, involving MCTS and a separate “rephrasing model” (GPT-4). While MCTS is a standard and powerful technique, the contribution of the rephrasing step is not ablated. It is unclear how much of the final performance and efficiency is attributable to DEPO versus this sophisticated, and potentially expensive, data curation process. A simpler baseline (e.g., using MCTS data without rephrasing) would have helped isolate DEPO’s specific contribution.
Hyperparameter Sensitivity: The paper shows that the α1 and α2 parameters are important, but it does not discuss how sensitive the method is to their values or provide guidance on how to set them for new tasks. The chosen values (2, 2, 3, 3) seem to work well, but a deeper analysis of the hyperparameter landscape would be valuable for practitioners.
Clarity on “Rephrasing”: The description of the rephrasing step is somewhat brief. A more detailed explanation or an example of a “Thought” before and after rephrasing would help the reader understand precisely what transformation is being applied to the data and how it contributes to step-level efficiency.

Summary

DEPO is a novel, effective, and practically valuable method for optimizing the efficiency of LLM agents. Its key contribution is the well-justified dual-efficiency framework and its simple, stable integration into an existing preference optimization algorithm. The results are significant, demonstrating dramatic reductions in token and step counts while maintaining or improving task performance. The main weaknesses are the lack of direct comparison to the most recent agent-tuning baselines and some opacity regarding the contribution of the data generation pipeline. Despite these minor points, the paper presents a compelling solution to a critical problem and is likely to influence future work on efficient LLM agents.

OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Authors: Xinli Tao, Xin Dong, Xuezhong Zhou

Keywords: Clinical Named Entity Recognition, Multi-Agent Systems, Zero-Shot Learning, Medical Ontology, Self-Improvement Framework, Prompt Engineering, SNOMED CT

Comments: 12 pages, 4 figures, 4 tables

Paper link: http://arxiv.org/abs/2511.15211v1

Abstract

Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA’s three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.

Summary

Based on the provided paper, here is a summary focusing on its key contributions, methods, and results.

Key Contributions: The paper introduces OEMA (Ontology-Enhanced Multi-Agent Collaboration Framework), a novel framework designed to address two primary challenges in zero-shot Clinical Named Entity Recognition (NER): (1) the mismatch between coarse, sentence-level example selection and the fine-grained, token-level nature of the NER task, and (2) the lack of effective integration between advanced prompt engineering and self-improvement frameworks. OEMA is the first work to explicitly combine type priors with structured examples within a multi-agent system for this task.

Methods: OEMA employs a multi-agent architecture consisting of three collaborative agents:

Self-Annotator: Automatically labels an unlabeled clinical corpus to create a self-annotated dataset, using self-consistency and majority voting to ensure reliability.
Discriminator: Selects high-quality examples for a given target sentence. It first retrieves a candidate set via cosine similarity and then refines the selection using a fine-grained, token-level strategy. This strategy leverages the SNOMED CT ontology to extract and score clinical concept overlaps between the target and candidate examples, assigning a “helpfulness score” to each.
Predictor: Performs the final NER by using a prompt that fuses entity-type descriptions with the top-k self-annotated examples selected by the discriminator. This creates a dual prompting strategy of “type priors + structured examples.”

Results: Experimental results on the MTSamples and VAERS datasets demonstrate that OEMA achieves state-of-the-art performance in zero-shot clinical NER.

Under exact-match evaluation, OEMA significantly outperformed all zero-shot baselines (Vanilla, IILLM, SILLM) on both GPT-3.5 and Gemini backbones.
Under relaxed-match evaluation, OEMA’s performance was comparable to the supervised BioClinicalBERT model and substantially outperformed traditional supervised methods like CRF.
Ablation studies confirmed that both the entity-type descriptions and the self-annotated few-shot examples are crucial components, with the removal of either leading to a performance drop. Case studies further illustrated how OEMA’s ontology-guided example selection corrects errors made by baseline methods.

Critique

Of course. Here is a critique of the paper “OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition,” covering its strengths and weaknesses.

Overall Summary

This paper presents a well-structured and compelling study on a challenging and practically important problem: zero-shot clinical Named Entity Recognition (NER). The proposed OEMA framework is a sophisticated multi-agent system that effectively integrates ontology-guided reasoning with self-improvement mechanisms. The results are significant, demonstrating state-of-the-art performance in a zero-shot setting and approaching the performance of supervised models.

Strengths

Clear Problem Formulation and Motivation: The paper excellently identifies and articulates two specific, non-trivial challenges in zero-shot NER:
- Challenge 1: The granularity mismatch between sentence-level example selection and token-level NER tasks.
- Challenge 2: The lack of practical integration between advanced prompt design and self-improvement frameworks. This clear problem statement provides a strong foundation and justification for the proposed work.
Novelty of the Approach: The core idea of using a multi-agent framework to decompose the problem is innovative and well-executed.
- Token-Level Ontology Filtering: The use of SNOMED CT to score examples at a token/fragment level is a key contribution. This directly addresses Challenge 1 by moving beyond simplistic sentence-level similarity.
- Synergistic Prompting Strategy: The explicit combination of “type priors” (entity-type descriptions) with “structured examples” (selected self-annotated examples) is a powerful idea that effectively addresses Challenge 2. The ablation studies confirm that both components are crucial.
- Multi-Agent Workflow: The division of labor among the Self-Annotator, Discriminator, and Predictor creates a clean, modular, and interpretable pipeline.
Significance and Quality of Results: The experimental results are strong and convincingly support the authors’ claims.
- State-of-the-Art Performance: OEMA demonstrably outperforms strong baselines (Vanilla, IILLM, SILLM) on two distinct clinical datasets under exact-match criteria.
- Comparison to Supervised Models: The most impressive result is that under relaxed-match criteria, OEMA (with Gemini) performs comparably to the supervised BioClinicalBERT model. This is a remarkable achievement for a zero-shot method and highlights its practical utility in scenarios where manual annotation is prohibitive.
- Robustness: Testing the framework on two different LLM backbones (GPT-3.5 and Gemini) adds credibility and demonstrates that the approach is not overly dependent on a single model’s capabilities.
Rigorous and Comprehensive Evaluation: The paper goes beyond just main results.
- Ablation Studies: These are essential and clearly show the individual contribution of each core component (entity-type descriptions and self-annotated examples).
- Hyperparameter Analysis: The investigation into K and k provides practical insights for future implementations and shows a thoughtful design process.
- Case Studies: The qualitative analysis helps to interpret the quantitative results and illustrates how the framework corrects errors, enhancing the paper’s explainability.
Clarity of Presentation: The paper is generally well-written and easy to follow.
- The framework diagram (Figure 2) is clear and effectively illustrates the multi-agent workflow.
- The research questions (RQ1-RQ5) provide a logical structure for the results section.
- The supplementary material with example prompts is a valuable addition for reproducibility.

Weaknesses

Computational Cost and Latency: A significant weakness, common to many LLM-based agentic systems, is the lack of discussion on computational efficiency. The three-agent pipeline, each invoking an LLM (with multiple calls for self-consistency and scoring), is undoubtedly expensive and slow. For real-time clinical applications, this cost and latency could be a major barrier to deployment. An analysis of inference time or token usage compared to simpler baselines would have been useful.
Dependence on SNOMED CT: The framework’s performance is tightly coupled with the availability and quality of the SNOMED CT ontology. The paper acknowledges this limitation, but it is a substantive one. The method’s applicability to domains without such a comprehensive, structured ontology is unclear, and its performance might degrade with less perfect ontological resources.
Scale of Evaluation: While the use of two datasets is reasonable, they are described as relatively small. Broader validation on larger, more diverse clinical corpora (e.g., MIMIC-III notes) would strengthen the claims of generalizability and robustness.
Baseline Implementation Detail: The authors note that all baselines were re-implemented using gpt-3.5-turbo-0125 to address version discrepancies. While this ensures a fair comparison, it would be helpful to briefly discuss if any hyperparameters or prompt-tuning from the original papers were retained or had to be re-optimized, to assure the reader that the baseline performances are faithfully reproduced.
Clarity on the “Discriminator” Agent: The description of the discriminator, while clear in its goal, slightly blurs the line between a “prompting strategy” and a distinct “agent.” It is essentially another LLM call with a specialized prompt for scoring. The paper could more explicitly frame the “multi-agent” aspect as a conceptual and prompt-based decomposition of the task, rather than implying separate, trained models.

Conclusion

This is a high-quality paper that makes a meaningful contribution to the field of zero-shot clinical NLP. Its strengths in novelty, clear problem-solving, and significant empirical results far outweigh its weaknesses. The OEMA framework presents a sophisticated and effective solution to key challenges in zero-shot NER. The primary areas for future work, as also noted by the authors, involve scaling the evaluation, reducing computational costs, and extending the framework to be less dependent on pre-existing ontologies.