Article In Press : Article / Volume 4, Issue 1

Assessing the Accuracy of AI in Understanding Medical Complaints: An Evaluation of Expert Medical Mediator

Toshimi Nakanishi*1Tatsuya Hayasaka2Shinya Sato3Kaneyuki Kawamae4Yukimi Narita5Eri Endo6Mayumi Matsuoka7Naoya Koshi8

1Associate Professor, General Medical Education Center, Faculty of Medicine, Yamagata University, Japan

2Assistant Instructor, Assistant Instructor, Department of Anesthesiology, Faculty of Medicine, Yamagata University, Japan

3Professor, General Medical Education Center, Faculty of Medicine, Yamagata University, Japan

4Ohta-nishinouti hospital. specially-appointed director, Professor Emeritus, Yamagata University, Japan

5Tsuchizaki Hospital, Japan

6Wakayama Rosai Hospital, Japan

7Tottori Nursing Association, Japan

8Yamagata University Medical School, Japan

Correspondng Author:

Toshimi Nakanishi, Associate Professor, General Medical Education Center, Faculty of Medicine, Yamagata University, Japan

Citation:

Toshimi Nakanishi. et, al. (2025). Assessing the Accuracy of AI in Understanding Medical Complaints: An Evaluation of Expert Medical Mediator. Journal of Internal Medicine and Health Affairs. 4(1); DOI: 10.58489/2836-2411/042

Copyright:

© 2025 Toshimi Nakanishi, this is an open-access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • Received Date: 20-01-2025   
  • Accepted Date: 27-01-2025   
  • Published Date: 28-01-2025
Abstract Keywords:

Patient complaints, artificial intelligence, evaluation by medical mediation experts, comprehensibility of interests.

Abstract

This study investigated the extent to which artificial intelligence (AI) models with natural language processing are able to understand and respond to medical complaints; the study was conducted in an environment that is shielded from the Internet by Cohere's model Command R. The study was conducted using the Cohere model Command R in an environment that is shielded from the Internet. A dataset of 51 medical complaints, including 9 patient deaths, was analyzed; the AI model was tasked with generating concise responses to each complaint that specifically addressed the patient's underlying desires and intentions; the AI-generated responses were independently evaluated across four dimensions: facts, feelings, demands, and concerns, by three expert A medical mediator independently rated the responses. The rating was based on a 5-point Likert scale, with a rating of 4 or 5 indicating “content that can be used in the field.” The rate of agreement (agreement) was 0.60 for the responses generated by the AI. The correlation coefficient between the demands and interest was 0.8567. The factual and emotional evaluations suggested that there were challenges for the AI. A strong correlation was found between “Demand” and “Interest” (correlation coefficient: 0.8567), suggesting that the AI model effectively captured these aspects of the patient. However, the assessment of “facts” and “feelings” revealed limitations in the AI's ability to accurately identify and interpret these important elements of patient communication. These results suggest that further development is needed to effectively apply AI to medical complaint resolution, particularly in the areas of terminology and emotional intelligence.

Background

It is inevitable that complaints from patients and their families will arise in daily medical practice. If complaints and dissatisfaction are not handled appropriately, they can lead to medical lawsuits, with approximately 800 such lawsuits filed in Japan each year1). Behind this is a lack of communication, misunderstanding, or mismatch of expectations between patients and healthcare providers2). Therefore, resolution of grievance complaints is particularly important in daily practice. The resolution of complaints requires a mutual understanding between the patient/family and the health care provider of the interests behind the various complaints3). This mutual understanding will lead to the maintenance of trust between patients/families and the promotion of patient-centered medical care.

To this end, Japan has established an additional system for enhanced patient support as part of its reimbursement system4). 4) This is an insurance item whereby medical insurance pays for facilities with dedicated personnel (e.g., medical mediators) and locations that have received training in promoting medical dialogue. In addition, a medical accident investigation system has been established for unexpected deaths5). There is one issue in the medical field that cannot be adequately resolved by these two initiatives alone. It is the issue of dealing with complaints promptly and resolving them at the interest level between patients/families and medical care providers. The number of medical mediators who promote medical dialogue has not yet reached the point where they can meet the demands of the medical field.

The majority of complaints and conflicts are made verbally5). It is not easy to understand these complaints as Interests at the core of the complaint. This is because complaints contain emotional expressions, and the content of the complaint can have a variety of meanings6). For these reasons, experience and training are required of complaint handlers. However, the training time required for this requires several years of practical experience.

Therefore, we examined the possibility of AI to support those who are currently handling complaints.

Although large-scale language models (LLMs) are still in its infancy7) in recent years, its progress has been remarkable, and it has shown various potential applications in the medical field, including healthcare8), dementia diagnosis9), and education10).

In other words, we examined whether AI technology could be used to capture the true meaning of complaints, i.e., what patients/families really want to appeal to medical professionals, i.e., interests.

Objective

The purpose of this study is to verify whether artificial intelligence (AI) technology can be used to accurately analyze and ascertain the true intent and requirements from the words of patients. Specifically, the study will use 51 cases that have become claims or lawsuits in the past to test how accurately AI can understand and evaluate interest from the words of these patients. This study aims to improve the quality of medical communication and may ultimately contribute to strengthening trust between patients/families and medical institutions.

Methods

This study was approved by the Ethics Review Committee of Yamagata University School of Medicine on November 9, 2023 and registered with the consent of the subjects (Ethics Review Approval Number: No. 2023-219, UMIN000052838 Title: “Understanding Interest (True Interest and Desire) in Dialogue Analysis of Complaint Resolution Cases - Co-occurrence Network (Quantitative Text Analysis) and Artificial Intelligence (AI) generated in a local (data protection) environment"). The target population for this study is patients who have filed lawsuits nationwide in the past five years and have already settled their cases. Consent to participate in this study was obtained in writing as post-dispute settlement subjects, patient information was anonymized according to strict ethical standards, and the use of data followed a protocol approved by an ethics committee.

Subject

The results of 10 evaluations provided by the AI will be examined by medical mediators to verify its accuracy. Furthermore, from a privacy protection perspective, the AI's analysis will be conducted with the Internet blocked to verify the effectiveness of the technology while maintaining the safety of patient information. Fifty-one cases that have become complaints in the past were included in the study. These complaints were replaced by letters.

AI used

The AI model “Command R” uses multilingual natural language processing technology developed by Cohere. This model supports 10 languages, including Japanese. It can be customized and optimized by research and educational institutions for noncommercial use only. The model “Command R” was developed by Cohere, Inc. using multilingual natural language processing technology. The target complaint text was written in Japanese. The following instructions were issued to this AI.

Prompt content

You are an excellent medical mediator. Please extract what you would like the patient to do in the following sentence. Your response should be no more than 200 characters, and please do not use bullet points. Please tell us what you want - first”. The same instructions were repeated 10 times for one case study.

Since a large-scale language model (LLM) produces different outputs for the same prompt input depending on its characteristics, conducting multiple trials and selecting the optimal response improves the likelihood of obtaining more equal results11).

In addition, in this study, the questions were asked independently in separate chat sessions. To avoid learning within the same chat, questions were asked one at a time12).

These instructions were given in an environment that was disconnected from the Internet. Responses were recorded as text and sent to the evaluators.

Evaluation Methods for AI Responses

Three professional medical mediators with more than 20 years of experience and no interaction with a qualified senior trainer evaluated the responses given by the AI in an independent environment. The evaluation was divided into four elements: facts, feelings, demands, and interest. A Likert 5-point scale was used to evaluate each factor. The factors were rated on a 5-point Likert scale, i.e., 1: Cannot be used for field response, 2: Cannot be used much for field response, 3: Neither, 4: Can be used somewhat for field response, and 5: Can be used for field response. The evaluation required the AI to respond 10 times for each case. Thus, 30 evaluations were conducted for each case. In addition, one of the evaluators was asked to rate the difficulty of the analysis for 51 cases in advance. The ratings were easy, not easy, neither, difficult, and very difficult. The median number of Japanese characters used in the study was 890, the maximum was 4724, and the minimum was 387.

The analysis environment was a high-performance computer running Ubuntu 22.04 with an Intel Core i9 processor and an NVIDIA RTX A6000 GPU. Stata, 18.0 was used for statistical analysis.

Results

Table 1 Descriptive Statistics was subjected to an analysis of variance and the test was a chi-square test with F=3.06 (df:1529), p=0.0471. Thus, a difference was observed in the total score among the three raters.

As shown in Table 2, the number of AI responses with a scale value of 4 or 5 from the raters was 919, with a ratio among the total of 919/1530 and an agreement rate of 0.60.

Table 1: Descriptive Statistics of 4-Factor Total Scores for 3 Raters

 

N

Median

Mean

Standard deviation

95%confidence interval

Maximum

Minimum

Evaluator 1

510

16

16.0

3.2

15.7-16.3

20

4

Evaluator 2

510

16

15.6

2.7

15.4-15.8

20

4

Evaluator 3

510

16

16.0

3.2

15.7-16.3

20

4

Four elements: Fact, Emotion, Demand, Interest

The Likert 5-point method was to evaluate the score of each element

Table 2: Number of matches between Artificial Intelligence and evaluators

 

Valuable

Total

NO

YES

 

Case Difficulty Level

NO

563

517

1080

YES

48

492

450

Total

 

611

919

1530

Difficulty: YES: total number of "easy", "not easy" and " neither".

                NO: total number of "difficult" and "very difficult".

Valuable: NO: is the total number of evaluation points "1","2" and "3".

               YES: is the total number of evaluation points " 4" and "5".

Total 3: number of 5- step Likert scale for the four elements

 

 

1

2

3

4

5

Total

Elements

Fact

30

32

326

736

406

1530

Emotion

16

43

327

751

393

1530

Demand

16

49

308

723

434

1530

Interest

14

56

341

675

444

1530

 

Total

76

180

1302

2885

1677

6120

1: Cannot be used for on-site response.

2: Cannot be used much for on-site response.

3: Cannot say either.

4: Can be used somewhat for on- site response.

5: Can be used for on-site response.

In cases where the difficulty of the cases had increased, the ratio was 492/919, or 0.54. Next, the percentage of cases with a score of 1, 2, or 3 that could not be evaluated was 1576/6120, or 0.26 (Table 3). The total number of 1's, or the total number of points for the “fact” element, was 30, about twice as many as for the other three elements.

The correlations among the elements are shown in Table 4. The correlation coefficients among all elements were positive. The highest correlation coefficients were for demand and interest. Conversely, the lowest correlation coefficient was 0.5983, which was the correlation coefficient between fact and interest.

Total 4: Spearman rank correlation coefficients among the four factors

 

Fact

Emotion

Request

Interest

Fact

1.0000

1530

 

 

 

Emotion

0.7851***

1530

1.0000

1530

 

 

Demand

0.6693***

1530

0.7079***

1530

1.0000

1530

 

Interest

0.5983***

1530

0.6784***

1530

0.8567***

1530

1.0000

1530

Sidak test was used for multiple comparison tests between elements.

The lower number in the correlation value is the number of cases.

Discussion

We examined the AI's responses to 51 examples of medical complaints from a narrative perspective, asking the complainant to provide a concise “what do you want me to do? " The respondents were asked to respond. The AI-generated responses were independently evaluated by three expert medical mediators in four components: facts, feelings, demands, and concerns. The results showed that the agreement between the evaluators and the AI was 0.6, rejecting the null hypothesis of 0.5. Next, the relationship between fact and interest was the lowest in the four-factor correlations.

The result was an agreement rate of 0.60, as shown in Table2. This suggested that, to some extent, narrative responses were created. In Table3, the 4-factor sum of 4 and 5 was 4562; 4562/6120 was 0.75. Table 3 support the results in Table2.

Natural language processing (morphological, syntactic, and semantic analysis) is necessary for AI to understand language and meaning. This analysis is thought to be based on word frequencies and correlations between words. In daily experience, complaints are considered to have a high frequency of occurrence of words related to demand, and thus it was easy for the AI to detect demand and interest. The strong association between demand and interest can be seen from the correlation coefficient between demand and interest of 0.8567 (Table 4).

The discrepancy rate of 0.40 shown in Table 2 is due to the type of AI analysis, which requires further study. If the frequency of occurrence of words plays a significant role in the analysis, the frequency of occurrence of emotion and fact is considered to be less than that of request and interest in the complaints. It is also inferred that the frequency of occurrence of technical terms is also less. Therefore, in order to increase the rate of agreement, it is necessary to devise ways to elicit more emotions and facts as utterances in the complainant's narrative. When the difficulty level of the language of the complaint is low, the AI can distinguish narratives, but when the difficulty level is high, it cannot distinguish narratives and tends to interpret interest and request together and consider responses.

This study reaffirmed the current challenges for AI: understanding context, understanding emotion, lack of specialized knowledge, and confusion between superficial request assertions and interest. In addition, the prediction from words to others' understanding of mind and emotion suggested a deep relationship between words and mind13). Specifically, medical terminology should be incorporated and utilized in the AI model's terminology training data, and for emotional intelligence, the AI's ability to recognize and interpret complex human emotions expressed in complaint narratives should be enhanced.

As future challenges, medical terminology should be incorporated into the terminology learning data for AI models, and for emotional intelligence, the ability to recognize and interpret complex human emotions expressed in bitchy narratives should be improved.

Conclusion

The effective application of AI in resolving medical claims reaffirmed the challenges of current AI, including understanding context, understanding emotions, lack of expertise, and confusion of superficial claim assertions and interests. In predicting the understanding of the minds and emotions of others from words, there were also challenges in understanding the deeper interest of words.

Conflict of Interest

None to declare.

References

  1. Hiroshi Iwai, (2016) “The present situation of medical malpractice in Japan and what we can learn to prevent medical disputes”, Official journal of the Japan Association of Endocrine Surgery, Vol. 33 No.(1):2-6,
  2. Nakanishi, T. (2014). Disclosing unavoidable causes of adverse events improves patients’ feelings towards doctors. The Tohoku journal of experimental medicine, 234(2), 161-168.
  3. Nakanishi, T. (2014). Effects of mediator skill training for facilitating disclosure process after adverse events. Asian Journal on Mediation, 1, 14-25.
  4. Minutes of the Central Social Insurance Medical Council, Ministry of Health, Labor and Welfare.2013 (Ministry of Health, Labour and Welfare Central Social Insurance Medical Council Proceedings)
  5. Toshimi Nakanishi, Akane Tokuda, Tsukasa Arao, Yoshihiro Sugiura and Tatsuya hayasaka: How does the Language Prediction Model, the ChatGPT Evaluate Negative Emotions?; Journal of Clinical Research and Medicine. 2023 Dec; 6(3): 1-4
  6. Wada, Y., & Nakanishi, T. (2011). Medical Mediation: A Narrative Approach to Conflict Management. Signe, Tokyo (in Japanese).
  7. Cerf, V. G. (2023). Large language models. Communications of the ACM, 66(8), 7-7.
  8. Cascella, M., Montomoli, J., Bellini, V., & Bignami, E. (2023). Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. Journal of medical systems, 47(1), 33.
  9. Wang, Z., Li, R., Dong, B., Wang, J., Li, X., Liu, N., ... & Wang, J. (2023). Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today. arXiv preprint arXiv:2306.01499.
  10. Liu, M., Okuhara, T., Dai, Z., Huang, W., Gu, L., Okada, H., ... & Kiuchi, T. (2025). Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. International Journal of Medical Informatics, 193, 105673.
  11. Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review. arXiv preprint arXiv:2310.14735.
  12. Monroe, C. L., Abdelhafez, Y. G., Atsina, K., Aman, E., Nardo, L., & Madani, M. H. (2024). Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT. Clinical Imaging, 110193.
  13. Kosinski, M. (2024). Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45), e2405460121.

Become an Editorial Board Member

Become a Reviewer

What our clients say

MEDIRES PUBLISHING

At our organization, we prioritize excellence in supporting the endeavors of researchers and practitioners alike. With a commitment to inclusivity and diversity, our journals eagerly accept various article types, including but not limited to Research Papers, Review Articles, Short Communications, Case Reports, Mini-Reviews, Opinions, and Letters to the Editor.

This approach ensures a rich tapestry of scholarly contributions, fostering an environment ripe for intellectual exchange and advancement."

Contact Info

MEDIRES PUBLISHING LLC,
447 Broadway, 2nd Floor, Suite #1734,
New York, 10013, United States.
Phone: +1-(302)-231-2656
Email: info@mediresonline.org