KAIST: "Automatic Detection of Temporal Errors...Improving LLM Vulnerabilities"

by Jeong Ilwoong

Published 14 Apr.2026 08:14(KST)

A technology for automatically identifying "temporal errors" in large language models (LLMs) by reflecting real-world information that changes in real time has been developed in South Korea. This technology corrects instances where, for example, when ChatGPT is asked “Who is the minister who took office last month?” it provides the name of a minister from a year ago rather than the current one. This is expected to help accelerate the era of “trusted AI” by enhancing the reliability of artificial intelligence (AI) systems.

AI-generated image. KAIST

KAIST announced on April 14 that the research team led by Professor Eui-Jong Hwang of the School of Electrical Engineering and the Microsoft Research team have developed a system that automatically evaluates and diagnoses the temporal reasoning abilities of LLMs using temporal database technology.

For AI to earn users’ trust, it must be able to accurately understand and provide information that reflects the changing real world. However, traditional evaluation methods have been limited to simply checking whether answers match reference answers, or have failed to sufficiently account for complex temporal relationships, making it difficult to properly evaluate real-world question scenarios.

To address this, the joint research team has, for the first time, applied the design theory of “temporal databases” to AI evaluation.

Temporal databases have been validated for over 40 years. Using this approach, the research team enabled the automatic generation of 13 types of complex, time-based problems from the structure and temporal flow of data, eliminating the need for human-crafted evaluation questions.

The automatic evaluation and diagnosis system is considered innovative in that it shifts the process to generating evaluation questions automatically based on data. This overcomes the traditional framework where humans had to create each question by hand.

In addition, automating the entire process-including question generation, answer derivation, and verification-based on the database, removes the need to manually revise each problem, thereby reducing maintenance burdens.

Most importantly, when real-world information changes, simply updating the database ensures that the evaluation questions, answers, and verification criteria are automatically reflected. This is a key strength of the system.

However, the input of the latest information itself is handled through external data sources or administrators. Once the data is updated, the automatic evaluation and diagnosis system carries out the entire evaluation process automatically.

(From left) Soyeon Kim, PhD candidate at KAIST; Jindong Wang, researcher at Microsoft (currently at the College of William & Mary); Xing Xie, researcher at Microsoft; and Wooijong Hwang, professor at KAIST. KAIST

The joint research team also introduced a new metric that verifies not only whether the final answer is correct, but also the logical validity of the dates and time periods provided in the answer process, moving beyond the conventional approach.

Through this, the system achieved an average improvement of 21.7% in accurately detecting “temporal hallucinations”-instances where answers appear correct on the surface but are based on incorrect temporal information-compared to existing methods.

The joint research team emphasized that applying the automatic evaluation and diagnosis system can reduce maintenance costs since only the database needs to be updated when information changes, and it also reduces the amount of input data required by an average of 51% compared to previous approaches.

Professor Hwang stated, “This research demonstrates that classical database design theory can play a crucial role in solving the reliability problems of cutting-edge AI. If vast, specialized datasets are used as evaluation resources, we anticipate that the automatic evaluation and diagnosis system will also be utilized to verify AI performance in various fields such as healthcare and law in the future.”

Meanwhile, Soyeon Kim, a doctoral candidate at KAIST, participated as the first author of this study, and Jindong Wang (currently affiliated with William & Mary University) and Xing Xie, researchers at Microsoft Research, joined as co-authors. The research results will be presented at the AI conference “ICLR 2026” later this month.