Jin Kim & Jeong-A Choi 'Introduction to Data Science'

[Lee Jong-gil's Autumn Return] 'Bayesian Inference' That Revealed Joanne K. Rowling's Pseudonym View original image


'Harry Potter' author Joanne K. Rowling wanted to be evaluated purely on the merit of her work. She published the detective novel The Cuckoo's Calling in 2013 under the pseudonym 'Robert Galbraith.' The media and critics paid attention to the newcomer who appeared like a comet. No one knew her true identity, but a reporter from The Sunday Times, a devoted fan of Rowling, was convinced it was her. He asked Tafreshi Juola, a computer science professor at Duquesne University in the United States, to compare the writing styles of the two authors.


Professor Juola focused on the usage of words like 'the,' 'in,' 'at,' 'on,' and 'to.' For example, in the sentence "Put it on the right," 'on,' 'to,' and 'at' could all be used. Even if someone pretends to be another person, authors have habitual expressions they use. Professor Juola identified and compared these in the works of Galbraith, Rowling, and other female authors, repeatedly applying 'Bayesian inference' through many subtle expressions.


Bayesian inference is a statistical reasoning method. It infers posterior probabilities by incorporating new data into prior probabilities of the subject. It can be somewhat difficult to understand from a statistical perspective. Jin Kim and Jeong-A Choi use the example of a fruit box containing lemons and tomatoes in their book Introduction to Data Science. This book explains data analysis techniques through examples instead of complex statistical symbols.


Box A contains nine lemons and one tomato, while Box B contains one lemon and nine tomatoes. The box marked with a question mark could be either A or B. Without any information, the probability of each is 50% according to the 'principle of insufficient reason.'


The authors calculate conditional probabilities for precise probability calculations of prior probabilities. They create a scenario where one fruit is taken from the question mark box. The certain facts are that the question mark box is either A or B, and the fruit taken out is either a lemon or a tomato. The sum of the probabilities of these four possible situations equals 100%.


The probability that the question mark box is A and a lemon is drawn is 45%, and the probability of drawing a tomato is 5%. The probability that the box is B and a lemon is drawn is 5%, and the probability of drawing a tomato is 45%. If the fruit drawn is a lemon, the possibility of drawing a tomato disappears because other possibilities are no longer needed after the event has occurred.


There are two possibilities when a lemon is drawn: the question mark box is either A or B. The possibility that the fruit drawn might be a tomato disappears. Therefore, the probabilities that the question mark box is A or B become 45% versus 5%, or in other words, 90% versus 10%. This is the posterior probability derived through Bayesian inference.


The authors explain, "If new data is added, another Bayesian estimation using the just-obtained result as the prior probability can be performed, gradually increasing the accuracy of the inference."


Using posterior probabilities as prior probabilities for new inferences is called sequential rationality. This is similar to how humans analyze data and make judgments.


For example, we believe that our mother loves us. Many events have occurred between mother and child to form that belief. Accumulated experiences, big and small, lead to the conclusion that "Mom loves me." Numerous pieces of evidence do not come to mind in detail one by one. The intermediate steps of inference are not well remembered, but the final conclusion remains clear.


Sequential rationality ultimately helps our brain process information efficiently and make judgments. Machine learning (a technology where computers analyze vast amounts of data to predict the future) also uses sequential rationality in the same way.


The authors explain, "When a machine learning system learns patterns based on data and then corrects errors and improves performance when new data is given, it means it uses Bayesian estimation methods to increase prediction accuracy."



The result is so precise that it can identify Galbraith and Rowling as the same person. Rowling no longer had a reason to hide her identity. She had to confess just a few months after publishing The Cuckoo's Calling.


This content was produced with the assistance of AI translation services.

© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

Today’s Briefing