Kakao Brain Unveils 'Honeybee,' an AI That Recognizes Images and Responds with Text

by Lee Jungyun

Published 19 Jan.2024 11:01(KST)

Updated 19 Jan.2024 15:07(KST)

open/close

Open-Source 'Multimodal' Language Model Uploaded to GitHub
User Interaction Enabled, Attracting Attention as Next-Generation Learning Tool

Kakao Brain Unveils 'Honeybee,' an AI That Recognizes Images and Responds with Text

Kakao Brain announced on the 19th that it has released the open-source multimodal language model Honeybee on GitHub.

Kakao Brain unveiled Honeybee, a high-level multimodal large language model (MLLM) open source, aiming to propose a new module that can connect images and large-scale language models.

Currently, research on multimodal language models is limited due to the small number of publicly available models and the lack of detailed disclosure of training methods, making development challenging. To contribute to the advancement of multimodal language models, Kakao Brain decided to release the source code of its self-developed Honeybee.

MLLM is a model that, when given images and commands (prompts), responds with text. It is an extension of large language models that only input and output text. By accepting both images and text as input, it has the ability to describe scenes contained in images or understand and answer questions about content that mixes images and text. For example, when an image of "two basketball players in a game" and the question "How many times has the player on the left won?" are input in English to Honeybee, the model comprehensively understands the image content and the question to generate an answer. As a result, it achieved the highest performance compared to other publicly available MLLMs in benchmarks such as MME, MMBench, and SEED-Bench. Notably, in the MME benchmark, which evaluates perceptual and cognitive abilities, it scored 1977 out of 2800 points.

Additionally, the related paper titled "Honeybee: Locality-enhanced Projector for Multimodal LLM" was published last year on the preprint site arXiv. The paper describes this technology as "a technique that helps deep learning models learn and understand more effectively by processing image data," explaining that "the visual projector plays a crucial role in connecting a pre-trained vision encoder and a large language model (LLM), enabling deeper visual understanding while leveraging the capabilities of the LLM."

Kakao Brain expects that, based on Honeybee’s MLLM characteristics, users will be able to input images and ask questions in text to generate answers and interact, making it a promising tool for effective education and learning assistance in the future.

Hot Picks Today

"Stocks Are Not Taxed, but Annual Crypto Gains Over 2.5 Million Won to Be Taxed Next Year... Investors Push Back"

Kim Il-du, Co-CEO of Kakao Brain, stated, "We have also released the code that enables inference of the Honeybee model on GitHub and are considering expanding various services utilizing Honeybee," adding, "We will continue relentless research and development to secure more advanced artificial intelligence (AI) models."

한글 기사 보기

This content was produced with the assistance of AI translation services.