Gather Korean Data to Achieve AI Independence... Upstage Launches '1T Club' View original image

Upstage is addressing the shortage of Korean language data. Through this, the company plans to achieve independence in the large language model (LLM) market, which is currently dominated by foreign companies such as OpenAI's ChatGPT.


On the 14th, Upstage announced the launch of the ‘1T Club’. The ‘1T Club’ is short for ‘1 Trillion Token Club’ and consists of partner companies contributing over 100 million words of Korean data in various forms such as text, books, articles, reports, and papers.


The ‘1T Club’ aims to solve the shortage of Korean language data and achieve Korea’s independence in LLMs through the development of high-performance LLMs. The development of LLMs requires vast amounts of language data as a foundation. Korean data is essential for the advancement of domestic LLMs, but it is severely lacking compared to English data due to copyright issues and other factors. LLMs developed by foreign companies, trained primarily on foreign languages, are vulnerable not only in Korean language proficiency but also in capturing sentiment and regional information, making them difficult for domestic companies to utilize.


In fact, the amount of Korean data used for training is about 100 million tokens based on GPT-3, accounting for only 0.01697%, ranking 28th among all languages. In contrast, English was trained on 4.5 trillion tokens, creating a performance gap in LLMs by language proportional to the amount of data.


Upstage is currently in discussions for partnerships with more than 20 data providers including media companies, corporations, and academia. The company plans to collaborate with various partners across different fields to contribute to the development of Korean LLMs, as well as negotiate the construction of private LLMs with diverse companies.


Through the ‘1T Club’, Upstage plans to further enhance Korea’s AI capabilities and strive to establish Korea as a leader in the global AI industry. In particular, it aims to resolve issues such as copyright problems caused by AI training through crawling, and operate in a way that benefits both data providers and model creators.



Kim Seong-hoon, CEO of Upstage, said, “LLMs are the core technology of today’s generative AI, and it is important to create an ecosystem where companies across various domestic industries can freely utilize high-performance private LLMs. Through the ‘1T Club’, we will protect the rights of data providers and develop LLMs that can embody Korean culture and sentiment, doing our best to ensure that all domestic companies benefit from AI advancements.”


This content was produced with the assistance of AI translation services.

© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

Today’s Briefing