Advertisements
The rapid evolution of artificial intelligence has opened up a myriad of opportunities and challenges for businesses across multiple sectorsRecently, a pivotal discourse in this context took place during a media briefing held by Amazon Web Services (AWS). The focal point of the conversation was a fundamental yet essential assertion: without data, there is no modelThis statement rings particularly true as organizations aim to harness the immense capabilities of large AI models in practical business scenarios.
After enduring the rigorous competition surrounding large models, enterprises of varying sizes are beginning to realize the vast potential and powerful abilities these models can offerYet, the journey from a foundational large model to its effective utilization in real-world applications is fraught with difficultiesOne of the most crucial components in this process is data—often overlooked, but fundamental to leveraging the power of generative AI.
Amazon Web Services emphasizes that organizations can capitalize on three significant data capabilities to stand out in the generative AI landscape: leveraging existing data to fine-tune or pre-train models, quickly combining current data with these models to generate unique value, and managing new data effectively to expedite the development of generative AI applications
These abilities form the cornerstone upon which a robust data foundation can be built, enabling businesses to drive innovation in their respective fields.
As Chen Xiaojian, the General Manager of Product at AWS Greater China, rightly points out, “What companies need are generative AI applications that understand their business and customers, and building such applications starts with data.” This underscores the integral role that data plays in the success of generative AI efforts.
With the booming market for foundational large models, the threshold for users to access advanced basic models is steadily decreasingThe inception of such models has always been anchored in vast, high-quality datasetsAs these foundational large models begin to penetrate various industries, the conversation around data remains ever relevant.
Every company’s accumulated data encompasses a substantial differentiating factor in their digital journey
In this new era of generative AI, the ability to leverage proprietary data in conjunction with foundational models is a key strategy for strengthening an organization’s unique capabilityFor instance, in real-world implementations such as Perplexity, the amalgamation of traditional search engines, customer data, and the inferencing and text generation capabilities of large models has produced unparalleled value for users.
Data integration strategies between generative AI and foundational large models can be categorized into three primary approaches: Retrieval-Augmented Generation (RAG), fine-tuning, and continual pre-trainingEach method caters to different situational requirements and imposes variable demands on data capabilitiesFor example, datasets used in continual pre-training often reach terabyte levels, mostly comprising raw format data that require minimal preprocessing; what is essential is continual submission to the large model for training, adapting to ongoing business evolution.
AWS advocates that these three methodologies for integrating data with foundational models are critical in the drive toward successful generative AI applications
An increasing number of organizations are adopting these methods through services like Amazon Bedrock, allowing them to systematically cultivate a powerful data capability geared for generative AI.
Estimates from IDC suggest that the global generative AI market will experience a compound annual growth rate of 85.7%, projected to near 150 billion dollars by 2027. Consequently, many enterprises grapple with the question of how to leverage generative AI to produce more competitive productsIt is undeniably becoming standard practice for businesses to bolster their data capabilities in this generative AI eraBut what specific data competencies should they focus on? AWS identifies three core areas as pivotal to a company's success in generative AI: the ability to process data required for model fine-tuning and pre-training, the capability to fuse proprietary data with models quickly for unique value creation, and effective techniques to handle new data that aids the rapid development of generative AI applications.
To delve deeper, the first hurdle companies must overcome is the management, cleansing, processing, and governance of vast datasets
In a world where multimodal models are increasingly dominant, generative AI applications typically require substantial and diverse data for training and inference purposesThis necessitates comprehensive data processing capabilitiesFor instance, addressing the challenges of a publicly available English dataset that initially exceeds 2TB, organizations often face a substantial task of cleaning and deduplication, ultimately refining the dataset to approximately 1.2TB, before further processing it into around 300 billion tokens.
Amazon provides a robust arsenal of data solutions—including Amazon S3, Amazon FSx for Lustre, Amazon EMR Serverless, Amazon Glue, and Amazon DataZone—to empower organizations to tackle these intensive data processing challengesFor example, the data cleansing and deduplication processes involve extensive ETL (Extract, Transform, Load) workTools like Amazon EMR Serverless or Amazon Glue simplify these operations through automation, allowing businesses to execute your data cleaning and processing without the burden of managing underlying resources, significantly enhancing efficiency.
Another critical capability is the rapid integration of existing data with models to create distinctive value
While foundational large models are undoubtedly powerful, they are not without limitations, including a lack of niche industry knowledge, time-lapse issues (unawareness of the latest developments), inaccuracies, and compliance risks concerning sensitive user data.
To address these concerns, proficiently merging existing data with the models is paramountIn RAG scenarios, vector embedding becomes vital; it is essential to integrate vector search with data storage to facilitate operations without incurring extra components and costsFor example, AWS has implemented vector search capabilities across eight different data storage solutions, offering greater flexibility for clients in developing generative AI applicationsUsing Amazon Neptune, organizations can store graph and vector data together, enabling built-in algorithms to analyze billions of connections in mere seconds.
Moreover, effectively processing incoming data will further propel the swift advancement of generative AI applications
Currently, many businesses report that a significant portion of user inquiries are repetitive or similar, leading to soaring costs and delayed responses due to frequent model invocationsConsequently, when faced with similar inquiries, businesses can opt for caching responses without actively invoking the model, reducing costs and increasing efficiency.
With products like Amazon Memory DB and Amazon OpenSearch Serverless, which support vector search, AWS provides near-instantaneous responses—just a few milliseconds—capable of achieving a 99% recall rate with millions of queries per second.
In Chen Xiaojian's perspective, the establishment of data capabilities in the generative AI era functions akin to a flywheelInitially, organizations might encounter myriad challenges; however, once a data flywheel is properly set in motion, it can generate sustained value for companies navigating the generative AI landscape.
“Looking ahead, critical scenarios from foundational model training to the construction of generative AI applications will demand the efficient handling, management, and application of vast multimodal datasets