MAROKO133 Breaking ai: Adobe Research Unlocking Long-Term Memory in Video World Models wit

šŸ“Œ MAROKO133 Eksklusif ai: Adobe Research Unlocking Long-Term Memory in Video World

Video world models, which predict future frames conditioned on actions, hold immense promise for artificial intelligence, enabling agents to plan and reason in dynamic environments. Recent advancements, particularly with video diffusion models, have shown impressive capabilities in generating realistic future sequences. However, a significant bottleneck remains: maintaining long-term memory. Current models struggle to remember events and states from far in the past due to the high computational cost associated with processing extended sequences using traditional attention layers. This limits their ability to perform complex tasks requiring sustained understanding of a scene.

A new paper, “Long-Context State-Space Video World Models” by researchers from Stanford University, Princeton University, and Adobe Research, proposes an innovative solution to this challenge. They introduce a novel architecture that leverages State-Space Models (SSMs) to extend temporal memory without sacrificing computational efficiency.

The core problem lies in the quadratic computational complexity of attention mechanisms with respect to sequence length. As the video context grows, the resources required for attention layers explode, making long-term memory impractical for real-world applications. This means that after a certain number of frames, the model effectively “forgets” earlier events, hindering its performance on tasks that demand long-range coherence or reasoning over extended periods.

The authors’ key insight is to leverage the inherent strengths of State-Space Models (SSMs) for causal sequence modeling. Unlike previous attempts that retrofitted SSMs for non-causal vision tasks, this work fully exploits their advantages in processing sequences efficiently.

The proposed Long-Context State-Space Video World Model (LSSVWM) incorporates several crucial design choices:

  1. Block-wise SSM Scanning Scheme: This is central to their design. Instead of processing the entire video sequence with a single SSM scan, they employ a block-wise scheme. This strategically trades off some spatial consistency (within a block) for significantly extended temporal memory. By breaking down the long sequence into manageable blocks, they can maintain a compressed “state” that carries information across blocks, effectively extending the model’s memory horizon.
  2. Dense Local Attention: To compensate for the potential loss of spatial coherence introduced by the block-wise SSM scanning, the model incorporates dense local attention. This ensures that consecutive frames within and across blocks maintain strong relationships, preserving the fine-grained details and consistency necessary for realistic video generation. This dual approach of global (SSM) and local (attention) processing allows them to achieve both long-term memory and local fidelity.

The paper also introduces two key training strategies to further improve long-context performance:

  • Diffusion Forcing: This technique encourages the model to generate frames conditioned on a prefix of the input, effectively forcing it to learn to maintain consistency over longer durations. By sometimes not sampling a prefix and keeping all tokens noised, the training becomes equivalent to diffusion forcing, which is highlighted as a special case of long-context training where the prefix length is zero. This pushes the model to generate coherent sequences even from minimal initial context.
  • Frame Local Attention: For faster training and sampling, the authors implemented a “frame local attention” mechanism. This utilizes FlexAttention to achieve significant speedups compared to a fully causal mask. By grouping frames into chunks (e.g., chunks of 5 with a frame window size of 10), frames within a chunk maintain bidirectionality while also attending to frames in the previous chunk. This allows for an effective receptive field while optimizing computational load.

The researchers evaluated their LSSVWM on challenging datasets, including Memory Maze and Minecraft, which are specifically designed to test long-term memory capabilities through spatial retrieval and reasoning tasks.

The experiments demonstrate that their approach substantially surpasses baselines in preserving long-range memory. Qualitative results, as shown in supplementary figures (e.g., S1, S2, S3), illustrate that LSSVWM can generate more coherent and accurate sequences over extended periods compared to models relying solely on causal attention or even Mamba2 without frame local attention. For instance, on reasoning tasks for the maze dataset, their model maintains better consistency and accuracy over long horizons. Similarly, for retrieval tasks, LSSVWM shows improved ability to recall and utilize information from distant past frames. Crucially, these improvements are achieved while maintaining practical inference speeds, making the models suitable for interactive applications.

The Paper Long-Context State-Space Video World Models is on arXiv

The post Adobe Research Unlocking Long-Term Memory in Video World Models with State-Space Models first appeared on Synced.

šŸ”— Sumber: syncedreview.com


šŸ“Œ MAROKO133 Breaking ai: Google’s ā€˜Nested Learning’ paradigm could solve AI's

Researchers at Google have developed a new AI paradigm aimed at solving one of the biggest limitations in today’s large language models: their inability to learn or update their knowledge after training. The paradigm, called Nested Learning, reframes a model and its training not as a single process, but as a system of nested, multi-level optimization problems. The researchers argue that this approach can unlock more expressive learning algorithms, leading to better in-context learning and memory.

To prove their concept, the researchers used Nested Learning to develop a new model, called Hope. Initial experiments show that it has superior performance on language modeling, continual learning, and long-context reasoning tasks, potentially paving the way for efficient AI systems that can adapt to real-world environments.

The memory problem of large language models

Deep learning algorithms helped obviate the need for the careful engineering and domain expertise required by traditional machine learning. By feeding models vast amounts of data, they could learn the necessary representations on their own. However, this approach presented its own set of challenges that couldn’t be solved by simply stacking more layers or creating larger networks, such as generalizing to new data, continually learning new tasks, and avoiding suboptimal solutions during training.

Efforts to overcome these challenges led to the innovations that led to Transformers, the foundation of today's large language models (LLMs). These models have ushered in "a paradigm shift from task-specific models to more general-purpose systems with various emergent capabilities as a result of scaling the 'right' architectures," the researchers write. Still, a fundamental limitation remains: LLMs are largely static after training and can't update their core knowledge or acquire new skills from new interactions.

The only adaptable component of an LLM is its in-context learning ability, which allows it to perform tasks based on information provided in its immediate prompt. This makes current LLMs analogous to a person who can't form new long-term memories. Their knowledge is limited to what they learned during pre-training (the distant past) and what's in their current context window (the immediate present). Once a conversation exceeds the context window, that information is lost forever.

The problem is that today’s transformer-based LLMs have no mechanism for ā€œonlineā€ consolidation. Information in the context window never updates the model’s long-term parameters — the weights stored in its feed-forward layers. As a result, the model can’t permanently acquire new knowledge or skills from interactions; anything it learns disappears as soon as the context window rolls over.

A nested approach to learning

Nested Learning (NL) is designed to allow computational models to learn from data using different levels of abstraction and time-scales, much like the brain. It treats a single machine learning model not as one continuous process, but as a system of interconnected learning problems that are optimized simultaneously at different speeds. This is a departure from the classic view, which treats a model's architecture and its optimization algorithm as two separate components.

Under this paradigm, the training process is viewed as developing an "associative memory," the ability to connect and recall related pieces of information. The model learns to map a data point to its local error, which measures how "surprising" that data point was. Even key architectural components like the attention mechanism in transformers can be seen as simple associative memory modules that learn mappings between tokens. By defining an update frequency for each component, these nested optimization problems can be ordered into different "levels," forming the core of the NL paradigm.

Hope for continual learning

The researchers put these principles into practice with Hope, an architecture designed to embody Nested Learning. Hope is a modified version of Titans, another architecture Google introduced in January to address the transformer model's memory limitations. While Titans had a powerful memory system, its parameters were updated at only two different speeds: a long-term memory module and a short-term memory mechanism.

Hope is a self-modifying architecture augmented with a "Continuum Memory System" (CMS) that enables unbounded levels of in-context learning and scales to larger context windows. The CMS acts like a series of memory banks, each updating at a different frequency. Faster-updating banks handle immediate information, while slower ones consolidate more abstract knowledge over longer periods. This allows the model to optimize its own memory in a self-referential loop, creating an architecture with theoretically infinite learning levels.

On a diverse set of language modeling and common-sense reasoning tasks, Hope demonstrated lower perplexity (a measure of how well a model predicts the next word in a sequence and maintains coherence in the text it generates) and higher accuracy compared to both standard transformers and other modern recurrent models. Hope also performed better on long-context "Needle-In-Haystack" tasks, where a model must find and use a specific piece of information hidden within a large volume of text. This suggests its CMS offers a more efficient way to handle long information sequences.

This is one of several efforts to create AI systems that process information at different levels. Hierarchical Reasoning Model (HRM) by Sapient Intelligence, used a hierarchical architecture to make the model more efficient in learning reasoning tasks. Tiny Reasoning Model (TRM), a model by Samsung, improves HRM by making architectural changes, improving its performance while making it more efficient.

While promising, Nested Learning faces some of the same challenges of these other paradigms in realizing its full potential. Current AI hardware and software stacks are heavily optimized for classic deep learning architectures and Transformer models in particular. Adopting Nested Learning at scale may require fundamental changes. However, if it gains traction, it could lead to far more efficient LLMs that can continually learn, a capability crucial for real-world enterprise applications where environments, data, and user needs are in constant flux.

šŸ”— Sumber: venturebeat.com


šŸ¤– Catatan MAROKO133

Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.

āœ… Update berikutnya dalam 30 menit — tema random menanti!

Author: timuna