📌 MAROKO133 Update ai: ByteDance Introduces Astra: A Dual-Model Architecture for A
The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for advanced navigation systems. However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the limitations of traditional approaches. Addressing the fundamental questions of “Where am I?”, “Where am I going?”, and “How do I get there?”, ByteDance has developed Astra, an innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile robots.
Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of target localization, self-localization, and path planning. Target localization involves understanding natural language or image cues to pinpoint a destination on a map. Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes). Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and reaching intermediate waypoints.
While foundation models have shown promise in integrating smaller models to tackle broader tasks, the optimal number of models and their effective integration for comprehensive navigation remained an open question.
ByteDance’s Astra, detailed in their paper “Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning” (website: https://astra-mobility.github.io/), addresses these limitations. Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local. Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local path planning and odometry estimation. This architecture promises to revolutionize how robots navigate complex indoor spaces.
Astra-Global: The Intelligent Brain for Global Localization
Astra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks: self-localization and target localization. It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global positioning within a map. Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions based on query images or text prompts.
The construction of this robust localization system begins with offline mapping. The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):
- V (Nodes): Keyframes, obtained by temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and landmark references.
- E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.
- L (Landmarks): Semantic landmark information is extracted by Astra-Global from visual data at each node, enriching the map’s semantic understanding. These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.
In practical localization, Astra-Global’s self-localization and target localization capabilities leverage a coarse-to-fine two-stage process for visual-language localization. The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark map, and filters candidates based on visual consistency. The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and positional information to directly output the predicted pose.
For language-based target localization, the model interprets natural language instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.
To empower Astra-Global with robust localization abilities, the team employed a meticulous training methodology. Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend estimation. In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used to train for visual-language localization. Experiments showed GRPO significantly improved Astra-Global’s zero-shot generalization, achieving 99.9% localization accuracy in unseen home environments, surpassing SFT-only methods.
Astra-Local: The Intelligent Assistant for Local Planning
Astra-Local acts as the intelligent assistant for Astra’s high-frequency tasks, a multi-task network capable of efficiently generating local paths and accurately estimating odometry from sensor data. Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.
The 4D spatio-temporal encoder replaces traditional mobile stack perception and prediction modules. It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to convert 2D image features into 3D voxel features. This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering. The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.
The planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using Transformer-based flow matching. To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field). This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates. Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other methods.
The odometry head predicts the robot’s relative pose using current and past 4D features and additional sensor data (e.g., IMU, wheel data). It trains a Transformer model to fuse information from different sensors. Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddi…
Konten dipersingkat otomatis.
🔗 Sumber: syncedreview.com
📌 MAROKO133 Eksklusif ai: Google unveils Gemini 3 claiming the lead in math, scien
After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and the company’s most comprehensive AI release since the Gemini line debuted in 2023.
The models are proprietary (closed-source), available exclusively through Google products, developer platforms, and paid APIs, including Google AI Studio, Vertex AI, the Gemini command line interface (CLI) for developers, and third-party integrations across the broader integrated developer environment (IDE) ecosystem.
Gemini 3 arrives as a full portfolio, including:
-
Gemini 3 Pro: the flagship frontier model
-
Gemini 3 Deep Think: an enhanced reasoning mode
-
Generative interface models powering Visual Layout and Dynamic View
-
Gemini Agent for multi-step task execution
-
Gemini 3 engine embedded in Google Antigravity, the company’s new agent-first development environment.
"This is the best model in the world, by a crazy wide margin!" wrote Google DeepMind Research Scientist Yi Tay on X.
Indeed, already, independent AI benchmarking and analysis organization Artificial Analysis has crowned Gemini 3 Pro the "new leader in AI" globally, achieving the top score of 73 on the organization's index, leaping Google from its former placement of 9th overall with the preceding Gemini 2.5 Pro model, which scored 60 behind OpenAI, Moonshot AI, xAI, Anthropic and MiniMax models. As Artificial Analysis wrote on X: "For the first time, Google has the most intelligent model."
Another independent leaderboard site, LMArena reported that Gemini 3 Pro ranked first in the world across all of its major evaluation tracks, including text reasoning, vision, coding, and web development.
In a public post, the @arena account on X said the model surpassed even the newly released (hours old) Grok-4.1, as well as Claude 4.5, and GPT-5-class systems in categories such as math, long-form queries, creative writing, and several occupational benchmarks.
The post also highlighted the scale of gains over Gemini 2.5 Pro, including a 50-point jump in text Elo, a 70-point increase in vision, and a 280-point rise in web-development tasks.
While these results reflect live community voting and remain preliminary, they signal unusually broad performance improvements across domains where previous Gemini models trailed competitors.
What It Means For Google In the Hotly Competitive AI Race
The launch represents one of Google’s largest, most tightly coordinated model releases.
Gemini 3 is shipping simultaneously across Google Search, the Gemini app, Google AI Studio, Vertex AI, and a range of developer tools.
Executives emphasized that this integration reflects Google’s control of tensor processing unit (TPU — its homegrown Nvidia GPU rival chips) hardware, data center infrastructure, and consumer products.
According to the company, the Gemini app now has more than 650 million monthly active users, more than 13 million developers build with Google’s AI tools, and more than 2 billion monthly users engage with Gemini-powered AI Overviews in Search.
At the center of the release is a shift toward agentic AI — systems that plan, act, navigate interfaces, and coordinate tools, rather than just generating text.
Gemini 3 is designed to translate high-level instructions into multi-step workflows across devices and applications, with the ability to generate functional interfaces, run tools, and manage complex tasks.
Major Performance Gains Over Gemini 2.5 Pro
Gemini 3 Pro introduces large gains over Gemini 2.5 Pro across reasoning, mathematics, multimodality, tool use, coding, and long-horizon planning. Google’s benchmark disclosures show substantial improvements in many categories.
Gemini 3 Pro debuted at the top of the LMArena text-reasoning leaderboard, posting a preliminary Elo score of 1501 based on pre-release community voting — the first LLM to ever cross the 1500 threshold.
That places it above xAI’s newly announced Grok-4.1-thinking model (1484) and Grok-4.1 (1465), both of which were unveiled just hours earlier, as well as above Gemini 2.5 Pro (1451) and recent Claude Sonnet and Opus releases.
While LMArena covers only text-reasoning performance and the results are labeled preliminary, this ranking positions Gemini 3 Pro as the strongest publicly evaluated model on that benchmark as of its launch day — though not necessarily the top performer in the world across all modalities, tasks, or evaluation suites.
In mathematical and scientific reasoning, Gemini 3 Pro scored 95 percent on AIME 2025 without tools and 100 percent with code execution, compared to 88 percent for its predecessor.
On GPQA Diamond, it reached 91.9 percent, up from 86.4 percent. The model also recorded a major jump on MathArena Apex, reaching 23.4 percent versus 0.5 percent for Gemini 2.5 Pro, and delivered 31.1 percent on ARC-AGI-2 compared to 4.9 percent previously.
ARC-AGI-2 is the second-generation version of the Abstraction and Reasoning Corpus (ARC), a benchmark introduced by AI researcher François Chollet to measure generalization, not memorization.
Unlike typical multiple-choice or dataset-based evaluations, ARC-AGI-2 presents models with tiny grid-based puzzles that require discovering and applying abstract rules.
Each task provides a few input–output examples, and the model must infer the underlying transformation and apply it to a new test case. The problems span visual pattern recognition, symbolic manipulation, object transformations, spatial reasoning, and rule induction — all designed to test reasoning capabilities that do not depend on training-set familiarity.
The new ARC-AGI-2 variant is deliberately constructed to be out-of-distribution and resistant to memorization, making it one of the most difficult benchmarks for large language models. Its tasks are engineered to stress-test whether a model can infer a previously unseen rule purely from examples, a proxy for early forms of generalized problem-solving.
Astonishingly, the "Deep Think" version of Gemini 3, designed to take longer to solve problems and use more reasoning, scored 45.1%, representing a substantial jump over prior frontier models, which typically score in the mid-teens to low-twenties. It also far exceeds Gemini 3 Pro’s 31.1% and is an order-of-magnitude improvement over older Gemini releases.
These results suggest that Deep Think’s architecture is particularly effective at multi-step hypothesis generation, checking, and revision — the specific capabilities ARC-AGI-2 is designed to measure.
Multimodal performance increased across the board. Gemini 3 Pro scored 81 percent on MMMU-Pro, up from 68 percent, and 87.6 percent on Video-MMMU, compared to 83.6 percent. Its result on ScreenSpot-Pro, a key benchmark for agentic computer use, rose from 11.4 percent to 72.7 percent. Document understanding and chart r…
Konten dipersingkat otomatis.
🔗 Sumber: venturebeat.com
🤖 Catatan MAROKO133
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!
