📌 MAROKO133 Hot ai: ByteDance Introduces Astra: A Dual-Model Architecture for Auto
The increasing integration of robots across various sectors, from industrial manufacturing to daily life, highlights a growing need for advanced navigation systems. However, contemporary robot navigation systems face significant challenges in diverse and complex indoor environments, exposing the limitations of traditional approaches. Addressing the fundamental questions of “Where am I?”, “Where am I going?”, and “How do I get there?”, ByteDance has developed Astra, an innovative dual-model architecture designed to overcome these traditional navigation bottlenecks and enable general-purpose mobile robots.
Traditional navigation systems typically consist of multiple, smaller, and often rule-based modules to handle the core challenges of target localization, self-localization, and path planning. Target localization involves understanding natural language or image cues to pinpoint a destination on a map. Self-localization requires a robot to determine its precise position within a map, especially challenging in repetitive environments like warehouses where traditional methods often rely on artificial landmarks (e.g., QR codes). Path planning further divides into global planning for rough route generation and local planning for real-time obstacle avoidance and reaching intermediate waypoints.
While foundation models have shown promise in integrating smaller models to tackle broader tasks, the optimal number of models and their effective integration for comprehensive navigation remained an open question.
ByteDance’s Astra, detailed in their paper “Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning” (website: https://astra-mobility.github.io/), addresses these limitations. Following the System 1/System 2 paradigm, Astra features two primary sub-models: Astra-Global and Astra-Local. Astra-Global handles low-frequency tasks like target and self-localization, while Astra-Local manages high-frequency tasks such as local path planning and odometry estimation. This architecture promises to revolutionize how robots navigate complex indoor spaces.
Astra-Global: The Intelligent Brain for Global Localization
Astra-Global serves as the intelligent core of the Astra architecture, responsible for critical low-frequency tasks: self-localization and target localization. It functions as a Multimodal Large Language Model (MLLM), adept at processing both visual and linguistic inputs to achieve precise global positioning within a map. Its strength lies in utilizing a hybrid topological-semantic graph as contextual input, allowing the model to accurately locate positions based on query images or text prompts.
The construction of this robust localization system begins with offline mapping. The research team developed an offline method to build a hybrid topological-semantic graph G=(V,E,L):
- V (Nodes): Keyframes, obtained by temporal downsampling of input video and SfM-estimated 6-Degrees-of-Freedom (DoF) camera poses, act as nodes encoding camera poses and landmark references.
- E (Edges): Undirected edges establish connectivity based on relative node poses, crucial for global path planning.
- L (Landmarks): Semantic landmark information is extracted by Astra-Global from visual data at each node, enriching the map’s semantic understanding. These landmarks store semantic attributes and are connected to multiple nodes via co-visibility relationships.
In practical localization, Astra-Global’s self-localization and target localization capabilities leverage a coarse-to-fine two-stage process for visual-language localization. The coarse stage analyzes input images and localization prompts, detects landmarks, establishes correspondence with a pre-built landmark map, and filters candidates based on visual consistency. The fine stage then uses the query image and coarse output to sample reference map nodes from the offline map, comparing their visual and positional information to directly output the predicted pose.
For language-based target localization, the model interprets natural language instructions, identifies relevant landmarks using their functional descriptions within the map, and then leverages landmark-to-node association mechanisms to locate relevant nodes, retrieving target images and 6-DoF poses.
To empower Astra-Global with robust localization abilities, the team employed a meticulous training methodology. Using Qwen2.5-VL as the backbone, they combined Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). SFT involved diverse datasets for various tasks, including coarse and fine localization, co-visibility detection, and motion trend estimation. In the GRPO phase, a rule-based reward function (including format, landmark extraction, map matching, and extra landmark rewards) was used to train for visual-language localization. Experiments showed GRPO significantly improved Astra-Global’s zero-shot generalization, achieving 99.9% localization accuracy in unseen home environments, surpassing SFT-only methods.
Astra-Local: The Intelligent Assistant for Local Planning
Astra-Local acts as the intelligent assistant for Astra’s high-frequency tasks, a multi-task network capable of efficiently generating local paths and accurately estimating odometry from sensor data. Its architecture comprises three core components: a 4D spatio-temporal encoder, a planning head, and an odometry head.
The 4D spatio-temporal encoder replaces traditional mobile stack perception and prediction modules. It begins with a 3D spatial encoder that processes N omnidirectional images through a Vision Transformer (ViT) and Lift-Splat-Shoot to convert 2D image features into 3D voxel features. This 3D encoder is trained using self-supervised learning via 3D volumetric differentiable neural rendering. The 4D spatio-temporal encoder then builds upon the 3D encoder, taking past voxel features and future timestamps as input to predict future voxel features through ResNet and DiT modules, providing current and future environmental representations for planning and odometry.
The planning head, based on pre-trained 4D features, robot speed, and task information, generates executable trajectories using Transformer-based flow matching. To prevent collisions, the planning head incorporates a masked ESDF loss (Euclidean Signed Distance Field). This loss calculates the ESDF of a 3D occupancy map and applies a 2D ground truth trajectory mask, significantly reducing collision rates. Experiments demonstrate its superior performance in collision rate and overall score on out-of-distribution (OOD) datasets compared to other methods.
The odometry head predicts the robot’s relative pose using current and past 4D features and additional sensor data (e.g., IMU, wheel data). It trains a Transformer model to fuse information from different sensors. Each sensor modality is processed by a specific tokenizer, combined with modality embeddings and temporal positional embeddi…
Konten dipersingkat otomatis.
🔗 Sumber: syncedreview.com
📌 MAROKO133 Update ai: Tome's founders ditch viral presentation app with 20M
Lightfield, a customer relationship management platform built entirely around artificial intelligence, officially launched to the public this week after a year of quiet development — a bold pivot by a startup that once had 20 million users and $43 million in the bank building something completely different.
The San Francisco-based company is positioning itself as a fundamental reimagining of how businesses track and manage customer relationships, abandoning the manual data entry that has defined CRMs for decades in favor of a system that automatically captures, organizes, and acts on customer interactions. With more than 100 early customers already using the platform daily — over half spending more than an hour per day in the system — Lightfield is a direct challenge to the legacy business models of Salesforce and HubSpot, both of which generate billions in annual revenue.
"The CRM, categorically, is perhaps the most complex and lowest satisfaction piece of software on Earth," said Keith Peiris, Lightfield's co-founder and CEO, in an exclusive interview with VentureBeat. "CRM companies have tens of millions of users, and you'd be hard-pressed to find a single one who actually loves the product. That problem is our opportunity."
The general availability announcement marks an unusual inflection point in enterprise software: a company betting that large language models have advanced enough to replace structured databases as the foundation of business-critical systems. It's a wager that has attracted backing from Coatue Management, which led the company's Series A when it was still building presentation software under the name Tome.
How Tome's founders abandoned 20 million users to build a CRM from scratch
The story behind Lightfield's creation reflects both conviction and pragmatism. Tome had achieved significant viral success as an AI-powered presentation platform, gaining millions of users who appreciated its visual design and ease of use. But Peiris said the team concluded that building lasting differentiation in the general-purpose presentation market would prove difficult, even with a working product and real user traction.
"Tome went viral as an AI slides product, and it was visually delightful and easy to use—the first real generative AI-based presentation platform," Peiris explained. "But, the more people used it, the more I realized that to really help people communicate something—anything—we needed more context."
That realization led to a fundamental rethinking. The team observed that the most effective communication requires deep understanding of relationships, company dynamics, and ongoing conversations — context that exists most richly in sales and customer-facing roles. Rather than building a horizontal tool for everyone, they decided to build vertically for go-to-market teams.
"We chose this lane, 'sales,' because so many people in these roles used Tome, and it seemed like the most logical place to go vertical," Peiris said. The team reduced headcount to a core group of engineers and spent a year building in stealth.
Dan Rose, a senior advisor at Coatue who led the original investment in Tome, said the pivot validated his conviction in the founding team. "It takes real guts to pivot, and even more so when the original product is working," Rose said. "They shrunk the team down to a core group of engineers and got to work building Lightfield. This was not an easy product to build, it is extremely complex under the hood."
Why Lightfield stores complete conversations instead of forcing data into fields
What distinguishes Lightfield from traditional CRMs is architectural, not cosmetic. While Salesforce, HubSpot, and their competitors require users to define rigid data schemas upfront — dropdown menus, custom fields, checkbox categories — and then manually populate those fields after every interaction, Lightfield stores the complete, unstructured record of what customers actually say and do.
"Traditional CRMs force every interaction through predefined fields — they're compressing rich, nuanced customer conversations into structured database entries," Peiris said. "We store customer data in its raw, lossless form. That means we're capturing significantly more detail and context than a traditional CRM ever could."
In practice, this means the system automatically records and transcribes sales calls, ingests emails, monitors product usage, and maintains what the company calls a "relationship timeline" — a complete chronological record of every touchpoint between a company and its customers. AI models then extract structured information from this raw data on demand, allowing companies to reorganize their data model without manual rework.
"If you realize you need different fields or want to reorganize your schema entirely, the system can remap and refill itself automatically," Peiris explained. "You're not locked into decisions you made on day one when you barely understood your sales process."
The system also generates meeting preparation briefs, drafts follow-up emails based on conversation context, and can be queried in natural language — capabilities that represent a departure from the passive database model that has defined CRMs since the category's inception in the 1980s.
Sales teams report reviving dead deals and cutting response times from months to days
Customer testimonials suggest the automation delivers measurable impact, particularly for small teams without dedicated sales operations staff. Tyler Postle, co-founder of Voker.ai, said Lightfield's AI agent helped him revive more than 40 stalled opportunities in a single two-hour session — leads he had neglected for six months while using HubSpot.
"Within 2 days, 10 of those were revived and became active opps that moved to poc," Postle said. "The problem was, instead of being a tool of action and autotracking—HubSpot was a tool where I had to do the work to record customer convos. Using HubSpot I was a data hygienist. Using Lighfield, I’m a closer."
Postle reported that his response times to prospects improved from weeks or months to one or two days, a change noticeable enough that customers commented on it. "Our prospects and customers have even noticed it," he said.
Radu Spineanu, co-founder of Humble Ops, highlighted a specific feature that addresses what he views as the primary cause of lost deals: simple neglect. "The killer feature is asking 'who haven't I followed up with?'" Spineanu said. "Most deals die from neglect, not rejection. Lightfield catches these dropped threads and can draft and send the follow-up immediately. That's…
Konten dipersingkat otomatis.
🔗 Sumber: venturebeat.com
🤖 Catatan MAROKO133
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!
