Data wheel or repeat sample? Physical AI, it's time to say goodbye to "hour worship."
The robots are still working hours, but what they really need is a new sample

TL;DR
• The robotist Animesh Garg questioned the industry ' s use of remote operating hours as a model capability indicator。
• The cost of robotic data collection is high, deployment data often come from narrow scenes, and repeat samples quickly become expensive。
• More valuable might be long tail failure, mission coverage and novel samples, rather than total running time。
Animesh Garg, a former robotic scholar at the University of Toronto and currently in Georgia, compared smart data contests to "Goldball" moments in baseball history, in an article entitled Moneyball for Physical AI。
What he wants to challenge is an increasingly common narrative of financing: robotics can form data flyers with more remote operations, more real deployments and more hours of operation. For investors, this is not academic rhetoric. Cost structures, commercialization speeds and model barriers for smart companies are often wrapped in the words "data closed rings". If the cumulative number of hours does not amount to progress in effective models, the market needs to take a fresh look at the data assets of these companies。
"Data hours" may be the robot industry's superstition
Garg borrowed a classic analogy from Moneyball. In 2002, the Auckland sports team won 103 games with one of the Alliance's low-paying teams, not to buy more expensive players, but to discover that the market miscalculated the players. While traditional scouts value strike rates, larceny and positions, the indicator that explains the team ' s ability to score is the rate of entry。
In his view, Physical AI could be at a similar stage. The industry recognizes data as essential to the universal robotic model, but it is easy to see the most visible indicators as the most important: Accumulated remote operating hours, number of instructional trajectories, number of robots deployed, length of operation of production scenes。
The availability of robotic and textual data varies. Large-language models can obtain large amounts of low-cost text from the Internet, code banks, books and web pages, and bottlenecks come more from computing, cleaning and training efficiency. The robotic model requires data with physical interactions, feedback on actions and environmental changes, and every hour of valid data is created in reality, with corresponding equipment, manpower, sites, sensors, failure processing and safety costs behind it。
The robotist Ken Goldberg used the term "100,000-year data cap" to describe the gap between robotics and the size of the Internet, AI data. More precisely, the text and image data consumed by the contemporary large visual language model training, if converted to human reading or viewing time, are equivalent to about 100,000 years, while robots lack real interactive data of the same size. Rather than setting precise thresholds for robotic models, this is a reminder to industry that real world interactive data cannot be captured at the same low cost as web text。
This is why Garg opposed the "sweet factory teleworking" narrative. While it is true that a large number of manual remote operations can produce action-intensive training samples, if the company evaluates the data only in the total number of hours, the funds may flow to duplicate, difficult, low-infodensity samples rather than the scenes that can best reduce the failure rate。
Three types of data buy something different
In the Garg classification, Physical AI data are broadly divided into three categories: observation data, intervention data and deployment data. They may be useful, but costs, constraints and information density vary widely。
The first is observational data, such as first person or third person video. It has the advantage of being low-cost and broad-based, helping models to understand objects, space, action results and environmental distribution. It is also clear that models can see what happens to people or objects, but they do not necessarily know what the robot should do in a given state。
The second category is intervention data, i.e. remote operation, teaching and manual intervention-generated state-to-action trajectory. This type of data is more direct to robotic training, as it contains a chain of " what to see, how to move, what to do after " . The cost is to buy every high-quality track, and human and equipment costs will hardly decline as quickly as software data。
The third category is deployment data, i.e. telemetry data generated by robots when they operate in real commercial settings. It sounds as close as a commercial wheel: robots work, make money, produce training data. But here's a statistical trap。
The first robotic scene today is usually also the least variable, the most fixed process and the most manageable, such as a highly structured storage, plant or single mission environment. The amount of such production data may be large, but it may be relatively narrow and repetitive. Once the model learns local patterns, the additional information that follows every additional hour of operation decreases。
Deployment data are not without value. What is really valuable is often not a large number of regular “successful missions”, but rather failure, jamming, abnormal objects, border conditions and rare disturbances. The problem is that these long-tail samples will not be stabilized at the desired pace of the company and will be found, screened and reset at higher cost。
More data is useful, but repeat samples will be expensive soon
Garg is more cautious in drawing on the language model scaling law: an increase in data usually leads to a decrease in model losses but a decrease in returns. If the sample is duplicated, nearly duplicated, or comes from the same narrow distribution, the help of the additional data becomes smaller more quickly。
It's more intuitive. A robot learns to grab a fixed packaging from a fixed shelf, and the first thousands of teachings, failures and corrections can be very valuable. Once actions, objects, light and paths are collected over and over again, the additional data are more like replicating local experiences that have already been learned。
Similar experience already exists in language modelling training: duplication and close duplication of data can waste the training budget, and excessive duplication can also undermine generalization. Garg did not put these conclusions directly into robotic training, but rather used them to illustrate a direction: the value of the data could not be measured only in quantitative terms, but also in terms of how different the samples were。
For Physical AI, diversity has at least two meanings. The first is to show the model more objects, space, materials, light, shielding and operating methods. The second is to avoid that the model is performing well in an oversimplified task distribution, and that a slightly different scenario will fail。
Long-tail failure cases are therefore critical. The real physical world is not evenly distributed, and low frequency anomalies often determine commercial usability: the object is one-sided, the packaging is deformed, the surface is reflective, the grabs slide, the person is suddenly involved, the sensor is missing and the ground friction changes. The model performed better on conventional samples, and if these tail events were not addressed, deployment would still be delayed by a few failures。
The deployment of the wheel is in place
The real challenge of this article is the common commercialization route of smart companies: to deploy robots to narrow scenes, to secure usability with human remote takeovers, to collect production data, to train stronger models and to open more scenes。
Garg calls this type of path a "neo-integrator" approach. It attempts to circumvent the costs of pure data collection by placing robots in commercial production so that operating revenues offset the costs of data. This route sounds more efficient than the construction of remote operating plants。
However, there is a premise that data from early commercial scenarios must be sufficiently new and diverse to help models migrate to more tasks. Data will be saturated quickly if the scene is deployed to a narrow mission with low variability, low entropy and robust engineering. The company is not likely to receive a generic capability flyer, but a customized set of projects requiring continuous integration, maintenance and unusual treatment。
This entails two types of costs. First, every move to a new scene involves environmental adaptation, process adaptation, failure and safety mechanisms. Second, if the deployment itself has not yet achieved a balance of gains and losses, scaling up may not necessarily be in the form of low-cost data collection or the exchange of losses for a large number of low-intensity samples。
Therefore, early deployment is not useless, but requires a closer look at: It brings with it new mission coverage, as well as many failed and unusual samples, whether they can be moved to other scenarios, and how much model improvements have been made to each dollar after deduction of hardware, manpower, maintenance and integration costs。
The valuation narratives can't just ask how many hours we've saved
Garg gave the suggestion not to stop collecting data but to replace the evaluation calibration. Cumulative operating hours, remote operating hours and trajectories can be used as operational indicators, but should not be directly equated with model advances。
More explanatory questions include when the data for a single mission will be saturated, how much engineering integration costs will be required for an additional mission, how many different scenes and action clusters are covered by the data, how many of the production data are real distributed drift and unusual samples, and how many conventional success segments in the deployment stream should be filtered out rather than continuing to feed the model。
The allocation of capital will vary according to the three types of data. Observational data should prioritize the pursuit of low-cost, diverse and broad coverage, which should be used to expand the base capacity boundary. When high-cost remote operations and teaching data are saturated on a single mission, budgets should be redirected to more tasks rather than continuing to repeat the same actions. The deployment data should focus on the selection of failed, border conditions and out-of-dispersion samples, discarding large amounts of regular operating records with low information density。
This set of views has a real impact on the Physical AI valuation narrative. A company with more robots, longer operating hours and larger remote operators does not automatically represent a stronger model barrier. The ability to replicate may be more difficult by continuously finding high-value, long-term data, judging when a particular type of data becomes saturated, and covering more task distribution at lower cost。
However, this is still a set of allocation perspectives and is not an industry conclusion. Whether robotic models generate the scale benefits of similar language models, whether deployment data can generate new information on a continuous basis in certain high-dimensional scenarios, and how efficient the migration between missions is, will depend on more empirical results。
Garg's reminder falls on a more specific issue: the Physical AI's "Goldball Indicator" may not be data hours, but a novel sample of each dollar. For robotic companies that still use data to tell stories, the market ultimately depends not on how long they run cumulatively, but on how much new information they generate。
