Robotics Field Notes

FIELD NOTE 001

From Jim Fan · NVIDIA GEAR

Three fuels for
embodied intelligence

Robots do not get an internet of actions. They must borrow common sense from human data, rehearse inside synthetic worlds, and pay for experience in physical reality.

The thesis

A capable robot foundation model will need all three data sources. Each one supplies what the other two are missing.

Jim Fan describes robotics data as a three-part portfolio: internet-scale data, simulation, and real-robot demonstrations. The useful distinction is not simply synthetic versus real. It is whether the data offers breadth, action labels, or physical truth. Today, no single source offers all three.

01 BREADTH

Web data

Watch the world before acting in it.

Human videos, images, text, and instructions expose models to a huge range of objects, activities, and environments. They teach semantic context and common-sense priors: what a mug is, how a drawer usually opens, or what tends to happen after water spills.

Superpower: Diversity at internet scale
Blind spot: No native motor commands

Seeing a hand move is not the same as knowing the joint positions, forces, or control signals that produced the motion.

02 SCALE

Simulation

Practice faster than the clock.

A simulator provides the missing action-to-consequence loop. Policies can attempt a grasp, fail, recover, and repeat across thousands of parallel worlds. The marginal cost of another episode is low, and experience scales with compute.

Superpower: Abundant labeled interaction
Blind spot: The sim-to-real gap

Graphics, contact physics, friction, latency, sensor noise, and environmental diversity can all differ from reality.

03 TRUTH

Real-robot data

Learn where physics cannot be approximated.

Teleoperation and human demonstrations record what actually happens on a specific body: camera frames, robot state, actions, contact, timing, and failures. This is the strongest grounding data for imitation learning and deployment.

Superpower: Embodiment-specific reality
Blind spot: Human time and hardware cost

Collection is bound to wall-clock time, operators, resets, maintenance, safety procedures, and physical wear.

The trade space

Each source wins on a different axis.

Data source	Breadth	Action signal	Physical fidelity	Marginal cost
Web	High	Low	Indirect	Low
Simulation	Designed	Exact	Approx.	Low
Real robot	Limited	Exact	Highest	High

My synthesis

The moat is the mixing strategy.

The lesson is not to choose one source. It is to assign each source the job it does best, then create a loop in which real failures improve the simulator and targeted demonstrations fill the remaining gaps.

01
PretrainBroad priors from web data
02
RehearseAction-rich practice in simulation
03
GroundCalibrate with real demonstrations
04
Deploy & learnTurn failures into the next curriculum

The NEURA lens

Manual teaching is expensive. That makes data operations strategic.

A hands-on training approach such as teleoperation can produce high-fidelity, task-relevant demonstrations with no sim-to-real gap. Its constraint is economics: every trajectory consumes operator and robot time.

The strategic question is therefore not only how to collect more demonstrations, but how to make each one compound: reuse across tasks, prioritize informative edge cases, automate quality checks, and combine demonstrations with simulation and pretrained visual-language representations.

Keep this

Web data supplies common sense. Simulation supplies practice. Real robots supply ground truth.

FIELD NOTE 002

From Jim Fan · NVIDIA GEAR

Robots that think
fast and slow

Kahneman's two systems map cleanly onto a robot's brain: a slow, deliberate planner and a fast, intuitive body. How the two talk to each other is still open research.

The thesis

Plan at one hertz. Act at a thousand. A robot mind needs both tempos at once.

Borrowing from Daniel Kahneman's Thinking, Fast and Slow, Jim Fan splits embodied intelligence in two. System 2 deliberates: reasoning, planning, writing code — work that large models already do well. System 1 executes: fast, intuitive motor control that never reaches conscious thought. Grasping a cup, you do not decide where each fingertip goes at every millisecond.

01 SLOW

System 2

Deliberate: reason, plan, write code.

The slow mind runs on big models. Vision-language models and LLMs in a loop can already reason about a scene, decompose a task, and even generate code to orchestrate behavior — deciding what to do next at roughly one decision per second.

Superpower: Long-horizon reasoning
Blind spot: Seconds of latency

A model that thinks in seconds cannot close a control loop that physics demands in milliseconds.

02 FAST

System 1

Intuitive: act before you can explain it.

The fast mind is reflex. Whole-body balance, grasping, and contact-rich manipulation need decisions at something like a thousand hertz — which points to compact, fast sensorimotor policies rather than giant models.

Superpower: Reflex-speed control
Blind spot: No deliberation

A reflex cannot plan a multi-step task or reason its way through a situation it has never seen.

The trade space

Two tempos, one body.

System	Tempo	Mode	Runs on	Breaks when
System 2	~1 Hz	Deliberate	Large VLMs & LLMs in a loop	Contact needs millisecond reflexes
System 1	~1 kHz	Intuitive	Compact sensorimotor policies	The task needs long-horizon planning

The open problem

The interface is the research.

Fan frames the unsolved part as architecture: one monolithic model or a cascade of specialized ones — and if a cascade, do the systems communicate through text or through latent variables? Text is interpretable but slow and lossy; latents are rich but opaque. NVIDIA's later GR00T N1 made the framing concrete: a vision-language module for System 2 driving a diffusion-transformer action module for System 1.

01
MonolithicOne end-to-end model — cleaner, harder to control
02
CascadedSeparate models per system, wired together
03
The wireText or latent vectors between the systems?
04
The clockBridging 1 Hz deliberation and 1 kHz control

The NEURA lens

Own the reflexes. Rent the reasoning.

General-purpose System 2 reasoning is fast becoming a commodity available from foundation-model providers. The defensible layers for a robot maker are System 1 — embodiment-specific, safety-critical, kilohertz-rate control — and the interface contract between the two systems.

That contract is also a safety boundary: while System 2 deliberates for seconds, System 1 must stay competent and safe on its own. Designing the handshake — skill APIs, latent commands, interruption semantics — is product work, not just research.

Keep this

System 2 sets the goal. System 1 does the touching. The handshake is still research.

FIELD NOTE 003

From Jim Fan · NVIDIA GEAR

Robotics is
pre-Chinchilla

Language models come with a recipe: how much data to pair with how many parameters. Embodied AI has no such curve yet — and finding it is itself a research frontier.

The thesis

The scaling laws for embodied AI are yet to be studied. Mapping them is the work.

In language modeling, scaling laws made training predictable: loss falls smoothly with parameters, data, and compute. The Chinchilla result reset the field — for a fixed compute budget, scale data and parameters together, roughly twenty tokens per parameter. Robotics has no equivalent map. Jim Fan is explicit that the embodied scaling law is an open research goal, not a known input to planning.

01 KNOWN

The LLM recipe

Compute in, capability out — predictably.

Kaplan (2020) and Chinchilla (Hoffmann et al., 2022) showed that language-model loss follows smooth power laws. Chinchilla's headline: most models were undertrained — pair every parameter with roughly 20 tokens. Capability planning became engineering rather than gambling.

Superpower: Predictable returns
Blind spot: Proven for text only

Every frontier lab sizes its training runs against these curves before committing compute.

02 UNKNOWN

The robotics tangle

Too many axes to scale at once.

An embodied scaling law must span model size, simulation hours, real-fleet hours, embodiments, and the mixture across all three data fuels. There is no canonical "token of action," and the metric that matters is task success rate — expensive to evaluate and not guaranteed to track loss.

Hard part: Heterogeneous data
Harder part: Success rate ≠ loss

Real-robot data is bound to wall-clock time: a robot can collect at most 24 hours per day, and usually far less.

03 EXPECTED

The wager

Emergence should arrive here too.

Fan's bet is that the LLM pattern repeats: tokenize actions well, compress them with a transformer like any other modality, and emergent properties appear as data and model size scale. The embodied Chinchilla is a result waiting to be measured.

Bet: Actions scale like text
Status: Active research

Until the curve exists, every data budget in robotics is a hypothesis.

The trade space

One field has a map. The other has a frontier.

Dimension	Language models	Embodied AI
Scaling recipe	Mapped	Unknown
Unit of data	Tokens	Trajectories & action tokens
Capability proxy	Validation loss	Task success rate
Data supply	Internet-scale	Robot-bound

My synthesis

Whoever maps the curve, budgets the future.

Until an embodied scaling law exists, robotics data strategy is a portfolio of bets. The rational response is to run the experiment deliberately: small pilots across a grid of model sizes and data volumes, fitted curves, and only then large spend on what the curve rewards.

01
TokenizeMake actions a first-class modality
02
PilotSmall runs across model × data grids
03
FitMeasure capability vs data and parameters
04
SpendScale only what the curve rewards

The NEURA lens

A data engine should discover its own scaling law.

For a company whose moat includes demonstration data, "how much data is enough?" has direct capital consequences. Instrumenting collection — marginal task success per additional demonstration-hour, per task family and embodiment — turns operations into the experiment that reveals the curve.

A training-gym program is exactly the place where those curves can be measured systematically: controlled tasks, repeatable evaluation, and a steady stream of demonstrations to plot against capability.

Keep this

LLMs have a recipe. Robotics still has a hypothesis.

FIELD NOTE 004

From Jim Fan · NVIDIA GEAR

Why robots
shaped like us?

Not romance, not sci-fi. The humanoid form factor is an interface decision — a body compatible with a world we built for ourselves, and with the video we already filmed of it.

The thesis

The world is built around the human form. Tools, spaces, and workflows assume our body and our hands.

Jim Fan's case for humanoids rests on compatibility: restaurants, factories, hospitals — and all the equipment and tools inside them — are designed for the human form and the human hands. A robot that shares the form factor inherits that world as-is. This is why Project GR00T focuses on humanoid robots.

01 ENVIRONMENT

A world pre-fitted

Zero retrofit required.

Door handles, stairs, shelves at human height, tools with human grips: the built environment is one giant assumption about the human body. A humanoid steps into it unchanged, while every other form factor ships with an implicit demand to modify the environment instead.

Superpower: Drop-in compatibility
Blind spot: Hardest hardware problem

Generality of the body is bought with difficulty of the control problem.

02 BODY

Two arms, two legs

Redundancy is capability.

Two arms beat one for manipulation: hold-and-act, handovers, large or awkward objects. Two legs with enough degrees of freedom let the robot balance in different ways, traverse complex terrain, and brace its whole body to move objects in ways a wheeled base cannot.

Superpower: Bimanual + whole-body
Blind spot: Balance is expensive

The same degrees of freedom that make control hard are what make the body general.

03 DATA

The video dividend

The internet already filmed your training set.

Most video online shows human bodies and five-fingered hands at work. That footage can be used — partially — for robot training only when the embodiment matches. The closer the body is to human, the more of the internet becomes usable pretraining signal.

Superpower: Web-scale priors
Blind spot: Transfer is partial

Per note 001: web data supplies common sense, not motor commands. Matching embodiment widens that pipe.

The trade space

Every body is a trade-off.

Form factor	Built-world fit	Manipulation	Video transfer	Economics today
Humanoid	High	Bimanual	Highest	Immature
Wheeled + arm	Flat floors	One arm	Low	Proven
Quadruped	No hands	Minimal	Low	Niches
Aerial	Reach only	Minimal	Low	Niches

My synthesis

GR00T bets on the general body.

Wheels are more efficient on flat ground, quadrupeds own rough inspection, drones own the air — each wins a niche by trading away manipulation breadth or data leverage. The humanoid bet is that generality compounds: one body that fits everywhere, learning from the largest data pool that exists.

01
FitInherit human spaces and tools as-is
02
ManipulateTwo arms, whole-body skills
03
LearnMine human video for priors
04
CompoundOne platform amortizes R&D across tasks

The NEURA lens

The humanoid is the endgame, not the only game.

A robot maker does not have to choose once. Cobot arms and mobile platforms win structured environments today and can fund the harder humanoid bet; a humanoid program like NEURA's 4NE-1 sits at the general end of the same portfolio.

A shared model across embodiments lets data collected on one platform partially transfer to the others — which makes the portfolio a data strategy, not just a product line.

Keep this

The humanoid is an interface decision: match the body the world was designed for, and unlock the video it already filmed.

FIELD NOTE 005

From Jim Fan · NVIDIA GEAR

Generalist first,
specialist later

NLP already ran this experiment: a zoo of task-specific models lost to one generalist. Jim Fan expects robotics to repeat the curve — that is the premise of Project GR00T.

The thesis

The specialized generalist is almost always stronger than the original specialist.

Before ChatGPT, NLP shipped different models and pipelines for translation, coding, math, and creative writing — completely different training pipelines per task. Then one generalist unified everything. Fan expects robotics, which is still mostly in its specialist stage, to follow the same trajectory.

01 BEFORE

The specialist zoo

One task, one model, one pipeline.

Translation, sentiment, summarization, code: each NLP task had its own architecture, its own training data, its own maintenance burden. Progress in one silo barely moved the others. Robotics today largely looks like this — one policy per cell, per task, per machine.

Strength: Focused performance
Weakness: Nothing transfers

Every new task restarts the engineering from near zero.

02 AFTER

The generalist

One model unified everything.

ChatGPT collapsed the zoo into a single model. Emergent capabilities — skills nobody explicitly trained — transfer from one task to the next, and a single model with a single API is far easier to maintain than a fleet of bespoke pipelines.

Strength: Transfer + emergence
Weakness: Heavy to train and run

The maintenance economics matter as much as the accuracy: one model, one update path.

03 NEXT

The specialized generalist

Trim the giant back down.

Once the generalist exists, you prompt it, fine-tune it, and distill it back into specialists — and the specialized generalist is almost always stronger than the original specialist. Many specialists, one parent model. This is exactly how LLMs are deployed today.

Strength: Best of both
Weakness: Needs the generalist first

GR00T's premise for robots: build the generalist, then specialize per task and embodiment.

The trade space

Robotics is rerunning the NLP timeline.

Stage	Language AI	Robotics
Specialists	One model per task — translation, NER, sentiment	Today One policy per cell
Generalist	GPT-3 → ChatGPT unified the field	In progress Project GR00T
Specialized generalists	Prompted, fine-tuned, distilled per task	To come Per task & embodiment

My synthesis

Bet on the curve, not the niche.

If the generalist arrives, value migrates from hand-built task stacks to whoever can specialize the generalist fastest. The cycle then feeds itself: every deployed specialist generates the demonstrations that improve the next generalist.

01
GeneralizeOne model across tasks and embodiments
02
PromptSteer it with instructions
03
DistillCompress it to the deployment
04
RepeatSpecialist data feeds the next generalist

The NEURA lens

Specialization becomes configuration.

If the generalist wins, per-customer engineering becomes per-customer fine-tuning: the marginal cost of serving a new task collapses, and the differentiator shifts to proprietary demonstrations plus the pipeline that turns them into specialized models quickly.

A training-gym program is that pipeline — the machinery for producing specialized generalists on demand.

Keep this

Train one generalist. Distill every specialist from it.

Building intelligence that can touch the world.

Three fuels for embodied intelligence

Robots that think fast and slow

Robotics is pre-Chinchilla

Why robots shaped like us?

Generalist first, specialist later

Web data

Simulation

Real-robot data

Each source wins on a different axis.

The moat is the mixing strategy.

Manual teaching is expensive. That makes data operations strategic.

System 2

System 1

Two tempos, one body.

The interface is the research.

Own the reflexes. Rent the reasoning.

The LLM recipe

The robotics tangle

The wager

One field has a map. The other has a frontier.

Whoever maps the curve, budgets the future.

A data engine should discover its own scaling law.

A world pre-fitted

Two arms, two legs

The video dividend

Every body is a trade-off.

GR00T bets on the general body.

The humanoid is the endgame, not the only game.

The specialist zoo

The generalist

The specialized generalist

Robotics is rerunning the NLP timeline.

Bet on the curve, not the niche.

Specialization becomes configuration.

A public notebook for learning robotics from first principles.

Building intelligence
that can touch the world.