Learning log Berlin · 2026

Building intelligence
that can touch the world.

A growing set of notes on embodied AI, robot learning, and the machinery behind physical intelligence.

Begin with note 01
001

Data engines · Embodied AI

Three fuels for embodied intelligence

12 Jun 2026 7 min
Read
002

Architectures · Embodied AI

Robots that think fast and slow

12 Jun 2026 5 min
Read
003

Scaling laws · Research frontier

Robotics is pre-Chinchilla

12 Jun 2026 5 min
Read
004

Form factors · Humanoids

Why robots shaped like us?

12 Jun 2026 5 min
Read
005

Strategy · Foundation models

Generalist first, specialist later

12 Jun 2026 5 min
Read
FIELD NOTE 001

From Jim Fan · NVIDIA GEAR

Three fuels for
embodied intelligence

Robots do not get an internet of actions. They must borrow common sense from human data, rehearse inside synthetic worlds, and pay for experience in physical reality.

A capable robot foundation model will need all three data sources. Each one supplies what the other two are missing.

Jim Fan describes robotics data as a three-part portfolio: internet-scale data, simulation, and real-robot demonstrations. The useful distinction is not simply synthetic versus real. It is whether the data offers breadth, action labels, or physical truth. Today, no single source offers all three.

01 BREADTH

Web data

Watch the world before acting in it.

Human videos, images, text, and instructions expose models to a huge range of objects, activities, and environments. They teach semantic context and common-sense priors: what a mug is, how a drawer usually opens, or what tends to happen after water spills.

Superpower
Diversity at internet scale
Blind spot
No native motor commands

Seeing a hand move is not the same as knowing the joint positions, forces, or control signals that produced the motion.

02 SCALE

Simulation

Practice faster than the clock.

A simulator provides the missing action-to-consequence loop. Policies can attempt a grasp, fail, recover, and repeat across thousands of parallel worlds. The marginal cost of another episode is low, and experience scales with compute.

Superpower
Abundant labeled interaction
Blind spot
The sim-to-real gap

Graphics, contact physics, friction, latency, sensor noise, and environmental diversity can all differ from reality.

03 TRUTH

Real-robot data

Learn where physics cannot be approximated.

Teleoperation and human demonstrations record what actually happens on a specific body: camera frames, robot state, actions, contact, timing, and failures. This is the strongest grounding data for imitation learning and deployment.

Superpower
Embodiment-specific reality
Blind spot
Human time and hardware cost

Collection is bound to wall-clock time, operators, resets, maintenance, safety procedures, and physical wear.

Each source wins on a different axis.

Data source Breadth Action signal Physical fidelity Marginal cost
Web High Low Indirect Low
Simulation Designed Exact Approx. Low
Real robot Limited Exact Highest High

The moat is the mixing strategy.

The lesson is not to choose one source. It is to assign each source the job it does best, then create a loop in which real failures improve the simulator and targeted demonstrations fill the remaining gaps.

  1. 01
    PretrainBroad priors from web data
  2. 02
    RehearseAction-rich practice in simulation
  3. 03
    GroundCalibrate with real demonstrations
  4. 04
    Deploy & learnTurn failures into the next curriculum

Manual teaching is expensive. That makes data operations strategic.

A hands-on training approach such as teleoperation can produce high-fidelity, task-relevant demonstrations with no sim-to-real gap. Its constraint is economics: every trajectory consumes operator and robot time.

The strategic question is therefore not only how to collect more demonstrations, but how to make each one compound: reuse across tasks, prioritize informative edge cases, automate quality checks, and combine demonstrations with simulation and pretrained visual-language representations.

Keep this

Web data supplies common sense. Simulation supplies practice. Real robots supply ground truth.

FIELD NOTE 002

From Jim Fan · NVIDIA GEAR

Robots that think
fast and slow

Kahneman's two systems map cleanly onto a robot's brain: a slow, deliberate planner and a fast, intuitive body. How the two talk to each other is still open research.

Plan at one hertz. Act at a thousand. A robot mind needs both tempos at once.

Borrowing from Daniel Kahneman's Thinking, Fast and Slow, Jim Fan splits embodied intelligence in two. System 2 deliberates: reasoning, planning, writing code — work that large models already do well. System 1 executes: fast, intuitive motor control that never reaches conscious thought. Grasping a cup, you do not decide where each fingertip goes at every millisecond.

01 SLOW

System 2

Deliberate: reason, plan, write code.

The slow mind runs on big models. Vision-language models and LLMs in a loop can already reason about a scene, decompose a task, and even generate code to orchestrate behavior — deciding what to do next at roughly one decision per second.

Superpower
Long-horizon reasoning
Blind spot
Seconds of latency

A model that thinks in seconds cannot close a control loop that physics demands in milliseconds.

02 FAST

System 1

Intuitive: act before you can explain it.

The fast mind is reflex. Whole-body balance, grasping, and contact-rich manipulation need decisions at something like a thousand hertz — which points to compact, fast sensorimotor policies rather than giant models.

Superpower
Reflex-speed control
Blind spot
No deliberation

A reflex cannot plan a multi-step task or reason its way through a situation it has never seen.

Two tempos, one body.

System Tempo Mode Runs on Breaks when
System 2 ~1 Hz Deliberate Large VLMs & LLMs in a loop Contact needs millisecond reflexes
System 1 ~1 kHz Intuitive Compact sensorimotor policies The task needs long-horizon planning

The interface is the research.

Fan frames the unsolved part as architecture: one monolithic model or a cascade of specialized ones — and if a cascade, do the systems communicate through text or through latent variables? Text is interpretable but slow and lossy; latents are rich but opaque. NVIDIA's later GR00T N1 made the framing concrete: a vision-language module for System 2 driving a diffusion-transformer action module for System 1.

  1. 01
    MonolithicOne end-to-end model — cleaner, harder to control
  2. 02
    CascadedSeparate models per system, wired together
  3. 03
    The wireText or latent vectors between the systems?
  4. 04
    The clockBridging 1 Hz deliberation and 1 kHz control

Own the reflexes. Rent the reasoning.

General-purpose System 2 reasoning is fast becoming a commodity available from foundation-model providers. The defensible layers for a robot maker are System 1 — embodiment-specific, safety-critical, kilohertz-rate control — and the interface contract between the two systems.

That contract is also a safety boundary: while System 2 deliberates for seconds, System 1 must stay competent and safe on its own. Designing the handshake — skill APIs, latent commands, interruption semantics — is product work, not just research.

Keep this

System 2 sets the goal. System 1 does the touching. The handshake is still research.

FIELD NOTE 003

From Jim Fan · NVIDIA GEAR

Robotics is
pre-Chinchilla

Language models come with a recipe: how much data to pair with how many parameters. Embodied AI has no such curve yet — and finding it is itself a research frontier.

The scaling laws for embodied AI are yet to be studied. Mapping them is the work.

In language modeling, scaling laws made training predictable: loss falls smoothly with parameters, data, and compute. The Chinchilla result reset the field — for a fixed compute budget, scale data and parameters together, roughly twenty tokens per parameter. Robotics has no equivalent map. Jim Fan is explicit that the embodied scaling law is an open research goal, not a known input to planning.

01 KNOWN

The LLM recipe

Compute in, capability out — predictably.

Kaplan (2020) and Chinchilla (Hoffmann et al., 2022) showed that language-model loss follows smooth power laws. Chinchilla's headline: most models were undertrained — pair every parameter with roughly 20 tokens. Capability planning became engineering rather than gambling.

Superpower
Predictable returns
Blind spot
Proven for text only

Every frontier lab sizes its training runs against these curves before committing compute.

02 UNKNOWN

The robotics tangle

Too many axes to scale at once.

An embodied scaling law must span model size, simulation hours, real-fleet hours, embodiments, and the mixture across all three data fuels. There is no canonical "token of action," and the metric that matters is task success rate — expensive to evaluate and not guaranteed to track loss.

Hard part
Heterogeneous data
Harder part
Success rate ≠ loss

Real-robot data is bound to wall-clock time: a robot can collect at most 24 hours per day, and usually far less.

03 EXPECTED

The wager

Emergence should arrive here too.

Fan's bet is that the LLM pattern repeats: tokenize actions well, compress them with a transformer like any other modality, and emergent properties appear as data and model size scale. The embodied Chinchilla is a result waiting to be measured.

Bet
Actions scale like text
Status
Active research

Until the curve exists, every data budget in robotics is a hypothesis.

One field has a map. The other has a frontier.

Dimension Language models Embodied AI
Scaling recipe Mapped Unknown
Unit of data Tokens Trajectories & action tokens
Capability proxy Validation loss Task success rate
Data supply Internet-scale Robot-bound

Whoever maps the curve, budgets the future.

Until an embodied scaling law exists, robotics data strategy is a portfolio of bets. The rational response is to run the experiment deliberately: small pilots across a grid of model sizes and data volumes, fitted curves, and only then large spend on what the curve rewards.

  1. 01
    TokenizeMake actions a first-class modality
  2. 02
    PilotSmall runs across model × data grids
  3. 03
    FitMeasure capability vs data and parameters
  4. 04
    SpendScale only what the curve rewards

A data engine should discover its own scaling law.

For a company whose moat includes demonstration data, "how much data is enough?" has direct capital consequences. Instrumenting collection — marginal task success per additional demonstration-hour, per task family and embodiment — turns operations into the experiment that reveals the curve.

A training-gym program is exactly the place where those curves can be measured systematically: controlled tasks, repeatable evaluation, and a steady stream of demonstrations to plot against capability.

Keep this

LLMs have a recipe. Robotics still has a hypothesis.

Jim Fan on NVIDIA's Embodied AI Lab and Jensen Huang's Prediction that All Robots Will Be Autonomous. Context: Hoffmann et al. 2022, "Training Compute-Optimal Large Language Models."

Watch the conversation
FIELD NOTE 004

From Jim Fan · NVIDIA GEAR

Why robots
shaped like us?

Not romance, not sci-fi. The humanoid form factor is an interface decision — a body compatible with a world we built for ourselves, and with the video we already filmed of it.

The world is built around the human form. Tools, spaces, and workflows assume our body and our hands.

Jim Fan's case for humanoids rests on compatibility: restaurants, factories, hospitals — and all the equipment and tools inside them — are designed for the human form and the human hands. A robot that shares the form factor inherits that world as-is. This is why Project GR00T focuses on humanoid robots.

01 ENVIRONMENT

A world pre-fitted

Zero retrofit required.

Door handles, stairs, shelves at human height, tools with human grips: the built environment is one giant assumption about the human body. A humanoid steps into it unchanged, while every other form factor ships with an implicit demand to modify the environment instead.

Superpower
Drop-in compatibility
Blind spot
Hardest hardware problem

Generality of the body is bought with difficulty of the control problem.

02 BODY

Two arms, two legs

Redundancy is capability.

Two arms beat one for manipulation: hold-and-act, handovers, large or awkward objects. Two legs with enough degrees of freedom let the robot balance in different ways, traverse complex terrain, and brace its whole body to move objects in ways a wheeled base cannot.

Superpower
Bimanual + whole-body
Blind spot
Balance is expensive

The same degrees of freedom that make control hard are what make the body general.

03 DATA

The video dividend

The internet already filmed your training set.

Most video online shows human bodies and five-fingered hands at work. That footage can be used — partially — for robot training only when the embodiment matches. The closer the body is to human, the more of the internet becomes usable pretraining signal.

Superpower
Web-scale priors
Blind spot
Transfer is partial

Per note 001: web data supplies common sense, not motor commands. Matching embodiment widens that pipe.

Every body is a trade-off.

Form factor Built-world fit Manipulation Video transfer Economics today
Humanoid High Bimanual Highest Immature
Wheeled + arm Flat floors One arm Low Proven
Quadruped No hands Minimal Low Niches
Aerial Reach only Minimal Low Niches

GR00T bets on the general body.

Wheels are more efficient on flat ground, quadrupeds own rough inspection, drones own the air — each wins a niche by trading away manipulation breadth or data leverage. The humanoid bet is that generality compounds: one body that fits everywhere, learning from the largest data pool that exists.

  1. 01
    FitInherit human spaces and tools as-is
  2. 02
    ManipulateTwo arms, whole-body skills
  3. 03
    LearnMine human video for priors
  4. 04
    CompoundOne platform amortizes R&D across tasks

The humanoid is the endgame, not the only game.

A robot maker does not have to choose once. Cobot arms and mobile platforms win structured environments today and can fund the harder humanoid bet; a humanoid program like NEURA's 4NE-1 sits at the general end of the same portfolio.

A shared model across embodiments lets data collected on one platform partially transfer to the others — which makes the portfolio a data strategy, not just a product line.

Keep this

The humanoid is an interface decision: match the body the world was designed for, and unlock the video it already filmed.

FIELD NOTE 005

From Jim Fan · NVIDIA GEAR

Generalist first,
specialist later

NLP already ran this experiment: a zoo of task-specific models lost to one generalist. Jim Fan expects robotics to repeat the curve — that is the premise of Project GR00T.

The specialized generalist is almost always stronger than the original specialist.

Before ChatGPT, NLP shipped different models and pipelines for translation, coding, math, and creative writing — completely different training pipelines per task. Then one generalist unified everything. Fan expects robotics, which is still mostly in its specialist stage, to follow the same trajectory.

01 BEFORE

The specialist zoo

One task, one model, one pipeline.

Translation, sentiment, summarization, code: each NLP task had its own architecture, its own training data, its own maintenance burden. Progress in one silo barely moved the others. Robotics today largely looks like this — one policy per cell, per task, per machine.

Strength
Focused performance
Weakness
Nothing transfers

Every new task restarts the engineering from near zero.

02 AFTER

The generalist

One model unified everything.

ChatGPT collapsed the zoo into a single model. Emergent capabilities — skills nobody explicitly trained — transfer from one task to the next, and a single model with a single API is far easier to maintain than a fleet of bespoke pipelines.

Strength
Transfer + emergence
Weakness
Heavy to train and run

The maintenance economics matter as much as the accuracy: one model, one update path.

03 NEXT

The specialized generalist

Trim the giant back down.

Once the generalist exists, you prompt it, fine-tune it, and distill it back into specialists — and the specialized generalist is almost always stronger than the original specialist. Many specialists, one parent model. This is exactly how LLMs are deployed today.

Strength
Best of both
Weakness
Needs the generalist first

GR00T's premise for robots: build the generalist, then specialize per task and embodiment.

Robotics is rerunning the NLP timeline.

Stage Language AI Robotics
Specialists One model per task — translation, NER, sentiment Today One policy per cell
Generalist GPT-3 → ChatGPT unified the field In progress Project GR00T
Specialized generalists Prompted, fine-tuned, distilled per task To come Per task & embodiment

Bet on the curve, not the niche.

If the generalist arrives, value migrates from hand-built task stacks to whoever can specialize the generalist fastest. The cycle then feeds itself: every deployed specialist generates the demonstrations that improve the next generalist.

  1. 01
    GeneralizeOne model across tasks and embodiments
  2. 02
    PromptSteer it with instructions
  3. 03
    DistillCompress it to the deployment
  4. 04
    RepeatSpecialist data feeds the next generalist

Specialization becomes configuration.

If the generalist wins, per-customer engineering becomes per-customer fine-tuning: the marginal cost of serving a new task collapses, and the differentiator shifts to proprietary demonstrations plus the pipeline that turns them into specialized models quickly.

A training-gym program is that pipeline — the machinery for producing specialized generalists on demand.

Keep this

Train one generalist. Distill every specialist from it.

A public notebook for learning robotics from first principles.

Each note turns one source into a compact mental model: what matters, where the trade-offs sit, and which questions are worth carrying into the next experiment.