Why joint-embedding predictive architectures require structured knowledge substrates to reach superhuman adaptable intelligence, and what that means for the field.
TL;DR
JEPA learns how things look and move. It cannot learn what things mean, where knowledge comes from, or how concepts relate across domains. Well-formed sentences can be false — and neither LLMs nor JEPA can tell the difference without provenance. The Semantic AI Model (SAM) solves this by extracting verified triples from authoritative sources and building unique concept signatures that eliminate ambiguity at the structural level. A new paper co-authored by LeCun himself argues that intelligence must embrace specialization and be measured by adaptation speed — validating the architectural thesis that perception and symbolic knowledge are distinct specializations that must be composed, not collapsed into a single system. SAM’s 25-year operational history, from its 1999 founding through CIFA counterintelligence deployment to tens of millions of disambiguated concepts today, demonstrates that the knowledge substrate problem has been solved in practice. Perception isn’t intelligence. Intelligence is perception anchored to structured knowledge. The substrate exists.
AICYC Team — LLM assisted March 11, 2026
I. The Premise
Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) represents the most ambitious current attempt to move AI beyond large language models toward systems that learn the way biological organisms do — by observing the world and building internal predictive models of it. Backed by a billion dollars through Meta’s AMI Labs (announced 2025), JEPA proposes that intelligence emerges from predicting missing or future information in abstract representation space rather than pixel space, trained primarily on video and sensory data.
The architecture has genuine theoretical merit. It addresses known limitations of LLMs — their lack of physical intuition, their inability to plan, their confinement to token-sequence prediction. But JEPA also contains structural gaps that are not engineering problems awaiting solutions. They are architectural absences — categories of knowledge the system cannot acquire through its own learning paradigm. Identifying these gaps precisely, without overstating or understating them, is the purpose of this analysis.
A recent paper co-authored by LeCun himself — “AI Must Embrace Specialization via Superhuman Adaptable Intelligence” (Goldfeder, Wyder, LeCun & Shwartz-Ziv, 2026) — introduces a framework that sharpens this analysis considerably. The paper argues that the field should abandon Artificial General Intelligence (AGI) as its North Star and replace it with Superhuman Adaptable Intelligence (SAI): intelligence that can learn to exceed humans at any important task, measured by the speed of adaptation to new domains. The SAI paper explicitly advocates for self-supervised learning, world models, and — critically — the composition of specialized modules whose coordination is “engineered rather than assumed.”[4] This architectural vision, coming from the architect of JEPA himself, provides the theoretical grounding for what this analysis demonstrates empirically: JEPA is one essential specialist module, but without a structured knowledge substrate it cannot achieve the adaptation speed that SAI demands.
II. What JEPA Does Well
JEPA’s core innovation is sound. By predicting in embedding space rather than pixel space, it avoids the intractable problem of generating exact future video frames and instead learns abstract representations that capture the structure of physical processes. A JEPA model trained on video of objects falling, liquids flowing, and surfaces deforming can develop what might be called intuitive physics — implicit models of gravity, solidity, permanence, and containment that mirror the understanding a human child develops in the first years of life.
The hierarchical extension of this idea is equally compelling. LeCun proposes multiple JEPA modules operating at different temporal and spatial scales, with higher-level modules predicting abstract outcomes over longer time horizons while lower-level modules handle fine-grained sensory prediction. This mirrors the hierarchical organization of biological perception and, if realized, could support the kind of multi-scale planning that current AI systems lack entirely.
None of this should be dismissed. Perceptual grounding is a genuine component of intelligence, and no prior architecture has proposed as coherent a mechanism for acquiring it from raw sensory data.
The SAI paper reinforces this assessment with a precise framing. JEPA, viewed through the SAI lens, is a perceptual specialist — and specialization, the paper argues, is not a limitation but a structural advantage. The No Free Lunch theorem guarantees that “no single, general-purpose machine learning algorithm or optimization strategy works best for every problem.”[4] An architecture that concentrates its capacity on learning physical dynamics from sensory data will outperform a general-purpose system at that task precisely because it is specialized. JEPA’s commitment to perceptual prediction in latent space is, in SAI’s terminology, exactly the kind of domain-targeted design that produces superhuman capability within a domain. The question is not whether JEPA does what it does well. It is whether what JEPA does constitutes the whole of intelligence — or one indispensable module within a larger composed system.
III. The Five Structural Gaps
The SAI paper provides a rigorous theoretical foundation for understanding why these gaps exist and why they cannot be closed by scaling JEPA alone. Its central arguments — that human intelligence is specialized rather than general, that true adaptability requires composition of specialized systems, and that “the AI that folds our proteins should not be the AI that folds our laundry”[4] — map directly onto the architectural absences identified below.
1. Symbolic Knowledge Has No Perceptual Pathway
The most consequential limitation is straightforward: the vast majority of knowledge that distinguishes an educated adult from a perceptually competent toddler has no sensory signature. A chemical engineer’s understanding that NFPA 1994 governs protective clothing standards, that Bacillus anthracis is classified as a CDC Category A select agent, or that Le Chatelier’s principle predicts equilibrium shifts under pressure changes — none of this can be derived from video observation at any scale. These facts exist exclusively in textbooks, technical manuals, standards documents, and regulatory codes. They entered human civilization through language, and they can only enter an AI system through language.
JEPA has no mechanism for reading a textbook. This is not a temporary limitation awaiting a future module. It is a consequence of the architecture’s foundational commitment to learning from sensory prediction. Adding a text-processing module would require precisely the kind of symbolic knowledge acquisition system that JEPA was designed to transcend.
The SAI paper’s analysis of human cognition illuminates why this gap is fundamental rather than incidental. Goldfeder et al. argue that human intelligence is not general but specialized through adaptation — honed by evolution for a specific subset of tasks critical to survival.[4] Crucially, the paper identifies two distinct notions of human capability that are routinely conflated: innate perceptual-motor competence (what a child acquires in the first years of life) and specialized expertise acquired through years of formal education and practice. JEPA targets the first. It has no pathway to the second. The SAI framework makes the distinction explicit: perceptual learning and symbolic knowledge acquisition are different specializations requiring different architectural substrates. Asking JEPA to acquire symbolic knowledge from video is, in SAI’s terms, equivalent to asking a protein-folding specialist to fold laundry — a category error that no amount of scaling resolves.
2. Polysemy Cannot Be Resolved Perceptually
Natural language is inherently ambiguous. The word “anthrax” refers to a bioweapon, a veterinary disease, an occupational hazard, and a rock band. The word “mercury” refers to a planet, an element, an automobile brand, and a Roman deity. Resolving these ambiguities — determining which concept a word refers to in a given context — is essential for any system that must interact with human knowledge.
LLMs attempt disambiguation through contextual embeddings, adjusting the representation of an ambiguous word based on surrounding tokens. But this disambiguation is probabilistic and approximate. The embedding for “anthrax” in a bioterrorism context still carries residual activation from every other training context, because the representations were never formally separated. There are no hard concept boundaries — only statistical gradients that can and do produce well-formed false statements when different senses bleed into each other.
JEPA inherits a more severe version of this problem. Its representations are learned from sensory data in continuous embedding space with no explicit concept boundaries whatsoever. Even if JEPA could observe scenarios involving different senses of “anthrax,” it has no mechanism for linking visual observations to the regulatory, taxonomic, and procedural distinctions that determine which sense matters in a given operational context.
The Semantic AI Model (SAM) addresses this at the structural level. After provenance-verified triples are aggregated, the ontology assembly process identifies, for each concept, a bag of tokens that is unique to that concept — not in the sense that no other concept shares any individual token, but in the combinatorial sense that the specific collection forms a signature distinguishing it from every other concept in the ontology. The concept of anthrax-as-bioweapon carries tokens like weaponization, aerosolization, spore viability, LD50, and Category A agent. The concept of anthrax-as-veterinary-disease carries livestock, cutaneous, grazing, soil persistence, and herd management. Each bag is unique. Polysemy is bounded because each sense has been crystallized into a distinct concept with a verified identifying signature.[1]
The SAI paper’s emphasis on adaptation speed as the primary metric of intelligence[4] makes the disambiguation problem even more acute. If the measure of intelligence is how quickly a system can adapt to a new task, then a system that must re-derive the correct sense of every ambiguous term from context on every query is fundamentally slower than a system that can look up the disambiguated concept in a pre-verified ontology. Disambiguation through statistical context is an O(n) problem that scales with corpus size and ambiguity depth. Disambiguation through ontological constraint satisfaction is O(1) — the concept is already crystallized. For SAI’s metric of adaptation speed, the architectural advantage of structural disambiguation is not marginal. It is categorical.
3. Provenance Is Absent by Design
When an LLM generates the statement “anthrax is primarily treated with radiation therapy,” the sentence is grammatically correct and statistically plausible. It is also false. The model cannot distinguish this from a true statement because it has no record of where it learned anything. Knowledge entered the model as undifferentiated training signal — a blend of CDC manuals, blog posts, Reddit threads, and fiction — and the provenance was discarded during training.
JEPA has the same structural absence. A JEPA model that has learned to predict physical dynamics from video cannot tell you which videos it learned from, whether those videos depicted typical or anomalous physical behavior, or how reliable the source was. Its predictions are statistical summaries with no provenance chain.
For any application where the cost of error is high — counterintelligence, medical diagnosis, industrial safety, regulatory compliance — the inability to trace a claim to an authoritative source is not merely inconvenient. It is disqualifying.
The SAI paper sharpens this point through its discussion of what constitutes useful intelligence. Goldfeder et al. deliberately constrain their definition of SAI to tasks that have utility — excluding “a potential infinitely set of useless tasks.”[4] In high-utility domains — precisely the domains where SAI matters most — provenance is not an optional feature. A counterintelligence analyst, a clinical oncologist, or a chemical safety engineer cannot act on a claim whose source is unknown. The SAI framework’s focus on utility therefore implicitly requires provenance: a system that adapts superhuman quickly to a high-utility task but cannot verify its outputs is not useful in that task. Adaptation without provenance produces confident, rapid error — the most dangerous failure mode of all.
4. The Scalability Arithmetic Is Unforgiving
LeCun has argued that a human child processes approximately 50 trillion bytes of visual data by age four — far exceeding the text used to train any LLM. His conclusion is that perception is the richer learning signal. But the implication for compute requirements is sobering. Video data is orders of magnitude larger and more computationally expensive to process than text. Current V-JEPA experiments operate at modest scales compared to what LeCun’s vision requires, and the gap between current V-JEPA capabilities and “understands the physical world the way a toddler does” remains immense.
Even granting optimistic compression — say the perceptual equivalent of a child’s first four years achieved in five months of training — the result is a system with the cognitive profile of a toddler. The subsequent 18 years of formal education that produce a competent professional have no compression mechanism within JEPA’s architecture. There is no way to accelerate the acquisition of symbolic knowledge that the system cannot acquire at all.
The SAI paper provides the formal machinery to express this limitation. Its core metric is adaptation speed — the time required to acquire a new competence.[4] For perceptual competence, JEPA offers a plausible acceleration pathway: self-supervised learning from video may eventually compress years of visual experience into months of training. But for symbolic competence — the knowledge that constitutes professional expertise — JEPA offers no acceleration pathway at all, because it offers no acquisition pathway. A knowledge substrate that can ingest the world’s reference corpora in hours, extracting and verifying structured knowledge through automated ETL pipelines[1], does not merely accelerate symbolic knowledge acquisition. It transforms the scaling curve from impossible (JEPA alone: cannot acquire symbolic knowledge at any speed) to solved (SAM: billions of pages processed with provenance verification across dozens of domains). In SAI’s terms, this is the difference between a system whose adaptation speed for symbolic tasks is undefined and one whose adaptation speed is measured in hours.
The SAI paper also introduces a relevant critique of autoregressive models: their errors “diverge exponentially with prediction length.”[4] This compounding-error problem applies not only to LLMs but to any system that attempts to chain inferences without structural verification. A JEPA model making multi-step predictions about physical dynamics faces the same exponential divergence. A knowledge substrate with provenance-weighted verification provides exactly the structural checkpoints that can bound this divergence — not by correcting predictions after the fact, but by anchoring each inference step to verified knowledge, preventing error propagation before it compounds.
5. Cross-Domain Inference Requires Explicit Structure
Many of the most valuable forms of intelligence involve reasoning across domains: recognizing that an industrial disinfectant effective against spore-forming bacteria is relevant to hospital HVAC decontamination requires traversing from chemistry through microbiology to facilities engineering. This kind of reasoning depends on explicit, typed relationships between concepts — taxonomic hierarchies, causal links, regulatory mappings — that cannot emerge from perceptual learning alone.
A system trained on video of chemistry labs, microbiology labs, and HVAC systems would learn three separate perceptual domains. It would not learn the conceptual bridges between them, because those bridges are defined by scientific theory, engineering practice, and regulatory frameworks, none of which have visual signatures.
The SAI paper’s discussion of negative transfer provides the formal explanation for why this gap cannot be closed through multi-domain perceptual training. Goldfeder et al. cite research showing that “multi-task learning can benefit performance when tasks share an underlying structure” but “can lead to ‘negative transfer’ when tasks compete for representational capacity or impose conflicting gradients.”[4] A JEPA model trained simultaneously on chemistry-lab video, microbiology-lab video, and HVAC-system video would face precisely this problem: the visual features that predict chemical reactions are different from those that predict microbial behavior, which are different from those that predict airflow dynamics. Shared perceptual training across these domains would likely degrade performance in each, even as it fails to create the conceptual bridges that cross-domain reasoning requires.
The paper’s proposed solution — “compose [specialists] into systems whose coordination is engineered rather than assumed”[4] — is exactly the integration architecture this analysis proposes: a JEPA perceptual specialist composed with a SAM knowledge specialist through an engineered integration layer that provides the cross-domain conceptual bridges neither system can learn alone.
IV. The Substrate Requirement
The gaps identified above share a common structure: JEPA can learn how things look and behave but cannot learn what things mean, how they relate, or why they matter within the frameworks of human knowledge. The missing layer is not more perception. It is structured, provenance-verified, disambiguated symbolic knowledge — an ontological substrate against which perceptual learning can be anchored.
The SAI paper’s framework clarifies what role this substrate plays in architectural terms. LeCun and his co-authors argue that “the brain is not a monolith, but a system of systems” and that “adaptation requires hierarchy and diversity of models and modalities.”[4] They advocate for moving beyond token-level prediction to “latent prediction architectures such as Dreamer 4, Genie 2, or Joint Embedding Prediction Architecture (JEPA)” — but they simultaneously insist that “no single system will be able to adapt in the way that humans do.”[4] The knowledge substrate is the missing system in this system of systems. JEPA provides latent perceptual prediction. LLMs provide statistical linguistic generation. The substrate provides what neither can: structured, verified, disambiguated symbolic knowledge with provenance chains and cross-domain typed relationships.
The requirements for such a substrate can be stated precisely.
Automated knowledge extraction from authoritative sources. The system must be able to ingest textbooks, technical manuals, standards documents, and reference corpora and extract structured knowledge — entities, relationships, and attributes — without human-in-the-loop encoding. Manual knowledge encoding was identified as the fundamental bottleneck of AI as early as the 1950s, and any substrate that depends on it will face the same scaling wall that defeated earlier knowledge-based systems.[1][5]
Provenance-weighted verification. Every extracted fact must retain its source attribution, and the system must be able to weight knowledge by the authority and reliability of its provenance. This is what distinguishes a knowledge substrate from a language model’s training corpus: not just what is known, but on whose authority it is known, verified through convergence across independent authoritative sources.
Bounded polysemy through unique concept signatures. The substrate must resolve lexical ambiguity at the structural level — not through probabilistic contextual adjustment, but by assigning each distinct concept a unique identifying signature derived from its verified attributes. When every concept carries a token bag that distinguishes it from every other concept in the ontology, disambiguation becomes constraint satisfaction against verified structure rather than statistical guessing.[2]
Multi-dimensional retrieval and inference. The substrate must support reasoning along multiple axes simultaneously — taxonomic hierarchy, cross-domain association, temporal currency, and authority provenance — to resolve queries that span domains and require implicit knowledge capture.[3]
Adaptation acceleration. In light of the SAI framework, a fifth requirement emerges: the substrate must enable rapid adaptation to new domains by providing pre-structured knowledge that a learning system can assimilate without deriving it from scratch. This is the substrate’s contribution to SAI’s central metric. A system that must learn chemistry from video observation has an adaptation speed measured in years or decades. A system that can ingest a verified chemistry ontology has an adaptation speed measured in hours. The substrate is not merely a knowledge store; it is an adaptation accelerator.
V. The Semantic AI Model as Existence Proof
Intellisophic’s Semantic AI Model (SAM) is not a theoretical proposal for such a substrate. It is a deployed semantic AI model with over two decades of operational history that implements each of the requirements above. Its foundational architecture — automated knowledge acquisition from reference corpora using Extract-Transform-Load pipelines, loaded into a logic-based system with semantic web 3.0 structure — was designed in 1999 specifically to overcome the cost barrier of human-in-the-loop knowledge encoding.[1][5]
The critical point is not that SAM is superior to JEPA. The two systems address fundamentally different aspects of intelligence. JEPA targets perceptual grounding — the intuitive physics and sensory prediction that enable an agent to navigate the physical world. SAM targets symbolic grounding — the structured, authoritative, disambiguated knowledge that enables an agent to reason about the world in the terms that human civilization has developed over centuries of scientific, regulatory, and technical work.
An AI system that possesses perceptual grounding without symbolic grounding is a toddler: capable of navigating physical space but unable to read, reason about abstractions, or apply domain expertise. An AI system that possesses symbolic grounding without perceptual grounding is an oracle: capable of answering questions about the world with precision and authority but unable to see, hear, or physically interact with it. General intelligence requires both.
The SAI paper’s own terminology provides a more precise formulation. In SAI’s framework, intelligence is not a fixed checklist of competencies but a measure of adaptation speed across important tasks.[4] SAM’s operational history demonstrates adaptation speed that no perceptual-only or statistical-only system can match in the symbolic domain: its processing of billions of pages across dozens of domains, its scaling to tens of millions of disambiguated concepts[1], and its ability to onboard entirely new knowledge domains through automated ingestion of authoritative sources represent precisely the kind of rapid, reliable domain adaptation that SAI defines as the hallmark of intelligence. The substrate problem is not merely solvable in principle. It has been solved in practice — and its solution directly addresses the metric that LeCun’s own co-authored paper identifies as the correct measure of AI progress.
VI. Integration Architecture
The SAI paper provides the theoretical mandate for what this section describes as an engineering architecture. Goldfeder, Wyder, LeCun, and Shwartz-Ziv argue explicitly that “for high-stakes applications (e.g., scientific discovery, medicine), the correct aspiration is not to preserve the romance of a single generalist mind, but to build the strongest available specialists—and, where needed, compose them into systems whose coordination is engineered rather than assumed.”[4] They further advocate for “hierarchy and diversity of models and modalities” and reject “the concept of a single model or architecture as the ‘one paradigm to rule them all.'”[4] The integration architecture described below is a concrete instantiation of this principle.
If JEPA were to adopt a structured knowledge substrate, the integration would address each of the five identified gaps.
Perceptual representations learned by JEPA would be anchored to disambiguated concepts in the ontology, giving visual and physical intuitions explicit semantic identity. A JEPA module that recognizes the visual pattern of white powder in a laboratory could be linked to the specific concept node for Bacillus anthracis spore preparation, inheriting the full graph of relationships — weaponization protocols, decontamination requirements, regulatory classification — that the perceptual observation alone cannot provide.
Cross-domain inference would become possible because the substrate provides the explicit typed relationships that bridge perceptual domains. The JEPA model sees chemistry and microbiology as separate visual environments; the substrate connects them through causal, taxonomic, and procedural relationships derived from authoritative sources. In SAI’s terms, this composition of a perceptual specialist with a knowledge specialist avoids the negative transfer that would degrade a single model trained on all domains simultaneously, while achieving the cross-domain capability that neither specialist possesses alone.
Provenance would flow from the substrate into every inference chain, enabling the integrated system to ground its claims in cited sources rather than statistical patterns. This alone would address the hallucination problem that currently limits every production AI deployment. It would also bound the exponential error divergence that the SAI paper identifies as a fundamental limitation of autoregressive prediction[4], by providing structural verification checkpoints at each inference step.
Scalability would improve dramatically because the symbolic knowledge that takes humans 18 years of formal education to acquire could be ingested from textbooks and reference corpora in hours or days — provided the extraction and verification pipeline exists. The perceptual learning that JEPA contributes could then be compressed into the months required for video-based training, rather than the decades that would be needed if the system had to derive symbolic knowledge from observation alone.
Adaptation speed — SAI’s central metric — would be maximized because the integrated system would inherit both JEPA’s perceptual adaptation capabilities and SAM’s symbolic adaptation capabilities. When confronted with a new domain — say, a novel biosecurity threat involving an unfamiliar pathogen — the perceptual module could rapidly learn to recognize visual indicators from limited video data, while the knowledge substrate could simultaneously ingest the relevant scientific literature, regulatory frameworks, and technical manuals, providing the full conceptual graph within hours. The composed system’s adaptation speed would be bounded by the faster of its two specialist pathways, not the slower. This is the architectural realization of SAI’s principle that specialization, properly composed, outperforms generality.
VII. Conclusion
JEPA represents a serious and well-reasoned approach to perceptual intelligence. Its limitations are not failures of imagination or engineering but consequences of scope: it was designed to solve the perception problem, and it may well solve it. But perception is not intelligence. The distance between a system that understands physical dynamics and a system that can function as a competent professional in any knowledge-intensive domain is measured in precisely the structured, provenance-verified, polysemy-bounded symbolic knowledge that JEPA cannot acquire and a semantic knowledge substrate can provide.
The SAI paper — co-authored by LeCun himself — now provides the theoretical framework that validates this architectural analysis. Its core arguments align precisely with what SAM’s operational history has demonstrated for 25 years: that intelligence is specialized, not general; that the strongest systems are composed of the strongest specialists; that adaptation speed is the correct metric; and that no single architecture can serve as “the one paradigm to rule them all.”[4]
The path to superhuman adaptable intelligence does not run exclusively through perception or exclusively through symbolic knowledge. It runs through their integration — through the engineered composition of a perceptual specialist that learns how the world looks and moves with a knowledge specialist that knows what the world means, on whose authority, and how its concepts connect across the full breadth of human expertise. The perceptual layer must exist — LeCun is right about that. But it must be built on a substrate of structured knowledge that gives perceptual representations their meaning, their context, and their connection to the accumulated expertise of human civilization. That substrate is not hypothetical. It exists, it scales, and it has been validated in production for a quarter century.
The SAI paper ends with a key insight: “The AI that folds our proteins should not be the AI that folds our laundry.”[4] We offer a complementary formulation: The AI that sees the world should not be the AI that knows the world — but the AI that understands the world must be both, composed.
References
[1] “The Fundamental AI Innovation Is Automating Knowledge Acquisition,” Intellisophic, February 12, 2026. https://intellisophic.net/2026/02/12/the-fundamental-innovation-orthogonal-corpus-indexing-oci/
[2] “Context Graph: SAM‑1 Product Mapping,” Intellisophic, 2026. https://intellisophic.net
[3] Orthogonal Corpus Indexing (OCI) and multi-dimensional retrieval architecture as described in [1], specifically the MOSAEC Chem-Bio portal implementation demonstrating 72,000 taxonomy sources and cross-domain concept linkage at scale.
[4] Goldfeder, J., Wyder, P., LeCun, Y., & Shwartz-Ziv, R. (2026). “AI Must Embrace Specialization via Superhuman Adaptable Intelligence.” arXiv preprint arXiv:2602.23643. https://arxiv.org/abs/2602.23643
[5] The foundational architecture — automated knowledge acquisition from reference corpora using ETL pipelines loaded into a logic-based semantic web 3.0 system — was designed by Intellisophic founders Burch, Kon, and Hoey in 1999. The system was validated through the S-Book project with publishers representing over 80% of English-language reference corpora, and subsequently selected as the foundational AI platform for the Counterintelligence Field Activity (CIFA) following 9/11, outperforming hand-coded and statistical methods on NIST evaluation criteria maintained for 25 years. See: “The Fundamental AI Innovation Is Automating Knowledge Acquisition,” Intellisophic, February 12, 2026. https://intellisophic.net/2026/02/12/the-fundamental-innovation-orthogonal-corpus-indexing-oci/
