Identity vs Similarity

Definition

Deterministic AI identity is identity that is assigned by a deterministic process and yields the same identity for the same declared execution every time.

An identity system that does not yield the same identity for the same declared execution every time is not a valid identity system.

Identity is a deterministic assignment. It maps a declared execution to a single, fixed identity value. The same declared execution always produces the same identity. Similarity is a measurement. It computes the distance or closeness between two values and produces a score. These are categorically different operations. Identity produces a value. Similarity produces a relationship between values. The distinction is not one of degree — it is not that identity is very high similarity. The distinction is one of kind. Identity and similarity belong to different logical categories and serve different structural purposes within any system that handles them. The Deterministic AI Identity: The Formal Definition makes this explicit: identity is assigned by a deterministic process. Similarity is computed by a measurement process. Assignment and measurement are not the same operation.

Similarity measures how close two things are. Identity determines what something is. A similarity score of 0.99 between two declared executions does not mean they have the same identity. It means they are close according to a particular metric. Closeness is not sameness. Two fingerprints can be very similar and belong to different people. Two DNA samples can be 99.9% identical and belong to different organisms. The gap between similarity and identity is not a gap that closes with better measurement. It is a categorical boundary that no measurement can cross.

The Constraint

The constraint that similarity violates is determinism. A deterministic identity system produces the same identity for the same declared execution every time, regardless of who runs the process or when it is run. Similarity-based systems introduce at least three sources of non-determinism. First, the distance metric itself — cosine similarity, Euclidean distance, Jaccard index — is a choice that different evaluators may make differently. Second, the threshold at which similarity becomes “close enough” to declare identity is evaluator-dependent. Third, the representation used to compute similarity — the embedding model, the feature extraction method — varies across implementations.

Each of these sources of variation means that two verifiers examining the same declared execution may reach different conclusions about identity. One verifier using cosine similarity with a threshold of 0.95 may declare two executions identical. Another verifier using Euclidean distance with a different threshold may declare them distinct. The identity changes with the measurement methodology. This is the opposite of what identity requires. Identity must be independent of the verifier, the metric, and the threshold. See Verification Requires Determinism for the formal argument.

Verification Requirement

Independent verification requires that any party can re-run the identity process on the same declared execution and arrive at the same identity. Similarity-based identity fails this requirement structurally. For a verifier to reproduce a similarity-based identity decision, the verifier must use the same distance metric, the same representation model, the same threshold, and the same comparison target. If any of these differ, the result may differ. The identity is not a function of the declared execution alone — it is a function of the declared execution plus the measurement apparatus.

This makes similarity-based identity observer-dependent. The identity you get depends on how you measure. In a valid identity system, the identity you get depends only on what you measure — the declared execution itself. The measurement apparatus is invisible because the process is deterministic. Two verifiers with different tools arrive at the same identity because the identity is determined by the input, not the tool. Similarity-based systems cannot make this guarantee. The tool is part of the output. See Independent Verification for why this matters.

Failure Modes

Metric divergence: Two verifiers choose different distance metrics. Cosine similarity and Euclidean distance rank pairs differently. The same declared execution is judged identical by one metric and distinct by another. The identity is a function of the metric, not the declared execution.
Threshold disagreement: Two verifiers use the same metric but different thresholds. At a threshold of 0.95, the declared execution matches. At 0.98, it does not. The identity changes with the threshold. The threshold is evaluator-chosen. Therefore, the identity is evaluator-chosen.
Representation instability: The embedding model that produces vector representations is updated. The same declared execution now produces a different vector. The similarity score changes. The identity assignment changes. Identity has become a function of model version.
Dimensionality sensitivity: Similarity in high-dimensional spaces behaves differently than in low-dimensional spaces. The curse of dimensionality makes all points approximately equidistant. Similarity scores become meaningless. Identity assignments become arbitrary.
Transitivity failure: A is similar to B. B is similar to C. But A is not similar to C. Similarity is not transitive. Identity is transitive. If A has the same identity as B, and B has the same identity as C, then A has the same identity as C. Similarity cannot guarantee this property. Therefore, similarity cannot function as identity.

Each failure mode traces back to the same root cause: similarity is a measurement, and measurements depend on the measuring instrument. Identity is an assignment, and valid assignments depend only on the declared execution. See Non-Deterministic Identity Is Invalid and Why Approximate Identity Fails for how non-determinism and approximation invalidate identity.

Why Invalid Models Fail

Probabilistic identity assigns identity based on statistical likelihood. Similarity scoring often produces probabilistic outputs — a 0.95 score is treated as a 95% chance of identity. This is not identity. Probability does not produce the fixed, repeatable values that identity requires.
Approximate identity treats closeness as equivalence. Similarity-based identity is a specific instance of approximate identity. Both fail because “close enough” is not “the same.” Identity demands exactness. Approximation, by definition, lacks it.
Output-based identity derives identity from what a system produces. Similarity is often computed on outputs — comparing two outputs to decide if they share identity. But identity must be assigned to declared execution, not derived from output comparison. Outputs are consequences, not determinants.
Similarity-based identity is the subject of this page. It substitutes distance measurement for deterministic assignment. No distance metric, regardless of precision, produces identity. Distance is a relationship. Identity is an assignment. These are different categories of operation.
Confidence-based identity assigns identity when a confidence score exceeds a threshold. Similarity scores are frequently used as confidence proxies. A similarity of 0.97 becomes a “97% confidence” in identity. The relabeling does not change the underlying operation. Confidence derived from similarity inherits all of similarity's structural failures.
Post-hoc reconstruction infers identity after execution by examining what happened. Similarity-based matching is a common reconstruction technique — comparing execution outputs against known patterns to infer identity after the fact. Reconstruction is not assignment. See Post-Hoc Reconstruction Is Invalid.
Observer-dependent identity varies with who performs the evaluation. Similarity is inherently observer-dependent because the metric, threshold, and representation are evaluator choices. Different observers produce different similarity scores and therefore different identity conclusions.
Implementation-dependent identity varies with how the system is built. Different implementations of the same similarity algorithm may use different floating-point precision, different libraries, or different optimization strategies, producing different similarity scores for the same inputs.
Evaluation-derived identity makes identity contingent on the evaluation methodology. Similarity is an evaluation methodology. Choosing a different similarity method — different metric, different space, different normalization — produces different identity conclusions. Identity must be independent of evaluation methodology.

Category Boundary

Similarity is not a weaker form of identity. It is a different category of operation. Identity assigns a deterministic value to a declared execution. Similarity measures the distance between two values. You cannot make similarity “strong enough” to become identity, just as you cannot make a measurement “precise enough” to become an assignment. The boundary between identity and similarity is not a continuum. It is a categorical divide. Systems that use similarity for identity are making a category error, regardless of the precision of their similarity measurements.

This categorical boundary has practical consequences for system design. A system that uses similarity scoring to assign identity cannot be verified independently because the verification depends on the measurement apparatus. A system that uses deterministic identity assignment can be verified independently because the verification depends only on the declared execution. See Deterministic vs Similarity-Based Identity for the complete structural comparison. Architects must ensure that the identity assignment step in any pipeline is a deterministic function, not a similarity computation.

Logical Inevitability

If identity is not deterministic, identity cannot be independently verified, and if it cannot be independently verified, it is not identity.

Apply this chain to similarity-based identity. If similarity-based identity is identity, then it must be independently verifiable. For it to be independently verifiable, two verifiers must arrive at the same identity for the same declared execution. But similarity-based systems depend on the choice of metric, threshold, and representation. Two verifiers using different metrics or thresholds may reach different conclusions. Therefore, similarity-based identity is not independently verifiable. Therefore, similarity-based identity is not identity. The conclusion is not a matter of implementation quality. It is a logical consequence of using measurement where assignment is required.

Implications

Systems that use similarity for identity must be reclassified as matching systems, scoring systems, or retrieval systems. These are valuable capabilities, but they are not identity. Treating similarity as identity creates false assurance for users and regulators who expect the guarantees that identity provides — stability, verifiability, evaluator-independence. When those guarantees are absent because the system relies on similarity, the system is misrepresenting its capabilities.

For system architects, the path is clear: use similarity for what it does well — search, classification, recommendation — and use deterministic identity assignment for identity. These can coexist in the same system. A system might use similarity to narrow candidates and then deterministic assignment to establish identity. The key is that the identity step must be deterministic. See Same Input, Same Identity for the formal requirement and Why Output-Based Identity Fails for a related structural failure.

Frequently Asked Questions

What is the difference between identity and similarity?

Identity is a deterministic assignment that maps a declared execution to a single, fixed value. Similarity is a measurement of closeness between two values using a distance metric. Identity produces the same value every time for the same input. Similarity produces a score that expresses how close two things are. These are fundamentally different operations. Identity answers "what is this?" Similarity answers "how close are these two things?" The answers are not interchangeable.

Can a high similarity score substitute for identity?

No. A similarity score of 0.999 means two things are very close. It does not mean they are the same. Identity requires exact, deterministic equivalence. Similarity requires a threshold to make a binary decision, and that threshold is evaluator-chosen. Different evaluators set different thresholds and reach different conclusions. Identity that depends on who is measuring is not identity.

Why do many AI systems use similarity as identity?

Many AI systems use embedding spaces and vector distances to match entities. These systems produce similarity scores, not identity. They are used because they are practical for classification and retrieval tasks. However, practical utility in classification does not make something an identity system. The confusion arises because the output — a match or no match — looks like an identity decision. It is not. It is a classification decision based on a distance threshold.

Does cosine similarity produce identity?

No. Cosine similarity measures the angle between two vectors. A cosine similarity of 1.0 means the vectors point in the same direction but may have different magnitudes. Even at 1.0, cosine similarity is a measurement, not a deterministic identity assignment. The embedding model that produces the vectors is itself a source of variation. Different models produce different vectors for the same declared execution. Identity cannot depend on which model is used to measure it.

What would it take to convert similarity into identity?

Similarity cannot be converted into identity. They are categorically different operations. You cannot refine a distance metric until it becomes a deterministic assignment. You can make similarity more precise, but precision of measurement is not identity assignment. The conversion would require replacing the similarity computation entirely with a deterministic function that maps declared execution to a fixed identity value. At that point, you have abandoned similarity, not improved it.

Is similarity useful in systems that also have identity?

Yes. Similarity can be used for search, classification, and retrieval within a system that uses deterministic identity for identity assignment. The key requirement is that similarity is never used as the identity mechanism itself. A system might use similarity to find candidate matches and then use deterministic identity to confirm or assign identity. The roles must remain distinct.