surface_validate_trie

surface_validate_trie(df: ~polars.dataframe.frame.DataFrame, max_num_checks: int = 1000, max_violations: int = 0, progress_wrap: ~typing.Callable = <function <lambda>>, seed: int | None = None) int

Validate trie reconstruction output data.

Performs structural checks and pairwise leaf-node validation to confirm that reconstructed trie correctly reflects common differentia among source hereditary stratigraphic surfaces.

Checks performed:

  1. Required dstream/downstream columns for surface deserialization from data_hex are present.

  2. The id and ancestor_id columns are present.

  3. Taxon ids are contiguous (i.e., match row indices 0, 1, …, n-1).

  4. Data is topologically sorted (each ancestor appears before all its descendants).

  5. Samples random leaf-node pairs and compares each pair’s first retained disparity rank (computed from deserialized surfaces) to the MRCA node’s dstream_rank - dstream_S in the trie (converting from raw dstream T space to hstrat rank space). A violation occurs when first_disparity_rank < mrca_rank: the surfaces prove divergence earlier than the trie records.

Parameters

dfpl.DataFrame

Trie reconstruction output, as produced by surface_unpack_reconstruct with --no-drop-dstream-metadata.

Required schema:
  • ‘id’integer

    Unique identifier for each taxon (RE alife standard).

  • ‘ancestor_id’integer

    Unique identifier for ancestor taxon (RE alife standard).

  • ‘dstream_rank’integer

    Rank stored at this node (generation count).

  • ‘data_hex’string

    Raw genome data as a hexadecimal string.

  • ‘dstream_algo’string or categorical

    Name of downstream curation algorithm (e.g., 'dstream.steady_algo').

  • ‘dstream_storage_bitoffset’integer

    Bit offset of the dstream buffer field in data_hex.

  • ‘dstream_storage_bitwidth’integer

    Bit width of the dstream buffer field in data_hex.

  • ‘dstream_T_bitoffset’integer

    Bit offset of the dstream counter (“rank”) field in data_hex.

  • ‘dstream_T_bitwidth’integer

    Bit width of the dstream counter field in data_hex.

  • ‘dstream_S’integer

    Capacity of the dstream buffer (number of differentia stored per annotation).

max_num_checksint (default 1_000)

Maximum number of leaf-pair comparisons to perform. Pairs are sampled randomly without replacement from all possible pairs.

max_violationsint (default 1)

Maximum number of MRCA-rank violations tolerated before returning early. Callers should treat a return value exceeding this threshold as a validation failure.

progress_wrapcallable, optional

Wrapper applied to the pair-check iterator, e.g., tqdm.tqdm for a progress bar. Must accept and return an iterable. Default is the identity function (no wrapping).

seedint, default None

Random seed used when sampling leaf pairs.

Returns

int

Number of leaf-pair violations detected. Returns early (possibly before all max_num_checks pairs have been checked) once max_violations is exceeded.

Raises

ValueError

If any required column is missing, ids are not contiguous, or data is not topologically sorted.

See Also

surface_unpack_reconstruct :

Produces trie reconstruction data to be validated here.