surface_validate_trie
- surface_validate_trie(df: ~polars.dataframe.frame.DataFrame, max_num_checks: int = 1000, max_violations: int = 0, progress_wrap: ~typing.Callable = <function <lambda>>, seed: int | None = None) int
Validate trie reconstruction output data.
Performs structural checks and pairwise leaf-node validation to confirm that reconstructed trie correctly reflects common differentia among source hereditary stratigraphic surfaces.
Checks performed:
Required dstream/downstream columns for surface deserialization from
data_hexare present.The
idandancestor_idcolumns are present.Taxon ids are contiguous (i.e., match row indices 0, 1, …, n-1).
Data is topologically sorted (each ancestor appears before all its descendants).
Samples random leaf-node pairs and compares each pair’s first retained disparity rank (computed from deserialized surfaces) to the MRCA node’s
dstream_rank - dstream_Sin the trie (converting from raw dstream T space to hstrat rank space). A violation occurs whenfirst_disparity_rank < mrca_rank: the surfaces prove divergence earlier than the trie records.
Parameters
- dfpl.DataFrame
Trie reconstruction output, as produced by
surface_unpack_reconstructwith--no-drop-dstream-metadata.- Required schema:
- ‘id’integer
Unique identifier for each taxon (RE alife standard).
- ‘ancestor_id’integer
Unique identifier for ancestor taxon (RE alife standard).
- ‘dstream_rank’integer
Rank stored at this node (generation count).
- ‘data_hex’string
Raw genome data as a hexadecimal string.
- ‘dstream_algo’string or categorical
Name of downstream curation algorithm (e.g.,
'dstream.steady_algo').
- ‘dstream_storage_bitoffset’integer
Bit offset of the dstream buffer field in
data_hex.
- ‘dstream_storage_bitwidth’integer
Bit width of the dstream buffer field in
data_hex.
- ‘dstream_T_bitoffset’integer
Bit offset of the dstream counter (“rank”) field in
data_hex.
- ‘dstream_T_bitwidth’integer
Bit width of the dstream counter field in
data_hex.
- ‘dstream_S’integer
Capacity of the dstream buffer (number of differentia stored per annotation).
- max_num_checksint (default 1_000)
Maximum number of leaf-pair comparisons to perform. Pairs are sampled randomly without replacement from all possible pairs.
- max_violationsint (default 1)
Maximum number of MRCA-rank violations tolerated before returning early. Callers should treat a return value exceeding this threshold as a validation failure.
- progress_wrapcallable, optional
Wrapper applied to the pair-check iterator, e.g.,
tqdm.tqdmfor a progress bar. Must accept and return an iterable. Default is the identity function (no wrapping).- seedint, default None
Random seed used when sampling leaf pairs.
Returns
- int
Number of leaf-pair violations detected. Returns early (possibly before all
max_num_checkspairs have been checked) oncemax_violationsis exceeded.
Raises
- ValueError
If any required column is missing, ids are not contiguous, or data is not topologically sorted.
See Also
- surface_unpack_reconstruct :
Produces trie reconstruction data to be validated here.