dataframe

End-to-end dataframe-based operations.

Modules

Functions

`surface_postprocess_trie`(df, *[, ...])	Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history.
`surface_test_drive`(df, *, dstream_algo, ...)	Reads alife standard phylogeny dataframe to create a population of hstrat surface annotations corresponding to the phylogeny tips, "as-if" they had evolved according to the provided phylogeny history.
`surface_validate_trie`(df[, max_num_checks, ...])	Validate trie reconstruction output data.

surface_postprocess_trie(df: ~polars.dataframe.frame.DataFrame, *, drop_dstream_metadata: bool | None = None, trie_postprocessor: ~typing.Callable = <hstrat.phylogenetic_inference.tree.trie_postprocess._NopTriePostprocessor.NopTriePostprocessor object>, delete_trunk: bool = True) → DataFrame

Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history.

Perfoms the following operations: - Delete trunk nodes with rank less than dstream_S. - Collapse unifurcations. - Assign contiguous IDs to nodes. - Apply supplied trie_postprocessor functor.

Parameters

dfpl.DataFrame

The input DataFrame containing packed data with required columns, one row per genome.

Required schema:

‘id’pl.UInt64
- Unique identifier for each taxon (RE alife standard format).
‘ancestor_id’pl.UInt64
- Unique identifier for ancestor taxon (RE alife standard format).
‘dstream_rank’pl.UInt64
- Num generations elapsed for ancestral differentia.
- Corresponds to dstream_Tbar for inner nodes.
- Corresponds to dstream_T - 1 for leaf nodes.
‘hstrat_differentia_bitwidth’pl.UInt32
- Size of annotation differentiae, in bits.
- Corresponds to dstream_value_bitwidth.
‘dstream_S’pl.UInt32
- Capacity of dstream buffer used for hstrat surface, in number of data items (i.e., differentia values).

Optional schema:

‘dstream_data_id’pl.UInt64
- Unique identifier for each genome in source genomedataframe

delete_trunkbool, default True

Should trunk nodes with rank less than dstream_S be deleted?

Trunk deletion accounts for “dummy” strata added to fill hstrat surface for founding ancestor(s), by segregating subtrees with distinct founding strata into independent trees.

trie_postprocessorCallable, default hstrat.NopTriePostprocessor()

Tree postprocess functor.

Must take trie of type pandas.DataFrame, p_differentia_collision of type float, mutate of type bool, and progress_wrap of type Callable params. Must return postprocessed trie (type pl.DataFrame).

To apply multiple postprocessors, use hstrat.CompoundTriePostprocessor.

Returns

pl.DataFrame

The output DataFrame containing the estimated phylogenetic tree in alife standard format, with the following columns:

Required schema: - ‘id’ : pl.UInt64

Unique identifier for each taxon (RE alife standard format).

‘ancestor_id’pl.UInt64
- Unique identifier for ancestor taxon (RE alife standard format).
‘hstrat_rank’pl.Int64
- Num generations elapsed for ancestral differentia.
- Corresponds to dstream_Tbar - dstream_S for inner nodes.
- Corresponds to dstream_T - 1 - dstream_S for leaf nodes.

Optional schema: - ‘origin_time’ : pl.Int64

Estimated origin time for phylogeny nodes, in generations elapsed since founding ancestor.

Value depends on the trie postprocessor used.

Additional user-defined columns will be forwarded from the input DataFrame. Any columns created by the trie postprocessor will also be included.

Note that the alife-standard ancestor_list column is not included in the output.

Notes

Collapsing trunk nodes with rank less than dstream_S assumes that S “dummy” strata were added to fill hstrat surface for founding ancestor(s).

Currently, data is converted to Pandas for processing, then back to Polars.

Parameters

dfpl.DataFrame

The input DataFrame containing alife standard phylogeny with required columns, one row per taxon.

Note that the alife-standard ancestor_list column is not required.

Required schema:

‘id’pl.UInt64
- Taxon identifier.
‘ancestor_id’pl.UInt64
- Taxon identifier of ancestor.
- Own ‘id’ if root.

Optional schema:

‘origin_time’pl.UInt64
- Number of generations elapsed from ancestor.
- Determines branch lengths.
- Otherwise, all branches are assumed to be length 1.
‘extant’pl.Boolean
- Should an entry corresponding to this phylogeny taxon be included in the output population?
- Otherwise, all tips are considered extant and all inner nodes are not.
Additional user-defined columns will be forwarded to the output DataFrame.

dstream_algostr

Name of downstream curation algorithm to use.

dstream_Sint

Capacity of annotation dstream buffer, in number of data items.

progress_wrapCallable, optional

Pass tqdm or equivalent to display a progress bar.

stratum_differentia_bit_widthint

The bit width of the generated differentia.

Returns

pl.DataFrame

The output DataFrame containing generated hstrat surface annotations.

Required schema:

‘data_hex’pl.String
- Raw genome data, with serialized dstream buffer and counter.
- Represented as a hexadecimal string.
‘downstream_version’pl.Categorical
- Version of downstream library used.
‘dstream_algo’pl.Categorical
- Name of downstream curation algorithm used.
- e.g., ‘dstream.steady_algo’
‘dstream_storage_bitoffset’pl.UInt32
- Position of dstream buffer field in ‘data_hex’.
‘dstream_storage_bitwidth’pl.UInt32
- Size of dstream buffer field in ‘data_hex’.
‘dstream_T_bitoffset’pl.UInt32
- Position of dstream counter field in ‘data_hex’.
‘dstream_T_bitwidth’pl.UInt32
- Size of dstream counter field in ‘data_hex’.
‘dstream_S’pl.Uint32
- Capacity of dstream buffer, in number of data items.
‘origin_time’pl.UInt64
- Number of generations elapsed since the founding ancestor.
‘td_source_id’pl.UInt64
- Corresponding taxon identifier in source phylogeny.

Additional user-defined columns will be forwarded from the input DataFrame.

Notes

Input columns “id”, “ancestor_id”, and “ancestor_list” are not forwarded to output, to avoid conflicts with the output schema for subsequent phylogeny reconstruction.

surface_validate_trie(df: ~polars.dataframe.frame.DataFrame, max_num_checks: int = 1000, max_violations: int = 0, progress_wrap: ~typing.Callable = <function <lambda>>, seed: int | None = None) → int

Validate trie reconstruction output data.

Performs structural checks and pairwise leaf-node validation to confirm that reconstructed trie correctly reflects common differentia among source hereditary stratigraphic surfaces.

Checks performed:

Required dstream/downstream columns for surface deserialization from data_hex are present.
The id and ancestor_id columns are present.
Taxon ids are contiguous (i.e., match row indices 0, 1, …, n-1).
Data is topologically sorted (each ancestor appears before all its descendants).
Samples random leaf-node pairs and compares each pair’s first retained disparity rank (computed from deserialized surfaces) to the MRCA node’s dstream_rank - dstream_S in the trie (converting from raw dstream T space to hstrat rank space). A violation occurs when first_disparity_rank < mrca_rank: the surfaces prove divergence earlier than the trie records.

Parameters

dfpl.DataFrame

Trie reconstruction output, as produced by surface_unpack_reconstruct with --no-drop-dstream-metadata.

Required schema:

‘id’integer
Unique identifier for each taxon (RE alife standard).
‘ancestor_id’integer
Unique identifier for ancestor taxon (RE alife standard).
‘dstream_rank’integer
Rank stored at this node (generation count).
‘data_hex’string
Raw genome data as a hexadecimal string.
‘dstream_algo’string or categorical
Name of downstream curation algorithm (e.g., 'dstream.steady_algo').
‘dstream_storage_bitoffset’integer
Bit offset of the dstream buffer field in data_hex.
‘dstream_storage_bitwidth’integer
Bit width of the dstream buffer field in data_hex.
‘dstream_T_bitoffset’integer
Bit offset of the dstream counter (“rank”) field in data_hex.
‘dstream_T_bitwidth’integer
Bit width of the dstream counter field in data_hex.
‘dstream_S’integer
Capacity of the dstream buffer (number of differentia stored per annotation).

max_num_checksint (default 1_000)

Maximum number of leaf-pair comparisons to perform. Pairs are sampled randomly without replacement from all possible pairs.

max_violationsint (default 1)

Maximum number of MRCA-rank violations tolerated before returning early. Callers should treat a return value exceeding this threshold as a validation failure.

progress_wrapcallable, optional

Wrapper applied to the pair-check iterator, e.g., tqdm.tqdm for a progress bar. Must accept and return an iterable. Default is the identity function (no wrapping).

seedint, default None

Random seed used when sampling leaf pairs.

Returns

int: Number of leaf-pair violations detected. Returns early (possibly before all max_num_checks pairs have been checked) once max_violations is exceeded.

Raises

ValueError: If any required column is missing, ids are not contiguous, or data is not topologically sorted.

dataframe

Parameters

Returns

Notes

See Also

Parameters

Returns

Notes

Parameters

Returns

Raises

See Also