dataframe
End-to-end dataframe-based operations.
Modules
Functions
|
Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history. |
|
Reads alife standard phylogeny dataframe to create a population of hstrat surface annotations corresponding to the phylogeny tips, "as-if" they had evolved according to the provided phylogeny history. |
|
Validate trie reconstruction output data. |
- surface_postprocess_trie(df: ~polars.dataframe.frame.DataFrame, *, drop_dstream_metadata: bool | None = None, trie_postprocessor: ~typing.Callable = <hstrat.phylogenetic_inference.tree.trie_postprocess._NopTriePostprocessor.NopTriePostprocessor object>, delete_trunk: bool = True) DataFrame
Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history.
Perfoms the following operations: - Delete trunk nodes with rank less than dstream_S. - Collapse unifurcations. - Assign contiguous IDs to nodes. - Apply supplied trie_postprocessor functor.
Parameters
- dfpl.DataFrame
The input DataFrame containing packed data with required columns, one row per genome.
- Required schema:
- ‘id’pl.UInt64
Unique identifier for each taxon (RE alife standard format).
- ‘ancestor_id’pl.UInt64
Unique identifier for ancestor taxon (RE alife standard format).
- ‘dstream_rank’pl.UInt64
Num generations elapsed for ancestral differentia.
Corresponds to dstream_Tbar for inner nodes.
Corresponds to dstream_T - 1 for leaf nodes.
- ‘hstrat_differentia_bitwidth’pl.UInt32
Size of annotation differentiae, in bits.
Corresponds to dstream_value_bitwidth.
- ‘dstream_S’pl.UInt32
Capacity of dstream buffer used for hstrat surface, in number of data items (i.e., differentia values).
- Optional schema:
- ‘dstream_data_id’pl.UInt64
Unique identifier for each genome in source genomedataframe
- delete_trunkbool, default True
Should trunk nodes with rank less than dstream_S be deleted?
Trunk deletion accounts for “dummy” strata added to fill hstrat surface for founding ancestor(s), by segregating subtrees with distinct founding strata into independent trees.
- trie_postprocessorCallable, default hstrat.NopTriePostprocessor()
Tree postprocess functor.
Must take trie of type pandas.DataFrame, p_differentia_collision of type float, mutate of type bool, and progress_wrap of type Callable params. Must return postprocessed trie (type pl.DataFrame).
To apply multiple postprocessors, use hstrat.CompoundTriePostprocessor.
Returns
- pl.DataFrame
The output DataFrame containing the estimated phylogenetic tree in alife standard format, with the following columns:
Required schema: - ‘id’ : pl.UInt64
Unique identifier for each taxon (RE alife standard format).
- ‘ancestor_id’pl.UInt64
Unique identifier for ancestor taxon (RE alife standard format).
- ‘hstrat_rank’pl.Int64
Num generations elapsed for ancestral differentia.
Corresponds to dstream_Tbar - dstream_S for inner nodes.
Corresponds to dstream_T - 1 - dstream_S for leaf nodes.
Optional schema: - ‘origin_time’ : pl.Int64
Estimated origin time for phylogeny nodes, in generations elapsed since founding ancestor.
Value depends on the trie postprocessor used.
Additional user-defined columns will be forwarded from the input DataFrame. Any columns created by the trie postprocessor will also be included.
Note that the alife-standard ancestor_list column is not included in the output.
Notes
Collapsing trunk nodes with rank less than dstream_S assumes that S “dummy” strata were added to fill hstrat surface for founding ancestor(s).
Currently, data is converted to Pandas for processing, then back to Polars.
See Also
- surface_unpack_reconstruct :
Creates raw reconstruction data postprocessed here.
- alifestd_try_add_ancestor_list_col :
Adds alife-standard ancestor_list column to phylogeny data.
- surface_test_drive(df: ~polars.lazyframe.frame.LazyFrame, *, dstream_algo: str, dstream_S: int, dstream_T_bitwidth: int = 32, progress_wrap: ~typing.Callable = <function <lambda>>, stratum_differentia_bit_width: int) DataFrame
Reads alife standard phylogeny dataframe to create a population of hstrat surface annotations corresponding to the phylogeny tips, “as-if” they had evolved according to the provided phylogeny history.
Parameters
- dfpl.DataFrame
The input DataFrame containing alife standard phylogeny with required columns, one row per taxon.
Note that the alife-standard ancestor_list column is not required.
- Required schema:
- ‘id’pl.UInt64
Taxon identifier.
- ‘ancestor_id’pl.UInt64
Taxon identifier of ancestor.
Own ‘id’ if root.
- Optional schema:
- ‘origin_time’pl.UInt64
Number of generations elapsed from ancestor.
Determines branch lengths.
Otherwise, all branches are assumed to be length 1.
- ‘extant’pl.Boolean
Should an entry corresponding to this phylogeny taxon be included in the output population?
Otherwise, all tips are considered extant and all inner nodes are not.
Additional user-defined columns will be forwarded to the output DataFrame.
- dstream_algostr
Name of downstream curation algorithm to use.
- dstream_Sint
Capacity of annotation dstream buffer, in number of data items.
- progress_wrapCallable, optional
Pass tqdm or equivalent to display a progress bar.
- stratum_differentia_bit_widthint
The bit width of the generated differentia.
Returns
- pl.DataFrame
The output DataFrame containing generated hstrat surface annotations.
- Required schema:
- ‘data_hex’pl.String
Raw genome data, with serialized dstream buffer and counter.
Represented as a hexadecimal string.
- ‘downstream_version’pl.Categorical
Version of downstream library used.
- ‘dstream_algo’pl.Categorical
Name of downstream curation algorithm used.
e.g., ‘dstream.steady_algo’
- ‘dstream_storage_bitoffset’pl.UInt32
Position of dstream buffer field in ‘data_hex’.
- ‘dstream_storage_bitwidth’pl.UInt32
Size of dstream buffer field in ‘data_hex’.
- ‘dstream_T_bitoffset’pl.UInt32
Position of dstream counter field in ‘data_hex’.
- ‘dstream_T_bitwidth’pl.UInt32
Size of dstream counter field in ‘data_hex’.
- ‘dstream_S’pl.Uint32
Capacity of dstream buffer, in number of data items.
- ‘origin_time’pl.UInt64
Number of generations elapsed since the founding ancestor.
- ‘td_source_id’pl.UInt64
Corresponding taxon identifier in source phylogeny.
Additional user-defined columns will be forwarded from the input DataFrame.
Notes
Input columns “id”, “ancestor_id”, and “ancestor_list” are not forwarded to output, to avoid conflicts with the output schema for subsequent phylogeny reconstruction.
- surface_validate_trie(df: ~polars.dataframe.frame.DataFrame, max_num_checks: int = 1000, max_violations: int = 0, progress_wrap: ~typing.Callable = <function <lambda>>, seed: int | None = None) int
Validate trie reconstruction output data.
Performs structural checks and pairwise leaf-node validation to confirm that reconstructed trie correctly reflects common differentia among source hereditary stratigraphic surfaces.
Checks performed:
Required dstream/downstream columns for surface deserialization from
data_hexare present.The
idandancestor_idcolumns are present.Taxon ids are contiguous (i.e., match row indices 0, 1, …, n-1).
Data is topologically sorted (each ancestor appears before all its descendants).
Samples random leaf-node pairs and compares each pair’s first retained disparity rank (computed from deserialized surfaces) to the MRCA node’s
dstream_rank - dstream_Sin the trie (converting from raw dstream T space to hstrat rank space). A violation occurs whenfirst_disparity_rank < mrca_rank: the surfaces prove divergence earlier than the trie records.
Parameters
- dfpl.DataFrame
Trie reconstruction output, as produced by
surface_unpack_reconstructwith--no-drop-dstream-metadata.- Required schema:
- ‘id’integer
Unique identifier for each taxon (RE alife standard).
- ‘ancestor_id’integer
Unique identifier for ancestor taxon (RE alife standard).
- ‘dstream_rank’integer
Rank stored at this node (generation count).
- ‘data_hex’string
Raw genome data as a hexadecimal string.
- ‘dstream_algo’string or categorical
Name of downstream curation algorithm (e.g.,
'dstream.steady_algo').
- ‘dstream_storage_bitoffset’integer
Bit offset of the dstream buffer field in
data_hex.
- ‘dstream_storage_bitwidth’integer
Bit width of the dstream buffer field in
data_hex.
- ‘dstream_T_bitoffset’integer
Bit offset of the dstream counter (“rank”) field in
data_hex.
- ‘dstream_T_bitwidth’integer
Bit width of the dstream counter field in
data_hex.
- ‘dstream_S’integer
Capacity of the dstream buffer (number of differentia stored per annotation).
- max_num_checksint (default 1_000)
Maximum number of leaf-pair comparisons to perform. Pairs are sampled randomly without replacement from all possible pairs.
- max_violationsint (default 1)
Maximum number of MRCA-rank violations tolerated before returning early. Callers should treat a return value exceeding this threshold as a validation failure.
- progress_wrapcallable, optional
Wrapper applied to the pair-check iterator, e.g.,
tqdm.tqdmfor a progress bar. Must accept and return an iterable. Default is the identity function (no wrapping).- seedint, default None
Random seed used when sampling leaf pairs.
Returns
- int
Number of leaf-pair violations detected. Returns early (possibly before all
max_num_checkspairs have been checked) oncemax_violationsis exceeded.
Raises
- ValueError
If any required column is missing, ids are not contiguous, or data is not topologically sorted.
See Also
- surface_unpack_reconstruct :
Produces trie reconstruction data to be validated here.