{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Quickstart\n", "\n", "The purpose of *hstrat* is to enable approximate inference of the phylogenetic history of a distributed digital population solely through analysis of `HereditaryStratigraphicColumn` genome annotations.\n", "Each annotation object attaches to a particular genome, and is inherited by all offspring genomes.\n", "(In most use cases, each genome within a simulation will have exactly one annotation.)\n", "\n", "The hereditary stratigraphy methodology implemented by `HereditaryStratigraphicColumn` enables generations elapsed since the most recent common ancestor (MRCA) of two genomes to be bounded within an estimated range.\n", "This is done by means of data local to the `HereditaryStratigraphicColumn` object — no centralized tracking system required.\n", "\n", "This capability has direct applications in digital evolution research (e.g., artificial life, genetic programming, genetic algorithms), and also may prove useful for other distributed systems applications where it is desired to track the copy-tree provenance of digital artifacts.\n", "\n", "The following provides a practical, code-first introduction to creating, propagating, and analyzing *hstrat* instrumentation to perform distributed phylogenetic tracking.\n", "See [\"Under the Hood\"](./mechanism.html) for more technical discussion of the underlying hereditary stratigraphy methodology." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installation\n", "\n", "```bash\n", "python3 -m pip install hstrat\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initializing a `HereditaryStratigraphicColumn`\n", "\n", "The *hstrat* library provides a genome annotation called a *hereditary stratigraphic column* to track phylogenetic relatedness within a simulation.\n", "Here, we initialize two columns --- representing two fully-independent evolutionary origins.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hstrat import hstrat\n", "from hstrat._auxiliary_lib import seed_random\n", "\n", "seed_random(1) # reproducibility\n", "\n", "founder1 = hstrat.HereditaryStratigraphicColumn(\n", " stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(1),\n", " stratum_differentia_bit_width=16,\n", ")\n", "founder2 = hstrat.HereditaryStratigraphicColumn(\n", " stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),\n", " stratum_differentia_bit_width=16,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the configuration of *stratum retention policy* for each annotation above.\n", "\n", "This policy governs the trade-off between space usage of `HereditaryStratigraphicColumn` objects and inferential power.\n", "Simply put, these are various methods use follow a specific algorithm to \"prune\", or systematically delete, data in the column as generations elapse.\n", "(Hereditary stratigraphy works by appending data to annotations each generation.)\n", "Pruning avoids linear space complexity as generations elapse, but comes at the expense of introducing uncertainty into estimates of relatedness between columns.\n", "\n", "Above, the algorithm `fixed_resolution_algo` was chosen.\n", "A *stratum retention algorithm* denotes a general class of pruning strategies.\n", "Creating a specific policy instance is supplied an integer that dictates the amount of pruning.\n", "Here, smaller values result in denser stratum retention.\n", "A policy using the fixed resolution algorithm supplied with the value 3 retains strata from every third generation, while 1 would retain every strata.\n", "\n", "Below is a quick overview of the five available algorithms.\n", "\n", "| Stratum Retention Algorithm | Space Complexity | MRCA Gen Uncertainty |\n", "| ----------------------------------------- | ---------------- | -------------------- |\n", "| Fixed Resolution Algorithm | `n/k` | `k` |\n", "| Recency-proportional Resolution Algorithm | `k * log(n)` | `m/k` |\n", "| Depth-proportional Resolution Algorithm | `k` | `n/k` |\n", "| Geometric Sequence Nth Root Algorithm | `k` | `m * n^(1/k)` |\n", "| Curbed Recency-proportional Resolution Algorithm | `k` | `m / k` -> `m * n^(1/k)` |\n", "\n", "For most purposes, the curbed recency-proportional algorithm represents a good choice.\n", "To find a more in depth overview, visit [Choosing a Retention Policy](./policies.html).\n", "\n", "*Differentia bit width* was also configured above.\n", "This specifies the number of bits used to store lineage-identification data within an annotation each generation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Identifying Independent Origins\n", "\n", "To check for any common ancestors between two columns (versus independent founder geneses):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"any common ancestry?\", hstrat.does_have_any_common_ancestor(\n", " founder1, founder2\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"any common ancestry?\", hstrat.does_have_any_common_ancestor(\n", " founder1, founder1\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replicating Annotations & Elapsing Generations\n", "\n", "Generational transmission of `HereditaryStratigraphicColumn`'s involves two operations:\n", " 1. cloning --- generation of a fully-independent annotation object (i.e., a \"deep\" copy),\n", " 2. deposition --- appending annotation data to register passage of a generation.\n", "\n", "These operations can be performed independently, via `.Clone()` and `.DepositStratum()`.\n", "(To elapse `n` generations at once, call `.DepositStrata(n)`.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "descendant2a = founder2.Clone()\n", "for __ in range(5):\n", " descendant2a.DepositStratum()\n", "\n", "descendant2a.DepositStrata(5)\n", "\n", "assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)\n", "f\"{descendant2a.GetNumStrataDeposited()} generations since founder genesis\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternately, call `.CloneDescendant()` to perform both operations --- cloning followed by deposition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "descendant2b = descendant2a.CloneDescendant()\n", "\n", "assert hstrat.does_have_any_common_ancestor(founder2, descendant2b)\n", "f\"{descendant2b.GetNumStrataDeposited()} generations since founder genesis\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The member function `.CloneNthDescendant(n)` operates equivalently to `.Clone()` followed by `DepositStrata(n)`, taking in an integer to specify the number of generations elapsed.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "descendant2c = descendant2a.CloneNthDescendant(20)\n", "\n", "assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)\n", "f\"{descendant2c.GetNumStrataDeposited()} generations since founder genesis\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most use cases will primarily use `.CloneDescendant()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mesuring Relatedness\n", "\n", "If you want to find the number of generations elapsed since the MRCA (most recent common ancestor), `calc_ranks_since_mrca_bounds_with` will return a tuple with the estimated lower and upper bound.\n", "(In `hstrat` parlance, the term \"rank\" corresponds to generations elapsed.)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.calc_ranks_since_mrca_bounds_with(\n", " descendant2b,\n", " descendant2c,\n", " prior=\"arbitrary\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncertainty in phylogenetic inference is introduced due to the stratum pruning procedure mentioned above.\n", "\n", "Note that generations since MRCA will not be symmetrical if uneven branch lengths are at play." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.calc_ranks_since_mrca_bounds_with(\n", " descendant2c,\n", " descendant2b,\n", " prior=\"arbitrary\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generation of MRCA (i.e., MRCA generational depth from genesis) can also be estimated." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.calc_rank_of_mrca_bounds_between(\n", " descendant2b, descendant2c, prior=\"arbitrary\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This measure is symmetric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.calc_rank_of_mrca_bounds_between(\n", " descendant2c, descendant2b, prior=\"arbitrary\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generation of MRCA can be estimated over an entire population." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.calc_rank_of_mrca_bounds_among(\n", " [descendant2a, descendant2c, descendant2b], prior=\"arbitrary\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument `prior` used above denotes the prior probability density distribution over possible generations of the MRCA.\n", "For most cases, `\"arbitrary\"` will suffice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User-defined Genomes\n", "\n", "In most cases, columns will need to be associated with genetic content relevant to the simulation at hand.\n", "Python's object-oriented programming model provides a convenient way to achieve this.\n", "\n", "Most user-defined `Genome`'s will look something like this.\n", "Note the coupling of content mutation and hstrat copy/deposit in `Genome.CloneDescendant`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import typing\n", "\n", "\n", "class Genome:\n", " column: hstrat.HereditaryStratigraphicColumn\n", " content: typing.Any\n", "\n", " def __init__(self, column=None, content=None):\n", " if column is None:\n", " column = hstrat.HereditaryStratigraphicColumn(\n", " # configure policy, differentia bit width, etc. here\n", " )\n", " if content is None:\n", " self.content = (\n", " \"placeholder\" # initialize functional genome content\n", " )\n", "\n", " self.column = column\n", " self.content = content\n", "\n", " def CloneDescendant(self):\n", " mutated_content = self.content # put mutation operator here\n", " return Genome(self.column.CloneDescendant(), mutated_content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Collections of `Genome` objects would then make up the simulated population.\n", "\n", "See [here](./demo-ping.ipynb) for a more detailed example of this pattern in action." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phylogenetic Reconstruction\n", "\n", "A major use case of *hstrat* is reconstruction of phylogenetic history over entire populations.\n", "We'll walk through that now.\n", "\n", "Begin by retrieving a ground-truth phylogeny.\n", "We will use this as a point of comparison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "random.seed(1)\n", "import dendropy as dp\n", "\n", "tree_url = \"https://raw.githubusercontent.com/mmore500/hstrat/5069db7c358ac6949ceda5fe8cc9989d5d7139f9/examples/assets/example.newick\"\n", "template_tree = dp.Tree.get(url=tree_url, schema=\"newick\")\n", "for node in template_tree:\n", " node.edge_length = random.randint(1, 10) # random branch lengths\n", "template_tree.is_rooted = True\n", "template_tree.ladderize()\n", "print(template_tree.as_ascii_plot(plot_metric=\"length\", width=50))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*hstrat* provides `descend_template_phylogeny_*` functions to generate end-state `HereditaryStratigraphicColumn`'s \"as if\" they had evolved from `seed_column` with a provided phylogenetic history.\n", "Here, we generate a population of `HereditaryStratigraphicColumn`'s corresponding to the leaves of `template_tree`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "extant_population = hstrat.descend_template_phylogeny_dendropy(\n", " template_tree,\n", " seed_column=hstrat.HereditaryStratigraphicColumn(\n", " hstrat.recency_proportional_resolution_algo.Policy(5)\n", " ),\n", " extant_nodes=template_tree.leaf_node_iter(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most use cases will not involve a template phylogeny approach.\n", "Rather, `HereditaryStratigraphicColumn`'s will evolve according to the simulation dynamics.\n", "However, it is convenient for this demonstration of phylogenetic reconstruction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `build_tree` takes in a population of columns and returns a tree constructed in alife standard format.\n", "The `version_pin` parameter dictates how calls to the function should resolve in future releases, with `hstrat.__version__` automatically tracking new updates.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reconstructed_phylogeny = hstrat.build_tree(\n", " extant_population,\n", " version_pin=hstrat.__version__,\n", " taxon_labels=map(lambda n: n.taxon.label, template_tree.leaf_node_iter()),\n", ")\n", "reconstructed_phylogeny" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reconstruction would be performed the same way on a population of `HereditaryStratigraphicColumn`'s resulting from an actual evolutionary simulation rather than \"as if\" template descent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use *dendropy* to visualize the reconstruction, and compare to ground truth." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import alifedata_phyloinformatics_convert as apc\n", "\n", "dendropy_tree = apc.alife_dataframe_to_dendropy_tree(\n", " reconstructed_phylogeny,\n", " setup_edge_lengths=True,\n", ")\n", "dendropy_tree.is_rooted = True\n", "dendropy_tree.ladderize()\n", "\n", "print(\"=============== reconstructed ===============\")\n", "print(dendropy_tree.as_ascii_plot(plot_metric=\"length\", width=50))\n", "\n", "print(\"=================== true ====================\")\n", "print(template_tree.as_ascii_plot(plot_metric=\"length\", width=50))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Serialization\n", "\n", "Saving and restoring `HereditaryStratigraphicColumn`'s is important in order to be able to transmit *hstrat* annotations between processes during a simulation or save state after a simulation.\n", "The library provides mechanisms to support both binary and text formats.\n", "\n", "Begin by creating an example `HereditaryStratigraphicColumn` to play with." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diffwidth = 8\n", "policy = hstrat.recency_proportional_resolution_curbed_algo.Policy(10)\n", "column = hstrat.HereditaryStratigraphicColumn(\n", " stratum_retention_policy=policy,\n", " stratum_differentia_bit_width=diffwidth,\n", " always_store_rank_in_stratum=False,\n", ")\n", "column.DepositStrata(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For binary formats, use `col_to_packet` and `col_from_packet`.\n", "(If the packet is contained within a larger buffer, you'll want to use `col_from_packet_buffer` instead.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "packet = hstrat.col_to_packet(column)\n", "reconst = hstrat.col_from_packet(\n", " packet,\n", " differentia_bit_width=diffwidth,\n", " stratum_retention_policy=policy,\n", ")\n", "assert reconst == column\n", "\n", "packet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `differentia_bit_width` and `stratum_retention_policy` must be passed --- this information is not stored within serialized packets.\n", "\n", "The first step for plain-text serialization is conversion between column objects and representation via builtin Python types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "records = hstrat.col_to_records(column)\n", "reconst = hstrat.col_from_records(records)\n", "assert reconst == column\n", "\n", "records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike packet serialization, policy information is stored in the generated records --- convenient for data persistence beyond simulation runtime." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conversion between records and plain text can then be handled by a number of tools.\n", "Python provides builtin support for JSON." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "as_json = json.dumps(records, indent=2) # indent kwarg pretty prints\n", "reconst = json.loads(as_json)\n", "assert records == reconst\n", "\n", "print(as_json)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps you prefer yaml?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import yaml\n", "\n", "as_yaml = yaml.safe_dump(records)\n", "reconst = yaml.safe_load(as_yaml)\n", "assert records == reconst\n", "\n", "print(as_yaml)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pop_to_records` and `pop_to_records` operate similarly, except on entire collections of `HereditaryStratigraphicColumns` rather than individual columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population = [column, column.CloneDescendant()]\n", "records = hstrat.pop_to_records(population)\n", "reconst = hstrat.pop_from_records(records)\n", "assert reconst == population\n", "\n", "records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For very large populations, consider using `gzip` or `lzma`/`xz` to counteract file size overhead from plain-text format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing Column Internals\n", "\n", "An ascii representation of a hereditary stratigraph column can be printed using the `hstrat.col_to_ascii()` function, passing in the specified column as a parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "col = hstrat.HereditaryStratigraphicColumn(\n", " stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),\n", ")\n", "col.DepositStrata(5)\n", "\n", "print(hstrat.col_to_ascii(col))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A column can also be exported to pandas dataframe.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hstrat.col_to_dataframe(col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Further Reading\n", "\n", "- Additional usage examples can be found within the [`examples/` directory](https://github.com/mmore500/hstrat/tree/master/examples).\n", "- All library components can be accessed directly from the convenience flat namespace `hstrat.hstrat` (e.g., `from hstrat import hstrat`).\n", " A full API listing is provided [here](./_autosummary/hstrat.hstrat.html).\n", "- Information and advice on selecting a stratum retention policy is [here](./policies.html).\n", "- Preliminary prototype work on improved methodology for fixed-size hstrat annotations is available [here](https://github.com/mmore500/hstrat-surface-concept/tree/master)\n", "- Preliminary prototype work on extensions for systems with sexual recombination is available [here](https://github.com/mmore500/hstrat-recomb-concept)\n" ] } ], "metadata": { "kernelspec": { "display_name": "env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }