Tutorial 1: Benchmark Generation
================================

This tutorial provides a detailed guide for creating customized benchmark datasets with the ``dyn-benchmark`` package, focusing on configuring all available parameters to control community evolution and network structure.

Prerequisites
-------------

.. code-block:: python

    # Install the package if you haven't already
    # pip install dyn-benchmark

    # Import required modules
    from dyn.benchmark.generator.groundtruth_generator import GroundtruthGenerator
    from dyn.benchmark.generator.communities_generator import CommunitiesGenerator, RelativeOverlap, Match, Overlap
    from dyn.benchmark.generator.edges_generator import SBM, BPAM, FastBPAM, PAM
    from dyn.benchmark.generator.nodes_generator import RandomMemberGenerator
    import networkx as nx
    import numpy as np
    import matplotlib.pyplot as plt

1. Community Generation Parameters
----------------------------------

The :class:`CommunitiesGenerator <dyn.benchmark.generator.communities_generator>` is responsible for creating evolving communities and operates on two distinct levels:

1. **Global generation parameters**: These are set as attributes in the constructor and control the overall properties of the generation process.
2. **Evolving Communities shape parameters**: These are controlled by overriding specific methods that define probability distributions for community attributes like size, lifetime, and growth patterns.

Global generation parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following parameters can be set when instantiating a :class:`CommunitiesGenerator <dyn.benchmark.generator.communities_generator>`:


.. code-block:: python

    # Create a communities generator with fully customized parameters
    community_generator = CommunitiesGenerator(
        # Basic parameters
        community_count=15,          # Number of evolving communities to generate
        snapshot_count=10,           # Number of snapshots in the temporal network
        community_size_min=5,        # Minimum size of any static community
        core_nodes_ratio=0.7,        # Ratio of members that stay in their community between snapshots
        matching_metric_type=RelativeOverlap,  # Algorithm for matching communities across snapshots
        seed=42                      # Seed for reproducibility
    )

Matching metrics help identify when a community at time t corresponds to a community at time t+1. As defined in :cite:t:`Aynaud2013`, matching metrics compare the intersection of nodes between communities to determine their continuity over time. The package implements three matching metrics:

.. code-block:: python

    # Match: min(|C0 ∩ C1| / |C0|, |C0 ∩ C1| / |C1|)
    # Compares the relative size of the intersection to both community sizes
    # Used when community sizes vary significantly
    match_generator = CommunitiesGenerator(matching_metric_type=Match)

    # RelativeOverlap: |C0 ∩ C1| / |C0 ∪ C1|
    # Compares the intersection to the union (Jaccard coefficient)
    # Balanced approach (default)
    relative_overlap_generator = CommunitiesGenerator(matching_metric_type=RelativeOverlap)

    # Overlap: |C0 ∩ C1|
    # Simply uses the raw intersection size
    # Used when community sizes are stable
    overlap_generator = CommunitiesGenerator(matching_metric_type=Overlap)

These matching metrics have different properties suitable for different community evolution patterns. For example, Match is appropriate when communities can significantly vary in size, as it takes the minimum of the relative intersections. RelativeOverlap (the default) uses the Jaccard coefficient to balance the comparison, while Overlap simply looks at the raw number of shared nodes.

Evolving Communities shape parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The generator creates communities based on several probability distributions that you can customize by subclassing :class:`CommunitiesGenerator <dyn.benchmark.generator.communities_generator>` and overriding its methods:

.. code-block:: python

    class CustomCommunitiesGenerator(CommunitiesGenerator):
        """Custom generator with specific community evolution patterns"""
        
        def draw_community_size(self, *args, **kwargs):
            """Controls the initial size of communities
            Returns a float (will be rounded to integer)"""
            return self.rng.normal(loc=50, scale=20)  # Normal distribution centered at 50
        
        def draw_community_lifetime(self, *args, **kwargs):
            """Controls how long communities exist
            Returns a float (will be rounded to integer)"""
            # Truncated normal distribution between 3-7 snapshots
            return np.maximum(3, np.minimum(7, self.rng.normal(loc=5, scale=2)))
        
        def draw_community_start(self, *args, **kwargs):
            """Controls when communities are born
            Returns a float between 0-1 (scaled to valid snapshot range)"""
            return self.rng.random()  # Uniform distribution
        
        def draw_change_ratio(self, *args, **kwargs):
            """Controls community size changes over time
            Returns a float (negative = shrink, positive = grow)"""
            return self.rng.normal(loc=0, scale=0.2)  # Normal distribution centered at 0


2. Network Structure Parameters
-------------------------------

The package offers several graph generation models to create the underlying network structure at a given snapshot.

Stochastic Block Model (SBM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Stochastic Block Model (SBM) is a classic approach for generating networks with community structure. It creates edges between nodes based on their community membership, with higher probability for intra-community connections than inter-community connections.

.. code-block:: python

    # SBM parameters (classic community-based model)
    sbm_generator = SBM(
        p_in=0.7,      # Probability of edge between nodes in same community
        p_out=0.05,    # Probability of edge between nodes in different communities
        max_iter=10,   # Maximum number of attempts to generate a connected graph
        seed=42        # Seed for reproducibility
    )

The SBM was originally introduced by :cite:t:`holland1983stochastic` as a statistical model for social networks with block structures, making it particularly suitable for generating synthetic networks with well-defined community structures.

For more details check the :class:`SBM <dyn.benchmark.generator.edges_generator.SBM>` class.

Preferential Attachment Model (PAM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Preferential Attachment Model (PAM) generates networks with a power-law degree distribution, simulating the "rich get richer" phenomenon observed in many real-world networks.

.. code-block:: python

    # PAM parameters (scale-free networks)
    pam_generator = PAM(
        m=5,           # Number of edges to add for each new node
        self_loop=False,  # Whether self-loops are allowed
        seed=42        # Seed for reproducibility
    )

This implementation is based on the efficient algorithms described by :cite:t:`tonelli2010three` for implementing preferential attachment mechanisms in evolving networks.

For more details check the :class:`PAM <dyn.benchmark.generator.edges_generator.PAM>` class.

Block Preferential Attachment Model (BPAM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Block Preferential Attachment Model (BPAM) combines the community-aware structure of SBM with the scale-free properties of PAM, creating networks with both community structure and realistic degree distributions.

.. code-block:: python

    # BPAM parameters (community-aware preferential attachment)
    bpam_generator = BPAM(
        gamma_in=0.8,  # Intra-community interaction strength
        gamma_out=0.1, # Inter-community interaction strength
        m=5,           # Number of edges to add for each new node
        self_loop=False,  # Whether self-loops are allowed 
        seed=42        # Seed for reproducibility
    )

BPAM was introduced by :cite:t:`tang2020buckley` as an extension of the Buckley-Osthus model, incorporating block structures to create networks with both community organization and power-law degree distributions.

For more details check the :class:`BPAM <dyn.benchmark.generator.edges_generator.BPAM>` class.


Fast Block Preferential Attachment Model (FastBPAM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Fast Block Preferential Attachment Model (FastBPAM) builds upon the original PAM by incorporating features from BPAM and optimizing performance. This model generates networks with both community-aware structures and realistic degree distributions, while being computationally efficient.

.. code-block:: python

    # FastBPAM - optimized version of BPAM with the same parameters
    fast_bpam = FastBPAM(
        gamma_in=0.8,
        gamma_out=0.1,
        m=5,
        self_loop=False,
        seed=42
    )

Fore more details check the :class:`FastBPAM <dyn.benchmark.generator.edges_generator.FastBPAM>` class.

3. Ensuring Reproducibility
---------------------------

To generate identical benchmarks, it's essential to use consistent seed values across all components of the generation process:

.. code-block:: python

    # First generator with seed 42
    generator1 = GroundtruthGenerator(seed=42)
    benchmark1 = generator1.generate()

    # Second generator with seed 42
    generator2 = GroundtruthGenerator(seed=42)
    benchmark2 = generator2.generate()

    # Verify the benchmarks are identical
    same_nodes = all(
        set(benchmark1.graphs[t].nodes) == set(benchmark2.graphs[t].nodes)
        for t in benchmark1.graphs.keys()
    )
    print(f"Benchmarks have identical nodes: {same_nodes}")

When creating a custom generator with multiple components, ensure that each component receives a consistent seed value:

.. code-block:: python

    # Create a reproducible custom generator
    custom_generator = GroundtruthGenerator(
        community_generator=CustomCommunitiesGenerator(seed=42),
        node_generator=RandomMemberGenerator(seed=42),
        edge_generator=FastBPAM(seed=42),
        seed=42  # Master seed for the entire generation process
    )

The :class:`GroundtruthGenerator <dyn.benchmark.generator.groundtruth_generator>` distributes child seeds to its components, so providing a master seed is usually sufficient, but explicitly setting component seeds ensures maximum control over reproducibility.

4. Exploring the Generated Benchmark
------------------------------------

Once you have generated a benchmark, you can explore its properties through various attributes and methods:

.. code-block:: python

    # Access different components of the benchmark
    print(f"Generated {len(groundtruth.tcommlist)} member assignments")
    print(f"Across {len(groundtruth.graphs)} snapshots")
    print(f"With {len(groundtruth.events)} community events")

    # Inspect the first snapshot
    first_snapshot = min(groundtruth.graphs.keys())
    graph = groundtruth.graphs[first_snapshot]
    print(f"First snapshot has {len(graph.nodes)} nodes and {len(graph.edges)} edges")

    # List community events
    event_types = set(event.label for event in groundtruth.events)
    print(f"Event types: {event_types}")

For deeper analysis and visualization of your generated benchmark, refer to the following tutorials:

- :doc:`Tutorial 3: Comprehensive Metrics Computation and Analysis <3_metrics_computation_analysis>` explains how to compute various metrics and analyze events for evolving communities and temporal networks.
- :doc:`Tutorial 4: Visualizing Evolving Communities <4_visualizing_evolving_communities>` provides comprehensive methods for visualizing community evolution, including Sankey diagrams, network snapshots, and animated visualizations.


5. Complete Example
-------------------

Here's a complete example that demonstrates how to generate a customized benchmark. This implementation:

1. Creates custom community generators with specific distributions
2. Sets up different network structure generators (SBM, PAM, BPAM, FastBPAM)
3. Ensures reproducibility with consistent seeds
4. Explores the generated benchmark properties

.. code-block:: python

    """
    Complete example of benchmark generation using dyn-benchmark
    """
    import numpy as np
    import networkx as nx
    import pandas as pd

    from dyn.benchmark.generator.groundtruth_generator import GroundtruthGenerator
    from dyn.benchmark.generator.communities_generator import CommunitiesGenerator, RelativeOverlap, Match, Overlap
    from dyn.benchmark.generator.edges_generator import SBM, BPAM, FastBPAM, PAM
    from dyn.benchmark.generator.nodes_generator import RandomMemberGenerator
    from dyn.core.communities import Membership
    from dyn.drawing.sankey_drawing import plot_sankey

    # 1. Define a custom communities generator with specific distributions
    class CustomCommunitiesGenerator(CommunitiesGenerator):
        """Custom generator with specific community evolution patterns"""

        def draw_community_size(self, *args, **kwargs):
            """Controls the initial size of communities"""
            return self.rng.normal(loc=50, scale=20)  # Normal distribution centered at 50

        def draw_community_lifetime(self, *args, **kwargs):
            """Controls how long communities exist"""
            # Truncated normal distribution between 3-7 snapshots
            return np.maximum(3, np.minimum(7, self.rng.normal(loc=5, scale=2)))

        def draw_community_start(self, *args, **kwargs):
            """Controls when communities are born"""
            return self.rng.random()  # Uniform distribution

        def draw_change_ratio(self, *args, **kwargs):
            """Controls community size changes over time"""
            return self.rng.normal(loc=0, scale=0.2)  # Normal distribution centered at 0

    # 2. Set up a custom generator with all components
    community_generator = CustomCommunitiesGenerator(
        community_count=15,          # Number of evolving communities
        snapshot_count=10,           # Number of snapshots in the temporal network
        community_size_min=5,        # Minimum size of any static community
        core_nodes_ratio=0.7,        # Ratio of members that stay in their community
        matching_metric_type=RelativeOverlap,  # Algorithm for matching communities
        seed=42                      # Seed for reproducibility
    )

    # 3. Create network structure generators with Fast Block Preferential Attachment Model (FastBPAM)
    fast_bpam_generator = FastBPAM(
        gamma_in=0.8,  # Intra-community interaction strength
        gamma_out=0.1, # Inter-community interaction strength
        m=5,           # Number of edges to add for each new node
        self_loop=False,  # Whether self-loops are allowed
        seed=42        # Seed for reproducibility
    )

    # 4. Create a node generator
    node_generator = RandomMemberGenerator(seed=42)

    # 5. Assemble the final generator with all components
    custom_generator = GroundtruthGenerator(
        community_generator=community_generator,
        node_generator=node_generator,
        edge_generator=fast_bpam_generator,  # Using FastBPAM for this example
        seed=42  # Master seed for the entire generation process
    )

    # 6. Generate the benchmark
    print("Generating benchmark...")
    groundtruth = custom_generator.generate()

    # 7. Explore the generated benchmark properties
    print(f"\nGenerated benchmark with:")
    #print(f"- {len(groundtruth.tcommlist)} member assignments")
    print(f"- {len(groundtruth.graphs)} snapshots")
    print(f"- {len(groundtruth.events)} community events")

    # 8. Inspect the first snapshot
    first_snapshot = min(groundtruth.graphs.keys())
    graph = groundtruth.graphs[first_snapshot]
    print(f"\nFirst snapshot (t={first_snapshot}):")
    print(f"- {len(graph.nodes)} nodes")
    print(f"- {len(graph.edges)} edges")
    print(f"- Density: {2*len(graph.edges)/(len(graph.nodes)*(len(graph.nodes)-1)):.4f}")

    # 9. Extract membership information
    membership = Membership.from_tcommlist(groundtruth.tcommlist)
    print(f"\nMembership details:")
    print(f"- {len(membership.members)} unique members")
    print(f"- {len(membership.evolving_communities)} evolving communities")
    print(f"- {len(membership.static_communities)} static communities")

    # 10. List community events
    event_types = {}
    for event in groundtruth.events:
        event_types[event.label] = event_types.get(event.label, 0) + 1

    print("\nCommunity events:")
    for event_type, count in event_types.items():
        print(f"- {event_type}: {count} occurrences")

    # 10. Create a visualization of community evolution using Sankey diagram
    plot_sankey(membership.community_graph)

    print("\nBenchmark generation and analysis complete!")


This tutorial has demonstrated the complete process of generating customized benchmarks using ``dyn-benchmark``, explaining all parameters and their effects on community evolution and network structure. For more advanced analyses and visualizations, refer to the other tutorials in this documentation.

References
----------

.. bibliography::
    :filter: False
    
    holland1983stochastic
    tang2020buckley
    tonelli2010three
    Aynaud2013