K-Core Decomposition for Ultramarathon Data

The Sparsity Problem

Ultramarathon race results present a fascinating modeling challenge. With thousands of races and hundreds of thousands of runners, you might think we'd have plenty of data—but the reality is far more complex. Most runners complete only a handful of races, and most races see only a fraction of the broader running population. This creates an extremely sparse dataset where direct comparisons between runners or courses are often impossible.

Imagine trying to compare a runner who only does Western States with someone who only does Leadville. Without shared race participation, we have no common reference point. This sparsity becomes a critical problem when we want to build statistical models that estimate runner abilities or course difficulties.

The solution? Rather than wrestling with the full sparse dataset, we can identify a densely-connected subset where runners and courses have enough overlap for meaningful inference. This is where k-core decomposition comes in—a graph-theoretic technique that iteratively prunes nodes until every remaining node meets minimum connectivity requirements.

In this notebook, we'll:

Explore the space of possible k-core configurations
Visualize the tradeoffs between subset size and density
Select and examine a specific k-core for downstream modeling

Setup

We begin by loading race results from UltraSignup, a comprehensive database of ultramarathon events. The preprocessing pipeline filters to running events (excluding cycling, swimming, etc.), extracts distance information, and identifies races that report DNFs (Did Not Finish)—a key signal for our later analysis.

We focus on marathon-distance and longer events (≥26.2 miles) since these are the races where DNF patterns become most meaningful and where the ultramarathon community truly begins.

1 import pandas as pd
2 import numpy as np
3 import plotly.graph_objects as go
4 import igraph as ig
5  
6 from utils.data_processing import load_results, process_results, filter_races_with_dnfs
7 from utils.kcore_subsetting import evaluate_all_kcore_candidates, subset_kcore_data

1 # preprocess data
2  
3 results = load_results()
4 results = process_results(results)
5 results = filter_races_with_dnfs(results)
6 results = results[results['distance_miles'] >= 26.2]

Exploring the K-Core Parameter Space

What is a (α, β)-Core?

Think of ultramarathon data as a bipartite graph: one set of nodes represents runners, another represents courses (race-distance combinations), and edges connect runners to courses they've completed. An (α, β)-core is the largest subgraph where:

Every runner has completed at least α different courses
Every course has at least β different runners

The parameters α and β control the density-size tradeoff:

Higher α means runners must have more race history → fewer runners qualify
Higher β means courses need more participants → fewer courses qualify

Finding the right balance is crucial. Too loose, and we retain too much sparsity for effective modeling. Too strict, and we throw away valuable data.

Breadth-First Exploration

Rather than testing every possible (α, β) combination, we use an intelligent breadth-first search. Starting from minimal requirements, we incrementally tighten constraints and track how the subset evolves—measuring not just size, but also completeness metrics that indicate how representative the k-core remains of each runner's and course's full history.

1 # search k-core space
2  
3 candidates_df = evaluate_all_kcore_candidates(
4     results=results,
5     max_entities=None,  # explore full parameter space
6     min_alpha=2,  # courses per runner
7     min_beta=30,  # runners per course
8     beta_step=2,  
9     verbose=True,
10     use_cache=True  
11 )

✅ Loading cached k-core candidates from candidates_maxentNone_alpha2_beta30_step2.pkl

Visualizing the Parameter Landscape

The 3D visualization below reveals the fundamental tradeoffs in k-core selection. Each point on the surface represents a different (α, β) configuration:

X-axis (Course Completeness): On average, what percentage of each course's total runners are retained in the k-core?
Y-axis (Runner Completeness): On average, what percentage of each runner's race history is retained?
Z-axis: Configurable—toggle between course count, total entities, or total results

Key patterns to observe:

The surface drops off sharply as we move toward higher completeness values—achieving both high course and runner completeness simultaneously requires sacrificing dataset size
The "ridge" of the surface shows the Pareto frontier: configurations where you can't improve one metric without sacrificing another
Hover over points to see the exact (α, β) values and detailed statistics

Use the toggle buttons to explore different perspectives on the data.

1 def plot_3d_kcore_space(candidates_df: pd.DataFrame, full_results: pd.DataFrame = None):
2     """
3     Create interactive 3D wireframe visualization of k-core parameter space.
4     
5     Displays parameter space as a grid wireframe, connecting discrete (α, β) points
6     from breadth-first search exploration.
7     
8     Args:
9         candidates_df: All k-core candidates with metrics
10         full_results: Full dataset for calculating baseline density reference planes
11     """
12     df = candidates_df.copy()
13     
14     # Calculate total entities
15     df['n_entities'] = df['n_courses'] + df['n_participants']
16     
17     # Get unique sorted alpha and beta values
18     alpha_values = sorted(df['alpha'].unique())
19     beta_values = sorted(df['beta'].unique())
20     
21     # Create grid lookup with all z metrics
22     grid_data = {}
23     for _, row in df.iterrows():
24         key = (row['alpha'], row['beta'])
25         grid_data[key] = {
26             'x': row['avg_course_completeness'],
27             'y': row['avg_runner_completeness'],
28             'z_courses': row['n_courses'],
29             'z_entities': row['n_entities'],
30             'z_total_results': row['total_results'],
31             'n_courses': row['n_courses'],
32             'n_entities': row['n_entities'],
33             'total_results': row['total_results'],
34             'hover': (
35                 f"α={int(row['alpha'])}, β={int(row['beta'])}<br>"
36                 f"Courses: {int(row['n_courses']):,}<br>"
37                 f"Runners: {int(row['n_participants']):,}<br>"
38                 f"Total Entities: {int(row['n_entities']):,}<br>"
39                 f"Total Results: {int(row['total_results']):,}<br>"
40                 f"Overall Density: {row['density']:.1f}<br>"
41                 f"Runner completeness: {row['avg_runner_completeness']:.1f}%<br>"
42                 f"Course completeness: {row['avg_course_completeness']:.1f}%"
43             )
44         }
45     
46     # Determine color scale ranges for each metric
47     min_courses = df['n_courses'].min()
48     max_courses = df['n_courses'].max()
49     min_entities = df['n_entities'].min()
50     max_entities = df['n_entities'].max()
51     min_results = df['total_results'].min()
52     max_results = df['total_results'].max()
53     
54     fig = go.Figure()
55     
56     # Helper function to create traces for a given z metric and color metric
57     def create_traces(z_key, color_key):
58         traces = []
59         
60         # Get color metric min/max
61         if color_key == 'n_courses':
62             color_min, color_max = min_courses, max_courses
63         elif color_key == 'n_entities':
64             color_min, color_max = min_entities, max_entities
65         else:  # total_results
66             color_min, color_max = min_results, max_results
67         
68         # Draw horizontal grid lines (constant alpha, varying beta)
69         for alpha in alpha_values:
70             x_line, y_line, z_line, color_line, hover_line = [], [], [], [], []
71             for beta in beta_values:
72                 if (alpha, beta) in grid_data:
73                     point = grid_data[(alpha, beta)]
74                     x_line.append(point['x'])
75                     y_line.append(point['y'])
76                     z_line.append(point[z_key])
77                     color_line.append(point[color_key])
78                     hover_line.append(point['hover'])
79             
80             if len(x_line) > 1:  # Only draw if we have multiple points
81                 # Use average color for the line segment (very light to very dark blue)
82                 avg_color = sum(color_line) / len(color_line)
83                 normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0
84                 
85                 # Very light blue (224, 242, 255) to very dark blue (0, 30, 80)
86                 r = int(224 + normalized_color * (0 - 224))
87                 g = int(242 + normalized_color * (30 - 242))
88                 b = int(255 + normalized_color * (80 - 255))
89                 
90                 traces.append(go.Scatter3d(
91                     x=x_line,
92                     y=y_line,
93                     z=z_line,
94                     mode='lines',
95                     line=dict(
96                         color=f'rgb({r}, {g}, {b})',
97                         width=2
98                     ),
99                     text=hover_line,
100                     hoverinfo='text',
101                     showlegend=False,
102                     visible=False  # Will be set to True for initial z metric
103                 ))
104         
105         # Draw vertical grid lines (constant beta, varying alpha)
106         for beta in beta_values:
107             x_line, y_line, z_line, color_line, hover_line = [], [], [], [], []
108             for alpha in alpha_values:
109                 if (alpha, beta) in grid_data:
110                     point = grid_data[(alpha, beta)]
111                     x_line.append(point['x'])
112                     y_line.append(point['y'])
113                     z_line.append(point[z_key])
114                     color_line.append(point[color_key])
115                     hover_line.append(point['hover'])
116             
117             if len(x_line) > 1:  # Only draw if we have multiple points
118                 # Use average color for the line segment (very light to very dark blue)
119                 avg_color = sum(color_line) / len(color_line)
120                 normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0
121                 
122                 # Very light blue (224, 242, 255) to very dark blue (0, 30, 80)
123                 r = int(224 + normalized_color * (0 - 224))
124                 g = int(242 + normalized_color * (30 - 242))
125                 b = int(255 + normalized_color * (80 - 255))
126                 
127                 traces.append(go.Scatter3d(
128                     x=x_line,
129                     y=y_line,
130                     z=z_line,
131                     mode='lines',
132                     line=dict(
133                         color=f'rgb({r}, {g}, {b})',
134                         width=2
135                     ),
136                     text=hover_line,
137                     hoverinfo='text',
138                     showlegend=False,
139                     visible=False  # Will be set to True for initial z metric
140                 ))
141         
142         return traces
143     
144     # Create traces for all 9 combinations (3 z-metrics × 3 color-metrics)
145     trace_sets = {}
146     for z_key in ['z_courses', 'z_entities', 'z_total_results']:
147         for color_key in ['n_courses', 'n_entities', 'total_results']:
148             trace_sets[(z_key, color_key)] = create_traces(z_key, color_key)
149     
150     # Add all traces (start with courses z-axis, total_results coloring)
151     for (z_key, color_key), traces in trace_sets.items():
152         for trace in traces:
153             if z_key == 'z_courses' and color_key == 'total_results':
154                 trace.visible = True
155             fig.add_trace(trace)
156     
157     # Add a single invisible trace for the colorbar
158     fig.add_trace(go.Scatter3d(
159         x=[df['avg_course_completeness'].iloc[0]],
160         y=[df['avg_runner_completeness'].iloc[0]],
161         z=[df['n_courses'].iloc[0]],
162         mode='markers',
163         marker=dict(
164             size=0.1,
165             color=[min_results],
166             colorscale=[[0, 'rgb(224, 242, 255)'], [1, 'rgb(0, 30, 80)']],
167             showscale=True,
168             colorbar=dict(
169                 title="Total Results",
170                 x=1.1,
171                 tickformat=','
172             ),
173             cmin=min_results,
174             cmax=max_results
175         ),
176         showlegend=False,
177         hoverinfo='skip',
178         visible=True
179     ))
180     
181     # Calculate number of traces per combination
182     n_traces_per_combo = len(trace_sets[('z_courses', 'n_courses')])
183     
184     # Create visibility arrays for toggling
185     def make_visible_array(z_metric, color_metric):
186         """Create visibility array for all traces."""
187         visible = []
188         for (z_key, color_key) in trace_sets.keys():
189             z_match = (z_key == f'z_{z_metric}')
190             color_match = (color_key == color_metric)
191             visible.extend([z_match and color_match] * n_traces_per_combo)
192         # Colorbar trace
193         visible.append(True)
194         return visible
195     
196     # Get current color metric name for colorbar updates
197     def get_color_title(color_metric):
198         if color_metric == 'n_courses':
199             return "Course Count"
200         elif color_metric == 'n_entities':
201             return "Total Entities"
202         else:
203             return "Total Results"
204     
205     def get_color_range(color_metric):
206         if color_metric == 'n_courses':
207             return [min_courses, max_courses]
208         elif color_metric == 'n_entities':
209             return [min_entities, max_entities]
210         else:
211             return [min_results, max_results]
212     
213     # Update layout
214     fig.update_layout(
215         title=dict(
216             text='Interactive 3D K-Core Parameter Space Wireframe',
217             x=0.5,
218             xanchor='center',
219             font=dict(size=18)
220         ),
221         scene=dict(
222             xaxis=dict(
223                 title='Course Completeness', 
224                 backgroundcolor="rgb(230, 230,230)", 
225                 gridcolor="white",
226                 ticksuffix='%'
227             ),
228             yaxis=dict(
229                 title='Runner Completeness', 
230                 backgroundcolor="rgb(230, 230,230)", 
231                 gridcolor="white",
232                 ticksuffix='%'
233             ),
234             zaxis=dict(
235                 title='Course Count', 
236                 backgroundcolor="rgb(230, 230,230)", 
237                 gridcolor="white",
238                 separatethousands=True
239             ),
240             camera=dict(
241                 eye=dict(x=1.5, y=-1.5, z=1.3)  # Runner completeness on right, course on left
242             )
243         ),
244         width=1000,
245         height=850,
246         hovermode='closest',
247         updatemenus=[
248             # Z-axis toggle buttons
249             dict(
250                 type="buttons",
251                 direction="left",
252                 buttons=[
253                     dict(
254                         label="Z-axis: Courses",
255                         method="update",
256                         args=[
257                             {"visible": make_visible_array('courses', 'total_results')},
258                             {
259                                 "scene.zaxis.title": "Course Count",
260                                 "scene.zaxis.range": [0, df['n_courses'].max() * 1.1],
261                                 "scene.zaxis.separatethousands": True
262                             }
263                         ]
264                     ),
265                     dict(
266                         label="Z-axis: Total Entities",
267                         method="update",
268                         args=[
269                             {"visible": make_visible_array('entities', 'total_results')},
270                             {
271                                 "scene.zaxis.title": "Total Entities (Courses + Runners)",
272                                 "scene.zaxis.range": [0, df['n_entities'].max() * 1.1],
273                                 "scene.zaxis.separatethousands": True
274                             }
275                         ]
276                     ),
277                     dict(
278                         label="Z-axis: Total Results",
279                         method="update",
280                         args=[
281                             {"visible": make_visible_array('total_results', 'total_results')},
282                             {
283                                 "scene.zaxis.title": "Total Results",
284                                 "scene.zaxis.range": [0, df['total_results'].max() * 1.1],
285                                 "scene.zaxis.separatethousands": True
286                             }
287                         ]
288                     ),
289                 ],
290                 x=0.5,
291                 xanchor="center",
292                 y=1.08,
293                 yanchor="top",
294                 bgcolor="rgba(255, 255, 255, 0.8)",
295                 bordercolor="rgba(100, 100, 100, 0.5)",
296                 borderwidth=1
297             ),
298             # Color mapping toggle buttons
299             dict(
300                 type="buttons",
301                 direction="left",
302                 buttons=[
303                     dict(
304                         label="Color: Courses",
305                         method="update",
306                         args=[
307                             {"visible": make_visible_array('courses', 'n_courses')},
308                             {
309                                 "marker.colorbar.title": "Course Count",
310                                 "marker.cmin": min_courses,
311                                 "marker.cmax": max_courses
312                             }
313                         ]
314                     ),
315                     dict(
316                         label="Color: Total Entities",
317                         method="update",
318                         args=[
319                             {"visible": make_visible_array('courses', 'n_entities')},
320                             {
321                                 "marker.colorbar.title": "Total Entities",
322                                 "marker.cmin": min_entities,
323                                 "marker.cmax": max_entities
324                             }
325                         ]
326                     ),
327                     dict(
328                         label="Color: Total Results",
329                         method="update",
330                         args=[
331                             {"visible": make_visible_array('courses', 'total_results')},
332                             {
333                                 "marker.colorbar.title": "Total Results",
334                                 "marker.cmin": min_results,
335                                 "marker.cmax": max_results
336                             }
337                         ]
338                     ),
339                 ],
340                 x=0.5,
341                 xanchor="center",
342                 y=1.03,
343                 yanchor="top",
344                 bgcolor="rgba(255, 255, 255, 0.8)",
345                 bordercolor="rgba(100, 100, 100, 0.5)",
346                 borderwidth=1
347             )
348         ]
349     )
350     
351     fig.show()
352  
353 # Visualize the parameter space with interactive 3D plot
354 plot_3d_kcore_space(candidates_df=candidates_df, full_results=results)

Loading Plotly visualization...

Examining the (3, 840) K-Core

Based on our exploration, we select the (α=3, β=840) configuration for downstream modeling. This choice balances several considerations:

α=3: Runners must have completed at least 3 different course types. This ensures we have enough repeated observations per runner to estimate individual ability, while not being so restrictive that we lose casual participants.
β=840: Courses must have at least 840 unique runners. This focuses on well-established events with substantial participation—races where we can reliably estimate course difficulty and where DNF reporting is likely to be consistent.

The resulting subset represents the "core" ultramarathon community: dedicated runners who return to the sport repeatedly, competing in popular events that form the backbone of the ultra calendar.

Visualizing the Bipartite Structure

With our k-core selected, we can now visualize the runner-course relationships. These visualizations reveal the community structure hidden within the data—clusters of runners who share similar race portfolios, courses that attract overlapping populations, and the overall connectivity patterns that make comparative inference possible.

Building the Subset

We extract all results matching our k-core criteria and compute basic network statistics. The degree of a runner is the number of unique courses they've completed; the degree of a course is its number of unique finishers. These degree distributions tell us about the heterogeneity within our subset—are there a few "super-connectors" or is participation relatively uniform?

1 # Create (3, 840) k-core subset
2 alpha, beta = 3, 840
3  
4 # Ensure fresh copy without any previous course_id columns
5 results_clean = results.drop(columns=['course_id'], errors='ignore')
6 kcore_subset = subset_kcore_data(results_clean, alpha=alpha, beta=beta)
7 kcore_data = kcore_subset[kcore_subset['in_kcore']].copy()
8  
9 # Get degree statistics
10 runner_degrees = kcore_data.groupby('participant_id').size()
11 course_degrees = kcore_data.groupby('event_distance_id').size()
12  
13 # Use ALL runners in k-core (no sampling)
14 kcore_sample = kcore_data.copy()
15  
16 # Count edges (participation frequency per runner-course pair)
17 edge_counts = kcore_sample.groupby(['participant_id', 'event_distance_id']).size().reset_index(name='edge_weight')
18  
19 print(f"K-core (α={alpha}, β={beta}) statistics:")
20 print(f"  Total courses: {kcore_data['event_distance_id'].nunique()}")
21 print(f"  Total runners: {kcore_data['participant_id'].nunique()}")
22 print(f"  Total results: {len(kcore_data):,}")
23 print(f"  Unique edges: {len(edge_counts)}")
24 print(f"  Max edge weight: {edge_counts['edge_weight'].max()}")

K-core (α=3, β=840) statistics:
  Total courses: 367
  Total runners: 3061
  Total results: 31,490
  Unique edges: 31490
  Max edge weight: 1

Adjacency Matrix Heatmap

The adjacency matrix provides a bird's-eye view of runner-course relationships. Each row is a runner, each column is a course, and cells are colored by percentile finish time (darker = faster relative to that course's field).

We apply hierarchical clustering to both rows and columns, reordering them to reveal latent structure. Look for:

Block patterns: Groups of runners who share similar race portfolios (they cluster together vertically) and tend to perform similarly across those races
Color gradients: Runners who consistently finish in similar percentiles across different courses (horizontal color consistency indicates stable relative ability)
Sparse regions: The gaps remind us that even within our dense k-core, not every runner-course combination is observed

This matrix shows only the top 100 runners by degree for readability, but the patterns extend throughout the full dataset.

1 import plotly.express as px
2 from scipy.cluster.hierarchy import linkage, dendrogram
3 from scipy.spatial.distance import pdist
4  
5 # Sample top 100 runners for adjacency matrix visualization
6 top_runners = runner_degrees.nlargest(100).index
7 kcore_matrix_sample = kcore_sample[kcore_sample['participant_id'].isin(top_runners)]
8  
9 # Create adjacency matrix with percentile finish times
10 matrix_data = kcore_matrix_sample.groupby(['participant_id', 'event_distance_id'])['time_ms'].mean().reset_index()
11  
12 # Calculate percentile for each course
13 percentile_data = []
14 for event_id in matrix_data['event_distance_id'].unique():
15     # Get all finish times for this course
16     course_times = kcore_matrix_sample[kcore_matrix_sample['event_distance_id'] == event_id]['time_ms']
17     
18     # Get participant times for this course
19     participant_times = matrix_data[matrix_data['event_distance_id'] == event_id].copy()
20     
21     # Calculate percentile rank (0-100)
22     participant_times['percentile'] = participant_times['time_ms'].rank(pct=True) * 100
23     
24     percentile_data.append(participant_times)
25  
26 # Combine all percentiles
27 percentile_df = pd.concat(percentile_data, ignore_index=True)
28  
29 # Create matrix with percentiles
30 matrix = percentile_df.pivot(index='participant_id', columns='event_distance_id', values='percentile')
31  
32 # Fill NaN values with median for clustering (represents missing races)
33 matrix_filled = matrix.fillna(matrix.median().median())
34  
35 # Perform hierarchical clustering on rows (runners)
36 row_linkage = linkage(pdist(matrix_filled, metric='euclidean'), method='ward')
37 row_dendrogram = dendrogram(row_linkage, no_plot=True)
38 row_order = row_dendrogram['leaves']
39  
40 # Perform hierarchical clustering on columns (courses)
41 col_linkage = linkage(pdist(matrix_filled.T, metric='euclidean'), method='ward')
42 col_dendrogram = dendrogram(col_linkage, no_plot=True)
43 col_order = col_dendrogram['leaves']
44  
45 # Reorder matrix using clustering results
46 matrix_clustered = matrix.iloc[row_order, col_order]
47  
48 # Get runner names for hover text
49 runner_names = kcore_matrix_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id').set_index('participant_id')
50 runner_names['full_name'] = runner_names['firstname'] + ' ' + runner_names['lastname']
51  
52 # Create custom hover text with runner names
53 matrix_clustered_with_names = matrix_clustered.copy()
54 matrix_clustered_with_names.index = [
55     runner_names.loc[pid, 'full_name'] if pid in runner_names.index else str(pid) 
56     for pid in matrix_clustered.index
57 ]
58  
59 # Get race names for column labels - keep event_distance_id to ensure uniqueness
60 race_info = kcore_matrix_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id').set_index('event_distance_id')
61 # Create unique labels by combining name with event_distance_id
62 matrix_clustered_with_names.columns = [
63     f"{race_info.loc[col, 'name'][:25]}... (ID:{col})" if len(race_info.loc[col, 'name']) > 25 
64     else f"{race_info.loc[col, 'name']} (ID:{col})" 
65     for col in matrix_clustered_with_names.columns
66 ]
67  
68 fig = px.imshow(
69     matrix_clustered_with_names,
70     aspect='auto',
71     color_continuous_scale='Viridis',
72     labels=dict(color="Percentile", x="Course", y="Runner"),
73     title=f"Runner × Course Adjacency Matrix (α={alpha}, β={beta}, Top 100 Runners)<br><sub>Color = Percentile Finish Time (higher = slower) | Hierarchically Clustered</sub>"
74 )
75  
76 fig.update_layout(
77     width=1200,
78     height=800,
79     xaxis=dict(tickangle=-45, tickfont=dict(size=8), showgrid=False, showline=False, showticklabels=False, title=''),
80     yaxis=dict(tickfont=dict(size=6), showgrid=False, showline=False, showticklabels=False, title=''),
81     plot_bgcolor='white',
82     paper_bgcolor='white'
83  
84 )
85  
86 fig.show()

Loading Plotly visualization...

Force-Directed Network

The final visualization presents the k-core as a force-directed network graph. This physics-inspired layout treats edges as springs and nodes as mutually-repelling particles, letting the graph "relax" into a natural arrangement where connected nodes cluster together.

Super-Node Compression

With thousands of runners, a naive visualization would be overwhelming. We address this by creating super-nodes: runners with identical race participation patterns are merged into a single node. This is semantically meaningful—runners who've done exactly the same set of races are essentially interchangeable from a network topology perspective.

The resulting graph reveals:

Central hubs: Courses that attract runners from across the community (high connectivity)
Peripheral clusters: Regional or niche race circuits with dedicated but isolated participant pools
Bridge nodes: Runners or courses that connect otherwise-separate sub-communities

Interactive features: Hover over any node to highlight its connections. Course nodes appear in coral; runner groups in the Viridis colorscale (colored by degree).

1 # Create super-nodes: group runners with identical race participation patterns
2 # This reduces graph complexity while preserving structure
3  
4 # Build race participation signatures for each runner
5 runner_signatures = {}
6 for runner_id in kcore_sample['participant_id'].unique():
7     races = frozenset(kcore_sample[kcore_sample['participant_id'] == runner_id]['event_distance_id'])
8     runner_signatures[runner_id] = races
9  
10 # Invert to create super-nodes: signature -> set of runners
11 unique_signatures = {}
12 for runner_id, sig in runner_signatures.items():
13     if sig not in unique_signatures:
14         unique_signatures[sig] = set()
15     unique_signatures[sig].add(runner_id)
16  
17 # Get unique courses
18 unique_courses = sorted(kcore_sample['event_distance_id'].unique())
19  
20 # Get runner and race names for hover text
21 runner_names = kcore_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id')
22 runner_names = {row['participant_id']: f"{row['firstname']} {row['lastname']}" for _, row in runner_names.iterrows()}
23  
24 race_names = kcore_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id')
25 race_names = {row['event_distance_id']: row['name'] for _, row in race_names.iterrows()}
26  
27 print(f"Created {len(unique_signatures):,} super-nodes from {len(runner_signatures):,} runners")
28 print(f"Average group size: {len(runner_signatures) / len(unique_signatures):.1f} runners per super-node")
29 print(f"Largest group: {max(len(runners) for runners in unique_signatures.values())} runners")
30 print(f"Total courses: {len(unique_courses)}")

Created 2,952 super-nodes from 3,061 runners
Average group size: 1.0 runners per super-node
Largest group: 9 runners
Total courses: 367

1 import igraph as ig
2 import plotly.graph_objects as go
3 import numpy as np
4 import time
5 import json
6  
7 start_time = time.time()
8  
9 # Create bipartite graph from super-nodes
10 edges = []
11 edge_weights = {}
12  
13 # Build runner index to super-node mapping
14 runner_to_super = {}
15 for super_idx, (sig, runners) in enumerate(unique_signatures.items()):
16     for runner_id in runners:
17         runner_to_super[runner_id] = super_idx
18  
19 # Use dict for O(1) course lookups instead of list.index()
20 course_to_idx = {course_id: idx for idx, course_id in enumerate(unique_courses)}
21  
22 # Build edges from edge counts
23 for _, row in edge_counts.iterrows():
24     runner_id = row['participant_id']
25     course_id = row['event_distance_id']
26     
27     if runner_id in runner_to_super:
28         super_idx = runner_to_super[runner_id]
29         # CRITICAL: Offset course index by number of super-nodes for bipartite graph
30         course_idx = len(unique_signatures) + course_to_idx[course_id]
31         edge_key = (super_idx, course_idx)
32         edge_weights[edge_key] = edge_weights.get(edge_key, 0) + row['edge_weight']
33  
34 # Create edges list
35 edges = list(edge_weights.keys())
36  
37 # Create bipartite graph
38 layout_start = time.time()
39 g = ig.Graph.Bipartite(
40     [0] * len(unique_signatures) + [1] * len(unique_courses),
41     edges
42 )
43  
44 # Compute layouts
45 layouts = {}
46  
47 # Fruchterman-Reingold with maximum overlap prevention
48 # The 'grid' parameter is the key to preventing overlap
49 fr_start = time.time()
50 base_layout = g.layout_fruchterman_reingold(
51     niter=10000,      # Many iterations for convergence
52     grid='auto'       # Spatial grid prevents overlap - this is the critical parameter
53 )
54  
55 layouts['Fruchterman-Reingold'] = base_layout
56  
57 # Get node degrees
58 degrees = g.degree()
59  
60 # Prepare visualization data
61 runner_names_list = []
62 for sig in unique_signatures.keys():
63     runners = unique_signatures[sig]
64     if len(runners) == 1:
65         runner_id = list(runners)[0]
66         runner_names_list.append(runner_names.get(runner_id, f"Runner {runner_id}"))
67     else:
68         runner_names_list.append(f"{len(runners)} runners")
69  
70 course_names_list = [race_names.get(c, f"Course {c}") for c in unique_courses]
71  
72 # Create figure with optimized dual-trace edge rendering
73 fig = go.Figure()
74  
75 def add_optimized_layout(fig, layout):
76     """Add layout with optimized edge rendering using dual traces."""
77     pos_x = [coord[0] for coord in layout.coords]
78     pos_y = [coord[1] for coord in layout.coords]
79     
80     # Build all edge coordinates and connectivity mapping
81     edge_x = []
82     edge_y = []
83     node_to_segments = {}  # Maps node index -> list of segment indices
84     node_neighbors = {}    # Maps node index -> list of connected node indices
85     
86     segment_idx = 0
87     for edge_idx, edge in enumerate(g.es):
88         source, target = edge.tuple
89         
90         # Track which segments connect to each node
91         if source not in node_to_segments:
92             node_to_segments[source] = []
93             node_neighbors[source] = []
94         if target not in node_to_segments:
95             node_to_segments[target] = []
96             node_neighbors[target] = []
97         
98         node_to_segments[source].append(segment_idx)
99         node_to_segments[target].append(segment_idx)
100         node_neighbors[source].append(target)
101         node_neighbors[target].append(source)
102         
103         # Add edge coordinates (x1, x2, None for line break)
104         edge_x.extend([pos_x[source], pos_x[target], None])
105         edge_y.extend([pos_y[source], pos_y[target], None])
106         
107         segment_idx += 1
108     
109     # Add base edges trace (always visible, low opacity)
110     fig.add_trace(go.Scattergl(  # WebGL acceleration!
111         x=edge_x,
112         y=edge_y,
113         mode='lines',
114         line=dict(width=0.5, color='rgba(200,200,200,0.3)'),
115         hoverinfo='skip',  # Prevent hover on edges
116         name='base_edges',
117         showlegend=False
118     ))
119     
120     # Add highlight edges trace (initially empty, updated by JavaScript)
121     fig.add_trace(go.Scattergl(  # WebGL acceleration!
122         x=[],
123         y=[],
124         mode='lines',
125         line=dict(width=2, color='rgba(255,50,50,0.9)'),
126         hoverinfo='skip',
127         name='highlight_edges',
128         showlegend=False
129     ))
130     
131     # Prepare node data
132     super_node_indices = list(range(len(unique_signatures)))
133     course_indices = list(range(len(unique_signatures), len(unique_signatures) + len(unique_courses)))
134     
135     # Calculate super-node sizes based on group size (even smaller)
136     super_node_sizes = []
137     super_node_degrees = []
138     super_node_hover_text = []
139     
140     for idx, (sig, runners) in enumerate(unique_signatures.items()):
141         group_size = len(runners)
142         # Reduced base size and scaling factor for smaller nodes
143         super_node_sizes.append(3 + np.log1p(group_size) * 2)
144         super_node_degrees.append(degrees[idx])
145         
146         if group_size == 1:
147             runner_id = list(runners)[0]
148             name = runner_names.get(runner_id, f"Runner {runner_id}")
149             super_node_hover_text.append(f"{name}<br>Degree: {degrees[idx]}")
150         else:
151             sample_names = [runner_names.get(r, f"Runner {r}") for r in list(runners)[:3]]
152             hover = f"Group of {group_size} runners<br>" + "<br>".join(sample_names)
153             if group_size > 3:
154                 hover += f"<br>... and {group_size - 3} more"
155             hover += f"<br>Degree: {degrees[idx]}"
156             super_node_hover_text.append(hover)
157     
158     # Course hover text
159     course_hover_text = [
160         f"{course_names_list[i]}<br>Degree: {degrees[len(unique_signatures) + i]}"
161         for i in range(len(unique_courses))
162     ]
163     
164     # Calculate course sizes with even more aggressive sub-linear scaling
165     course_sizes = [2 + np.log1p(degrees[i]) * 1.5 for i in course_indices]
166     
167     # Add super-node trace
168     fig.add_trace(go.Scattergl(  # WebGL acceleration!
169         x=[pos_x[i] for i in super_node_indices],
170         y=[pos_y[i] for i in super_node_indices],
171         mode='markers',
172         marker=dict(
173             size=super_node_sizes,
174             color=super_node_degrees,
175             colorscale='Viridis',
176             showscale=False,
177             symbol='circle',
178             opacity=0.6,
179             line=dict(width=0)
180         ),
181         text=super_node_hover_text,
182         hoverinfo='text',
183         name='Runner Groups',
184         showlegend=False,
185         customdata=[[i] for i in super_node_indices]
186     ))
187     
188     # Add course node trace with sub-linear size scaling
189     fig.add_trace(go.Scattergl(  # WebGL acceleration!
190         x=[pos_x[i] for i in course_indices],
191         y=[pos_y[i] for i in course_indices],
192         mode='markers',
193         marker=dict(
194             size=course_sizes,
195             color='coral',
196             symbol='circle',
197             opacity=0.6,
198             line=dict(width=0)
199         ),
200         text=course_hover_text,
201         hoverinfo='text',
202         name='Courses',
203         showlegend=False,
204         customdata=[[i] for i in course_indices]
205     ))
206     
207     # Add highlight runner trace (initially empty, will show highlighted runners on top)
208     fig.add_trace(go.Scattergl(
209         x=[],
210         y=[],
211         mode='markers',
212         marker=dict(
213             size=[],
214             color=[],
215             colorscale='Viridis',
216             symbol='circle',
217             opacity=1.0,
218             line=dict(width=2, color='white')
219         ),
220         text=[],
221         hoverinfo='text',
222         name='highlight_runners',
223         showlegend=False
224     ))
225     
226     # Add highlight course trace (initially empty, will show highlighted courses on top)
227     fig.add_trace(go.Scattergl(
228         x=[],
229         y=[],
230         mode='markers',
231         marker=dict(
232             size=[],
233             color='coral',
234             symbol='circle',
235             opacity=1.0,
236             line=dict(width=2, color='white')
237         ),
238         text=[],
239         hoverinfo='text',
240         name='highlight_courses',
241         showlegend=False
242     ))
243     
244     # Return connectivity mapping, edge coordinates, neighbor mapping, and all positions
245     return node_to_segments, edge_x, edge_y, node_neighbors, pos_x, pos_y, super_node_sizes, super_node_degrees, super_node_hover_text, course_hover_text, course_sizes
246  
247 # Add visualization
248 node_to_segs, base_edge_x, base_edge_y, node_neighbors, pos_x, pos_y, runner_sizes, runner_degrees, runner_hover, course_hover, course_sizes = add_optimized_layout(fig, layouts['Fruchterman-Reingold'])
249 connectivity_data = {
250     'node_to_segments': {str(k): v for k, v in node_to_segs.items()},
251     'node_neighbors': {str(k): v for k, v in node_neighbors.items()},
252     'base_edge_x': base_edge_x,
253     'base_edge_y': base_edge_y,
254     'pos_x': pos_x,
255     'pos_y': pos_y,
256     'runner_sizes': runner_sizes,
257     'runner_degrees': runner_degrees,
258     'runner_hover': runner_hover,
259     'course_hover': course_hover,
260     'course_sizes': course_sizes
261 }
262  
263 # Update layout
264 fig.update_layout(
265     title=f"Bipartite Network: Runner Groups × Courses (α={alpha}, β={beta})<br><sub>10k iterations with grid='auto' for overlap prevention + smaller nodes</sub>",
266     showlegend=False,
267     hovermode='closest',
268     width=1200,
269     height=700,
270     plot_bgcolor='white',
271     uirevision='constant'
272 )
273  
274 # Remove axes
275 fig.update_xaxes(showgrid=False, zeroline=False, showticklabels=False)
276 fig.update_yaxes(showgrid=False, zeroline=False, showticklabels=False)
277  
278 # Generate HTML with Plotly.update() for notebook compatibility
279 import plotly.io as pio
280 from IPython.display import HTML, display
281  
282 html_str = pio.to_html(fig, include_plotlyjs='cdn', full_html=False)
283  
284 # Enhanced: Bring connected nodes to top with borders
285 optimized_js = f"""
286 <div id="connectivity-data" style="display:none;">{json.dumps(connectivity_data)}</div>
287 <script>
288 (function() {{
289     function initOptimizedHover() {{
290         var allPlotlyDivs = document.querySelectorAll('.plotly-graph-div');
291         var gd = allPlotlyDivs[allPlotlyDivs.length - 1];
292         
293         if (!gd || !gd.data) {{
294             setTimeout(initOptimizedHover, 100);
295             return;
296         }}
297         
298         // Load connectivity data
299         var connectivityData = JSON.parse(document.getElementById('connectivity-data').textContent);
300         var nodeToSegments = connectivityData.node_to_segments;
301         var nodeNeighbors = connectivityData.node_neighbors;
302         var cachedBaseX = connectivityData.base_edge_x;
303         var cachedBaseY = connectivityData.base_edge_y;
304         var posX = connectivityData.pos_x;
305         var posY = connectivityData.pos_y;
306         var runnerSizes = connectivityData.runner_sizes;
307         var runnerDegrees = connectivityData.runner_degrees;
308         var runnerHover = connectivityData.runner_hover;
309         var courseHover = connectivityData.course_hover;
310         var courseSizes = connectivityData.course_sizes;
311         
312         // Find trace indices
313         var baseTraceIdx = -1;
314         var highlightTraceIdx = -1;
315         var runnerTraceIdx = -1;
316         var courseTraceIdx = -1;
317         var highlightRunnerTraceIdx = -1;
318         var highlightCourseTraceIdx = -1;
319         
320         gd.data.forEach(function(trace, i) {{
321             if (trace.name === 'base_edges') baseTraceIdx = i;
322             else if (trace.name === 'highlight_edges') highlightTraceIdx = i;
323             else if (trace.name === 'Runner Groups') runnerTraceIdx = i;
324             else if (trace.name === 'Courses') courseTraceIdx = i;
325             else if (trace.name === 'highlight_runners') highlightRunnerTraceIdx = i;
326             else if (trace.name === 'highlight_courses') highlightCourseTraceIdx = i;
327         }});
328         
329         if (baseTraceIdx === -1 || highlightTraceIdx === -1) return;
330         
331         // Store original node data
332         var originalRunnerData = {{
333             marker: {{
334                 opacity: gd.data[runnerTraceIdx].marker.opacity
335             }}
336         }};
337         
338         var originalCourseData = {{
339             marker: {{
340                 opacity: gd.data[courseTraceIdx].marker.opacity
341             }}
342         }};
343         
344         // Recursive handler attachment function
345         function attachHoverHandlers(gd, highlightTraceIdx) {{
346             gd.removeAllListeners('plotly_hover');
347             gd.removeAllListeners('plotly_unhover');
348             
349             // Hover handler
350             gd.on('plotly_hover', function(data) {{
351                 if (!data.points || data.points.length === 0) return;
352                 
353                 var point = data.points[0];
354                 if (point.data.mode !== 'markers' || !point.data.customdata) return;
355                 
356                 var nodeIdx = point.data.customdata[point.pointIndex][0];
357                 var connectedSegments = nodeToSegments[nodeIdx.toString()] || [];
358                 var neighbors = nodeNeighbors[nodeIdx.toString()] || [];
359                 
360                 if (connectedSegments.length === 0) return;
361                 
362                 // Build highlight coordinates
363                 var highlightX = [];
364                 var highlightY = [];
365                 for (var i = 0; i < connectedSegments.length; i++) {{
366                     var segIdx = connectedSegments[i];
367                     var startIdx = segIdx * 3;
368                     highlightX.push(cachedBaseX[startIdx], cachedBaseX[startIdx + 1], null);
369                     highlightY.push(cachedBaseY[startIdx], cachedBaseY[startIdx + 1], null);
370                 }}
371                 
372                 // Dim original nodes
373                 var runnerOpacity = new Array(gd.data[runnerTraceIdx].x.length).fill(0.2);
374                 var courseOpacity = new Array(gd.data[courseTraceIdx].x.length).fill(0.2);
375                 
376                 // Build highlighted node data (these will appear on top with borders)
377                 var highlightRunnerX = [];
378                 var highlightRunnerY = [];
379                 var highlightRunnerSizes = [];
380                 var highlightRunnerColors = [];
381                 var highlightRunnerText = [];
382                 
383                 var highlightCourseX = [];
384                 var highlightCourseY = [];
385                 var highlightCourseSizes = [];
386                 var highlightCourseText = [];
387                 
388                 neighbors.forEach(function(neighborIdx) {{
389                     if (neighborIdx < gd.data[runnerTraceIdx].x.length) {{
390                         // It's a runner - add to highlight trace
391                         highlightRunnerX.push(posX[neighborIdx]);
392                         highlightRunnerY.push(posY[neighborIdx]);
393                         highlightRunnerSizes.push(runnerSizes[neighborIdx]);
394                         highlightRunnerColors.push(runnerDegrees[neighborIdx]);
395                         highlightRunnerText.push(runnerHover[neighborIdx]);
396                     }} else {{
397                         // It's a course - add to highlight trace
398                         var courseArrayIdx = neighborIdx - gd.data[runnerTraceIdx].x.length;
399                         highlightCourseX.push(posX[neighborIdx]);
400                         highlightCourseY.push(posY[neighborIdx]);
401                         highlightCourseSizes.push(courseSizes[courseArrayIdx]);
402                         highlightCourseText.push(courseHover[courseArrayIdx]);
403                     }}
404                 }});
405                 
406                 // Update all traces
407                 Plotly.update(gd, 
408                     {{
409                         x: [highlightX],
410                         y: [highlightY]
411                     }}, 
412                     {{}},
413                     [highlightTraceIdx]
414                 ).then(function() {{
415                     return Plotly.restyle(gd, {{'marker.opacity': runnerOpacity}}, [runnerTraceIdx]);
416                 }}).then(function() {{
417                     return Plotly.restyle(gd, {{'marker.opacity': courseOpacity}}, [courseTraceIdx]);
418                 }}).then(function() {{
419                     return Plotly.restyle(gd, {{
420                         x: [highlightRunnerX],
421                         y: [highlightRunnerY],
422                         'marker.size': [highlightRunnerSizes],
423                         'marker.color': [highlightRunnerColors],
424                         text: [highlightRunnerText]
425                     }}, [highlightRunnerTraceIdx]);
426                 }}).then(function() {{
427                     return Plotly.restyle(gd, {{
428                         x: [highlightCourseX],
429                         y: [highlightCourseY],
430                         'marker.size': [highlightCourseSizes],
431                         text: [highlightCourseText]
432                     }}, [highlightCourseTraceIdx]);
433                 }}).then(function() {{
434                     attachHoverHandlers(gd, highlightTraceIdx);
435                 }});
436             }});
437             
438             // Unhover handler
439             gd.on('plotly_unhover', function() {{
440                 // Reset everything
441                 Plotly.update(gd,
442                     {{
443                         x: [[]],
444                         y: [[]]
445                     }},
446                     {{}},
447                     [highlightTraceIdx]
448                 ).then(function() {{
449                     return Plotly.restyle(gd, {{
450                         'marker.opacity': originalRunnerData.marker.opacity
451                     }}, [runnerTraceIdx]);
452                 }}).then(function() {{
453                     return Plotly.restyle(gd, {{
454                         'marker.opacity': originalCourseData.marker.opacity
455                     }}, [courseTraceIdx]);
456                 }}).then(function() {{
457                     return Plotly.restyle(gd, {{
458                         x: [[]],
459                         y: [[]],
460                         'marker.size': [[]],
461                         'marker.color': [[]],
462                         text: [[]]
463                     }}, [highlightRunnerTraceIdx]);
464                 }}).then(function() {{
465                     return Plotly.restyle(gd, {{
466                         x: [[]],
467                         y: [[]],
468                         'marker.size': [[]],
469                         text: [[]]
470                     }}, [highlightCourseTraceIdx]);
471                 }}).then(function() {{
472                     attachHoverHandlers(gd, highlightTraceIdx);
473                 }});
474             }});
475         }}
476         
477         // Initial handler attachment
478         attachHoverHandlers(gd, highlightTraceIdx);
479     }}
480     
481     if (document.readyState === 'loading') {{
482         document.addEventListener('DOMContentLoaded', initOptimizedHover);
483     }} else {{
484         initOptimizedHover();
485     }}
486 }})();
487 </script>
488 """
489  
490 # Combine and display
491 full_html = f"""
492 <div>
493 {html_str}
494 {optimized_js}
495 </div>
496 """
497  
498 display(HTML(full_html))

Summary & Next Steps

We've successfully identified a densely-connected subset of the ultramarathon universe using k-core decomposition. The (3, 840)-core balances data retention with the connectivity needed for comparative inference, giving us a foundation for the modeling work ahead.

Key takeaways:

The full ultramarathon dataset is extremely sparse—most runners and courses share few direct connections
K-core decomposition provides a principled way to extract a dense, well-connected subset
The selected subset represents the "core" ultramarathon community: repeat participants at established events
Visualizations reveal meaningful structure: runner clusters, hub courses, and community topology

What's next? In the following notebooks, we'll use this k-core subset to:

Model finish times: Estimate runner abilities and course difficulties using Bayesian hierarchical models
Analyze DNF patterns: Understand which factors predict whether a runner finishes vs. drops out
Identify reporting artifacts: Distinguish true DNFs from data collection issues

The dense connectivity of our k-core ensures that runner and course effects can be properly identified—the foundation for all downstream inference.

Contents

Controls