ShipSleuthShipSleuthPublic GitHub diligence

Methodology

How ShipSleuth turns messy public GitHub activity into an honest DD read.

The goal is not to pretend public GitHub is the whole truth. The goal is to make the visible surface more legible, comparable, and harder to misuse.

What ShipSleuth measures

  • Public, owned GitHub repositories for an org or user
  • Default-branch commits inside the selected date window
  • Public PRs opened or merged inside the selected date window
  • Visible releases published in-window
  • Author-versus-bot-account activity, repo breadth, collaboration, and concentration
  • PR merge velocity and bus-factor concentration from the same data

What it does not know (by default)

  • Private repositories or private deployment activity — unless the user opts in to the private activity supplement
  • Internal engineering velocity outside GitHub.com
  • Whether a hidden org membership maps to an employee or contractor
  • Product quality; it only measures public shipping signal

How the read works

  • Author commits, merged PRs, releases, active contributors, active repos, active days, and lines changed are the primary reads
  • Concentration and bot-heavy patterns are called out so the visible activity shape is harder to misread
  • Confidence and caveats are shown with every result so partial samples stay obvious
  • Trend windows let you compare adjacent periods without pretending the tool knows the whole company
  • "View the Math" on any result shows the exact anchors, interpolation, and population estimates

Why caveats are first-class

  • Commit counts are easy to misuse when presented without context
  • Monorepos, squashes, mirrors, and bots distort the naive story
  • Some companies ship heavily in private and look quiet in public
  • This is a diligence signal, not a verdict

Optional private supplement

ShipSleuth defaults to public-only analysis, but users can optionally enter self-reported private activity — additional commits, PRs, repos, releases, active days, and lines changed — to produce a more complete picture. When supplemented:

Private supplement values are self-reported and not verified by ShipSleuth. They are meant to reduce the structural blind spot of private work, not to replace proper diligence. The analysis always indicates when private data has been included.

Calculator-first

Earlier drafts experimented with composite scoring and leaderboard-style presentation. ShipSleuth now prioritizes direct metrics and context to avoid false precision.

Author commitsVisible commit volume from real accounts (bot accounts excluded) inside the window.
Merged PRsA cleaner signal of integrated public work than commit count alone.
Contributor breadthHow many visible contributors and repos share the public activity footprint.
ReleasesVisible shipping artifacts that suggest something reached a public milestone.
Active daysDistinct days with at least one push event inside the window.
Lines changedWeekly code adds + deletes via GitHub code frequency. Noisy — includes all authors and generated code.
Merge velocityMedian hours from PR open to merge for author-attributed PRs merged in-window.
Bus factorMinimum contributors needed to account for half of author commits.
ConcentrationWhether one repo or one actor dominates the visible public output.
ConfidenceHow trustworthy the visible sample looks after caps, failures, and truncation.

Percentile anchor calibration

Where the threshold numbers come from

When ShipSleuth says “Top ~0.1%” for a metric, it interpolates your value against a table of anchor thresholds derived from real GH Archive data. Here is how each one was derived.

Data source

Anchors are derived from GH Archive data queried via ClickHouse Playground (March 2026, covering the preceding 90 days). GH Archive captures all public GitHub events — pushes, PRs, releases, etc. — and ClickHouse provides free, zero-auth SQL access to the full dataset.

Population baseline

GitHub reports 100M+ total accounts (Octoverse 2023). Most are inactive. Querying GH Archive for distinct actors with at least 1 public PushEvent in the last 90 days yields ~8.21M accounts. Filtering out known bot/CI accounts (dependabot, renovate, github-actions, etc.) brings the human pool to ~8.19M — bots account for only ~0.3% of push-active accounts.

Last refreshed: March 2026.

Commits (humanCommits)

GH Archive counts PushEvents, not individual commits. Each push contains ~1-3 commits on average, so we apply a ×2 multiplier. Raw GH Archive percentiles (push events): P50=4, P90=31, P99=201, P99.99=9,954.

8Top ~50%GH Archive P50=4 pushes ×2. Median dev pushes ~4 times in 90 days.
25Top ~25%P75=12 pushes ×2. Commits a few times per week.
60Top ~10%P90=31 pushes ×2. Pushing almost daily — consistent contributor.
115Top ~5%P95=57 pushes ×2. Multiple pushes per day, full-time open-source pace.
400Top ~1%P99=201 pushes ×2. Among the most active public contributors.
1,500Top ~0.1%P99.9=746 pushes ×2. Extremely prolific — often monorepo or multi-project workflows.
20,000Top ~0.01%P99.99=9,954 pushes ×2. Top handful globally — may include automated-but-human-attributed workflows.

Merged PRs (mergedPullRequests)

GH Archive PullRequestEvent(action=closed). Only ~151k actors closed any PRs out of 8.19M pushers (~1.8%). Percentiles are among PR-active users: P50=1, P90=3, P99=12, P99.99=11,292.

1Top ~50%GH Archive P50=1 closed PR. Most PR users close just 1 in 90 days.
2Top ~25%P75=2. Uses PR workflow semi-regularly.
3Top ~10%P90=3. Consistent PR contributor.
4Top ~5%P95=4. Active reviewer and contributor.
12Top ~1%P99=12. Heavy PR throughput — managing multiple repos.
90Top ~0.1%P99.9=90. Among the most active mergers on GitHub.
1,500Top ~0.01%P99.99=11,292. Near the absolute ceiling for human-driven PR closes.

Lines changed (linesChanged)

Not available in GH Archive — retains estimated anchors based on GitHub's code frequency API (weekly adds + deletes). Includes all authors in weeks overlapping the window, so this metric is noisier than commit counts. Large refactors and generated code inflate it.

10kTop ~50%~110 lines/day. Light but steady code changes. (Estimated — no GH Archive data.)
30kTop ~25%~333/day. Regular feature development.
80kTop ~10%~888/day. Heavy development or multiple active projects.
200kTop ~5%~2.2k/day. Major features, migrations, or multiple concurrent repos.
500kTop ~1%~5.5k/day. Often includes generated code, large refactors, or monorepo changes.
1MTop ~0.1%~11k/day. Almost certainly includes codegen, migrations, or vendor updates.
2MTop ~0.01%~22k/day. Extreme outlier — major infrastructure or generated code.

Active repos

GH Archive distinct repos per human actor. P50=1, P75=2, P90=4, P95=6, P99=12, P99.99=118. Most devs push to just 1 repo. Maintaining 12+ active repos in 90 days puts you in the Top ~1%.

Releases

GH Archive ReleaseEvents. Only ~210k actors publish any releases (~2.6% of active devs). Among publishers: P50=1, P75=3, P90=5, P95=9, P99=26, P99.99=1,691.

Active days

GH Archive distinct push dates per actor. P50=2, P75=5, P90=11, P95=19, P99=44, P99.99=91. The median developer pushes on just 2 distinct days per quarter. Active days are capped at window length (91 days = Top ~0.01% for a 90-day window — almost no weekends off).

Contributors

GH Archive distinct human pushers per repo owner. Over 90% of owners have just 1 contributor (themselves). P99=4, P99.9=10, P99.99=45. Having 5+ distinct contributors puts an owner in the Top ~1%.

How interpolation works

Your value is placed between the two nearest anchors. The percentile axis is interpolated in log-space (not linearly) because developer activity follows a power-law distribution — a small number of developers are orders of magnitude more active than the median. Log-interpolation respects this shape.

For non-90-day windows, the value thresholds are scaled proportionally (e.g., a 30-day window scales thresholds to 1/3). The percentile axis stays the same — “Top ~1%” always means Top ~1% regardless of window length.

Limitations and honesty

  • Anchors are derived from real GH Archive data, but GH Archive only captures public GitHub events. Private repo activity is invisible.
  • GH Archive counts PushEvents, not individual commits. The ×2 multiplier for commits is an approximation — actual commit-per-push ratios vary by workflow.
  • PR “closed” events include both merges and rejections. The real merged-PR distribution may differ.
  • The real distribution shifts over time as GitHub grows. We plan to refresh anchors periodically via automated ClickHouse queries.
  • Monorepo teams, squash-merge policies, and CI bot patterns can inflate or deflate raw counts.
  • The “Top ~X%” claim means: “among ~8.2M human accounts active on public GitHub in the last 90 days, we estimate your visible activity places you roughly in the top X%.” It is not an exact ranking.
  • You can verify every step: click “View the math” on any analysis result to see your value, the scaled anchors, and the exact interpolation.

Overall score

The composite score is a weighted average of 7 log-scaled dimensions (volume 30%, breadth 19%, consistency 18%, releases 10%, recency 10%, collaboration 8%, concentration 5%). The score itself is then mapped to a percentile tier using a separate set of anchors calibrated against the expected score distribution. A score of 85+ maps to Top 1%; a score of 30 maps to Top 50%.