This spring, I managed a quantitative research pod for Paragon Global Investments, which is, I suppose, the premier intercollegiate quantitative finance group (not that we have much competition). I had my pod work on a model that forecasts a probability distribution of the NASDAQ-100 (NDX) closing level. In particular, our model runs on days with scheduled, market-moving events, e.g. Fed rate cut decisions and Mag 7 quarterly earnings. Our alpha is found on the Kalshi KXNASDAQ100 series, a stratified market with really high spreads (≈15c), which indicates a great deal of uncertainty on the part of market participants. We trade on the series by using our aforementioned distribution to generate per-bucket trading signals with bootstrap-derived confidence intervals. Special thanks to Tommy S. and David H. for their excellent technical work on this, as well as Forrest G., my boss/mentor.
Alpha Thesis
To motivate the project a bit, it’s worth noting that forecasting some point estimate for the NDX closing level and then trading on that estimate itself is very silly. Your decision-making space is quite constrained: you can short if your prediction is lower, long if higher, overleverage as much as you want, run 0DTE options, et cetera, but you are fundamentally just taking directional bets on the assumption that the market is mispriced. If you are really beating an index consistently, RenTech is always hiring! In practice, this is probably just coin-flipping with slight negative EV, because you are not nearly that good at modelling.
Okay, so if forecasting a point estimate is unrealistic, why is forecasting a distribution much better? It turns out that forecasting distributions captures a lot of additional information. Consider the following toy example. Suppose that the entire NDX depends on a single company c (necessarily it can’t because of the whole 100 thing… but suppose the other 99 comprise some negligible ϵ). Suppose, further, that the company has an earnings call today and we have information regarding that call which no one else has: if the call goes well, with p=0.75, then Δprice(c)=+10, and if it goes poorly, with p=0.25, then Δprice(c)=−30. Then E[Δprice(c)] is 0, and if our point estimate takes the e.g. MSE (very reasonable), then the price(c)=price(c)t=0. This is very unfortunate for us, because despite having information which isn’t priced into the distribution (by hypothesis), the information is priced into the point. Notice how e.g. Var[Δprice(c)]=0.75⋅102+0.25⋅302−02=300 is unique information which trading on a point doesn’t incorporate.
Of course, in this case there are clever things we can do with options, but we would prefer less informed counterparties. That is why we turn to the KXNASDAQ100 series. The spreads on that series are generally quite high, owing to uncertainty. And, moreover, amateurish market makers often quote with an unsophisticated Gaussian distribution about the current price. Therefore, we aim to forecast the closing level distribution influenced by these market-moving events, hoping that the fat tails unaccounted for by makers can generate us profit.
Importantly, there are good reasons to think kurtosis is high enough for alpha. For starters, a few companies comprise much of the NDX; the index is ridiculously concentrated (Nvidia comprises 14%), with the only salient comparison being the dot-com bubble. Thus, if these highly weighted companies are capricious and volatile, they will have a meaningful impact on the index. And they very much are! Trailing P/E is at an all-time high, VVIX and SKEW are close to or at all-time highs, and so on.
Okay, so we have a strong pretext: certain events have outsized impacts on companies which have outsized impacts on our index price. What are these events?
Forecasting
We curate a JSON event file where we track events of interest starting in 2019—a year chosen because that’s when a lot of the aforementioned theses became particularly true. The events we care about include 444 macro events (FOMC, CPI, NFP, PCE, Jackson Hole, etc.), 309 mega-cap earnings prints, and 108 AI lab events (Nvidia GTC keynotes, OAI/Anthropic/GDM model releases, Apple WWDC, etc.).
We then have an audit document that maps each entry to primary sources, and we had subagents run multiple pass-throughs to be thorough. In particular, we had to be very careful with which market dates we attributed to which events. For instance, the 2025 government shutdown delayed several CPI and NFP releases (and of course Trump decided to just not do an October 2025 CPI), and Good Friday occasionally (but not always!) messes with the NFP, the list goes on. Then across the board, events occurring outside market hours had to be mapped to the next day’s market—these overnight factors must be additionally accounted for because the KXNASDAQ100 series runs 24/7.
Then, for each (event_type, ticker) pair we computed marginal statistics—the mean, standard deviation, skewness, and excess kurtosis—over the historical reaction-day returns for that bucket. We end up with 1,530 ticker-bucket marginal records. We then generated a Pearson correlation matrix across tickers from the same panel in order to have high-dimensional correlation vectors. We discover some fascinating stuff through this matrix.
Consider for instance AMGN (Amgen) and TMUS (T-Mobile). With COST (Costco) earnings, they have a correlation of ρ=+0.77. Yet with META (Meta) earnings, ρ=−0.36. For those curious about the statistics, n>30 for both, and it’s still +0.74,−0.26 with LOO. Such behavior is intuitive. Costco prints are treated as a consumer-spending/staples signal, so defensive yield megacaps (which Amgen and T-Mobile are, as both pay solid dividends, have low beta w/r/t the cyclical tape, etc.) move in lockstep. However, Meta’s print is taken as a signal of how digital advertisements and consumer engagement are performing—advertising was literally 98% of their revenue in 2025. Think of how many T-Mobile and Amgen ads you have seen—for me, probably hundreds versus zero. So, the behavior aligns with our heuristics. Note how important the Pearson correlation matrix is—if we had simply calculated 1-d correlation, hoping to run some mean-reversion strategy, like every other unsophisticated algotrading project, we lose this nuance entirely!
The universe for a given event type is then the intersection of tickers with the cached returns for every reaction date, which keeps the series equal-length so we can do correlation estimation. Otherwise, unequal-length pairwise estimates would mean our covariance matrix would lose its positive semidefiniteness (PSD).
To forecast on dates with multiple events, we treat each event’s shock as an independent additive contribution. Obviously, these events are not actually uncorrelated, but we don’t have sufficient sample size to train our model as if they are correlated, so it is what it is—this affects 23.1% of event-days. Adding them works out neatly because the per-event μ vectors sum without an issue, and the Σ matrices are PSD (the sum of PSD matrices remains PSD). We care about the PSD property because it means that Cholesky decomposition stays well-defined. We use Cholesky decomposition since it considerably speeds up our computation; it factorises in 31n3+O(n2) cf. eigendecomposition’s O(9n3).
Finally, for the actual Monte Carlo simulation, we draw n=100,000 joint component-return vectors, then aggregate each draw into the actual NDX using NNLS-estimated index weights for each company, then apply this to the current price. The result is a Monte Carlo distribution over the closing level.
A brief aside: component companies about which we are agnostic we simulate with Geometric Brownian Motion (GBM), as is standard practice. The GBM is fine-tuned such that σ is the annualised realised volatility of NDX log-returns over the 504 trailing trading days, again, as is standard practice. We are content with this simplistic model because they make up a sliver of market cap (13.15%). The equation we use is as follows, with σ calculated based on window: dSt=μStdt+σStdWt⟹ST=S0exp((μ−21σ2)T+σTZ),Z∼N(0,1).
Backtest
The model has two particularly relevant hyperparameters which we fine-tune in the backtest: mean_shrinkage (how aggressively to bias towards zero, which we do for bias-variance trade-off reasons), stdev_shrink_weight (how much to shrink per-ticker σ’s). There are some more but they are relatively immaterial.
We score a forecast with Continuous Ranked Probability Score (CRPS). The scoring rule penalises a predictive CDF F for being far (in L2) from the point mass at the true realised outcome y: CRPS(F,y)=∫−∞∞(F(x)−1{x≥y})2dx.
In particular for a sample-based forecast {X(k)}k=1N (which is what our Monte Carlo simulation is), the empirical estimator becomes: CRPS({X(k)},y)=N1k=1∑NX(k)−y−2N21j=1∑Nk=1∑NX(j)−X(k).
A lower CRPS is better. The first term rewards accuracy (it represents the mass near y) while the second term rewards sharpness (it prefers sharper distributions). When N=1, CRPS is equivalent to mean absolute error, as one would expect. It is reasonably conceptualised as the generalisation of a point-forecast MAE to a full distributional forecast. We also use confidence intervals because they are easy to use and easy to understand.
For backtesting, we use three periods. The first is 2019-01-01 through 2021, the second is 2022-01-01 through 2023, and the third is 2024-01-01 to now. We use the first to calculate training statistics (the per-event marginals and the Pearson correlation matrices), so the model conditions on strictly historical data. The next window is validation, on which we test those aforementioned hyperparameters: mean_shrinkage∈[0,1] and stdev_shrink_weight∈[0,0.5]. We first sweep a 200-cell grid (20×10) to find a coarse optimum, then refine via Nelder-Mead, seeded from the best grid cell with tolerances xatol=0.003,fatol=0.05. It’s worth noting (I didn’t notice this initially) that we cannot use a single window to both calculate the statistics and the hyperparameters, since the hyperparameters correct the overfitting of the training statistics, and would both fine-tune to 0 tautologically if we had run this on a single window. The refined configuration is mean_shrinkage=0.9869, stdev_shrink_weight=0.0009.
On the held-out test window, the tuned model beats the untuned default by −6.93% mean CRPS, with a 95% percentile bootstrap confidence interval of [−11.66%,−2.48%]. The CI excludes zero, so our tuning lift is statistically significant! Unfortunately, against the GBM baseline on the same window, it returns +1.46% CRPS with a 95% CI of [−0.83%,+3.69%]—a statistical tie, since zero is in the CI. To be fair, this isn’t any serious indictment; if we beat GBM, that would probably mean we can directly trade on the NDX. We are certainly good enough to trade on Kalshi! Moreover, upon some further analysis, our consideration of kurtosis means we predict days with fat tails incredibly well—significantly better than GBM, in fact.
Tail Diagnostics
An aggregate CRPS tie could mean a lot of things. Perhaps both models are interchangeable at every part of the distribution (in which case we have reinvented the wheel), or, hopefully, we have fatter tails that only pay off on the days when the realised closing level lands in them. Since our entire alpha thesis hinges on the latter, it’d be really nice if the latter is true. And it is! If we restrict CRPS analysis to the days where the realised NDX move was large, we get some interesting data.
ΔNDX at least
n
our CRPS
GBM CRPS
Δ CRPS
95% Interval
0.0% (all days)
156
174.42
171.91
+1.5%
[−0.8%,+3.7%]
1.0%
67
279.64
286.43
−2.4%
[−4.3%,−0.6%]
1.5%
39
364.61
376.45
−3.1%
[−5.5%,−0.9%]
2.0%
22
480.00
497.19
−3.5%
[−6.3%,−0.6%]
2.5%
11
644.78
671.35
−4.0%
[−6.9%,−0.7%]
The improvement grows monotonically with how extreme the NDX move was, and, in particular, the 95% CI excludes zero at all levels, indicating statistical significance. In plain English: on the days we care about—those where NDX moved substantially in response to an event—our model dominated GBM by a statistically significant margin, and that margin strictly increases as NDX movement increases.
Implications
In summation, aggregate CRPS on event days is approximately a tie because GBM's narrower distribution beats us on quiet days, but we win substantially on days where realised moves are large—precisely the regime in which we hope to profit on Kalshi.
Materialising our Edge
Kalshi sells daily range contracts on the NDX close. They pay a dollar if it settles in the bucket and nothing otherwise. The full set of buckets (which include catch-alls on both ends, e.g. “18999.99 or below”) partition the price axis, so they collectively define a market-implied probability distribution—though we need to strip vig (subtract half-spread) and normalise because of how poorly the market is priced. By bucketing the probability distribution implied by our Monte Carlo simulation, we can compare the two. If our model's bucket probability sits meaningfully (not that fees are particularly high—around a percent) above the ask on a YES contract, buying YES has positive EV, and vice versa. Doing so is fairly straightforward. We bin our 100,000 Monte Carlo draws into the bucket boundaries pulled from Kalshi’s REST API, then we compare per-bucket model probability against the current top-of-book bids and asks. If our edge is sufficiently high, we buy (where edge threshold depends on risk appetite).
Execution and Infrastructure
Our codebase revolves around a few load-bearing pieces.
Our data layer caches 162 of the 192NDX constituents from our time scope—about 84% of tickers and ≈95% of index value—as daily OHLCVCSVs. We verified historical membership per-date with the n100tickers package. Then their index weights are estimated by running a non-negative least squares regression of the NDX daily return on component daily returns, subject to wi≥0 (weights are nonnegative) and normalised so they sum to 1. Our result has an R2=0.998, so our basket faithfully reproduces NDX. This process was necessary because historical weights are paywalled—and our recreation is excellent.
Our event-statistics pipeline has three sequential scripts. The first, build_event_returns.py, computes the intersection of tickers with the cached returns for every reaction date (keeping series equal-length). Then build_event_marginals.py computes the marginal statistics (mean, standard deviation, skew, and kurtosis). Finally, correlation.py ingests the outputs of the prior steps, building a Pearson correlation matrix per event type.
Then live signal generation (main.py signals) pulls the current NDX from yfinance, runs our model on that price, then fetches the Kalshi orderbook to find per-bucket signals.