$30
1 Introduction
Here you will assess trade flow as means of generating profit opportunities in 3 cryptotoken markets. We stress the word “opportunity” because at high data rates like these, and given the markets’ price-time priority, it is far easier to identify desirable trades in the data stream than it is to inject oneself profitably into the fray.
2 Data
We have preprocessed level 2 exchange messages from the Coinbase WebSocket API for you into a more digestible format.
2.1 Treatment
Load the 2021 data for all 3 pairs from the class website. For each one, split it into test and training sets, with your training set containing the first 20% of the data and the test set containing the remainder.
2.2 Format
The data has the following structure[1]
2.2.1 Trades
1618090137140737000 1618090137157544000 35690
1000000
-1
1618090137851379000 1618090137864544000 35700
29801980
2
1618270615253262000 1618270615358639000 35760
2926932560
-1
1618270616012160000 1618270616105583000 35760
The Side is actually a sum of trade sides at the same price and time.
2.2.2 Book
16673940
-1
Ask1PriceMillionths 35700 35700
35770
35770
Bid1PriceMillionths 35690 35690
35760
35760
Ask1SizeBillionths 11872084060 11872084060
1255039420
1255039420
Bid1SizeBillionths 32957203990 32957203990
24752612680
24752612680
Ask2PriceMillionths 35710 35710
35780
35780
Bid2PriceMillionths 35680 35680
35750
35750
Ask2SizeBillionths 31032423370 30332423370
31011776970
31011776970
Bid2SizeBillionths 45284575470 45284575470
41785630850
41785630850
received utc nanoseconds
3 Exercise
Write code to find τ-interval trade flow just prior[2] to each trade data point[3][4] i. Compute T-second forward returns . Regress them against each other in your training set, to find a coefficient β of regression.
For each data point in your test set you already have , so your return prediction is ˆ . Define a threshold j for ˆri and assume you might attempt to trade whenever j < |rˆi| .
4 Analysis
Assess the trading opportunities arising from using these return predictions in your test set. As part of this assessment, comment on the reliability of β, how you chose j, and what you might expect from using much longer training and test periods.
[1] Note that inaccuracies in clock settings, i.e. “clock skew”, can cause timestamps to appear later than the time at which they are recorded as having been received.
[2] We do not include the trade i data itself, because we are evaluating trade i in terms of the flow we would have been aware of just before it happened.
[3] NOTE: the trade data series does not necessarily have strictly increasing timestamps. Be sure not to include other trades at the same timestamp in your computation of Fi.
[4] It is not necessary to handle latency in your homework, but for your edification: a more careful implementation would account for lags. For a pessimistic approach we could choose L as, say, twice the 99th percentile of computational and communications lag. Then, it would use book data (not just trade data) to help compute return from time ti+L to ti+L+T and run regressions using that. The idea here is that it takes approximately time L to “do anything” about trade information.