MarketNet: Computer Vision Applications in Financial Markets

22 min readMar 22, 2024

We’ve got WordNet and ImageNet, it’s time for a MarketNet. As traders we follow various asset classes across multiple time-frames. The human visual abilities are incredible, but we get tired, sloppy and biased. I thought what would happen if traders had a decision support system, an unbiased second opinion to confirm your entries. Here comes MarketNet inspired by some questions I had for quite some time:

Is the network able to identify the market. Is it a snapshot of E-Mini S&P 500 futures or a snapshot of GOOGL stock?
Is the network able to predict future moves? Obviously, it is a money maker, right? But as traders, we know that what’s obvious, is obviously wrong [Druckenmiller].
Is the network able to identify profitable trades?

Here comes a brief summary of the experiments I did. Work on this project started on 2022–09–27.

Questions

The first question while messing around with the data was is the model able to identify the market? Obviously, we know what market we are currently watching and trading, but are they really different?

model = tf.keras.Sequential([
  tf.keras.layers.Rescaling(1./255),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])

The setup is a simple vanilla model from the load and preprocess guide by TensorFlow. Nothing fancy here, I’ve stretched the image input size to a square of 200 pixels and increased the number of epochs to spot overfitting points. Data was stored in separate directories for each market. The first experiment is within the equities asset class including ES, NQ, YM and RTY.

The model is poor and started overfitting as well, killed it. Then I’ve decided to try the same idea within interest rate asset class.

Identifying interest rates: not spectacular.

Identifying interest rates (ZB, ZN or FGBL) was not spectacular either, and eventually started overfitting as well.

Going across various asset classes: trying to identify a market with targets ES, RTY, ZB and FGBL.

Identifying multiple markets across various asset classes: not spectacular.

All in all, there is nothing interesting in the asset identification domain, we just have been warming up.

Identify Side

Now let us separate the images into buying and selling days. Any positive change for the day (Close — Open) is considered a buy target and vice versa.

It is important to note that the data being used is provided by the IB TWS API, and has been collected since 2020. Generating images takes a lot of times due to many Matplotlib related operations.

Open price is defined as the CME opening price and close price is defined as NYSE closing price. Bars are 30 minute bars.

Can identify the side of the market for a day (NQ).

Look at that, the model is able to detect the side for a day. Early stopping was not crucial here as it does overfit, but keeps steady metrics, all in all it is OK. Well that’s spectacular but obviously has no value. We don’t need to state the obvious, and we know how to identify the trend for the day (particularly once the trading day is over), we need more.

Let’s Get Serious

OK, now let us try to build something more useful to be actually used in day trading. Suppose you have an edge, a system of rules that receives as input the current market state and outputs the suggested side for action. You should buy, sell or wait aside. That’s in the most basic form a trading strategy. Now let’s boost the decision by providing an extra pair of eyes via a Machine Learning model.

With some historical data analysis, it is possible to simulate the outcomes of your edges (trades). Essentially we have four class variables (targets). The first target is a clear buy signal where the STP (stop loss) did not hit. The complement class is the same buy signal where the STP did hit and we lost moneys. In a way it is like spam detection for both (possible) sides of the trade. Same goes for the sell side: true sell signals which would (maybe) make money vs. sell signals that would hit the STP and lose money (for sure).

It is important to note that a positive outcome does not mean we definitely made money, nor it means we didn’t lose money. All it tries to predict is whether a trade did not hit the STP and lost moneys for sure. It is a very important distinction of the problem. Assuming you have an edge, on the balance, once you sample enough (trade the trades, do the deeds, walk the walk) you should be fine (make money). All we are trying to do here, is essentially to improve the edge itself as is. As we are obliged to take every trade, the modeling helps to filter only the good ones.

A word about the paradox introduced here. In order to become a consistently profitable trader, you must have an edge and trade it like a robot. An edge is a trading system that with good sample makes money for sure. Mental stamina and following the system religiously is where the vast majority fails. Trading systems are hard to find, but it is much harder not to mess around with / change your system once you have a losing streak (and you will, for sure). But then again, we try to build a model that essentially changes our strategy and allows not to follow the strategy from time to time, which we just mentioned as being a big no-no for becoming a pro.

Let me elaborate. It is both true and false. As long as we don’t change the system rules, it is OK to introduce systematic discretion into our execution routine. For instance, even if you have a signal to enter, but here comes CPI / GDP / FOMC event in a couple of minutes, well you should definitely skip this one. It is better to donate this money than to go in with no liquidity and chaotic market behavior before and after the event (just try it out once). The modeling we do is a systematic discretion on when not to take a trade, thus it is acceptable and even encouraged. Don’t you think large institutions have been working on something like this for years?

If you want to read more about what it takes to find an edge, you are welcome to check out my previous stories:

Things You Learn After 1 Year of Day Trading for a Living — a very detailed description of the ups and downs of my first attempt.
Trading for a Living — yet another detailed description of the ups an down of my second attempt.

The first trial was to learn on all of the targets all together. The model was clearly poor and unpractical.

Now let us try to build a model for the buy side only: to buy or not to buy? This will require two models (for every side), but as we know for sure which side we are about to take, it is possible to differentiate between the models. Obviously, the problem is much simpler now. It is always a trade-off, but we can afford it.

The performance is acceptable but an early stopping is required, otherwise it starts to overfit.

Now let us try to identify a (maybe) profitable sell signal vs. a losing sell signal (for sure).

Again, the performance is acceptable but an early stopping is required.

So far, we have too much noise and unreliable models.

Improving Inputs

A word on synthetic data generation is important. Some market states have small number of bars (a signal to buy appears right after CME opening). Other signals have almost all of the bars for the day (a sell signal 1 hour before NYSE close). Obviously, generating images with varying widths requires some kind of adjustments (to keep them squared). So we calculate the defined range for the particular market state and add extra columns to compensate for markets with low number of bars. Markets with many bars and/or smaller ranges naturally generate a square-like images and sometimes even do not require any adjustments. Why bother? Well because sometimes it is important to check things manually and we prefer to keep the synthetic data in a unified similar format: squares, black and while, etc.

Another word on statistics and trading: you probably are familiar with the two approaches / schools of though / methodologies — frequentists vs. Bayesians. Well trading requires both: you need the classy statistical analysis for defining risk, and you need Bayesian analysis for assessing your edge. ATR is based on classy statistics. Trading in the Zone [Douglas] and the famous 20 trades experiment is a classy Bayesian analysis. Yin and yang of trading is using those two approaches equally and appropriately.

Oh, yeah, and don’t forget the noise:

OHLC

The next step is about introducing open and close prices as well. First we used only ranges via high and low prices but obviously, the entire bar metadata is important and might improve modeling. We introduce a modified version of the utility to create images, this time with open, high, low and close prices. Note that we stick to black and white color maps to keep things simple. The pixel matrix sets 1 for bars’s tails and 2 for bar’s body. Currently there is no differentiation between green/red bars (but it might be the next improvement, thinking out loud).

Visually speaking, data is much easier to understand and test with the open/close metadata, but we are not the network to be fitted.

Trying to classify all of the 4 targets together: over fitting and a poor model.

To buy or not to buy: not bad, starts to overfit though.

To sell or not to sell: again not bad, but starts to overfit.

Zoom In

Maybe taking a snapshot of the entire day is too much? Maybe all we need is the last couple of bars, but how many? I’ve decided to try many options, from impractical 4 bars (seems too little) to the original idea (taking all of the bars so far in the day).

Data sets starting with HL use only high and low prices to create images. Data sets starting with OHLC use open, high, low and close prices to create images. The difference has been described above. Now let’s try various TPO counts. TPO stands for Time Price Opportunity in Market Profile terminology, and it is just another name for a single 30 minute bar. When no TPO count exists, we take all of the bars available when the signal was to buy or sell. If a TPO count exists, we take the last number of bars: 4, 8, 12 and 16. Essentially the network now sees the entire data for the day or the last 2, 4, 6 or 8 hours of data.

Let us summarize the results for all of the possible models. We start with a simple (and not practical) identification of all of the 4 targets (buy, not to buy, sell and not to sell). We continue with more relevant classification for the buy side and the sell side separately. Both have two targets: to take the trade or not to take the trade.

Clearly, taking the last number of bars has no value, but giving the network all of the bars increases the risk of overfitting. Validation performance is not spectacular, but the model may be used. Adding open and close prices did not seem to help much either. All in all, those models are OK but I’m reluctant to use them in my day trading so far. Even if the performance is acceptable, there is so much noise involved, and the metrics are fragile and volatile.

A word on Machine Learning best practices. I have not been part of Andrej Karpathy’s team or something, but his lectures make you feel like you did.

Anyway, I’ve done my fare share amount of modeling for various companies once upon a time as a Big Data Engineer / Data Scientist. Couple of suggestions and things I’ve learned the hard way:

Less is much more. Simple models rock, fancy models suck.
Feature Engineering is much more important than (simple) models.
Data quality is never perfect, just relax, you will never have a log loss of 0 on train and validation.

Zoom Out

Let us now try to zoom out: provide the network much more input bars than necessary. If a signal is created intraday, previous day’s history is easily accessible and can be added to the images.

OHLC data for yesterday and today in a single image.

Wider images are squished to the good old squares, just to be consistent. The results are significantly better, particularly for the OHLC data. It finally seems to improve modeling.

Now those are models that I’m considering to rely on with my intraday trading. Although marked as they all have been overfitting, they had log loss much lower than the 0.5 coin-toss cutoff, and the accuracy was almost unchanged through the iterations. In addition, we have much higher accuracy levels which again have been spotted both on train and validation sets. All in all it seems acceptable.

Market Profile

Let us now go crazy: let’s build images of Market Profiles instead of just charts. In a way, the images created so far, are just bar charts. If you include only high and low ranges, it is a range chart. If you do include open, high, low and close, it is a bar chart. In our experiment, we don’t care about red vs. green bars coloring, thus it is not a 100% bar chart. If you mess around with the coloring of the bars, it truly is a bar chart to the core.

Now bar charts are awesome, and to be clear, I do make money currently relying only on 30 minute bars, but my secret sauce is the Market Profile (which is originally build from 30 minute bars).

Side note: my Market Profile course is available on Udemy, just in case.

In a nutshell, a Market Profile is just applying group by price on raw historical data of 30 minute bars. Technically, any bar size will do, but the original idea sticks to 30 minute bars. You are more than encouraged to try it out. I’ve spent 3 months drawing Market Profiles by hand, it changed me forever. First, I found my first edge by doing it. Second, I can build a Market Profile in my head for any market, by just looking on any chart (bars, line, etc.). Highly recommended exercise.

There are many concepts and terminology to the Market Profile (POC, Value Area, single prints, etc.), but this story is not about it. In a nutshell, all you need to care about (in my opinion) is clock (current time) and single prints (tails). No much opportunities once CME is open, or about to close either. Market making a new high/low for the day is very interesting and provides opportunities. Ranging markets are very tricky and risky to trade.

Now instead of drawing charts, we build a simple Market Profile, and turn it into an image. It is important to note that Market Profiles originally heavily rely on letters where each letter represents a predefined TPO (Time Price Opportunity). In layman’s terms, letters represents specific 30 minute slices of the day. It’s a matter of taste and/or how the software draws and defines it for you. Let’s make an example: a Market Profile starting with the letter ‘a’ means that any time you see ‘a’ in the Market Profile, you know this range has been traded during the first 30 minutes after CME opening. Respectively, the letter ‘R’ shows on the Market Profile trading activity during the last 30 minutes before NYSE closing. Needless to say, that I strongly encourage you to stop reading this, grab a pen and paper and work out the logic behind this example. Trust me, it will bring you bucks. It is important to note that in this example, our Market Profile starts with a lowercase ‘a’, but some applications/implementations could start with a capital ‘A’, it doesn’t really matter as long as you stick to it and being consistent.

Results for Market Profile images.

Feeding images of Market Profiles is not spectacular, there is a regression here in all terms. As seen previously, including previous day’s data might help. Let’s try it.

After some tweaking, I was able to build an image for a multi-day Market Profile. The core principle of any trading strategy is that you rely on something and compare it with some previous activity. It is always about the context. Bringing the past 3 months is probably exaggerated, but chances are low to make a living from just the last 3 bars either.

Images of Market Profiles spanning 2 day periods.

A word on blending models. Blending rocks, always. There is no way you reduce the modeling quality if you add new (good) models and average the results. The averaging methodologies don’t really matter as well. A caveat on the subject is to keep it simple though. Blending too many models is all in all inefficient because it will take much more time to orchestrate modeling and it is much more error prone.

Results for Market Profile spanning 2 day periods.

Finally a decent model! Metrics were good all the way. Even though it is marked as if overfitting exists, metrics were very good with slight jumps in and out of the local minimum, but staying close all the time.

Extra Checks

So far, we have been modeling only E-mini S&P 500 futures, let’s try modeling additional markets: E-mini Nasdaq-100, E-mini Dow Jones Industrial Average Index and E-mini Russell 2000 Index. In my opinion, those are the best markets to trade in terms of liquidity (spread/depth), opportunity and importance. Liquidity is defined by many parameters but tight spreads and deep bids/asks are key components. Opportunity is a subjective measure, but the volume for those markets doesn’t lie, and the swings intraday are spectacular. Lastly, importance is yet again a subjective measure, but those indexes cover any company you know. Bad/good news are truly reflected if someone is having skin in the game. Remember that any financial professional / adviser / guru / stock picker is usually just a scared fragelista with no money/risk on the line, particularly if the individual makes it’s life work to share his opinions with others. Could you imagine Paul Tudor Jones sharing his buy/sell signals on X? You get the point.

Unfortunately, making the simulations and creating images suddenly became very slow. As now we need both OHLC and Market Profile images for 4 markets, instead of minutes it turned into hours of batch time. Obviously, if the data is created initially for the entire data store, then we could incrementally add new images say once a week, but then storage becomes an issue. As always in Computer Science, there is a trade-off between space and time. Something could be done faster with more memory or storage and vice versa. It is an art much more than what people think or know.

Now there is much more data and directories, let’s make some order. Top-level directory ‘ML’ contains directories for each market ‘ES’, ‘NQ’, etc. Each market directory such as ‘ES’ contains two directories for every type of image: OHLC (open, high, low and close bar chart) vs. MP (Market Profile). Each type of image directory such as ‘OHLC’ contains possible entry side for a trade ‘Buy’ vs. ‘Sell’. Finally, each entry side directory such as ‘Buy’ contains data for a yes/no question we are so interested in: to trade or not to trade an edge?

Thanks to the tf.keras.utils.image_dataset_from_directory function, we can easily provide the desired target and build separate models. For instance provide ‘…/ML/ES/MP/Buy/’ to build a model for incoming trades in ES on the buy side, via Market Profile images. Additionally, we can easily check the data manually, just in case.

A word about fancy data formats and storing images efficiently. Although MNIST uses some sophisticated encoding and a custom format for efficacy, we keep it naive and as is. Again it is all for that one strange day (news / crazy spikes / FED announcements, etc.) that we will have to zoom into and check thoroughly, because it was either a spectacular day or a very bad one (in terms of P&L).

Grand Finale

Let us preset the final results for the big 4 indices futures listed above, OHLC and MP images with the plain vanilla model we started with.

Results for big 4 equity index futures via OHLC and MP images.

Ta-da! Looks great and useful as well. Blending the models will definitely improve modeling quality even further.

One Minute Bars

So we’ve got some decent metrics and good models. The above examples all rely on 30 minute bars, but one of my edges is actually scalping on one minute charts. I was thinking, is it possible to model smaller time-frames? Let’s give it a shot.

Immediately I’ve realized that my local machine won’t do. I decided to use my backup AWS trading machine (t3a.medium). Generating images for 1 minute bars took 24 hours just to do 1 year of simulation. I stopped it because it was taking too much time, and we already had over 150K images, which is more than enough for a proof of concept.

Epochs now take more than half an hour to pass. The machine was lagging and felt very heavy, I did just two models for OHLC. The metrics sucked, and there was no reason to continue this endeavor.

Obviously, if we were forced to make it work out, we could have used much more powerful machines, GPU, TPU, etc. All in all it’s impractical, and that’s why I quit it right at the beginning. As traders we know to cut our losses fast, same here.

Live Trading Experiment

Trading is a pure meritocracy. You eat what you hunt, and politics won’t help. Skill is mandatory, but courage and mental stamina are sometimes even more important.

There is no reason for doing research without skin in the game. I could do an infinite number of experiments and show you how good/bad the results are, but it has no value.

Trading is all about sample size. Do one fat trade, and you could call yourself a trader right? Unfortunately, it takes much much more. I can’t recall the number of times I was holding a big profit for the day that went bust, ending up a loser for the day. I also can’t recall the number of times a trade was flirting with my stop loss level which I already classified as a sure loss, ending up a profit and reaching my take profit level. You get the point. It is known that profits are like eels in this business.

Money time, enough with the algorithms and theory, that’s nice and all, but I’m not running a charity here. Bills and taxes must be paid and something must be saved for a rainy day. It’s a business I’m running and everything is related to Game Theory. There is no reason to do a thorough research unless it can provide something in return.

Inspired by Trading in the Zone [Douglas], the only way I’ve been assessing trading performance ever since is in groups of 20 trades. The more the better, but 20 is the absolute minimum. Everything below 20 is just noise. Let us now trade 20 new incoming edges/trades and use our fancy decision support system. The question is, if I was to use this model to strictly filter incoming trades, would I improve my P&L or not?

I zoom into my most profitable markets so far: ES and YM. Same idea is applied on both markets. Trades are taken via 30 minute bars, usually it takes a couple of minutes to enter/exit the trade, so adding a real-time Machine Learning prediction is not a big issue (no rush).

Interpretation

As we now have an incoming stream of trades with fancy Machine Learning predictions, let us decide how to interpret them. There are many correct ways, it is not deterministic. We have two predictions for every trade: prediction based on OHLC (open, high, low and close) chart and prediction based on MP (Market Profile) chart. Every prediction is composed of class name (Y — yes / N — no) and a probability (a range of [50, 100]). I’ve decided to simplify predictions by dividing them into 3 groups.

The 3 groups (scenarios) are:

Y — take the trade.
N — do not take the trade.
A — ambiguity, OHLC and MP have mixed predictions, thus it is in a way similar to not taking the trade.

Obviously we could have added some quantile analysis on the probabilities and multiple combinations of the two predictions, but as mentioned before, less is much more.

Results

Unfortunately as most trading ideas and things in the markets, it is much harder then what it looks like. After decent sampling (trading) it seems that all of this fancy technology and setup has no value at all!

Out-of-sample performance of the fancy DSS.

The system is pretty much OK with any outcome of the DSS. We were expecting a clear identification of ’N’ labels to be bad trades, but they are profitable. As expected ‘A’ labels should be avoided and indeed their P&L is poor.

Out-of-sample performance of OHLC chart images.

Trying to use various cutoff thresholds for DSS clustering did not improve the results either. Separating to OHLC vs. MP explicitly also did not help. They are all either good together or bad together (in terms of total P&L).

Out-of-sample performance of Market Profile images.

Another research I was conducting during the same sample and period was incorporating a simple tree-based approach. No spectacular results here also. All of the labels make money, so not using the signal or using it has no meaning at all.

Out-of-sample performance of a simple ML (tree-based approach).

All in all the experiment is eventually a failure on one side, and a great journey of education, trial and error on the other hand.

Possible reasons for the failure:

Curse of dimensionality — there is too much data and we can’t process it all. If you are an institutional trader, and got the resources, you may get better performance.
Seasonality — markets have extreme seasonality effects and data is varying with random volatility all the time (heteroscedasticity). There is a huge difference between classifying cats and snapshots of markets. Cats are cats, in markets cats can essentially have horns.
Human error — something could have been improperly aligned, optimized or written in mediocre way. No matter how experienced you are, or how good are the linters, some bugs are hard to detect, particularly in ML.

Anyway this was a very interesting learning experience.

Take 2

I was thinking about finishing the story right here. Experiment is complete, sample size is good enough and results are bad. I suddenly have decided to give it another shot. It took me many months to develop and test this infrastructure and even though the results were bad, I wanted to keep on testing it. I was reluctant to drop it.

So instead I’ve decided to change my modelling setup. First I reduced min accuracy threshold for the custom callback. Requiring a threshold of 0.95 suddenly seems too tight and 0.8 is better as converging is guaranteed to happen fast and essentially we call for an early stopping not to overfit.

Then I’ve reduced the image size to 32 instead of 200. Those changes called for lightning fast flows and much much smaller models (~1GiB vs. ~500MiB). Finally, I’ve stopped providing train data as validation data. At the beginning it was to have some kind of a feedback on model’s performance. But after reviewing the stats of a true out of sample experiment, it is now clear that in this business validation is useless. The only true validation is the out of sample experimentation. Needless to say that it reduced running times even further.

Let us now trade (sample) with the new setup and see if we get better results.

Out-of-sample performance of the modified DSS.

The modified neural network did not improve performance either. As a matter of fact it is performing worse. Yes classes have a negative P&L while No classes have a positive P&L. This is awkward and should be the opposite.

Out-of-sample performance of a simple ML (tree-based approach, scalping trading idea).

And gain the simple tree-based approach rocks. Note how all false classes (regardless of the probability cutoff) have a total of a negative P&L, while the true classes have a total of a positive P&L. Also note that the vast majority of trades are losers (which is common in this business), but the simple ML is able to filter the good ones (which is exactly what we are looking for).

KISS (Keep It Simple, Stupid) is very important in this business. Unfortunately with all the fancy trials and errors it is obvious that simple ML modeling is much better. Another consideration which has been neglected for the entire experiment is the real-time prediction times. Python modules imports and prediction times are much heavier for TensorFlow compared to LightGBM. In practice making prediction for both modeling flows is usually around no less then 6 seconds. Reducing the modeling to the tree-based approach only cuts those times by half. It is important to note that loading (importing) times are much longer than prediction times (which are less than one second for both approaches).

Insights

Along the way, we had some insights which are important to recall. Let us summarize the most important ones.

More data is good: adding yesterday’s prices had a major breakthrough.
Using all four vectors of historical data: open, high, low and close improves modeling.
NQ prices have huge ranges and usually it takes much more time (iterations) than other markets to create images for, but it’s performance is slightly better as well (trade-off).
The sample size of those experiments is essentially tiny. I’ve been in the business professionally since 2017, and started this Machine Learning modeling with a data store of just 3 years, it’s not enough. I could have been just fooled by randomness.
Trading is a hard business. It takes years to develop an edge. It takes years to improve it. The learning never stops.

Hope you have learned something new, feel free to reach out at kreimer dot andrew at gmail dot com.