When in doubt, you should always select simpler machine learning model over a more complex one. Because it will perform much better in wide variety of cases.

Machine learning is the only field where you need to restrain yourself:

  • if you’re a runner you want ot run as fast as possible
  • swimmer wants to swim as fast as possible
  • chess player wants to win as much as possible
    • and find the best move as possible

But only in machine learning you need to do less than possible.

Otherwise – you’ll be overfitting your model and you’ll have terrible results.

How Far Can You Push?

This is based on experience but ultimately – it’s subjective.

Because you’re not pushing against the wall as hard as you can. Instead, you’re walking on the cliff’s edge and hoping you don’t go to far and break your model.

BUT

If you are going to make a mistake, it’s better to make a mistake towards your model being too simple.

Why?

Because usually this will be a smaller mistake.

So I suggest you to develop a bias towards simplicity when dealing with machine learning models.

Data: Gold Mine

Everybody skips this part – and you will too.

Why?

Because you assume you know what this means.

But you don’t – not even close.

Just as an example I will share with you (what I can due to NDA) story about how understanding data makes your model unbeatable.

And can give you strategic market advantage that puts you ahead of any other competitor.

ASSUMPTIONS

Everybody assumes the data they’re working with is not a complete garbage. Let me be the first to tell you – it is.

All I can disclose, I was working with weather data.

You can simply access it on the net (from official sources).

BUT

If you don’t ask yourself how this data is sourced, you will NEVER find gold inside it.

The official data gives you just the averages of the several measurement locations. Which wouldn’t be a problem if those locations weren’t “tampered” with.

By tampered I mean, measurement conditions changed.

Because this data set included data over the past 50-100 years, during which the geographic conditions changed.

Specifically, those measurement stations used to be on the grass areas outside cities.

HOWEVER

50-100 years later they found themselves to be in the middle of the concrete.

Have you ever walked on an asphalt road or concrete during the summer?

How about the grass?

Noticed any difference?

DILLIGENCE

If we just assumed the data is correct, we would come to completely incorrect conclusions.

And the entire project would be a disaster and a failure.

Instead, I can tell you the stakeholders are pretty happy with the results (to say the least).

And the only difference between success and failure (between making and losing millions) was in questioning every piece of data we got our hands on.

No matter how stupid the question appeared to be – it had to be answered and verified.

Why?

Biggest Opportunities

You see, the biggest opportunities hide in plain sight, in places no one is even questioning.

Just remember how many lives were lost because people didn’t wash their hands (didn’t think the germs were real).

It seems so obvious today. Often we wonder how could people be so naive. This could never happen today.

BUT

Let me tell you a “little” secret… nothing has changed.

There are still more things than you could possibly imagine, that are completely wrong, yet people think they are completely true.

And because they believe, they never bother to check.

Best Data Scientist

Data preparation and model selection is relatively easy compared to asking the right questions.

If you can (at least temporarily):

  • abandon everything you think you know
  • objectively and critically examine every piece of data
  • and especially take into consideration all the circumstances around how that data was collected
  • then take into consideration this data was collected by people who were also more-than-ideally biased

OPTIONAL PRACTICE: take 1 thing every day that is 100% true (or at least that’s what your think) and try to find some good counter-arguments for that thesis.

=> Then you will be able to:

  • find the best opportunities
  • create the best performing machine learning models
  • and work on the most interesting and highest paying projects

P.S. -> read this article if you want to double your salary