Data science competitions are an exciting way to test your data science knowledge with real-world problems in hand. They can enhance your existing knowledge, inspire creativity and improve your problem-solving skills. With the availability of platforms like Topcoder and Kaggle, it has never been easier to get started with your data science journey. These platforms have become more than just organized competitions - they are a community in which data scientists can learn from each other and discuss ideas. There are even job opportunities available within those competitions.
But along with all these pros, there is a major con that most newcomers face - atychiphobia, or irrational and persistent fear of failing. Most of the time they believe that they are too late and cannot figure out where to start. Although winning is not everything, it is always thrilling to be a winner. In this article, I will share the most important practical skills needed to win any data science competition. More specifically, you will learn:
What skills you need to acquire
Why these skills are important
How to improve upon your skills
To be clear, this guide is meant for beginners.I will avoid getting into technical jargon as much as possible. But if you are interested in going deeper, I will also post links that may provide you right direction to do so. Excited? So am I. Let’s get started.
Popularly known by its shorthand EDA, it is not only important but an interesting way to summarize the main characteristics of competition data. This is also the best way to understand the problem more clearly. You will have a strong urge to try out that popular deep learning model and submit the results, but it is always advisable to gather as many insights about the data as possible before going further. That being said, I am not against deep learning, however, in real-world problems most of your models will not be able to get you a good accuracy if they are not robust to outliers.
What do I mean by that? Your model must take into consideration all the edge cases where some observations deviate significantly from the majority of observations. How do you accomplish that efficiently? By using various data visualization techniques. Seaborn is a very popular statistical data visualization library in Python. EDA is not just about outlier detection. It is also used to find missing values, discover new patterns in data and test your initial hypothesis.
Now that you know how important EDA is, I would like to mention that it is also very easy for even a novice to get started with it quickly. Here is a link to a very good micro-course on Data Visualization.
Now that you know how you can handle outliers in your data and make data consistent by filling in missing values, you need to know which part of the data (or what features of your data) you want to give as input to your machine learning model. Not all features need to be input to your model. Many times you will encounter some features that might reduce the accuracy of your model as there are always some that have zero correlation with the output variable. Including them just makes your model inefficient since with the inclusion of each new feature the number of trainable parameters grows significantly.
Feature engineering also involves breaking existing features into multiple simpler features and adding new features. There are several heuristics available to help you with this process. But in my experience, having some amount of domain knowledge gives you a little edge in making conclusions based on those heuristics. Here is a link to one more micro-course on Feature Engineering.
Have you ever experienced wonderful accuracy on the training data but once you submit your model for evaluation on hidden test data (which is often 3-4x the size of training data) you get relatively less accuracy? This is probably because your model is overfitting on training data. Overfitting happens when your model starts to memorize the inputs in the training data but cannot generalize the overall pattern of the problem that you are trying to solve.
How do you detect overfitting at a very early stage of your training? You can do so by building a stable Cross-Validation (popularly recognized by its shorthand CV) set. Basically, the concept is to reserve some part of the training dataset for validating the generalization capability of the model (hence the name validation set).
How do you prevent overfitting or remove it if detected? There are some popular methods. First and most intuitive is to obtain more data for the model. This works but most of the time is not feasible for many reasons. One more effective solution is regularization techniques like L1 (Least Absolute Deviation) and L2 (Least Square Errors). However, in all practical situations, it often requires some experience to use.
One more effective, easy to use solution is dropouts. The idea is to temporarily disable some portion of the previous layer’s output to increase the self-balancing capability of the current layer. Preventing the model from overfitting is an entire subject in itself and it would not be possible to conclude everything here. Maybe I’ll create an entire separate article on this topic soon. Meanwhile, you can always get deeper into the details with a little help from professor Google.
With all the cool deep learning frameworks available in the market right now it is often a matter of learning a few lines of code and you are able to create the model which can identify cats and dogs. Learning those frameworks is essential. But do you think just learning how to build and train models using those frameworks gives you any competitive advantage? No, because everyone knows it.
The secret lies in those tiny variables that you often leave as default values you see in an online tutorial (I’ve done it too). According to Wikipedia:
A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.
The definition is quite intuitive. Some of the important hyperparameters are learning rate, number of hidden layers in neural-network, and rate of dropout if you are using dropout as a method to prevent overfitting. As with any other skill, you will master setting the right hyperparameters with time but it is always important to give it enough attention right from the beginning.
There are often some repeatable patterns in winning solutions. This entire article is just a summary of what you will find in all of them. However, you will be able to get more insight if you actually go through the winning solutions of past competitions. Yes, this is a bit of hard work but it is worth it. You can find some of the winning solutions of past Kaggle competitions here. However, it seems like the page is no longer updated.
Last but not least is to get started as soon as possible. Do not hesitate to fail. Experience is the best teacher. For rapid iterations, it is advisable to set up automated data processing pipeline. With time you will notice the pattern in most of the repetitive tasks and have some reusable code handy that you can use in future competitions with minor changes. If you can beat your previous score, then it’s a win! Eventually, you will shine at the leaderboard as well.