Data science (also known as machine learning) is becoming a very popular career choice. Data scientists are required in a number of fields, ranging from finance to mining to big technology companies like Facebook and Google. Data science aims to build methods that allow machines to learn and improve from their experience without being explicitly programmed. These methods help machines to find potentially novel patterns in data and use them to make educated predictions. Machine learning has given us a number of key technologies, such as self-driving cars, face recognition at airports and efficient web search. In fact, machine learning is so pervasive these days that you probably use it all the time without even realizing it.
This guide will provide some useful tips to build your career in data science.
The first step in your data science journey should begin with studies that build a solid foundation of knowledge. There are a number of online and in-person courses available for one to choose from.
It is a good idea to check out your local university as most offer onsite courses on data science, usually part of computer science or engineering degrees. Most courses begin with a refresher in some of the prerequisite topics, such as linear algebra, probability, calculus and programming. If you don’t have a background in these then it may help to study them separately.
If you prefer to study at home or you have limited time available then online study could be a great option for you. There are many online data science courses and the Coursera website has the best selection of these. Many of its courses are free to enroll, but you may need to pay a small fee to get the course completion certificate:
One of the top courses in this area is “Machine Learning” by Stanford’s Professor Andrew Ng, which has been completed by over three million people. Andrew is a world-leading expert in the field, and he presents the material in a simple and understandable way. The course takes approximately fifty-four hours to complete.
IBM offers a short course titled “What is Data Science?” that will teach you the basics of data science in just ten hours of study.
John Hopkins University offers the “Data Science Specialization” degree, consisting of ten courses that introduce you to data science. This degree covers the concepts and tools you’ll need throughout the entire data science pipeline, from asking the right questions to making inferences and publishing results. This degree takes about eleven months to complete at seven hours/week.
IBM offers the “Introduction to Data Science Specialization” degree, consisting of four courses that will help you to gain foundational data science skills. The courses will introduce you to what data science is and what data scientists do. You’ll discover the applicability of data science across fields, and learn how data analysis can help you make data-driven decisions. This degree takes about four months to complete at four hours/week.
Python has become a very popular programming language for data science. University of Michigan offers a great degree that teaches you Python, called “Python for Everybody Specialization”. You will learn how to use Python’s data structures, web scraping with Python and implementing SQL databases. This degree takes about eight months to complete at three hours/week.
Professor Andrew Ng also offers a degree in “Deep Learning Specialization” for more advanced users. Deep learning has become a key factor to machine learning’s recent progress. This degree takes about four months to complete at five hours/week.
There are also free courses offered by other online platforms. EdX offers a popular course called “Machine Learning”, which takes twelve weeks to complete at eight to ten hours/week.
Fast.ai offers a course called “Introduction to Machine Learning for Coders”. This course is presented by Jeremy Howard who was Kaggle’s number one competitor. The course takes about twelve weeks to complete at eight hours/week.
I believe that every aspiring data scientist should have a few good books at his/her disposal. Books are a great way to learn the material in your own time and pace. Also they provide greater depth of material that you are unlikely to get from a course, including proofs, references and examples. Good books also come with exercises and solutions.
There are many books available. Here are some I recommend:
“An Introduction to Statistical Learning” by James, Witten, Hastie and Tibshirani. This book is very clear and easy to understand. It provides many examples and exercises in R. This book is available for free here.
“Hands-On Machine Learning with Scikit-Learn and TensorFlow” by Aurelien Geron. This book will reinforce your Python programming skills and show you how to apply machine learning in practice.
“Pattern Recognition and Machine Learning” by Christopher M. Bishop. This is one of the earlier books on machine learning and has become a true classic.
“The Elements of Statistical Learning: Data Mining, Inference and Prediction” by Hastie, Tibshirani and Friedman. Another great book that covers many aspects of the field.
“Dive into Deep Learning” by Zhang, Lipton, Li and Smola. This is a great book to learn deep learning. It is an interactive book, providing code in a number of popular deep learning frameworks: Tensorflow, PyTorch and MXNet.
When I first started machine learning around fifteen years ago, there were no languages/packages specifically designed for it. I had to write my own code or borrow it from a friend. These days the situation is very different - there are a plethora of tools available that are both powerful and easy to use. On top of that, many researchers release their code and models on GitHub, so it is ready to be used.
Python has become a very popular programming language for data scientists. Python has many nice features – it is concise, easy to learn, code and understand. It also comes with many great numerical packages (NumPy, SciPy and scikit-learn) and visualisation libraries (Matplotlib, Seaborn and Plotly). Python scripts can be run via the traditional command line or in an interactive mode via Jupyter notebooks. Almost all popular deep learning frameworks have been developed in Python or have interfaces to it: TensorFlow, Keras, PyTorch, Theano, Caffe. If you want to get started quickly then I highly recommend scikit-learn. It has great examples and many methods for classification, regression and clustering.
For many years, Matlab was the de facto language for scientific computing and was only recently overtaken by Python. It has a nice visual interface and comes with many great packages for Data Science, Deep Learning and Computer Vision. Although Matlab is not a free language, it has a free “cousin” called Octave.
Another programming language worth trying is R. It provides a variety of statistical and graphical techniques, and has an active open source community.
As the famous saying goes, practice makes perfect – the more you practice something the better you get at it. This rule also applies to data science. One of the best ways to practice is to compete in online programming competitions. The competition format will make you try harder and experiment with different techniques. Once the competition ends, you will be able to read about the winning submissions, which is a great way to learn and improve. There are a number of programming competitions available. Here are the best ones around:
Topcoder runs regular data science competitions called Marathon Matches. Submissions to these contests are judged objectively by an automated scoring function that feeds a live leaderboard. Allowed programming languages are Java, C++, C# and Python. Not only are these fun, but you can also win money and help companies get better results. Marathon Matches cover a variety of problems such as optimization, prediction, computer vision, and bioinformatics. Typically these matches run for three to eight weeks. There are also fun-based matches that ask competitors to solve an abstract problem, such as a game or an optimization. These run for one week and are a great way to learn.
Kaggle is the most famous platform for data science competitions. It is an online community of over three million users ranging from novices to experts in the field. The site hosts a variety of competitions ranging from computer vision tasks to predicting user ratings for films (Netflix Prize). Some competitions offer large prize money, attracting a great deal of attention from researchers around the world. Kaggle provides access to over 19,000 datasets and 200,000 notebooks written by its users. I found Kaggle to have a very collaborative atmosphere with many competitors sharing their approaches and code at the end of the competition.
Other competition sites include CodaLab, DrivenData and AIcrowd. Some of the competitions offer great prizes so are definitely worth checking out.
These days many researchers release their code and models on GitHub. This provides a great opportunity to pick up their code and learn from it. Consider combining this with some great project ideas. Machine learning is a very fast-paced field, evolving in leaps and bounds every year, so it is important to read the latest research papers and stay updated on all the developments. Perhaps one day you will create the next revolution in data science! Good luck on your journey and thank you for reading.