Introduction to Data Science Using Python and Statistics
Data science has become a popular topic for many individuals in the tech community. Data is everywhere, from smartphones to robots to automobiles. The massive amounts of data generated by these devices require tools and procedures for analysis and decision-making.
Python is quickly becoming a popular programming language used in colleges, universities, and schools across the world. Python is fast and agile with libraries supporting tasks such as network automation and game development. The Python ecosystem has developed libraries to handle data analysis and is taught frequently in data science programs.
The blog series I am writing will introduce the Topcoder community to both data science fundamentals and Python programming. Through the use of Python, readers will be familiar with techniques to utilize to answer questions and make recommendations. The series will be presented in the following fashion using sample programs to amplify concepts:
The data science lifecycle: Data science has a specific lifecycle utilized for analysis and used by companies worldwide. This lifecycle is a way to create hypotheses and test those hypotheses using statistical techniques. In this blog, data, data science, and data analysis will be defined, giving the reader context as they understand the lifecycle. The scientific method will be compared and contrasted to the lifecycle, noting differences and similarities.
The terms data science and data analysis are terms used interchangeably.
An introduction to Python using a small data set is available online and utilizing NumPy, SciPy, amd Keras to run basic statistical analysis on the data set. These analyses include measures of central tendency, probability distributions, and hypothesis testing on the data. OpenStax has published a statistics book explaining the above concepts and is free to use. OpenStax also has a business statistics book for free as well.
The second post dealing with Python looks at input/output operations and variables via another sample program.The program will also demonstrate how to name variables and look at data types, crucial to understanding data analysis and data science. The Math libraries will also be introduced with an emphasis of the statistics library in core Python.
Next, the series continues by looking at decision statements and how they differ from other programming languages in use. Python is unique because there are no case statements in the language.
Then, the series takes on loops emphasizing the use of lists, sets, and dictionaries. Particular attention will be given to loops which allows programmers to iterate through data. This is an important topic as data scientists work with huge data sets daily in their work for companies and organizations. Concepts such as Big-O notation and algorithmic complexity will be explained. For data science and algorithm topics, look here.
Even though it is not used specifically in data science, object-oriented analysis and design will be presented. Object-oriented analysis and design is a way to organize programs around modules. The vocabulary used to describe data will be explained using a sample class. Here is a Wikipedia article describing the concept.
The blog series then looks at the following libraries with example programs demonstrating these libraries: Numpy, Scipy, scikit-learn, keras, and TensorFlow. These libraries form the basis of data science using Python.
The final post is a comprehensive case study using all the concepts presented in the series. An online data set of moderate to large size will be analyzed using the tools presented in the series. This case study discussion can serve as a helpful post for those individuals participating in the data science competitions.
For some Topcoder members, this will serve as a review of basic computer programming, but with a relatable context. For other readers, it is an accessible introduction to Python using simple, but powerful case studies to drive home concepts. I did not include every aspect of data science in the outline because this is just a tutorial on basic data science concepts. Here are links to the libraries I will be discussing in future posts to get you started with Python:
Core Python (including the math and statistics modules plus a beginner’s tutorial):
Towards Data Science is a popular Medium blog which explains data science in detail. Use this blog to get further insights on topics discussed for this blog. Here is the link: https://towardsdatascience.com
Data Science Central is a website on the Ning platform dedicated to the field. Data Science Central had an apprenticeship they were advertising to members plus ebooks on data science topics. You do need to sign up. Here is the link: https://datasciencecentral.com. There are forums for discussing concepts.
Another site useful for the series is Real Python: https://realpython.org. Real Python is the brainchild of Dan Bader, a popular YouTuber in the Python community.
Many of these libraries have online sandboxes for trying out features of the library and tutorials to get you started programming with these Python modules. In the next blog post, I will describe the data science lifecycle in detail with several examples in different sectors. Until then, happy coding!