{"id":203,"date":"2017-07-09T09:08:16","date_gmt":"2017-07-09T09:08:16","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=203"},"modified":"2017-07-10T06:49:19","modified_gmt":"2017-07-10T06:49:19","slug":"beginner-data-science-tutorial","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/beginner-data-science-tutorial\/","title":{"rendered":"Data Science Tutorial for Absolutely Python Beginners"},"content":{"rendered":"
<\/p>\n
My anaconda don’t,\u00a0My anaconda don’t,\u00a0My anaconda don’t want none, unless you’ve got….<\/p><\/blockquote>\n
Yes, you are still reading Code A Star<\/a> blog.<\/p>\n
In this post we are going to try our Data Science tutorial in Python. Since we are targeting Python beginners for this hands-on, I would like to introduce Anaconda<\/a> to all of you.<\/p>\n
<\/p>\n
All-in-one starting place<\/h3>\n
For our Data Science tutorial, there are not many lines to code actually. But we have to spend time understanding the basic concept<\/a>, modules and functions used in our program. And we have to install several libraries for Python to do the science like:<\/p>\n
\n
- Matplotlib – a plotting library to make histograms, bar charts, scatter plots and other graphs<\/li>\n
- NumPy – a fast library for handling n-dimensional array<\/li>\n
- Pandas – a set of data structures tools<\/li>\n
- Scikit Learn – a machine learning library that we use it to teach our computer and make prediction<\/li>\n<\/ul>\n
You can install above libraries by using the friendly Python command — pip<\/a>. But do you remember what we have
sung<\/del> said on the first paragraph? Yes, the Anaconda. It is a Python environment bundled with all essential data science libraries. That means, you can simply use Anaconda to start a data science project instead of pip’ing those libraries one by one.<\/p>\nMy Anaconda does<\/h3>\n
Once you open Anaconda, you would see a similar interface likes below:<\/p>\n
<\/p>\n
Click “Environment” on your left and there are tons of Python libraries installed in the environment, including those data science libraries we have mentioned:<\/p>\n
<\/p>\n
What we do next is, start our project by clicking the green arrow button and select “Open with Jupyter Notebook” option:<\/p>\n
<\/p>\n
Jupyer Notebook is a web application for users to create and share (not only) Data Science projects in (not only again) Python. We can click the “New” button in the upper right corner and select “Python 3”:<\/p>\n
<\/p>\n
A Python development UI is then launched. Okay, here we go, our science starts here:<\/p>\n
<\/p>\n
Do you remember the Data Science Life Cycle?<\/h3>\n
You can click here<\/a> to recall your memory. We are going to do the “Hello World” of Data Science — the Iris Classification and the first step of our project is: define a problem.<\/p>\n
Iris Classification is a data set of 150 iris plants categorized into 3 classes. Our problem for this project is: when we have some iris plants, which class should our plants belonged to?<\/p>\n
We move to step 2 of the Data Science Life Cycle, collect data. Since Iris Data Set is a famous data pattern\u00a0recognition resource, we can simply download it from\u00a0the web<\/a> (yeah, that is why it is the “Hello World” of Data Science).<\/p>\n
Now, let’s put our coding parts on Jupyer Notebook. Firstly, we import our required Data Science modules(the modules which we have mentioned above):<\/p>\n
import pandas as pd\r\nimport numpy as np\r\nimport matplotlib.pyplot as plt\r\nfrom sklearn.svm import SVC\r\nfrom sklearn.metrics import accuracy_score\r\nfrom sklearn.metrics import classification_report\r\nfrom sklearn.metrics import confusion_matrix\r\n<\/pre>\nSecondly, we obtain our data set using Pandas’s<\/em> read_csv function as a variable(df<\/em>, dataframe in short):<\/p>\n
df = pd.read_csv(\"http:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/iris\/bezdekIris.data\",\r\nnames = [\"Sepal Length\", \"Sepal Width\", \"Petal Length\", \"Petal Width\", \"Class\"])<\/pre>\n