Practical Data Mining with Python - DZone Refcardz (2024)

Usually, the first step of a data analysis consists of obtaining the data and loading the data into our work environment. We can easily download data using the following Python capability:

import urllib2url = 'http://aima.cs.berkeley.edu/data/iris.csv'u = urllib2.urlopen(url)localFile = open('iris.csv'', 'w')localFile.write(u.read())localFile.close()

In the snippet above we used the library urllib2 to access a file on the website of the University of Berkley and saved it to the disk using the methods of the File object provided by the standard library. The file contains the iris dataset, which is a multivariate dataset that consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Each sample has four features (or variables) that are the length and the width of sepal and petal, in centimeters.

The dataset is stored in the CSV (comma separated values) format. It is convenient to parse the CSV file and store the information that it contains using a more appropriate data structure. The dataset has 5 rows. The first 4 rows contain the values of the features while the last row represents the class of the samples. The CSV can be easily parsed using the function genfromtxt of the numpy library:

from numpy import genfromtxt, zeros# read the first 4 columnsdata = genfromtxt('iris.csv',delimiter=',',usecols=(0,1,2,3)) # read the fifth columntarget = genfromtxt('iris.csv',delimiter=',',usecols=(4),dtype=str) 

In the example above we created a matrix with the features and a vector that contains the classes. We can confirm the size of our dataset looking at the shape of the data structures we loaded:

print data.shape(150, 4)print target.shape(150,)

We can also know how many classes we have and their names:

print set(target) # build a collection of unique elementsset(['setosa', 'versicolor', 'virginica'])

An important task when working with new data is to try to understand what information the data contains and how it is structured. Visualization helps us explore this information graphically in such a way to gain understanding and insight into the data.

Using the plotting capabilities of the pylab library (which is an interface to matplotlib) we can build a bi-dimensional scatter plot which enables us to analyze two dimensions of the dataset plotting the values of a feature against the values of another one:

from pylab import plot, showplot(data[target=='setosa',0],data[target=='setosa',2],'bo')plot(data[target=='versicolor',0],data[target=='versicolor',2],'ro')plot(data[target=='virginica',0],data[target=='virginica',2],'go')show()

This snippet uses the first and the third dimension (sepal length and sepal width) and the result is shown in the following figure:

Practical Data Mining with Python - DZone Refcardz (1)

In the graph we have 150 points and their color represents the class; the blue points represent the samples that belong to the specie setosa, the red ones represent versicolor and the green ones represent virginica.

Another common way to look at data is to plot the histogram of the single features. In this case, since the data is divided into three classes, we can compare the distributions of the feature we are examining for each class. With the following code we can plot the distribution of the first feature of our data (sepal length) for each class:

from pylab import figure, subplot, hist, xlim, showxmin = min(data[:,0])xmax = max(data[:,0])figure()subplot(411) # distribution of the setosa class (1st, on the top)hist(data[target=='setosa',0],color='b',alpha=.7)xlim(xmin,xmax)subplot(412) # distribution of the versicolor class (2nd)hist(data[target=='versicolor',0],color='r',alpha=.7)xlim(xmin,xmax)subplot(413) # distribution of the virginica class (3rd)hist(data[target=='virginica',0],color='g',alpha=.7)xlim(xmin,xmax)subplot(414) # global histogram (4th, on the bottom)hist(data[:,0],color='y',alpha=.7)xlim(xmin,xmax)show()

The result should be as follows:

Practical Data Mining with Python - DZone Refcardz (2)

Looking at the histograms above we can understand some characteristics that could help us to tell apart the data according to the classes we have. For example, we can observe that, on average, the Iris setosa flowers have a smaller sepal length compared to the Iris virginica.

Practical Data Mining with Python - DZone Refcardz (2024)
Top Articles
AUTHENTICATION OF RESULT IN NIGERIA
Smart TVs: Everything You Need to Know
English Bulldog Puppies For Sale Under 1000 In Florida
Katie Pavlich Bikini Photos
Gamevault Agent
Pieology Nutrition Calculator Mobile
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Doby's Funeral Home Obituaries
Compare the Samsung Galaxy S24 - 256GB - Cobalt Violet vs Apple iPhone 16 Pro - 128GB - Desert Titanium | AT&T
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Things To Do In Atlanta Tomorrow Night
Non Sequitur
Crossword Nexus Solver
How To Cut Eelgrass Grounded
Pac Man Deviantart
Alexander Funeral Home Gallatin Obituaries
Craigslist In Flagstaff
Shasta County Most Wanted 2022
Energy Healing Conference Utah
Testberichte zu E-Bikes & Fahrrädern von PROPHETE.
Aaa Saugus Ma Appointment
Geometry Review Quiz 5 Answer Key
Icivics The Electoral Process Answer Key
Allybearloves
Bible Gateway passage: Revelation 3 - New Living Translation
Yisd Home Access Center
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
Marquette Gas Prices
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Vera Bradley Factory Outlet Sunbury Products
Pixel Combat Unblocked
Cvs Sport Physicals
Mercedes W204 Belt Diagram
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Teenbeautyfitness
Where Can I Cash A Huntington National Bank Check
Topos De Bolos Engraçados
Sand Castle Parents Guide
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Holzer Athena Portal
Hello – Cornerstone Chapel
Stoughton Commuter Rail Schedule
Selly Medaline
Latest Posts
Article information

Author: Melvina Ondricka

Last Updated:

Views: 6584

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Melvina Ondricka

Birthday: 2000-12-23

Address: Suite 382 139 Shaniqua Locks, Paulaborough, UT 90498

Phone: +636383657021

Job: Dynamic Government Specialist

Hobby: Kite flying, Watching movies, Knitting, Model building, Reading, Wood carving, Paintball

Introduction: My name is Melvina Ondricka, I am a helpful, fancy, friendly, innocent, outstanding, courageous, thoughtful person who loves writing and wants to share my knowledge and understanding with you.