Machine learning means creating algorithms and systems that can learn things automatically. Many of our everyday services and gadgets would not be possible without it.
Machine learning means creating algorithms and systems that can learn things automatically. Many of our everyday services and gadgets would not be possible without it.
These days the world is run by algorithms. Whether you shop online, google the name of that dog food you forgot, browse Facebook, or watch a film on Netflix you interact with algorithms. Some highly advanced algorithms have become almost legendary, such as PageRank, the all-knowing algorithm that Google uses to rank websites in search engine results. Yet at its core an algorithm is just a step-by-step procedure computer systems use to calculate things, process data and figure stuff out.
One of the most fascinating things related to algorithms is machine learning. Machine learning means basically creating algorithms that can learn from the data they process and analyse. In this way machine learning can be seen as one area of artificial intelligence (AI), and it is closely connected with data mining. Machine learning is not just some fancy theoretical stuff mathematicians and programmers can play around with, but it can be also very powerful tool for business.
For instance, machine learning enables websites to offer better-targeted ads for users based on their previous actions and site history. Some amazing implementations of machine learning include Apple’s Siri speech recognition, or even more amazingly, the much-talked Google’s driverless car.
I have long been interested in machine learning and I’d like to point out for all the Python lovers out there a Python package for machine learning called scikit-learn. I have been lately using this package in my home hobby projects. For the remaining part of this article, I will present the basic steps of data analysis and classification problem solving such as:
- Data pre-processing and encoding
- Data normalization
- Model construction and evaluation of its accuracy
So, highly technical stuff ahead. The remaining part of this article will probably only interest people fascinated with Python or machine learning. If you have questions regarding the instructions presented here, feel free to leave comments below.
Machine learning with Python using scikit
So, let’s begin. First of all, let’s say we have a CSV format file that contains category names and its features like:
feature_1;feature_2;…;feature_n;category_name
Our task in this example is to determine into which category an item belongs based on its features using machine learning methods.
We can start by assigning indexes to textual data. This must be done because most of machine learning models work with numerical data. The more we can encode, the better – and even better if the features can be reduced to Boolean values.
In our case we can assume that all of the features can be represented in dictionary form. For example:
tradegroup = {
“1”: “Clothing”,
“…”: “…”,
“n”: “Electronics”
}
Because all of the features in this example can be represented as dictionaries, the csv file will look like:
1;6;…;5
4;3;…;2
In the file the numbers are indexes into the feature lists. This routine can be automated by means provided by scikit-learn (aka sklearn).
import pandas
from sklearn import preprocessing, svm
from sklearn.preprocessing import OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.externals import joblib
For the automatic encoding of labels to numerical values, we will have to load the LabelEncoder helper class from the sklearn.preprocessing module. For convenience, I use the pandas package when working with csv files. With it, it is very convenient to work with tabular data, both strings and columns.
list = pandas.read_csv(‘cat.csv’,’;’)
label_enc = preprocessing.LabelEncoder()
for i in list:
label_enc.fit(list[i].drop_duplicates())
list[i] = label_enc.transform(list[i])
In the code above we load the CSV data, and then encode every textual field into numbers with the transform() function.
After encoding the original dataset it should be divided into two arrays: one with features that will be used for actual classification and one with values of encoded features. This can be done in 2 lines of code.
labels = list.TYPENAME
training_set = list.drop([‘TYPENAME’], axis=1)
Next step would be to prepare our data to be applied to one of the machine learning algorithms. This can be done with the OneHotEncoder class, which is precisely designed to convert numbers into binary form categories introduced in the dataset:
encoder = OneHotEncoder()
fa = encoder.fit_transform(training_set)
After that we are ready to start one of the classification algorithms, but in order to validate the model we need to create the so-called “test” sample, i.e. a sample with different features and known categories. To create such a set we must take a portion of training set. The sklearn package has a specific function, train_test_split(), to accomplish this, and the code for it will look like this:
1_train, 1_test, 2_train, 2_test = train_test_split(fa, labels, test_size=0.2, random_state=40)
Now that we have both the training dataset and the test sample, we can start the process of classification of one of the algorithms. In this case an SVM (support vector machine) with a linear kernel was chosen.
clf = svm.SVC(kernel=’linear’, C=1).fit(1_train, 2_train)
Now that the model is built we can evaluate its accuracy using some techniques. The easiest way is to take an average of the test data, i.e. compare the class which we calculated the model and see how it corresponds with reality:
print clf.score(1_test, 2_test)
After sufficient accuracy is achieved the model can be saved as a Python pickle (serialized object) with the joblib.dump() function:
joblib.dump(clf, ‘model.pkl’)