Intro to scikit-learn

Machine Learning in Python

Vlad Niculae

[email protected] http://vene.ro

University of Bucharest

University of Wolverhampton


In a nutshell

Kaggle competition Detecting Insults In Social Commentary entries ranked 1-6 used scikit-learn!

In [1]:
import sklearn
sklearn.__version__
Out [1]:
'0.12.1'

Philosophy

Let's generate some sample data

In [3]:
from sklearn.datasets import make_blobs
X, y = make_blobs(cluster_std=1, random_state=0)
print 'X is %s, y is %s' % (X.shape, y.shape)
X is (100, 2), y is (100,)
In [6]:
def show_dataset(X, y):
    for label in np.unique(y):
        scatter(X[y == label, 0], X[y == label, 1], c="rgb"[label])

show_dataset(X, y)

Estimators

In [7]:
from sklearn.neighbors import NearestCentroid

clf = NearestCentroid()
clf.fit(X, y)
Out [7]:
NearestCentroid(metric='euclidean', shrink_threshold=None)
In [8]:
scatter(clf.centroids_[:, 0], clf.centroids_[:, 1])
Out [8]:
<matplotlib.collections.PathCollection at 0x106a51990>
In [9]:
clf.predict([20, -10])
Out [9]:
array([1])

Estimators implement a score method for evaluation:

In [10]:
print clf.score(X, y)
0.91

Other evaluation metrics are available in their own module:

In [12]:
from sklearn.metrics import f1_score, zero_one_score, v_measure_score, silhouette_score

y_pred = clf.predict(X)
print "F_1 score: %f" % f1_score(y, y_pred)
print "0-1 score: %f" % zero_one_score(y, y_pred)
print "V-measure: %f" % v_measure_score(y, y_pred)
F_1 score: 0.909980
0-1 score: 0.910000
V-measure: 0.710265

For classifiers, usually the default score is $0-1$, whereas for regression it's $R^2$

Estimators that matter a lot to us:

LinearSVC / LogisticRegression

As opposed to the regular SVC, LinearSVC uses liblinear and is thus very efficient on sparse data.

LabelPropagation

Graph-based learning is very efficient for many NLP tasks. The LabelPropagation implementation in sklearn only supports knn graph construction for now. Contributions encouraged!

In [13]:
# Authors: Clay Woolam <[email protected]>
#          Andreas Mueller <[email protected]>
# Licence: BSD

from sklearn.semi_supervised import label_propagation
from sklearn.datasets import make_circles

# generate ring with inner box
n_samples = 200
X_c, y_c = make_circles(n_samples=n_samples, shuffle=False)
outer, inner = 0, 1
labels = -np.ones(n_samples)
labels[0] = outer
labels[-1] = inner

###############################################################################
# Learn with LabelSpreading
label_spread = label_propagation.LabelSpreading(kernel='knn', alpha=1.0)
label_spread.fit(X_c, labels)

###############################################################################
# Plot output labels
output_labels = label_spread.transduction_
figure(figsize=(8.5, 4))
subplot(1, 2, 1)
plot_outer_labeled, = plot(X_c[labels == outer, 0],
                              X_c[labels == outer, 1], 'rs')
plot_unlabeled, = plot(X_c[labels == -1, 0], X_c[labels == -1, 1], 'g.')
plot_inner_labeled, = plot(X_c[labels == inner, 0],
                              X_c[labels == inner, 1], 'bs')
legend((plot_outer_labeled, plot_inner_labeled, plot_unlabeled),
          ('Outer Labeled', 'Inner Labeled', 'Unlabeled'), 'upper left',
          numpoints=1, shadow=False)
title("Raw data (2 classes=red and blue)")

subplot(1, 2, 2)
output_label_array = np.asarray(output_labels)
outer_numbers = np.where(output_label_array == outer)[0]
inner_numbers = np.where(output_label_array == inner)[0]
plot_outer, = plot(X_c[outer_numbers, 0], X_c[outer_numbers, 1], 'rs')
plot_inner, = plot(X_c[inner_numbers, 0], X_c[inner_numbers, 1], 'bs')
legend((plot_outer, plot_inner), ('Outer Learned', 'Inner Learned'),
          'upper left', numpoints=1, shadow=False)
title("Labels learned with Label Spreading (KNN)")

subplots_adjust(left=0.07, bottom=0.07, right=0.93, top=0.92)

Metaestimators e.g. Multilabel classification

In [14]:
# Authors: Vlad Niculae, Mathieu Blondel
# License: BSD

from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelBinarizer
from sklearn.decomposition import PCA


def plot_hyperplane(clf, min_x, max_x, linestyle, label):
    # get the separating hyperplane
    w = clf.coef_[0]
    a = -w[0] / w[1]
    xx = np.linspace(min_x - 5, max_x + 5)  # make sure the line is long enough
    yy = a * xx - (clf.intercept_[0]) / w[1]
    plot(xx, yy, linestyle, label=label)


def plot_subfigure(X_m, Y_m, decision_boundaries):
    min_x = np.min(X_m[:, 0])
    max_x = np.max(X_m[:, 0])

    zero_class = np.where([0 in y for y in Y_m])
    one_class = np.where([1 in y for y in Y_m])
    scatter(X_m[:, 0], X_m[:, 1], s=40, c='gray')
    scatter(X_m[zero_class, 0], X_m[zero_class, 1], s=160, edgecolors='b',
               facecolors='none', linewidths=2, label='Class 1')
    scatter(X_m[one_class, 0], X_m[one_class, 1], s=80, edgecolors='orange',
               facecolors='none', linewidths=2, label='Class 2')
    axis('tight')

    plot_hyperplane(classif.estimators_[0], min_x, max_x, 'k--',
                    'Boundary\nfor class 1')
    plot_hyperplane(classif.estimators_[1], min_x, max_x, 'k-.',
                    'Boundary\nfor class 2')
    legend()
    xticks(())
    yticks(())



figure(figsize=(8, 6))

X_m, Y_m = make_multilabel_classification(n_classes=2, n_labels=1,
                                          allow_unlabeled=True,
                                          random_state=1)

X_m = PCA(n_components=2).fit_transform(X_m)
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(X_m, Y_m)

plot_subfigure(X_m, Y_m, classif.estimators_)

Transformers

API contract: can be fitted to data and can transform new data according to the learned model. To do both at once, fit_transform.

In [16]:
from sklearn.preprocessing import Scaler

X_scaled = Scaler().fit_transform(X)
print "Before: %s %s" % (X.mean(axis=0), X.std(axis=0))
print "After: %s %s" % (X_scaled.mean(axis=0), X_scaled.std(axis=0))
Before: [ 0.47293671  2.83654807] [ 1.82787855  1.52943524]
After: [ -1.26565425e-16  -3.65263375e-16] [ 1.  1.]

Matrix factorization models are seen as transformers: PCA, NMF, DictionaryLearning. So are manifold learning algorithms.

In [17]:
from sklearn.lda import LDA  # Fisher's Linear Discriminant Analysis
                             # not Latent Dirichlet Allocation ;)

X_mdim = LDA().fit_transform(X_scaled, y)
subplot(1, 2, 1)
show_dataset(X_scaled, y)
subplot(1, 2, 2)
show_dataset(X_mdim, y)

But the most interesting transformers to as are in the feature_extraction.text module:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# There also is the TfidfVectorizer which does it on the fly

docs = [
    "reading group meet",
    "buy rolex buy",
    "club meet",
    "cheap viagra",
    "rolex club"
]

labels = [0, 1, 0, 1, 1]

vec = CountVectorizer(min_df=0)  # by default, rare words are ignored
X = vec.fit_transform(docs)
vec.vocabulary_
Out [18]:
{u'buy': 0,
 u'cheap': 1,
 u'club': 2,
 u'group': 3,
 u'meet': 4,
 u'reading': 5,
 u'rolex': 6,
 u'viagra': 7}
In [19]:
X
Out [19]:
<5x8 sparse matrix of type '<type 'numpy.int64'>'
	with 11 stored elements in COOrdinate format>
In [20]:
X.toarray()
Out [20]:
array([[0, 0, 0, 1, 1, 1, 0, 0],
       [2, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 1, 0]])
In [21]:
char_vec = CountVectorizer(analyzer='char', ngram_range=(1, 2)).fit(docs)
char_vec.vocabulary_.keys()[:10]
Out [21]:
[u'gr', u'cl', u'ee', u'ea', u' m', u'ex', u'et', u' ', u'le', u'lu']

You can find more about this in the docs.

Glue

Pipelines

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

steps = [('vectorizer', CountVectorizer()),
         ('clf', MultinomialNB())]

pipe = Pipeline(steps)
pipe.fit(docs, labels)
print "Accuracy: %.2f" % pipe.score(docs, labels)
Accuracy: 1.00

Cross-validation and Grid Search

Language detection in a few lines of code. Takes advantage of all your cores!

In [24]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
# Modified by Vlad Niculae
# License: Simplified BSD

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn import metrics


dataset = load_files('code/sklearn-tutorial/data/languages/paragraphs')

# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.5, random_state=0)

vectorizer = TfidfVectorizer(analyzer='char', use_idf='false')

pipeline = Pipeline([
    ('vec', vectorizer),
    ('clf', LinearSVC()),
])

grid = GridSearchCV(
           pipeline, 
           {
            'clf__C': np.logspace(-2, 2, 5, base=10),
            'vec__ngram_range': [(1, 2), (1, 3), (1, 4)]
           },
           cv=StratifiedKFold(y_train, k=5),
           refit=True,
           n_jobs=-1
        )

grid.fit(docs_train, y_train)

print grid.best_params_

y_predicted = grid.best_estimator_.predict(docs_test)

print metrics.classification_report(y_test, y_predicted,
                                    target_names=dataset.target_names)

# Plot the confusion matrix
print metrics.confusion_matrix(y_test, y_predicted)

# Predict the result on some short new sentences:
sentences = [
    u'This is a language detection test.',
    u'Ceci est un test de d\xe9tection de la langue.',
    u'Dies ist ein Test, um die Sprache zu erkennen.',
]
predicted = grid.best_estimator_.predict(sentences)

for s, p in zip(sentences, predicted):
    print u'The language of "%s" is "%s"' % (s, dataset.target_names[p])
{'vec__ngram_range': (1, 2), 'clf__C': 1.0}
             precision    recall  f1-score   support

         ar       1.00      1.00      1.00        18
         de       1.00      1.00      1.00        43
         en       1.00      1.00      1.00        44
         es       1.00      1.00      1.00        46
         fr       1.00      1.00      1.00        44
         it       1.00      1.00      1.00        42
         ja       1.00      1.00      1.00        32
         nl       1.00      1.00      1.00        17
         pl       1.00      1.00      1.00        16
         pt       1.00      1.00      1.00        11
         ru       1.00      1.00      1.00        25

avg / total       1.00      1.00      1.00       338

[[18  0  0  0  0  0  0  0  0  0  0]
 [ 0 43  0  0  0  0  0  0  0  0  0]
 [ 0  0 44  0  0  0  0  0  0  0  0]
 [ 0  0  0 46  0  0  0  0  0  0  0]
 [ 0  0  0  0 44  0  0  0  0  0  0]
 [ 0  0  0  0  0 42  0  0  0  0  0]
 [ 0  0  0  0  0  0 32  0  0  0  0]
 [ 0  0  0  0  0  0  0 17  0  0  0]
 [ 0  0  0  0  0  0  0  0 16  0  0]
 [ 0  0  0  0  0  0  0  0  0 11  0]
 [ 0  0  0  0  0  0  0  0  0  0 25]]
The language of "This is a language detection test." is "en"
The language of "Ceci est un test de détection de la langue." is "fr"
The language of "Dies ist ein Test, um die Sprache zu erkennen." is "de"

Upcoming

FeatureUnion: concatenate the columns from several transformers See Andy's blog post for it in use.

Relevant works in progress: