# Intro to scikit-learn

## In a nutshell

Kaggle competition Detecting Insults In Social Commentary entries ranked 1-6 used scikit-learn!

• A library for machine learning in Python
• Shorthand / import path sklearn
• Pythonic, Zen
• Easy to use, good docs, very active development
In [1]:
import sklearn
sklearn.__version__

Out [1]:
'0.12.1'

## Philosophy

• Goals: ease of use, speed, accessible to new devs
• Data format: The Numpy array (or Scipy sparse)
• Pros: speed, standardisation
• Cons: missing values
• Flat is better than nested: avoid complex engineering & OO Overkill
• Only implement tried-and-true algorithms (be accompanying code for ESL/PRML)

## Let's generate some sample data

In [3]:
from sklearn.datasets import make_blobs
X, y = make_blobs(cluster_std=1, random_state=0)
print 'X is %s, y is %s' % (X.shape, y.shape)

X is (100, 2), y is (100,)

In [6]:
def show_dataset(X, y):
for label in np.unique(y):
scatter(X[y == label, 0], X[y == label, 1], c="rgb"[label])

show_dataset(X, y)


# Estimators

• API contract: can be fitted to data, can predict new outcomes
In [7]:
from sklearn.neighbors import NearestCentroid

clf = NearestCentroid()
clf.fit(X, y)

Out [7]:
NearestCentroid(metric='euclidean', shrink_threshold=None)
In [8]:
scatter(clf.centroids_[:, 0], clf.centroids_[:, 1])

Out [8]:
<matplotlib.collections.PathCollection at 0x106a51990>
In [9]:
clf.predict([20, -10])

Out [9]:
array([1])

Estimators implement a score method for evaluation:

In [10]:
print clf.score(X, y)

0.91


Other evaluation metrics are available in their own module:

In [12]:
from sklearn.metrics import f1_score, zero_one_score, v_measure_score, silhouette_score

y_pred = clf.predict(X)
print "F_1 score: %f" % f1_score(y, y_pred)
print "0-1 score: %f" % zero_one_score(y, y_pred)
print "V-measure: %f" % v_measure_score(y, y_pred)

F_1 score: 0.909980
0-1 score: 0.910000
V-measure: 0.710265


For classifiers, usually the default score is $0-1$, whereas for regression it's $R^2$

Estimators that matter a lot to us:

### LinearSVC / LogisticRegression

As opposed to the regular SVC, LinearSVC uses liblinear and is thus very efficient on sparse data.

### LabelPropagation

Graph-based learning is very efficient for many NLP tasks. The LabelPropagation implementation in sklearn only supports knn graph construction for now. Contributions encouraged!

In [13]:
# Authors: Clay Woolam <[email protected]>
#          Andreas Mueller <[email protected]>
# Licence: BSD

from sklearn.semi_supervised import label_propagation
from sklearn.datasets import make_circles

# generate ring with inner box
n_samples = 200
X_c, y_c = make_circles(n_samples=n_samples, shuffle=False)
outer, inner = 0, 1
labels = -np.ones(n_samples)
labels[0] = outer
labels[-1] = inner

###############################################################################

###############################################################################
# Plot output labels
figure(figsize=(8.5, 4))
subplot(1, 2, 1)
plot_outer_labeled, = plot(X_c[labels == outer, 0],
X_c[labels == outer, 1], 'rs')
plot_unlabeled, = plot(X_c[labels == -1, 0], X_c[labels == -1, 1], 'g.')
plot_inner_labeled, = plot(X_c[labels == inner, 0],
X_c[labels == inner, 1], 'bs')
legend((plot_outer_labeled, plot_inner_labeled, plot_unlabeled),
('Outer Labeled', 'Inner Labeled', 'Unlabeled'), 'upper left',
title("Raw data (2 classes=red and blue)")

subplot(1, 2, 2)
output_label_array = np.asarray(output_labels)
outer_numbers = np.where(output_label_array == outer)[0]
inner_numbers = np.where(output_label_array == inner)[0]
plot_outer, = plot(X_c[outer_numbers, 0], X_c[outer_numbers, 1], 'rs')
plot_inner, = plot(X_c[inner_numbers, 0], X_c[inner_numbers, 1], 'bs')
legend((plot_outer, plot_inner), ('Outer Learned', 'Inner Learned'),
title("Labels learned with Label Spreading (KNN)")



### Metaestimators e.g. Multilabel classification

In [14]:
# Authors: Vlad Niculae, Mathieu Blondel

from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelBinarizer
from sklearn.decomposition import PCA

def plot_hyperplane(clf, min_x, max_x, linestyle, label):
# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(min_x - 5, max_x + 5)  # make sure the line is long enough
yy = a * xx - (clf.intercept_[0]) / w[1]
plot(xx, yy, linestyle, label=label)

def plot_subfigure(X_m, Y_m, decision_boundaries):
min_x = np.min(X_m[:, 0])
max_x = np.max(X_m[:, 0])

zero_class = np.where([0 in y for y in Y_m])
one_class = np.where([1 in y for y in Y_m])
scatter(X_m[:, 0], X_m[:, 1], s=40, c='gray')
scatter(X_m[zero_class, 0], X_m[zero_class, 1], s=160, edgecolors='b',
facecolors='none', linewidths=2, label='Class 1')
scatter(X_m[one_class, 0], X_m[one_class, 1], s=80, edgecolors='orange',
facecolors='none', linewidths=2, label='Class 2')
axis('tight')

plot_hyperplane(classif.estimators_[0], min_x, max_x, 'k--',
'Boundary\nfor class 1')
plot_hyperplane(classif.estimators_[1], min_x, max_x, 'k-.',
'Boundary\nfor class 2')
legend()
xticks(())
yticks(())

figure(figsize=(8, 6))

X_m, Y_m = make_multilabel_classification(n_classes=2, n_labels=1,
allow_unlabeled=True,
random_state=1)

X_m = PCA(n_components=2).fit_transform(X_m)
classif = OneVsRestClassifier(SVC(kernel='linear'))
classif.fit(X_m, Y_m)

plot_subfigure(X_m, Y_m, classif.estimators_)


## Transformers

API contract: can be fitted to data and can transform new data according to the learned model. To do both at once, fit_transform.

In [16]:
from sklearn.preprocessing import Scaler

X_scaled = Scaler().fit_transform(X)
print "Before: %s %s" % (X.mean(axis=0), X.std(axis=0))
print "After: %s %s" % (X_scaled.mean(axis=0), X_scaled.std(axis=0))

Before: [ 0.47293671  2.83654807] [ 1.82787855  1.52943524]
After: [ -1.26565425e-16  -3.65263375e-16] [ 1.  1.]


Matrix factorization models are seen as transformers: PCA, NMF, DictionaryLearning. So are manifold learning algorithms.

In [17]:
from sklearn.lda import LDA  # Fisher's Linear Discriminant Analysis
# not Latent Dirichlet Allocation ;)

X_mdim = LDA().fit_transform(X_scaled, y)
subplot(1, 2, 1)
show_dataset(X_scaled, y)
subplot(1, 2, 2)
show_dataset(X_mdim, y)


But the most interesting transformers to as are in the feature_extraction.text module:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# There also is the TfidfVectorizer which does it on the fly

docs = [
"club meet",
"cheap viagra",
"rolex club"
]

labels = [0, 1, 0, 1, 1]

vec = CountVectorizer(min_df=0)  # by default, rare words are ignored
X = vec.fit_transform(docs)
vec.vocabulary_

Out [18]:
{u'buy': 0,
u'cheap': 1,
u'club': 2,
u'group': 3,
u'meet': 4,
u'rolex': 6,
u'viagra': 7}
In [19]:
X

Out [19]:
<5x8 sparse matrix of type '<type 'numpy.int64'>'
with 11 stored elements in COOrdinate format>
In [20]:
X.toarray()

Out [20]:
array([[0, 0, 0, 1, 1, 1, 0, 0],
[2, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 0]])
In [21]:
char_vec = CountVectorizer(analyzer='char', ngram_range=(1, 2)).fit(docs)
char_vec.vocabulary_.keys()[:10]

Out [21]:
[u'gr', u'cl', u'ee', u'ea', u' m', u'ex', u'et', u' ', u'le', u'lu']

## Glue

### Pipelines

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

steps = [('vectorizer', CountVectorizer()),
('clf', MultinomialNB())]

pipe = Pipeline(steps)
pipe.fit(docs, labels)
print "Accuracy: %.2f" % pipe.score(docs, labels)

Accuracy: 1.00


### Cross-validation and Grid Search

Language detection in a few lines of code. Takes advantage of all your cores!

In [24]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn import metrics

# Split the dataset in training and test set:
docs_train, docs_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.5, random_state=0)

vectorizer = TfidfVectorizer(analyzer='char', use_idf='false')

pipeline = Pipeline([
('vec', vectorizer),
('clf', LinearSVC()),
])

grid = GridSearchCV(
pipeline,
{
'clf__C': np.logspace(-2, 2, 5, base=10),
'vec__ngram_range': [(1, 2), (1, 3), (1, 4)]
},
cv=StratifiedKFold(y_train, k=5),
refit=True,
n_jobs=-1
)

grid.fit(docs_train, y_train)

print grid.best_params_

y_predicted = grid.best_estimator_.predict(docs_test)

print metrics.classification_report(y_test, y_predicted,
target_names=dataset.target_names)

# Plot the confusion matrix
print metrics.confusion_matrix(y_test, y_predicted)

# Predict the result on some short new sentences:
sentences = [
u'This is a language detection test.',
u'Ceci est un test de d\xe9tection de la langue.',
u'Dies ist ein Test, um die Sprache zu erkennen.',
]
predicted = grid.best_estimator_.predict(sentences)

for s, p in zip(sentences, predicted):
print u'The language of "%s" is "%s"' % (s, dataset.target_names[p])

{'vec__ngram_range': (1, 2), 'clf__C': 1.0}
precision    recall  f1-score   support

ar       1.00      1.00      1.00        18
de       1.00      1.00      1.00        43
en       1.00      1.00      1.00        44
es       1.00      1.00      1.00        46
fr       1.00      1.00      1.00        44
it       1.00      1.00      1.00        42
ja       1.00      1.00      1.00        32
nl       1.00      1.00      1.00        17
pl       1.00      1.00      1.00        16
pt       1.00      1.00      1.00        11
ru       1.00      1.00      1.00        25

avg / total       1.00      1.00      1.00       338

[[18  0  0  0  0  0  0  0  0  0  0]
[ 0 43  0  0  0  0  0  0  0  0  0]
[ 0  0 44  0  0  0  0  0  0  0  0]
[ 0  0  0 46  0  0  0  0  0  0  0]
[ 0  0  0  0 44  0  0  0  0  0  0]
[ 0  0  0  0  0 42  0  0  0  0  0]
[ 0  0  0  0  0  0 32  0  0  0  0]
[ 0  0  0  0  0  0  0 17  0  0  0]
[ 0  0  0  0  0  0  0  0 16  0  0]
[ 0  0  0  0  0  0  0  0  0 11  0]
[ 0  0  0  0  0  0  0  0  0  0 25]]
The language of "This is a language detection test." is "en"
The language of "Ceci est un test de détection de la langue." is "fr"
The language of "Dies ist ein Test, um die Sprache zu erkennen." is "de"


### Upcoming

FeatureUnion: concatenate the columns from several transformers See Andy's blog post for it in use.

#### Relevant works in progress:

• Grid search support for other cost functions (PR)
• Replacements for grid search: random search, GP-based hyperparameter optimisation (PR by Andy Mueller based on James Bergstra's work)
• Random hashing-based vectorizer (Olivier Grisel and Lars Butinick already laid out the foundation for this here)