Homework10

机器学习¶

数据建模在机器学习中具有极其重要的地位。它是机器学习的核心组成部分，对于训练和评估模型、做出预测和优化决策都至关重要。

模型训练和学习：在机器学习中，模型通过从数据中学习模式和关系来进行训练。模型的性能和准确性取决于所用数据的质量和数量。良好的数据建模可以提供高质量的训练数据，有助于构建更准确和可靠的模型。

机器学习的一般框架

选择模型：根据问题的性质，选择适当的机器学习模型。例如，对于分类问题，可以选择支持向量机、决策树、随机森林等。
划分数据集：将数据集分为训练集和测试集，以便评估模型的性能。通常，80%的数据用于训练，20%用于测试。
训练模型：使用训练数据集来拟合模型。
评估模型：使用测试数据集评估模型性能。
调优模型：根据模型性能进行调优，可能需要调整模型超参数、使用交叉验证等。
预测：使用训练好的模型进行新数据的预测。

题目一、以鸢尾花数据集为例学习 SVM¶

加载数据，划分鸢尾花数据集，训练集比例0.2，随机种子42
创建并训练 SVM 模型，使用线性核函数，随机种子42（也可以自己调试、体验不同参数的作用，选择更好的值，注释明确即可）
使用 Accuracy、Recall、F1 Score、Confusion Matri 这四个评估指标来评估实验效果

In [1]:

Copied!





#1. 加载数据，划分鸢尾花数据集，训练集比例0.2，随机种子42
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"训练集大小: {X_train.shape[0]}")
print(f"测试集大小: {X_test.shape[0]}")
#1. 加载数据，划分鸢尾花数据集，训练集比例0.2，随机种子42
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"训练集大小: {X_train.shape[0]}")
print(f"测试集大小: {X_test.shape[0]}")

训练集大小: 120
测试集大小: 30

In [2]:

Copied!





#2. 创建并训练 SVM 模型，使用线性核函数，随机种子42（也可以自己调试、体验不同参数的作用，选择更好的值，注释明确即可）
from sklearn import svm
from sklearn.metrics import accuracy_score

model = svm.SVC(kernel='linear', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#2. 创建并训练 SVM 模型，使用线性核函数，随机种子42（也可以自己调试、体验不同参数的作用，选择更好的值，注释明确即可）
from sklearn import svm
from sklearn.metrics import accuracy_score

model = svm.SVC(kernel='linear', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [3]:

Copied!





#3. 使用 Accuracy、Recall、F1 Score、Confusion Matri 这四个评估指标来评估实验效果
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"模型准确率 (Accuracy): {accuracy:.2f}")
print(f"召回率 (Recall): {recall:.2f}")
print(f"F1 分数 (F1 Score): {f1:.2f}")
print("混淆矩阵 (Confusion Matrix):")
print(conf_matrix)
#3. 使用 Accuracy、Recall、F1 Score、Confusion Matri 这四个评估指标来评估实验效果
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"模型准确率 (Accuracy): {accuracy:.2f}")
print(f"召回率 (Recall): {recall:.2f}")
print(f"F1 分数 (F1 Score): {f1:.2f}")
print("混淆矩阵 (Confusion Matrix):")
print(conf_matrix)

模型准确率 (Accuracy): 1.00
召回率 (Recall): 1.00
F1 分数 (F1 Score): 1.00
混淆矩阵 (Confusion Matrix):
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

SVM 基本概念¶

将实例的特征向量（以二维为例）映射为空间中的一些点，如下图的实心点和空心点，它们属于不同的两类。SVM 的目的就是想要画出一条线，以“最好地”区分这两类点，以至如果以后有了新的点，这条线也能做出很好的分类。

支持向量机（support vector machines，SVM）是一种二分类模型，它将实例的特征向量映射为空间中的一些点，SVM 的目的就是想要画出一条线，以 “最好地” 区分这两类点，以至如果以后有了新的点，这条线也能做出很好的分类。SVM 适合中小型数据样本、非线性、高维的分类问题。

SVM 是有监督的学习模型，就是说我们需要先对数据打上标签，之后通过求解最大分类间隔来求解二分类问题，而对于多分类问题，可以组合多个 SVM 分类器来处理。

题目二、以新闻数据分类为例学习朴素贝叶斯¶

导入库与数据集，数据集的导入方式：from sklearn.datasets import fetch_20newsgroups
查看类别标签、数据集的描述、数据样本
将文本数据转换为词袋模型
将数据集分为训练集和测试集，训练集比例0.2，随机种子42
创建并训练朴素贝叶斯分类器
使用 Accuracy、Recall、F1 Score 这三个评估指标来评估实验效果
横坐标为 Predicted，纵坐标为 Actual，画出混淆矩阵Confusion Matrix

In [4]:

Copied!

#1. 导入库与数据集，数据集的导入方式：from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')
#1. 导入库与数据集，数据集的导入方式：from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all')

In [5]:

Copied!





#2. 查看类别标签、数据集的描述、数据样本
categories = newsgroups.target_names
print("类别标签:")
print(categories)

description = newsgroups.DESCR
print("\n数据集描述:")
print(description)
#2. 查看类别标签、数据集的描述、数据样本
categories = newsgroups.target_names
print("类别标签:")
print(categories)

description = newsgroups.DESCR
print("\n数据集描述:")
print(description)

类别标签:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

数据集描述:
.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    =================   ==========
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    =================   ==========

|details-start|
**Usage**
|details-split|

The :func:`sklearn.datasets.fetch_20newsgroups` function is a data
fetching / caching functions that downloads the data archive from
the original `20 newsgroups website`_, extracts the archive contents
in the ``~/scikit_learn_data/20news_home`` folder and calls the
:func:`sklearn.datasets.load_files` on either the training or
testing set folder, or both of them::

  >>> from sklearn.datasets import fetch_20newsgroups
  >>> newsgroups_train = fetch_20newsgroups(subset='train')

  >>> from pprint import pprint
  >>> pprint(list(newsgroups_train.target_names))
  ['alt.atheism',
   'comp.graphics',
   'comp.os.ms-windows.misc',
   'comp.sys.ibm.pc.hardware',
   'comp.sys.mac.hardware',
   'comp.windows.x',
   'misc.forsale',
   'rec.autos',
   'rec.motorcycles',
   'rec.sport.baseball',
   'rec.sport.hockey',
   'sci.crypt',
   'sci.electronics',
   'sci.med',
   'sci.space',
   'soc.religion.christian',
   'talk.politics.guns',
   'talk.politics.mideast',
   'talk.politics.misc',
   'talk.religion.misc']

The real data lies in the ``filenames`` and ``target`` attributes. The target
attribute is the integer index of the category::

  >>> newsgroups_train.filenames.shape
  (11314,)
  >>> newsgroups_train.target.shape
  (11314,)
  >>> newsgroups_train.target[:10]
  array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

It is possible to load only a sub-selection of the categories by passing the
list of the categories to load to the
:func:`sklearn.datasets.fetch_20newsgroups` function::

  >>> cats = ['alt.atheism', 'sci.space']
  >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

  >>> list(newsgroups_train.target_names)
  ['alt.atheism', 'sci.space']
  >>> newsgroups_train.filenames.shape
  (1073,)
  >>> newsgroups_train.target.shape
  (1073,)
  >>> newsgroups_train.target[:10]
  array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])

|details-end|

|details-start|
**Converting text to vectors**
|details-split|

In order to feed predictive or clustering models with the text data,
one first need to turn the text into vectors of numerical values suitable
for statistical analysis. This can be achieved with the utilities of the
``sklearn.feature_extraction.text`` as demonstrated in the following
example that extract `TF-IDF`_ vectors of unigram tokens
from a subset of 20news::

  >>> from sklearn.feature_extraction.text import TfidfVectorizer
  >>> categories = ['alt.atheism', 'talk.religion.misc',
  ...               'comp.graphics', 'sci.space']
  >>> newsgroups_train = fetch_20newsgroups(subset='train',
  ...                                       categories=categories)
  >>> vectorizer = TfidfVectorizer()
  >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
  >>> vectors.shape
  (2034, 34118)

The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero
components by sample in a more than 30000-dimensional space
(less than .5% non-zero features)::

  >>> vectors.nnz / float(vectors.shape[0])
  159.01327...

:func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which
returns ready-to-use token counts features instead of file names.

.. _`20 newsgroups website`: http://people.csail.mit.edu/jrennie/20Newsgroups/
.. _`TF-IDF`: https://en.wikipedia.org/wiki/Tf-idf

|details-end|

|details-start|
**Filtering text for more realistic training**
|details-split|

It is easy for a classifier to overfit on particular things that appear in the
20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very
high F-scores, but their results would not generalize to other documents that
aren't from this window of time.

For example, let's look at the results of a multinomial Naive Bayes classifier,
which is fast to train and achieves a decent F-score::

  >>> from sklearn.naive_bayes import MultinomialNB
  >>> from sklearn import metrics
  >>> newsgroups_test = fetch_20newsgroups(subset='test',
  ...                                      categories=categories)
  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
  >>> clf = MultinomialNB(alpha=.01)
  >>> clf.fit(vectors, newsgroups_train.target)
  MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

  >>> pred = clf.predict(vectors_test)
  >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
  0.88213...

(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
the training and test data, instead of segmenting by time, and in that case
multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
yet of what's going on inside this classifier?)

Let's take a look at what the most informative features are:

  >>> import numpy as np
  >>> def show_top10(classifier, vectorizer, categories):
  ...     feature_names = vectorizer.get_feature_names_out()
  ...     for i, category in enumerate(categories):
  ...         top10 = np.argsort(classifier.coef_[i])[-10:]
  ...         print("%s: %s" % (category, " ".join(feature_names[top10])))
  ...
  >>> show_top10(clf, vectorizer, newsgroups_train.target_names)
  alt.atheism: edu it and in you that is of to the
  comp.graphics: edu in graphics it is for and of to the
  sci.space: edu it that is in and space to of the
  talk.religion.misc: not it you in is that and to of the

You can now see many things that these features have overfit to:

- Almost every group is distinguished by whether headers such as
  ``NNTP-Posting-Host:`` and ``Distribution:`` appear more or less often.
- Another significant feature involves whether the sender is affiliated with
  a university, as indicated either by their headers or their signature.
- The word "article" is a significant feature, based on how often people quote
  previous posts like this: "In article [article ID], [name] <[e-mail address]>
  wrote:"
- Other features match the names and e-mail addresses of particular people who
  were posting at the time.

With such an abundance of clues that distinguish newsgroups, the classifiers
barely have to identify topics from text at all, and they all perform at the
same high level.

For this reason, the functions that load 20 Newsgroups data provide a
parameter called **remove**, telling it what kinds of information to strip out
of each file. **remove** should be a tuple containing any subset of
``('headers', 'footers', 'quotes')``, telling it to remove headers, signature
blocks, and quotation blocks respectively.

  >>> newsgroups_test = fetch_20newsgroups(subset='test',
  ...                                      remove=('headers', 'footers', 'quotes'),
  ...                                      categories=categories)
  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
  >>> pred = clf.predict(vectors_test)
  >>> metrics.f1_score(pred, newsgroups_test.target, average='macro')
  0.77310...

This classifier lost over a lot of its F-score, just because we removed
metadata that has little to do with topic classification.
It loses even more if we also strip this metadata from the training data:

  >>> newsgroups_train = fetch_20newsgroups(subset='train',
  ...                                       remove=('headers', 'footers', 'quotes'),
  ...                                       categories=categories)
  >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
  >>> clf = MultinomialNB(alpha=.01)
  >>> clf.fit(vectors, newsgroups_train.target)
  MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

  >>> vectors_test = vectorizer.transform(newsgroups_test.data)
  >>> pred = clf.predict(vectors_test)
  >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
  0.76995...

Some other classifiers cope better with this harder version of the task. Try the
:ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
example with and without the `remove` option to compare the results.
|details-end|

.. topic:: Data Considerations

  The Cleveland Indians is a major league baseball team based in Cleveland,
  Ohio, USA. In December 2020, it was reported that "After several months of
  discussion sparked by the death of George Floyd and a national reckoning over
  race and colonialism, the Cleveland Indians have decided to change their
  name." Team owner Paul Dolan "did make it clear that the team will not make
  its informal nickname -- the Tribe -- its new team name." "It's not going to
  be a half-step away from the Indians," Dolan said."We will not have a Native
  American-themed name."

  https://www.mlb.com/news/cleveland-indians-team-name-change

.. topic:: Recommendation

  - When evaluating text classifiers on the 20 Newsgroups data, you
    should strip newsgroup-related metadata. In scikit-learn, you can do this
    by setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be
    lower because it is more realistic.
  - This text dataset contains data which may be inappropriate for certain NLP
    applications. An example is listed in the "Data Considerations" section
    above. The challenge with using current text datasets in NLP for tasks such
    as sentence completion, clustering, and other applications is that text
    that is culturally biased and inflammatory will propagate biases. This
    should be taken into consideration when using the dataset, reviewing the
    output, and the bias should be documented.

.. topic:: Examples

   * :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`

   * :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`

   * :ref:`sphx_glr_auto_examples_text_plot_hashing_vs_dict_vectorizer.py`

   * :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`

In [6]:

Copied!





sample_data = newsgroups.data[:3]
for i, sample in enumerate(sample_data, 1):
    print(f"\n样本 {i}:")
    print(sample)
sample_data = newsgroups.data[:3]
for i, sample in enumerate(sample_data, 1):
    print(f"\n样本 {i}:")
    print(sample)

样本 1:
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

样本 2:
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local Bus

  - Orchid Farenheit 1280

  - ATI Graphics Ultra Pro

  - Any other high-performance VLB card

Please post or email.  Thank you!

  - Matt

-- 
    |  Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu)  |   
  --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King  --+-- 
    |   of heaven, because everything he does is right and all his ways  |   
    |   are just." - Nebuchadnezzar, king of Babylon, 562 B.C.           |   

样本 3:
From: hilmi-er@dsv.su.se (Hilmi Eren)
Subject: Re: ARMENIA SAYS IT COULD SHOOT DOWN TURKISH PLANES (Henrik)
Lines: 95
Nntp-Posting-Host: viktoria.dsv.su.se
Reply-To: hilmi-er@dsv.su.se (Hilmi Eren)
Organization: Dept. of Computer and Systems Sciences, Stockholm University

|>The student of "regional killings" alias Davidian (not the Davidian religios sect) writes:

|>Greater Armenia would stretch from Karabakh, to the Black Sea, to the
|>Mediterranean, so if you use the term "Greater Armenia" use it with care.

	Finally you said what you dream about. Mediterranean???? That was new....
	The area will be "greater" after some years, like your "holocaust" numbers......

|>It has always been up to the Azeris to end their announced winning of Karabakh 
|>by removing the Armenians! When the president of Azerbaijan, Elchibey, came to 
|>power last year, he announced he would be be "swimming in Lake Sevan [in 
|>Armeniaxn] by July".
		*****
	Is't July in USA now????? Here in Sweden it's April and still cold.
	Or have you changed your calendar???

|>Well, he was wrong! If Elchibey is going to shell the 
|>Armenians of Karabakh from Aghdam, his people will pay the price! If Elchibey 
						    ****************
|>is going to shell Karabakh from Fizuli his people will pay the price! If 
						    ******************
|>Elchibey thinks he can get away with bombing Armenia from the hills of 
|>Kelbajar, his people will pay the price. 
			    ***************

	NOTHING OF THE MENTIONED IS TRUE, BUT LET SAY IT's TRUE.

	SHALL THE AZERI WOMEN AND CHILDREN GOING TO PAY THE PRICE WITH
						    **************
	BEING RAPED, KILLED AND TORTURED BY THE ARMENIANS??????????

	HAVE YOU HEARDED SOMETHING CALLED: "GENEVA CONVENTION"???????
	YOU FACIST!!!!!

	Ohhh i forgot, this is how Armenians fight, nobody has forgot
	you killings, rapings and torture against the Kurds and Turks once
	upon a time!

|>And anyway, this "60 
|>Kurd refugee" story, as have other stories, are simple fabrications sourced in 
|>Baku, modified in Ankara. Other examples of this are Armenia has no border 
|>with Iran, and the ridiculous story of the "intercepting" of Armenian military 
|>conversations as appeared in the New York Times supposedly translated by 
|>somebody unknown, from Armenian into Azeri Turkish, submitted by an unnamed 
|>"special correspondent" to the NY Times from Baku. Real accurate!

Ohhhh so swedish RedCross workers do lie they too? What ever you say
"regional killer", if you don't like the person then shoot him that's your policy.....l

|>[HE]	Search Turkish planes? You don't know what you are talking about.<-------
|>[HE]	since it's content is announced to be weapons? 				i	 
										i
|>Well, big mouth Ozal said military weapons are being provided to Azerbaijan	i
|>from Turkey, yet Demirel and others say no. No wonder you are so confused!	i
										i
										i
	Confused?????								i
	You facist when you delete text don't change it, i wrote:		i
										i
        Search Turkish planes? You don't know what you are talking about.	i
        Turkey's government has announced that it's giving weapons  <-----------i
        to Azerbadjan since Armenia started to attack Azerbadjan		
        it self, not the Karabag province. So why search a plane for weapons	
        since it's content is announced to be weapons?   

	If there is one that's confused then that's you! We have the right (and we do)
	to give weapons to the Azeris, since Armenians started the fight in Azerbadjan!

|>You are correct, all Turkish planes should be simply shot down! Nice, slow
|>moving air transports!

	Shoot down with what? Armenian bread and butter? Or the arms and personel 
	of the Russian army?

Hilmi Eren
Stockholm University

In [7]:

Copied!





#3. 将文本数据转换为词袋模型
from sklearn.feature_extraction.text import CountVectorizer
newsgroups = fetch_20newsgroups(subset='all')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
#3. 将文本数据转换为词袋模型
from sklearn.feature_extraction.text import CountVectorizer
newsgroups = fetch_20newsgroups(subset='all')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)

In [8]:

Copied!





#4. 将数据集分为训练集和测试集，训练集比例0.2，随机种子42
from sklearn.model_selection import train_test_split

newsgroups = fetch_20newsgroups(subset='all')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#4. 将数据集分为训练集和测试集，训练集比例0.2，随机种子42
from sklearn.model_selection import train_test_split

newsgroups = fetch_20newsgroups(subset='all')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:

Copied!

#5. 创建并训练朴素贝叶斯分类器
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
#5. 创建并训练朴素贝叶斯分类器
from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

Out[9]:

MultinomialNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [11]:

Copied!





#6. 使用 Accuracy、Recall、F1 Score 这三个评估指标来评估实验效果
from sklearn.metrics import accuracy_score, recall_score, f1_score

y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"模型准确率 (Accuracy): {accuracy:.6f}")
print(f"召回率 (Recall): {recall:.6f}")
print(f"F1 分数 (F1 Score): {f1:.6f}")
#6. 使用 Accuracy、Recall、F1 Score 这三个评估指标来评估实验效果
from sklearn.metrics import accuracy_score, recall_score, f1_score

y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"模型准确率 (Accuracy): {accuracy:.6f}")
print(f"召回率 (Recall): {recall:.6f}")
print(f"F1 分数 (F1 Score): {f1:.6f}")

模型准确率 (Accuracy): 0.850398
召回率 (Recall): 0.845460
F1 分数 (F1 Score): 0.836677

In [12]:

Copied!





#7. 横坐标为 Predicted，纵坐标为 Actual，画出混淆矩阵Confusion Matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

plt.rcParams['font.sans-serif'] = ['SimHei']

y_pred = nb_classifier.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=newsgroups.target_names, yticklabels=newsgroups.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
#7. 横坐标为 Predicted，纵坐标为 Actual，画出混淆矩阵Confusion Matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

plt.rcParams['font.sans-serif'] = ['SimHei']

y_pred = nb_classifier.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=newsgroups.target_names, yticklabels=newsgroups.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

D:\programs\anaconda\lib\site-packages\pandas\core\arrays\masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (

No description has been provided for this image

朴素贝叶斯基本概念¶

朴素贝叶斯（Naive Bayes）是一种基于贝叶斯定理的统计学分类方法。它被广泛应用于机器学习和数据挖掘领域，特别是在文本分类和垃圾邮件过滤等任务中取得了很好的效果。

聚类分析¶

聚类是一种无监督学习的方法，旨在将数据集中的样本分组（或簇）成相似的集合，使得同一组内的样本相互之间更相似，而不同组之间的样本更不相似。

聚类是发现数据内在结构的一种方法，它能够帮助我们理解数据的组织、发现隐藏的模式以及从数据中提取有用的信息。

题目三、以鸢尾花数据集为例学习k-means聚类¶

导入数据集和聚类库
使用k-means聚类，将数据分为3个簇，设置随机种子为0
PCA 降维到2维空间后，输出可视化结果
尝试先降维，再聚类，再输出可视化结果，比较两次的不同
使用轮廓系数比较聚类效果
绘制轮廓系数与聚类数的关系图

轮廓系数（Silhouette Coefficient） 是一种用于度量数据点与其自身簇内数据的相似度与与最近的相邻簇的数据点的不相似度的指标。

对于每个样本，计算它与同簇内所有其他点的平均距离（称为簇内平均距离，a）。
对于每个样本，计算它与最近的不同簇内所有点的平均距离（称为簇间平均距离，b）。
计算轮廓系数（S）：

轮廓系数的取值范围在[-1, 1]之间：

如果 S 接近1，表示样本与自身簇内的其他样本相似度高，与其他簇内的样本不相似，聚类效果好。
如果 S 接近-1，表示样本与自身簇内的其他样本相似度低，与其他簇内的样本相似度高，聚类效果差。

In [13]:

Copied!





#1. 导入数据集和聚类库
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
X = iris.data
#1. 导入数据集和聚类库
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris()
X = iris.data

In [14]:

Copied!





#2. 使用k-means聚类，将数据分为3个簇，设置随机种子为0
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

print(f"聚类中心:\n{kmeans.cluster_centers_}")
print(f"聚类标签:\n{kmeans.labels_}")
#2. 使用k-means聚类，将数据分为3个簇，设置随机种子为0
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

print(f"聚类中心:\n{kmeans.cluster_centers_}")
print(f"聚类标签:\n{kmeans.labels_}")

聚类中心:
[[5.9016129  2.7483871  4.39354839 1.43387097]
 [5.006      3.428      1.462      0.246     ]
 [6.85       3.07368421 5.74210526 2.07105263]]
聚类标签:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]

C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

In [15]:

Copied!





#3. PCA 降维到2维空间后，输出可视化结果
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
plt.rcParams['axes.unicode_minus'] = False

labels = kmeans.labels_

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', marker='o')
plt.title('K-means 聚类结果 (PCA 降维到2维)')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()
#3. PCA 降维到2维空间后，输出可视化结果
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
plt.rcParams['axes.unicode_minus'] = False

labels = kmeans.labels_

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', marker='o')
plt.title('K-means 聚类结果 (PCA 降维到2维)')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()

In [16]:

Copied!





#4. 尝试先降维，再聚类，再输出可视化结果，比较两次的不同
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans_pca = KMeans(n_clusters=3, random_state=0)
kmeans_pca.fit(X_pca)
labels_pca = kmeans_pca.labels_

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_pca, cmap='viridis', marker='o')
plt.title('先降维再聚类')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_

X_pca_original = pca.transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca_original[:, 0], X_pca_original[:, 1], c=labels, cmap='viridis', marker='o')
plt.title('先聚类再降维')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()
#4. 尝试先降维，再聚类，再输出可视化结果，比较两次的不同
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans_pca = KMeans(n_clusters=3, random_state=0)
kmeans_pca.fit(X_pca)
labels_pca = kmeans_pca.labels_

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_pca, cmap='viridis', marker='o')
plt.title('先降维再聚类')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_

X_pca_original = pca.transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca_original[:, 0], X_pca_original[:, 1], c=labels, cmap='viridis', marker='o')
plt.title('先聚类再降维')
plt.xlabel('主成分 1')
plt.ylabel('主成分 2')
plt.colorbar(label='簇标签')
plt.show()

C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

In [17]:

Copied!





#5. 使用轮廓系数比较聚类效果
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans_pca = KMeans(n_clusters=3, random_state=0)
kmeans_pca.fit(X_pca)
labels_pca = kmeans_pca.labels_

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_

silhouette_score_original = silhouette_score(X, labels)
silhouette_score_pca = silhouette_score(X_pca, labels_pca)

print(f"原始数据聚类的轮廓系数: {silhouette_score_original:.6f}")
print(f"PCA降维后聚类的轮廓系数: {silhouette_score_pca:.6f}")
#5. 使用轮廓系数比较聚类效果
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans_pca = KMeans(n_clusters=3, random_state=0)
kmeans_pca.fit(X_pca)
labels_pca = kmeans_pca.labels_

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_

silhouette_score_original = silhouette_score(X, labels)
silhouette_score_pca = silhouette_score(X_pca, labels_pca)

print(f"原始数据聚类的轮廓系数: {silhouette_score_original:.6f}")
print(f"PCA降维后聚类的轮廓系数: {silhouette_score_pca:.6f}")

原始数据聚类的轮廓系数: 0.552819
PCA降维后聚类的轮廓系数: 0.597676

C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(

In [18]:

Copied!





#6. 绘制轮廓系数与聚类数的关系图
silhouette_scores = []
cluster_range = range(2, 11)

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(X)
    labels = kmeans.labels_
    silhouette_avg = silhouette_score(X, labels)
    silhouette_scores.append(silhouette_avg)

# 绘制轮廓系数与聚类数的关系图
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.title('轮廓系数与聚类数的关系图')
plt.xlabel('聚类数')
plt.ylabel('轮廓系数')
plt.xticks(cluster_range)
plt.grid(True)
plt.show()
#6. 绘制轮廓系数与聚类数的关系图
silhouette_scores = []
cluster_range = range(2, 11)

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans.fit(X)
    labels = kmeans.labels_
    silhouette_avg = silhouette_score(X, labels)
    silhouette_scores.append(silhouette_avg)

# 绘制轮廓系数与聚类数的关系图
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.title('轮廓系数与聚类数的关系图')
plt.xlabel('聚类数')
plt.ylabel('轮廓系数')
plt.xticks(cluster_range)
plt.grid(True)
plt.show()

C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
C:\Users\tommy\AppData\Roaming\Python\Python310\site-packages\sklearn\cluster\_kmeans.py:1440: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(