11- Data Mining / Affinity Propagation Algorithm

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.

Introduction
What is the Affinity Propagation Algorithm?
Unsupervised vs. Supervised Learning
Silhouette Score: Evaluating Clustering
Correlation Matrix
- Positive Correlation
- Negative Correlation
Affinity Propagation vs. KMeans: Use Case Table
Code Examples: Correlation Matrix with 5 Variables
References

This repository provides an in-depth review of the Affinity Propagation Algorithm as applied in data mining tasks, emphasizing both theory and practical implementation. It is designed as a companion to professional and academic courses focusing on unsupervised learning and practical clustering methods.

What is the Affinity Propagation Algorithm ?

Affinity Propagation is a clustering algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. Unlike algorithms like KMeans, it does not require the number of clusters to be specified beforehand. Instead, it uses the concept of "message passing" between data points to find a number of clusters that best reflects the data structure.

Key steps:

- Exchanges real-valued messages between data points.

- Determines exemplars based on similarities.

- Forms clusters around exemplars.

How does it differ from other clustering methods ?

Affinity Propagation chooses the number of clusters automatically by maximizing a global criterion. KMeans, by contrast, needs a predefined number of clusters.

Unsupervised vs. Supervised Learning

[Unsupervised learning algorithms are designed to identify patterns or groupings in datasets without labeled responses. The model explores the input data to find structure (like clustering or association).

Supervised learning uses labeled data to train predictive models for regression or classification.
Unsupervised learning (like Affinity Propagation, KMeans, Hierarchical Clustering) finds patterns without explicit feedback.

Criteria	Supervised Learning	Unsupervised Learning
Labeled data	Yes	No
Tasks	Classification, Regression	Clustering, Association
Example Algorithms	Decision Tree, SVM	KMeans, Affinity Propagation

Silhouette Score: Evaluating Clustering

The Silhouette Score evaluates how well an object lies within its cluster compared to other clusters. It ranges from -1 to 1:

- Close to 1: sample is well matched to its own cluster.

- Close to 0: sample is on or very close to the decision boundary between two clusters.

- Close to -1: sample might have been assigned to the wrong cluster.

The Silhouette Score $S$ for a sample is:

$S = \frac{b - a}{\max(a, b)}$

Where:

- $a$ = mean intra-cluster distance for a sample (mean distance to all other samples in the same cluster).

- $b$ = mean nearest-cluster distance (lowest average distance to samples of another cluster).

Correlation Matrix

A Correlation Matrix shows the statistical relationship (correlation) between pairs of variables. It helps to identify whether variables move together (correlate) and how strongly.

Diagonal values: always 1 (a variable's correlation with itself).
Off-diagonal values: range from -1 to 1.
- 1: Perfect positive correlation.
- 0: No linear correlation.
- -1: Perfect negative correlation.

- Positive Correlation

Two variables have positive correlation if they increase in tandem.

Tip

Example: Height and weight in humans typically show positive correlation.

- Negative Correlation

Negative correlation occurs when one variable increases while the other decreases, in an inversely proportional way.

Tip

Example: The amount of time spent watching TV and academic grades may have a negative correlation.

Affinity Propagation vs. KMeans: Use Case Table

Feature/Aspect	Affinity Propagation	KMeans
Cluster count	Determined automatically	Must be specified
Speed / Scalability	Slower for large datasets	Fast for large datasets
Sensitive to initialization	No	Yes
Suitable for	Arbitrary shaped clusters	Spherical clusters
Handles outliers	Better	Poorly
Core principle	Message Passing	Centroid Minimization
Memory requirement	Higher	Lower
Illustrative use cases	Small/medium data, unknown cluster count	Large data, known K

Code Examples: Correlation Matrix with 5 Variables

Below is a Python example to generate a correlation matrix for two different dataframes (df1, df2). To generate two different plots, use df1 and df2 as indicated.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Example with 5 random variables for df1
np.random.seed(42)
df1 = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])

corr_matrix1 = df1.corr()
sns.heatmap(corr_matrix1, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df1')
plt.show()

# To generate another correlation matrix for another dataframe df2:
df2 = pd.DataFrame(np.random.rand(100, 5), columns=['V', 'W', 'X', 'Y', 'Z'])

corr_matrix2 = df2.corr()
sns.heatmap(corr_matrix2, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df2')
plt.show()

Note: To generate a second plot with the same dataframe but different label (example: "df1 for another experiment"), just change the dataframe and the labels.

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.

Complementary Bibliography

THOMAS, C. Data Mining. IntechOpen, 2018.
HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Code		Code
Pedro_Implement_Affinity_Propagation_Algo		Pedro_Implement_Affinity_Propagation_Algo
code_+_dataset_clustering_comparison		code_+_dataset_clustering_comparison
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

11- Data Mining / Affinity Propagation Algorithm

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

What is the Affinity Propagation Algorithm ?

How does it differ from other clustering methods ?

Unsupervised vs. Supervised Learning

Silhouette Score: Evaluating Clustering

The Silhouette Score $S$ for a sample is:

Where:

Correlation Matrix

- Positive Correlation

- Negative Correlation

Affinity Propagation vs. KMeans: Use Case Table

Code Examples: Correlation Matrix with 5 Variables

Bibliography

Complementary Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/11-DataMining_Affinity_Propagation_Algorithm

Folders and files

Latest commit

History

Repository files navigation

11- Data Mining / Affinity Propagation Algorithm

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Table of Contents

What is the Affinity Propagation Algorithm ?

How does it differ from other clustering methods ?

Unsupervised vs. Supervised Learning

Silhouette Score: Evaluating Clustering

The Silhouette Score $S$ for a sample is:

Where:

Correlation Matrix

- Positive Correlation

- Negative Correlation

Affinity Propagation vs. KMeans: Use Case Table

Code Examples: Correlation Matrix with 5 Variables

Bibliography

Complementary Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages