Skip to content

πŸ‘©πŸ»β€πŸš€ 11-Data Miining -Affinity Propagation Algorithm - This repo shows how Affinity Propagation performs unsupervised clustering by finding exemplars automatically β€” no need to set the number of clusters like K-Means. It uses message passing between data points and includes code, K-Means comparison and Silhouette Score evaluation

License

Notifications You must be signed in to change notification settings

Quantum-Software-Development/11-DataMining_Affinity_Propagation_Algorithm

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡ΊπŸ‡Έ English]





Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.




Table of Contents



This repository provides an in-depth review of the Affinity Propagation Algorithm as applied in data mining tasks, emphasizing both theory and practical implementation. It is designed as a companion to professional and academic courses focusing on unsupervised learning and practical clustering methods.



Affinity Propagation is a clustering algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. Unlike algorithms like KMeans, it does not require the number of clusters to be specified beforehand. Instead, it uses the concept of "message passing" between data points to find a number of clusters that best reflects the data structure.


  • Key steps:

    - Exchanges real-valued messages between data points.

    - Determines exemplars based on similarities.

    - Forms clusters around exemplars.



Affinity Propagation chooses the number of clusters automatically by maximizing a global criterion. KMeans, by contrast, needs a predefined number of clusters.



[Unsupervised learning algorithms are designed to identify patterns or groupings in datasets without labeled responses. The model explores the input data to find structure (like clustering or association).


  • Supervised learning uses labeled data to train predictive models for regression or classification.

  • Unsupervised learning (like Affinity Propagation, KMeans, Hierarchical Clustering) finds patterns without explicit feedback.



Criteria Supervised Learning Unsupervised Learning
Labeled data Yes No
Tasks Classification, Regression Clustering, Association
Example Algorithms Decision Tree, SVM KMeans, Affinity Propagation




The Silhouette Score evaluates how well an object lies within its cluster compared to other clusters. It ranges from -1 to 1:

- Close to 1: sample is well matched to its own cluster.

- Close to 0: sample is on or very close to the decision boundary between two clusters.

- Close to -1: sample might have been assigned to the wrong cluster.


The Silhouette Score $S$ for a sample is:


$S = \frac{b - a}{\max(a, b)}$



- $a$ = mean intra-cluster distance for a sample (mean distance to all other samples in the same cluster).

- $b$ = mean nearest-cluster distance (lowest average distance to samples of another cluster).



A Correlation Matrix shows the statistical relationship (correlation) between pairs of variables. It helps to identify whether variables move together (correlate) and how strongly.



Two variables have positive correlation if they increase in tandem.



Tip

Example: Height and weight in humans typically show positive correlation.



Negative correlation occurs when one variable increases while the other decreases, in an inversely proportional way.



Tip

Example: The amount of time spent watching TV and academic grades may have a negative correlation.






Feature/Aspect Affinity Propagation KMeans
Cluster count Determined automatically Must be specified
Speed / Scalability Slower for large datasets Fast for large datasets
Sensitive to initialization No Yes
Suitable for Arbitrary shaped clusters Spherical clusters
Handles outliers Better Poorly
Core principle Message Passing Centroid Minimization
Memory requirement Higher Lower
Illustrative use cases Small/medium data, unknown cluster count Large data, known K




Below is a Python example to generate a correlation matrix for two different dataframes (df1, df2). To generate two different plots, use df1 and df2 as indicated.



import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Example with 5 random variables for df1
np.random.seed(42)
df1 = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])

corr_matrix1 = df1.corr()
sns.heatmap(corr_matrix1, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df1')
plt.show()

# To generate another correlation matrix for another dataframe df2:
df2 = pd.DataFrame(np.random.rand(100, 5), columns=['V', 'W', 'X', 'Y', 'Z'])

corr_matrix2 = df2.corr()
sns.heatmap(corr_matrix2, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df2')
plt.show()



Note: To generate a second plot with the same dataframe but different label (example: "df1 for another experiment"), just change the dataframe and the labels.



1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.


  • THOMAS, C. Data Mining. IntechOpen, 2018.
  • HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
  • NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
  • RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
  • SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.







πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

πŸ‘©πŸ»β€πŸš€ 11-Data Miining -Affinity Propagation Algorithm - This repo shows how Affinity Propagation performs unsupervised clustering by finding exemplars automatically β€” no need to set the number of clusters like K-Means. It uses message passing between data points and includes code, K-Means comparison and Silhouette Score evaluation

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project