Fake News Classification Machine Learning Model

by rudelabs.ai | Jan 20, 2023 | Coding Projects | 0 comments

What We Do

Software & SaaS Development

Delivered 100+ SaaS Solutions. An expert team capable of converting your ideas into reality.

Custom Mobile Apps Design & Development

Fast Development, Fast Deployment. We develop native apps compatible with both Android & iOS.

AI & Augmented Reality

Agentic Workflows, Process Automation and AI Integration. Our team will help you to deliver AI Apps within 4 weeks.



Introduction

We are going to create a fake news classification machine learning model, which is a type of artificial intelligence model that is trained to identify and classify news articles or statements as genuine or fake. We are going to train this model on a dataset of labeled examples of real and fake news, which can be used to classify new, unseen news articles or statements automatically. There are different approaches to building such a model, but common techniques include natural language processing, machine learning, and deep learning. The performance of the model can be evaluated by measuring its accuracy, precision, recall, and other metrics on a separate test dataset.

This machine learning model will help us to classify the news as fake news or real news according to the words and special characters present in the text. We are going to use algorithms like Count Vectorizer and the concepts of Porter Steamer to perform necessary actions.

Objectives

The main objectives of creating a fake news classification machine learning model are:

Identifying fake news by automatically classifying news articles or statements as genuine or fake based on patterns and characteristics learned from a labeled training dataset.
Improving the accuracy and performance of the classifier by experimenting with different machine learning algorithms, feature engineering techniques, and hyperparameter tuning.
Making the classifier more robust by handling different types of text and handling issues such as imbalanced classes, missing data, and noisy data.
Incorporating additional information sources, such as social media data, to improve the classifier’s ability to identify fake news.
Improving the interpretability of the classifier by providing insights into the features and decision rules used by the model.
Continuously monitoring the classifier’s performance and updating it as new fake news detection techniques and data become available.

Requirements

To perform a fake news classification machine learning model using Python, the following requirements are typically needed:

A labeled dataset of real and fake news articles or statements will be used to train and evaluate the classifier.
Python programming language and a set of commonly used libraries such as NumPy, pandas, scikit-learn, and NLTK for data pre-processing, feature extraction, and machine learning.
A machine learning algorithm for building the classifier, such as logistic regression, Naive Bayes, decision trees, random forests, or deep learning models.
Knowledge of natural language processing techniques for text processing, such as tokenization, stemming, and lemmatization.
A development environment for coding and testing the classifier, such as Jupyter Notebook or PyCharm. We have used Jupyter Notebook.
Access to a computing platform with sufficient resources to train and test the classifier, such as a local machine or a cloud-based platform.
Familiarity with machine learning and data analysis fundamentals, such as feature engineering, model evaluation, and hyperparameter tuning.
Experience with visualization libraries such as Matplotlib and Seaborn to visualize the results and insights of the model.
Familiarity with web scraping and web crawling to extract data from different sources.

Source Code

import pandas as pd

df=pd.read_csv('fake-news/train.csv')

df.head()

## Get the Independent Features

X=df.drop('label',axis=1)

X.head()

## Get the Dependent features

y=df['label']

y.head()

df.shape

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

df=df.dropna()

df.head(10)

messages=df.copy()

messages.reset_index(inplace=True)

messages.head(10)

messages['title'][6]

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

corpus = []

for i in range(0, len(messages)):

review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])

review = review.lower()

review = review.split()




review = [ps.stem(word) for word in review if not word in stopwords.words('english')]

review = ' '.join(review)

corpus.append(review)

corpus[3]

## Applying Countvectorizer

# Creating the Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=5000,ngram_range=(1,3))

X = cv.fit_transform(corpus).toarray()

X.shape

y=messages['label']

## Divide the dataset into Train and Test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

cv.get_feature_names()[:20]

cv.get_params()

count_df = pd.DataFrame(X_train, columns=cv.get_feature_names())

count_df.head()

import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, classes,

normalize=False,

title='Confusion matrix',

cmap=plt.cm.Blues):

"""

See full source and example:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html




This function prints and plots the confusion matrix.

Normalization can be applied by setting `normalize=True`.

"""

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

tick_marks = np.arange(len(classes))

plt.xticks(tick_marks, classes, rotation=45)

plt.yticks(tick_marks, classes)

if normalize:

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print("Normalized confusion matrix")

else:

print('Confusion matrix, without normalization')

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, cm[i, j],

horizontalalignment="center",

color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()

plt.ylabel('True label')

plt.xlabel('Predicted label')

from sklearn.naive_bayes import MultinomialNB

classifier=MultinomialNB()

from sklearn import metrics

import numpy as np

import itertools

classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)

score = metrics.accuracy_score(y_test, pred)

print("accuracy: %0.3f" % score)

cm = metrics.confusion_matrix(y_test, pred)

plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)

score = metrics.accuracy_score(y_test, pred)

score

y_train.shape

from sklearn.linear_model import PassiveAggressiveClassifier

linear_clf = PassiveAggressiveClassifier(n_iter=50)

linear_clf.fit(X_train, y_train)

pred = linear_clf.predict(X_test)

score = metrics.accuracy_score(y_test, pred)

print("accuracy: %0.3f" % score)

cm = metrics.confusion_matrix(y_test, pred)

plot_confusion_matrix(cm, classes=['FAKE Data', 'REAL Data'])

classifier=MultinomialNB(alpha=0.1)

previous_score=0

for alpha in np.arange(0,1,0.1):

sub_classifier=MultinomialNB(alpha=alpha)

sub_classifier.fit(X_train,y_train)

y_pred=sub_classifier.predict(X_test)

score = metrics.accuracy_score(y_test, y_pred)

if score>previous_score:

classifier=sub_classifier

print("Alpha: {}, Score : {}".format(alpha,score))

## Get Features names

feature_names = cv.get_feature_names()

classifier.coef_[0]

### Most real

sorted(zip(classifier.coef_[0], feature_names), reverse=True)[:20]

### Most fake

sorted(zip(classifier.coef_[0], feature_names))[:5000]

Output

Explanation of the Code

1. Initially, we imported all the libraries required to build our machine-learning model.

2. Then, we cleaned our dataset by dropping the null values through dropna() function.

3. Accordingly, we have looked at our dataset in the head and tail functions, respectively.

4. Then, we removed some special characters from the text so that analysis becomes easier.

5. Then, through the natural language toolkit, we imported all the necessary libraries and algorithms like porter streamer and count vectorizer and through the fit function, we trained our model through this algorithm.

6. Algorithms used: HashingVectorizer, TfidfVectorizer, CountVectorizer

Conclusion

Hence we have successfully built the machine learning model to predict the news as fake or real, which helps extract the correct information from the news and remove the disinformation.

Get Started

Let’s Build The Future Together

Set Up a Free Consultation

Company

About Us

Media & Press

Careers

Products

API

Apps

Services

Design

Consultation

Development