EMAIL AND SPAM CLASSIFICATION USING PROBABILISTIC METHODS

Ankur Sharma1

M.Tech Scholar, Department of CEIT, Suresh Gyan Vihar University, Jaipur

ankur.18179@mygyanvihar.com

Sohit Agarwal2

Assistant Professor Department of CEIT, Suresh Gyan Vihar University, Jaipur

sohit.agarwal@mygyanvihar.com

 

Abstract:In this paper we are discussing the classification technique of separation of ham and spam. The classification of ham and spam is an old concept and a lot of work has done on this. Here we are using the probabilistic method for this classification. It’s a simple mathematical tool based on Bays theorem of probability.

It is a very light version and the importance of the work is that it is not dependent on dimension of the statement

Keywords— Bayesian, Spam, Probability,Tokenization

 

I. INTRODUCTION

Nowadays there are several ways to filter the mail and spam. Variety of spam controller are used. The Bays theorem can be used for this application. The base theorem containing the three terms i.e. likelihood, total probability and aposteriori probability.

We are calculating the probability of a mail to be ham or spam when the features of the email are giving. These features are collected from the data set. We have download the dataset from the kegal.

In this paper we are discussing the classification technique of separation of ham and spam. The classification of ham and spam is an old concept and a lot of work has done on this. Here we are using the probabilistic method for this classification. It’s a simple mathematical tool based on Bays theorem of probability.

It is a very light version and the importance of the work is that it is not dependent on dimension of the statement

Naxos’ theory of virtualization and expression, which can be accessed in such a way that they are presented to him: and in the opinion of a good cause of testimony. A lot of research has been conducted to improve the performance of this workbook. I know what time it makes more business cards to learn the spam filter developer. Paul Graham applied Bayeevan’s approach to spam [1] for access to training before accessing the fragmented database at low arithmetic. According to the test to find out why the public is available, spam. As for the rest, set in the allotted space, because a piece of paper.

There is no doubt what elements in hand, e-mail or profile are provided due to spam and spam messages. So, along with one of the best news from it, the characteristics of spam have been provided because we have a reason to be able to do it: this is a potential back-to-back spam.

This can break down the rigid propaganda lines of the e-mail message to be given to the spam message in a first harmful probability separated from the correct probability of the general message divided by the simple Kents situation and we can suppose that the different characteristics of the word and conditional independent data let’s know about the family now It’s spam So, there is a rule that Bayes is deaf to the presumption of freedom of conditional words

 

II. NAÏVEBAYES METHOD

 

Naive Bayesian [2] [3] is a great way to use technology to use spam problems with other techniques [4]. Paul Graham intends to use this idea [1] This process has been revised [5]. The most important Baisy filter made the difference in different words [6]. There are many that continue to apply to the algorithm [7].

X and the category with the highest likelihood of probability is the target group, for example X.

 

 

 

 

III. DATA PROCESSING AND MODELLING

Here we have a dataset of 5574 samples. Out of which 2850 are the ham emails and rest are the spam. The complete statement has is used for the detection of the mail. The statement is collected from the subject of the email. The statement of the email is converted into the numerical values of 0 and 1.

The representation of e-mail (Figure 1) is important, because it will use the correct use of naive naiveté.

 

 

one is showing the data set sample which is used in the program. The whole work is done on python.

 

The structure of the program is as follows:

Load the training data and import the libraries

Calculate the probability of features of the mail

Calculate the probability of feature when class of mail is given.

Calculate the probability of the class.

Do the same process for all the classes and features.

Multiply all the probabilities

Finally calculate the probability of class when feature is given

Calculate the maximum apriori probability (MAP)

Compare the result of MAP of both ham and spam

Take the decision

 

IV. RESULT & CONCLUSION

We have design the adaptive email and spam classification model. This model is working very well with 97% accuracy. This model can be used in twitter , emai or facebook account.

 

 

Reference

 

[1] J. Clark, I. Koprinska and J. Poon, “Linger – A Smart Personal Assistantfor E-Mail Classification”, in International Conference on ArtificialNeural Networks, 2003, pp. 274–277.

[2] S. Wasi, S. Jami and Z. Shaikh, “Context-based email classificationmodel”, Expert Systems, vol. 33, no. 2, pp. 129-144, 2015.

[3] I. Alsmadi and I. Alhami, “Clustering and classification of emailcontents”, Journal of King Saud University – Computer and InformationSciences, vol. 27, no. 1, pp. 46-57, 2015.

[4] J. Rennie, “ifile : An Application of Machine Learning to E-MailFiltering”, in Proceedings of the KDD (Knowledge Discovery inDatabases) Workshop on Text Mining, 2000.

[5] S. Sayed, “Three-Phase Tournament-Based Method for Better EmailClassification”, International Journal of Artificial Intelligence &Applications, vol. 3, no. 6, pp. 49-56, 2012.

[6] M. Fuad, D. Deb and M. Hossain, “A trainable fuzzy spam detectionsystem”, in 7th International Conference on Computer and InformationTechnology, 2004.

[7] S. Youn and D. McLeod, “Spam Email Classification using an AdaptiveOntology”, JSW, vol. 2, no. 3, 2007.

[8] M. Aery and S. Chakravarthy, “eMailSift: Email Classification Based onStructure and Content,” Data Mining, Fifth IEEE Int. Conf., pp. 18–25,2005.

[9] S. Chakravarthy, A. Venkatachalam, and A. Telang, “A graph-basedapproach for multi-folder email classification,” Proc. – IEEE Int. Conf.Data Mining, ICDM, pp. 78–87, 2010.

[10] T. Ayodele, S. Zhou, and R. Khusainov, “Email Classification UsingBack Propagation Technique,” Int. J., vol. 1, no. 1, pp. 3–9, 2010.