The Spark
Everyone hates spam (except for the people who spam), and its a problem on any popular platform. And like the problem of scale, its something everyone has to deal with, and almost always is never really solved. Maybe theres a way I can help Twitter reduce the amount of spam on their platform.
The Theory
I may be able to reduce the amount of spam on the Twittersphere by listening to the public timeline and computationally analyze tweets that could be spam and determine what is the liklihood that they are a spam account. Then, automatically report high likelihood users to Twitter using their REST API. The two biggest challenges I see with this project is accuracy of finding actual spammers, which needs to be near perfect, and being able to do that on a very large scale. But the good news is even a small improvement over Twitter’s spam engineers can have a significant result. Even finding a spammer 30 minutes before Twitter does can result in decent sized decline of spam because a spammer may send out 5 or 10 more spam tweets between when I find them and when Twitter does.
The Lab
For my testing and development environment, I’m choosing to use python because its easy to code in (not a lot of lines of code), and I can run it easily on any server. To help with the Twitter integration I am using Tweepy. Once testing and development is nearly completed, I plan on continuously running my program on a homegrown scalable Amazon EC2 cluster with a webpage displaying live metrics of its performance including how many spammers found, how accurate are my reports, etc.
