No description

Python 64.4%
Shell 35.6%

Find a file

Mitchell C. Pleune 69169ae6fb Add doc pictures		2019-12-09 19:01:04 -05:00
archives	Add CESNA output to archives	2019-12-09 15:55:59 -05:00
doc	Add doc pictures	2019-12-09 19:01:04 -05:00
.gitignore	Score accounts with novel method	2019-12-09 04:23:37 +00:00
build_all_data.sh	Add Data.tar.gz and Results.tar.gz, add .sh ext	2019-12-09 15:35:10 -05:00
build_user_list.sh	Add Data.tar.gz and Results.tar.gz, add .sh ext	2019-12-09 15:35:10 -05:00
find_relevant_accounts.py	Score accounts with novel method	2019-12-09 04:23:37 +00:00
LICENSE	Add LICENSE	2019-12-09 22:37:12 +00:00
README.md	Add Data.tar.gz and Results.tar.gz, add .sh ext	2019-12-09 15:35:10 -05:00
requirements.txt	Score accounts with novel method	2019-12-09 04:23:37 +00:00
seed.txt	twitterseed.txt -> seed.txt, ter rel paths	2019-12-08 21:07:26 +00:00

README.md

Twitter Community Analysis

This repo was created as part of a school project at LTU focusing on Social Network Data Mining under the direction of Dr. Mazin Al-Hamando.

The goal of this repo is to take a rather small number of well connected twitter accounts and do the following:

Scrape all accounts that are following the seed accounts
Scrape all follower and following accounts from each of these new "follower of seed" accounts
Collect the profile page data of each of these the "follower of seed" accounts
Score each of the "follower of seed" on how relevant they are to the network using a novel method
Create a single-command script to collect and process all of the data in an easily reproducible way.

This project will use "Official" LTU accounts as an example, including many sport team accounts. See seed.txt

What is in this repo:

Below are the important files in this repo:

seed.txt is a file containing a few twitter account names. These will be the starting point of our account finding. It contains 23 "Official" LTU twitter accounts. It's mostly sports team pages, etc.
build_user_list.sh is a script that takes a file full of twitter account names and generates a list of their followers, or following accounts. It saves them in user_followers/<account>.txt or user_following/<account>.txt respectively, where account is a single line from the input list. This script takes three arguments: the input list of accounts, the number of parallel jobs to do, and the argument "following" or "followers" to specify what kind of list to generate.
find_relevant_accounts.py is a python3 script that reads in the scraped data and scores each accounts, saving various results to the results/ folder. These include, the sorted and ranked account relevancies, a graph of the score distribution, an edge-list of each node in the network, and the in/out-degrees of each node in the 2-distance follower neighborhood.
build_all_data.sh is a script that runs (2) and (3). See below. This is the main script
Results.tar.gz and Data.tar.gz are the packaged outputs of this repo. Extract Data into this repo's root to be able to run the python script.

How to Use

Environment Requirements

I started with Debian 10 container. We just need python3 packages:

apt-get install python3-pip python3-distutils
pip3 install -r requirements.txt

Build Script

One of the goals if this repo is to allow the entire process to be completed with only a single command that can be left for a day or two.

Expect this command to take >24 hours to complete. 15 is the number of parallel data scrape processes, and is a decently safe number to not cause any issues.

./build_data twitterseed.txt 15

Output

in the end you should have:

data/followers_of_seed.txt, which contains ~13,000 accounts who all follow one of the seed accounts.
data/user_followers and user_following, which contains a list of all followers and a list of all accounts following the followers of the seed file. If the account is private, these lists will be empty.
data/profile_data.csv, which contains all the information for each follower-of-seed accounts' profile page. This includes the number of followers/following of the account, even if it is private.
Data.tar.gz, which at around 350MB, contains all the above items.
results/results.csv, which is the main output file of this repo. It contains the sorted scores, usernames, and account names of the public accounts following the seed file.
results/edges.txt, which is the network graph of all the public accounts following the seed accounts.
results/degrees.txt, which allows looking up the in/out-degree of each node in the graph.
results/enu.txt, which allows looking up the associated usernames for each node number in the above two files.
results/distribution.png, which graphs the scores of the followers to help get a feel for the results.
Results.tar.gz, containing the above results files.

Explanation

Scraping Tool

twint is a twitter scraper written in python. It does not use the twitter API, which limits follower requests to an average of 1 per minute. It's not written the best, and can only be run reliably in parallel in separate processes. It is also inefficient with it's resources, and does things like querying twitters DNS record many times per request (I made ~8,000,000 DNS requests to mobile.twitter.com while collecting data within a couple hours, which required that I set up a local DNS cache to speed things up). Among other things, Twint can find "Followers" and "Following" accounts, and pull the entire profile page of any public account. Overall, it works very well for this project.

Network Composition

Since this project starts only with a small set of "seed" accounts, we must first discover a much larger set of accounts to use as nodes in the graph model. Only the followers of the seed accounts are used because: First, we want to avoid using very large and popular accounts like Lady Gaga or Donald Trump. Twitter has a phenomena where the "importance" of accounts generally decreases for followers, generally omitting these huge accounts. Second, It is generally easy to search for good seed accounts whose followers will cover the entire network. To do do, think of the general concepts or large organizations that pertain to the network you want to search, and type those into the twitter search. Sport teams, fraternities, and community leaders make for good seed accounts. The node accounts are know as "followers of seed" accounts.

Once we have our node accounts, we want to scrape the attributes and edges between them. To do so, we find both the follower and following accounts of each node account. At this point, we don't care about finding the popular accounts from the following search, since that data will quickly be discarded (we only care about the connections between the nodes, and not to non-network accounts). We also scrape the profile page of each node account. Note that the following/follower accounts cannot be scraped for private accounts, and so these nodes will end up being thrown out of the results.

Scoring

The score is calculated with the following code:

# magic score
# Important accounts are a combination of having a high percentage of followers
# and following in the network, and a large number of total connections in the
# network. Without any of these three things, the node is significantly less important.
# For example, a node with zero followers but many many following should be extremely low
# score, since it is most likely a spam account. For this reason, the three factors are multiplied.
# They are then cube-rooted to keep the score linear.
score = ((out_degree / followers) * (in_degree / following)
         * (out_degree + in_degree)) ** (1/3)
results.append((user, score, name))

This method of scoring does a good job in the following ways:

Not including spam accounts who follow many accounts
Sorting the larger accounts first
Separating very small accounts with only a couple dozen followers/following from the noise.

This method of scoring does a bad job in the following ways:

Finding accounts who are followed by a large parentage of the community, but are not an integral part (they don't follow back).
Accounts who don't follow the seed accounts will be missed entirely. This could theoretically be solved by including an extra layer of follower accounts as nodes, but this would jump the run time from days to weeks or months.
Separating accounts that have become inactive, or were only relevant for a very short period of time. Perhaps a way to remedy this would be to add another term to the score of the percentage of time since their creation that an account has stayed active.

Results

The example seed is good, since most students at LTU will follow at least one of the official LTU accounts. Further, many of the returned results are easy to verify that they have an association with LTU, since the have it in their name. Below are the top 50 results from the example sport team results.

Score,Username,Name
LawrenceTechU,12.80092137827989,Lawrence Tech
LTU_FB,11.68747337317872,LTU Football
LTUAthletics,11.00516449990972,LTU Athletics
LTUMensSoccer,10.816285793960501,LTU Men's Soccer
LTU_WSOC,8.913779366651873,LTU Women's Soccer
LTUWBasketball,8.245973351413221,LTU Women's Basketball
LTUSigEp,7.820121344407897,LTU SigEp
LTUSOFTBALL,7.728649983759125,LTU Softball
LTU_WVolleyball,7.723786698314488,LTU Women's Volleyball
LTUHockey,7.674854479004646,LTU Hockey
BlueFromLTU,7.475230623143809,Blue Lawrence
LtuBarstool,7.362964796471958,LTU_Barstool
LTUSAAC,7.340163364105895,LTU SAAC
LTUOCS,7.165724564595381,LTU Career Services
LTUTechRec,6.76104327473853,LTUTECHREC
LTUWLacrosse,6.759945483417965,LTU Women's Lacrosse
LTUBASEBALL,6.532411085492512,LTU Baseball
LTULib,6.442541018116271,Lawrence Tech Lib
LTUGamedayLive,6.420720152532724,LTU Gameday Live
LTUalumni,6.346827977915432,Lawrence Tech Alumni
LTUOnline,6.273505841738704,Lawrence Tech Online
LTUAdmissions,6.2476778423464525,LTU Admissions
ltustugov,6.010563402090628,LTUStudentGov
LTUHousing,5.871958479251638,LTUHousing
JeffDuvendeck,5.774158087803759,Jeff Duvendeck
LTUMWGolf,5.64103721519239,LTU Golf
BDBNetwork,5.638408388202654,BlueDevilNetwork
LTUProblems,5.384493910046987,LTU Problems
ltumusicfest,5.270948251523028,LTU Music Fest
LTUmensvball,5.227839069137144,LTU Men's Volleyball
LTUgameops,5.224000236285381,LTU Game Operations
Delta_Tau_Sigma,5.126307865672807,Delta Tau Sigma LTU
KBG_LTU,5.0879406731075285,Kappa Beta Gamma
Coach_NWilliams,5.056587914740197,Nate Williams
EwbLtu,4.854667177747688,LTU EWB
BowlingLtu,4.748524635794039,LTU Men's Bowling
LTU_CoAD,4.70762575068569,LTU_CoAD
CoachBeckham,4.644640980666312,Coach Beckham 🔱
LarryTechTalk,4.544555684645781,Larry Tech Talk Show
LTUAAC,4.526454409922682,LTU AAC
WillDyer16,4.466947364267957,Will Dyer
LTUDANCETEAM,4.447512475027792,Lawrence Tech Dance
LTU_Tennis,4.431819896688798,LTU Tennis
ChiOmegaRho,4.4185564562361535,Chi Omega Rho
LTUMBasketball,4.393633553244091,LTU Men's Basketball
LTU_Goose,4.359176646047465,LTU Goose
erikvh98,4.071062313157371,Erik VerHoef
LTUHookups,3.9780361840948504,LTUHookups
BlueDevilDialer,3.939228965507794,LTU Blue Devil Dialers
ian_cudney,3.9200441388605114,ian

Here you can see the distribution of scores. The find_relevant_accounts.py script find the maximum slope at about account number 1,760, which makes for a good cutoff for separating the connected and non connected accounts.

Data Quality

The Twint tool does a very good job of finding every single following account. However, there are a few accounts there the tool left incomplete from segmentation faults. This was rare, but I did see it happen a good number of times while collecting all the data. I would guess it was around 0.1% of the data, but this is far from scientific. In order to get a real number for this, the tool could be run again and checked for matching files. To complete the data, the larger files of those that did not match should be used in the data analysis.