A Big Data project to let you know more about university life in China

Which is the happiest university in China?
Which is the most athletic university in China?

<

Mining China's most popular campus social network Renren.com


What is
UniLife
?

UniLife is a big data project that acquire big data by web crawling from Renren website, store it in database and apply data mining methods then visualize it in a friendly way.

It provides data analysis results of universities in China in terms of campus life, student background, employment prospects, etc. The data source is Renren website, which is the most popular social network in China among college students. The data was acquired by web crawling and stored in MySQL database, then analyzed by data mining methods. The results are visualized into charts.

What we do

Web Crawling

Make Http request in JAVA program. Imitate log in request then crawl the data in web pages implementing Apache Httpclient.

Big Data Store

Collecte and insert big data crawled into database efficiently

Data Mining

Process and analyze the big data using methods such as natural language processing to compare differences in campus-experience among universities in China

Data Visualization

Visualize the data mining results in comprehensible way like ranks and graphs

See what we analytic results have now!

Campus Mood Rank
of universities in China

The happiest university

keyword happy mentioned most frequently


First searching keywords representing moods in statuses posted by students and matching them with the school they are attending, then we can analyze the big data and have a rank in how frequently these mood-keywords are mentioned in their timeline by students from certain school


Related Status

Campus Lifestyle Rank
of universities in China

We also want to know what kind of activities are mentioned most frequently!


More Lifestyle Keywords

go to the gym study stay up late party

Related Status

Tech Facts

Renren website log-in mechanism

To check other people's profile, the first thing is to log-in and we need to imitate that with Java program

To learn the log-in mechanism of Renren, I used Http Analyzer to monitor http requests during log-in behavior.
From the analyzer, it is clear what name-value pairs are sent by post method and to what address.
After getting these information, we can imitate the log-in request to Renren.

Some main classes and methods of RenrenAnalyze packet

Renren Analyze package        crawl data from Renren and data analyze

CrawlSearchedStatus.java        contains the configuration variables and main method of the program

RenrenSpider.java        make http requests, imitate log-in behavior, crawl pages

dbhelper.java        connect with MySQL database, insert and update data efficiently

RenrenStats.java        Process and analyze the big data acquired and generate statistic results

SwingGUI package        Add Graphic User Interface to the program, make it more user friendly

Configuration before crawling

variable setups for the main method

  /**txt file path of Renren account username and password to log-in*/
  String accountsFilePath = "D:/Renren/account.txt";
  /**the start page number of crawling target */
  int offset = 9420;
  /**keyword of timeline statuses to crawl in English*/
  String tableName = "party";
  /**keyword of timeline statuses to crawl in Chinese*/
  String keyword = "聚会";
  /**the start proxy host to use*/
  int proxyIndex=9;
  /**if need proxy host to crawl at beginning*/
  boolean needProxy = false;
  /**pause time (in seconds) after every crawl*/
  double sleepSec = 0.5;
      

Avoid being blocked

To avoid crawler traps, some methods are taken like below:

Take a rest before next crawling!       

let the thread "sleep" for a second before next crawl mission

change host and porter to make http requests       

Every once in a while, change host and porter to continue crawling using free proxy so that certain IP won't be blocked

Look like an explore       

Change the http request headers to look more like an explore especially the user agent