Make Http request in JAVA program. Imitate log in request then crawl the data in web pages implementing Apache Httpclient.
Collecte and insert big data crawled into database efficiently
Process and analyze the big data using methods such as natural language processing to compare differences in campus-experience among universities in China
Visualize the data mining results in comprehensible way like ranks and graphs
To learn the log-in mechanism of Renren, I used Http Analyzer to monitor http requests during log-in behavior.
From the analyzer, it is clear what name-value pairs are sent by post method and to what address.
After getting these information, we can imitate the log-in request to Renren.
CrawlSearchedStatus.java contains the configuration variables and main method of the program
RenrenSpider.java make http requests, imitate log-in behavior, crawl pages
dbhelper.java connect with MySQL database, insert and update data efficiently
RenrenStats.java Process and analyze the big data acquired and generate statistic results
/**txt file path of Renren account username and password to log-in*/ String accountsFilePath = "D:/Renren/account.txt"; /**the start page number of crawling target */ int offset = 9420; /**keyword of timeline statuses to crawl in English*/ String tableName = "party"; /**keyword of timeline statuses to crawl in Chinese*/ String keyword = "聚会"; /**the start proxy host to use*/ int proxyIndex=9; /**if need proxy host to crawl at beginning*/ boolean needProxy = false; /**pause time (in seconds) after every crawl*/ double sleepSec = 0.5;