0 Comments
This analysis of global video game sales data is done as a part of various discussions on public data sets in Kaggle.com . The data set can be downloaded from kaggle public data repository. This data set provides information about Global, North America, Europe , Japan and Other country sales revenue in USD for different video game publishers. Data also contains details about the Platform, Year of sales and Genre of the video games sold. Click here to view Shiny application ----------> VG SALES Download Github source code --------------> VG Sales Github Analysis Analysis can be divided into two sections: 1) Analysis about the data in general 2) Analysis of data about each publisher. General Data Analysis
The R code is : sales_publisher<-as.data.frame(table(vgsales$Publisher)) colnames(sales_publisher)<-c("publisher","numbers") sales_publisher<-sales_publisher[order(-sales_publisher$numbers),] top_20_sales_publisher<-head(sales_publisher,n=20) ggplot(top_20_sales_publisher,aes(x=reorder(publisher,numbers),y=numbers))+geom_bar(stat="identity",fill="orange")+theme_minimal()+coord_flip()+geom_text(aes(label=numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Publisher")+ggtitle("Top Selling Publishers") 2. Video Game Releases per year Total number of video game sales by year is identified by creating a pivot table of video game sales per year and creating bar plots for the video game sales per year R Code: sales_year<-as.data.frame(table(vgsales$Year)) colnames(sales_year)<-c("Year","Numbers") sales_year<-sales_year[-nrow(sales_year),] ggplot(sales_year,aes(x=Year,y=Numbers))+geom_bar(stat="identity",fill="lightgreen")+theme(axis.text=element_text(size=8))+geom_text(aes(label=Numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Year")+ggtitle("Video Game Sales by Year") 3. Video Game Revenue per year Total revenue of video game sales in a year is calculated by aggregating global video game sales by year. R Code: sales_year_revenue<as.data.frame(aggregate(vgsales$Global_Sales,by=list(Year=vgsales$Year),FUN=sum)) colnames(sales_year_revenue)<-c("Year","Sales") sales_year_revenue<-sales_year_revenue[-nrow(sales_year_revenue),] ggplot(sales_year_revenue,aes(x=Year,y=Sales))+geom_bar(stat="identity",fill="magenta")+theme(axis.text=element_text(size=8))+geom_text(aes(label=Sales),vjust=0.5,color="black",size=4.0)+ylab("Total Sales Revenue")+xlab("Year")+ggtitle("Video Game Sales revenue by Year") 4. Top Selling Platforms Top selling platforms are identified by creating a pivot table of gaming platforms and sorting them in descending order to find top 20 . R Code: sales_platform<-as.data.frame(table(vgsales$Platform)) colnames(sales_platform)<-c("platform","Numbers") sales_platform<-sales_platform[order(-sales_platform$Numbers),] top_20_sales_platform<-head(sales_platform,n=20) ggplot(top_20_sales_platform,aes(x=reorder(platform,Numbers),y=Numbers))+geom_bar(stat="identity",fill="steelblue")+theme_minimal()+coord_flip()+geom_text(aes(label=Numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Platform")+ggtitle("Top Selling Video Game Platforms") Analysis by Publisher The data is filtered based on publishers using Shiny dashboard and subset based the publisher name selected from the drop down menu.
R Code: ggplot(head(vgsales_publisher,n=20),aes(x=reorder(Name,Global_Sales),y=Global_Sales))+geom_bar(stat="identity",fill="steelblue")+theme_minimal()+coord_flip()+geom_text(aes(label=Global_Sales),vjust=0.5,color="black",size=4.0)+ylab("Global Sales in Millions of Dollars")+xlab("Video Game")+ggtitle("Top Global Selling Games") 2. Top Selling Platforms Sales by platform is identified by aggregating platform based on sales revenue and creating pie-charts to undertstand the distribution. Then repeated for different countries. R Code: sales_platform_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Platform=vgsales_publisher$Platform),FUN=sum)) colnames(sales_platform_global)<-c("platform","total_Sales") Pie1<-gvisPieChart(sales_platform_global,labelvar = "Platform",options = list(title="Global Sales by Platform",width=1000,height=500)) 3. Top Selling Genre Sales by platform is identified by aggregating genre based on sales revenue and creating pie-charts to undertstand the distribution. Then repeated for different countries. R Code: sales_genre_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Genre=vgsales_publisher$Genre),FUN=sum)) colnames(sales_genre_global)<-c("genre","total_Sales") Pie1<-gvisPieChart(sales_genre_global,labelvar = "Genre",options = list(title="Global Sales by Genre",width=1000,height=500)) 4. Sales By Year Sales by Year is calculated by aggregating sales with respect toevery year. This is then evaluated using a Line Chart and repeated for different countries. R Code: sales_year_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Year=vgsales_publisher$Year),FUN=sum)) colnames(sales_year_global)<-c("Year","total_sales") sales_year_global<-sales_year_global[-nrow(sales_year_global),] line1<-gvisLineChart(sales_year_global ,options = list(title="Global Sales by Year",width=1000,height=500)) ForecastingForecasting of time series data related to video game sales per year of different publishers are based on following two forecasting models:
1) ARIMA Model 2)ETS Model The code and step by step procedure followed for building the model as in the blogs You Canalytics , Analytics Vidhya and Dataiku.
Cricket is a very popular sport in many countries and International Cricket Council(ICC ) will conduct World Cup Tournaments for two formats of the game, usually 20 Overs and 50 Overs. It is followed by millions of viewers across the globe and it normally creates a buzz in social media about various popular players.
For the purpose of this study, I have considered tweets about various players in different teams during the period of T20 worldcup from 2016-03-08 until 2016-04-03, the duration of the worldcup . There were 10 major teams participating and from each team , 1-3 popular players were considered and tweets of the players during the time frame is extracted and sentiment analysis and wordcloud visualization is performed on each player tweets using shiny. For most of the players a maximum of 5000 tweets extracted , and for few popular players ,a maximum of 10000 tweets were extracted. Click here to view the App - Cricket T20 Shiny APP The list of players considered from various teams are: INDIA
ENGLAND
WEST INDIES
AUSTRALIA
BANGLADESH
PAKISTAN
SOUTH AFRICA
NEWZELAND
SRILANKA
AFGHANISTAN
Extracting the tweets from twitter
A detailed tutorial about using twitteR package in R for extracting tweets can be found here -Extract tweets in R . Details of the tweets extraction is not provided in this blog
A sample code to fetch tweets of the player Virat Kohli is provided below. kohli_tweets<-searchTwitter('Virat Kohli',since = '2016-03-08',until = '2016-04-03',n=10000,lang = "en") kohli_tweets<-sapply(kohli_tweets,function(x) x$getText()) Detailed code can be obtained from my Github
The cleaning of tweets require the following steps:
Three separate functions are created for the entire cleaning of the tweets , the code can be obtained from Github Sentiment classification
Classification of sentiments can be done using the package 'sentiment' in R. First convert the tweets into a dataframe.
A sample code is given as below: library(RCurl) require(sentiment) ###Tweets Classification # classify emotion class_emo = classify_emotion(kohli_tweets$tweets, algorithm="bayes", prior=1.0) # get emotion best fit emotion = class_emo[,7] # classify polarity class_pol = classify_polarity(kohli_tweets$tweets, algorithm="bayes") # get polarity best fit polarity = class_pol[,4] Repeat this procedure for all the players and classify the emotions of tweets. Sentiment Score Classification
We can generate a sentiment score based on comparison of tweet words with positive and negative words lexicon and come up with a sentiment score.
#Scan positive words opinion.lexicon.pos<-scan("positive-words.txt",what = 'character',comment.char = ';') #Scan negative words opinion.lexicon.neg<-scan("negative-words.txt",what = 'character',comment.char = ';') pos.words = c(opinion.lexicon.pos,'upgrade') neg.words = c(opinion.lexicon.neg,'wait','waiting', 'wtf', 'cancellation') getSentimentScore = function(sentences, words.positive, words.negative, .progress='none') { require(plyr) require(stringr) scores = laply(sentences, function(sentence, words.positive, words.negative) { # Let first remove the Digit, Punctuation character and Control characters: sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '', gsub('\\d+', '', sentence))) # Then lets convert all to lower sentence case: sentence = tolower(sentence) # Now lets split each sentence by the space delimiter words = unlist(str_split(sentence, '\\s+')) # Get the boolean match of each words with the positive & negative opinion-lexicon pos.matches = !is.na(match(words, words.positive)) neg.matches = !is.na(match(words, words.negative)) # Now get the score as total positive sentiment minus the total negatives score = sum(pos.matches) - sum(neg.matches) return(score) }, words.positive, words.negative, .progress=.progress ) # Return a data frame with respective sentence and the score return(data.frame(score=scores)) } score<-getSentimentScore(kohli_tweets$tweets,pos.words,neg.words) kohli_tweets<-cbind(kohli_tweets, data.frame(emotion,polarity,score)) Shiny Codeglobal.R
ui.R
server.r
|
Categories
All
|