DATA SCIENCE ZING
  • Data Science
  • Machine Learning
  • Deep Learning
  • Artificial Intelligence
  • Big Data
  • Computer Vision
  • Blog
  • Advertise with us

How to create Scatterplots in R using simple,3D, ggplot and ggvis Methods

8/4/2017

0 Comments

 
Picture

Load the Car dataset

library(datasets) cars<-cars head(cars) 
##   speed dist ## 1     4    2 ## 2     4   10 ## 3     7    4 ## 4     7   22 ## 5     8   16 ## 6     9   10 

Simple Scatter Plot

plot(cars$speed,cars$dist,main="Relation between Speed and Stopping distance of cars",xlab="Car Speed",ylab="Car Stopping distance",pch=20) 

plot of chunk unnamed-chunk-2

Add Some color and change the shape of points

plot(cars$speed,cars$dist,main="Relation between Speed and Stopping distance of cars",xlab="Car Speed",ylab="Car Stopping distance",pch=2, col="red") 

plot of chunk unnamed-chunk-3

Add a fitline to the scatter plot

plot(cars$speed,cars$dist,main="Relation between Speed and Stopping distance of cars",xlab="Car Speed",ylab="Car Stopping distance",pch=20)  #Regression Line  abline(lm(cars$dist~cars$speed),col="red")  # Lowess Line lines(lowess(cars$speed,cars$dist),col="yellow") 

plot of chunk unnamed-chunk-4

3D Scatter Plots

library(scatterplot3d) scatterplot3d(cars$speed,cars$dist,main="Relation between Speed and Stopping distance of cars in 3D",xlab = "Car Speed",ylab="Car Stopping Distance",color = "red",pch=20) 

plot of chunk unnamed-chunk-5

3D Scatter plots with colors and vertical drop lines

scatterplot3d(cars$speed,cars$dist,pch=20,highlight.3d = TRUE,type="h",main="Relation between Speed and Stopping distance of cars in 3D",xlab = "Car Speed",ylab="Car Stopping Distance") 

plot of chunk unnamed-chunk-6

Using ggplot2

library(ggplot2)  ggplot(cars,aes(x=speed,y=dist,fill=dist))+geom_point(shape=21,color="black",fill="blue")+ ggtitle("Relation between Speed and Stopping distance of cars")+labs(x="Car Speed",y="Car Stopping Distance") 

plot of chunk unnamed-chunk-7

Add regression lines

ggplot(cars,aes(x=speed,y=dist,fill=dist))+geom_point(shape=21,color="black",fill="blue")+ ggtitle("Relation between Speed and Stopping distance of cars")+labs(x="Car Speed",y="Car Stopping Distance")+geom_smooth(method = lm,color="darkred") 

plot of chunk unnamed-chunk-8

Using googleVis

library(googleVis)  op <- options(gvis.plot.tag='chart')  scatter_ggvis<-gvisScatterChart(cars,                                  options=list(legend="none",pointSize=5,title="Relation between Speed and Stopping distance of cars",vAxis="{title:'Car Stopping Distance'}",hAxis="{title:'Car Speed'}",width=800,height=500))  plot(scatter_ggvis)   

Python Django Online Training by Edureka
Start your future with a Business Analytics Certificate.
Picture
Microsoft Azure Online Training by Edureka
Coursera CS
Picture
Picture
0 Comments

Shiny Application to Analyse and Forecast Global Video Game Sales

12/14/2016

2 Comments

 
Picture
This analysis of global video game sales data is done as a part of various discussions on public data sets in  Kaggle.com . The data set can be downloaded from kaggle public data repository.

This data set provides information about Global, North America, Europe , Japan and Other country sales revenue in USD for different video game publishers. Data also contains details about the Platform, Year of sales  and Genre of the video games sold.


Click here to view  Shiny application ----------> VG SALES
​  
Download Github source code --------------> VG Sales Github 

Analysis

    ​Analysis  can be divided into two sections:

1) Analysis about the data in general
2) Analysis of data about each publisher.

General Data Analysis
  1.  Top publishers
  Top publishers can be identified by  taking the pivot table of total publishers and sorting them to identify the top 20 publishers.

The R code is :

​   sales_publisher<-as.data.frame(table(vgsales$Publisher))
    colnames(sales_publisher)<-c("publisher","numbers")
    sales_publisher<-sales_publisher[order(-sales_publisher$numbers),]
    top_20_sales_publisher<-head(sales_publisher,n=20)
    ggplot(top_20_sales_publisher,aes(x=reorder(publisher,numbers),y=numbers))+geom_bar(stat="identity",fill="orange")+theme_minimal()+coord_flip()+geom_text(aes(label=numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Publisher")+ggtitle("Top  Selling Publishers")


    2. Video Game Releases per year
       Total number of video game sales by year is identified by creating a pivot table of video game sales per year and creating bar plots for the video game sales per year

R Code:
    sales_year<-as.data.frame(table(vgsales$Year))
    colnames(sales_year)<-c("Year","Numbers")
    sales_year<-sales_year[-nrow(sales_year),]
    ggplot(sales_year,aes(x=Year,y=Numbers))+geom_bar(stat="identity",fill="lightgreen")+theme(axis.text=element_text(size=8))+geom_text(aes(label=Numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Year")+ggtitle("Video Game Sales by Year")


    3. Video Game Revenue per year
Total revenue of video game sales in a year is calculated by aggregating  global video game sales by year. 

R Code:
 sales_year_revenue<as.data.frame(aggregate(vgsales$Global_Sales,by=list(Year=vgsales$Year),FUN=sum))
   colnames(sales_year_revenue)<-c("Year","Sales")
    sales_year_revenue<-sales_year_revenue[-nrow(sales_year_revenue),]
    ggplot(sales_year_revenue,aes(x=Year,y=Sales))+geom_bar(stat="identity",fill="magenta")+theme(axis.text=element_text(size=8))+geom_text(aes(label=Sales),vjust=0.5,color="black",size=4.0)+ylab("Total  Sales Revenue")+xlab("Year")+ggtitle("Video Game Sales revenue by Year")


    4. Top Selling Platforms
   Top selling platforms are identified by creating a pivot table of gaming platforms and sorting them in descending order to find top 20 .

R Code: 
 sales_platform<-as.data.frame(table(vgsales$Platform))
    colnames(sales_platform)<-c("platform","Numbers")
    sales_platform<-sales_platform[order(-sales_platform$Numbers),]
    top_20_sales_platform<-head(sales_platform,n=20)
    ggplot(top_20_sales_platform,aes(x=reorder(platform,Numbers),y=Numbers))+geom_bar(stat="identity",fill="steelblue")+theme_minimal()+coord_flip()+geom_text(aes(label=Numbers),vjust=0.5,color="black",size=4.0)+ylab("Total Number of Sales")+xlab("Platform")+ggtitle("Top  Selling Video Game Platforms")

    
Analysis by Publisher

The data is filtered based on publishers using Shiny dashboard and subset based the publisher name selected from the drop down menu.
  1. Top Selling Games
         Top selling games by a publisher is identified by sorting the total sales and identifying the top 20. This analysis is repeated for different countries and global.

R Code:
​ggplot(head(vgsales_publisher,n=20),aes(x=reorder(Name,Global_Sales),y=Global_Sales))+geom_bar(stat="identity",fill="steelblue")+theme_minimal()+coord_flip()+geom_text(aes(label=Global_Sales),vjust=0.5,color="black",size=4.0)+ylab("Global Sales in Millions of Dollars")+xlab("Video Game")+ggtitle("Top Global Selling Games")

       2. Top Selling Platforms
            Sales by platform is identified by aggregating platform based on sales revenue and creating pie-charts to undertstand the distribution. Then repeated for different countries.

R Code:
sales_platform_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Platform=vgsales_publisher$Platform),FUN=sum))
    colnames(sales_platform_global)<-c("platform","total_Sales")
     Pie1<-gvisPieChart(sales_platform_global,labelvar = "Platform",options = list(title="Global Sales by Platform",width=1000,height=500))


       
3. Top Selling Genre
                Sales by platform is identified by aggregating genre  based on sales revenue and creating pie-charts to undertstand the distribution. Then repeated for different countries.

R Code:

sales_genre_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Genre=vgsales_publisher$Genre),FUN=sum))
    colnames(sales_genre_global)<-c("genre","total_Sales")
    Pie1<-gvisPieChart(sales_genre_global,labelvar = "Genre",options = list(title="Global Sales by Genre",width=1000,height=500))

     
    
4. Sales By Year
           Sales by Year is calculated by aggregating sales with respect toevery year. This is then evaluated using a Line Chart and repeated for different countries.

R Code:
sales_year_global=as.data.frame(aggregate(vgsales_publisher$Global_Sales,by=list(Year=vgsales_publisher$Year),FUN=sum))
    colnames(sales_year_global)<-c("Year","total_sales")
    sales_year_global<-sales_year_global[-nrow(sales_year_global),]
    line1<-gvisLineChart(sales_year_global ,options = list(title="Global Sales by Year",width=1000,height=500))




Forecasting

Forecasting of time series data related to video game sales per year of different publishers are based on following two forecasting models:

1) ARIMA  Model
2)ETS Model

The code and step by step procedure followed for building the model as in the blogs You Canalytics , Analytics Vidhya and Dataiku.

​




2 Comments

Twitter Sentiment Analysis of T20 Cricket World cup  players using R and Shiny

4/14/2016

6 Comments

 
Cricket is a very popular sport in many countries and International Cricket Council(ICC ) will conduct World Cup Tournaments for two formats of the game, usually 20 Overs and 50 Overs. It is  followed by millions of viewers across the globe and  it normally creates a buzz in social media about various popular players.

For the purpose of this study, I have considered tweets about various players in different teams during the period of T20  worldcup  from 2016-03-08 until 2016-04-03, the duration of the worldcup . There were 10 major teams participating and from each team , 1-3 popular players were considered and tweets of the players during the time frame is extracted and sentiment analysis and wordcloud visualization is performed on each player tweets using shiny. For most of the players a  maximum of 5000 tweets extracted ,  and for few popular players ,a maximum of 10000 tweets were extracted.

Click here to view the App - Cricket T20 Shiny APP


The list of players considered from  various  teams are:

      INDIA 
  1. Virat Kohli
  2. MS Dhoni
  3. Jasprit Bumrah

        ENGLAND
  1. Joe Root
  2. Jos Buttler
  3. Ben Stokes

        WEST INDIES
  1. Chris Gayle
  2. Dwayne Bravo
  3. Carlos Brathwaite

        AUSTRALIA
  1. David Warner
  2. Shane Watson
  3. Glenn Maxwell

       BANGLADESH
  1. Mushfiqur Rahim
  2. Tamim Iqbal
  3. Mustafizur Rahman

       PAKISTAN
  1. Shahid Afridi
  2. Mohammad Amir      

       SOUTH AFRICA
  1. Quinton de Kock
  2. AB de Villiers
  3. Hashim Amla

      NEWZELAND
  1. Martin Guptill
  2. Mitchell Santner
  3. Ross Taylor

       SRILANKA
  1. Angelo Mathews
  2. Tillakaratne Dilshan
  3. Lasith Malinga

       AFGHANISTAN
  1. Mohammad Shahzad

Extracting the tweets from twitter

A detailed tutorial about  using twitteR  package in R for extracting tweets can be found here -Extract tweets in R .  Details of the tweets extraction is not provided in this blog

A sample code to fetch tweets of the player Virat Kohli is provided below.

kohli_tweets<-searchTwitter('Virat Kohli',since = '2016-03-08',until = '2016-04-03',n=10000,lang = "en")
kohli_tweets<-sapply(kohli_tweets,function(x) x$getText())


Detailed code can be obtained from my Github

Clean the tweets

​

The cleaning of tweets require the following steps:
  1. Remove html links from the tweets
  2. Remove retweet entities
  3. Remove all hashtags
  4. Remove all @people
  5. Remove all punctuation
  6. Remove all numbers
  7. Remove all unnecessary white spaces 
  8. Convert all text into lowercase and
  9. Remove duplicates

  Three separate functions are created for the entire cleaning of the tweets , the code can be obtained from Github

Sentiment classification

​Classification of sentiments  can be done using the package 'sentiment' in R.  First convert the tweets into a dataframe.

A sample code is given as below:

library(RCurl)
require(sentiment)

###Tweets Classification

# classify emotion
class_emo = classify_emotion(kohli_tweets$tweets, algorithm="bayes", prior=1.0)

# get emotion best fit
emotion = class_emo[,7]

# classify polarity
class_pol = classify_polarity(kohli_tweets$tweets, algorithm="bayes")

# get polarity best fit
polarity = class_pol[,4]



Repeat this procedure for all the players and classify the emotions of tweets.

Sentiment Score Classification

We can generate a sentiment score based on comparison of tweet words with positive and negative words lexicon and come up with a sentiment score.

#Scan positive words
opinion.lexicon.pos<-scan("positive-words.txt",what = 'character',comment.char = ';')

#Scan negative words
opinion.lexicon.neg<-scan("negative-words.txt",what = 'character',comment.char = ';')

pos.words = c(opinion.lexicon.pos,'upgrade')
neg.words = c(opinion.lexicon.neg,'wait','waiting', 'wtf', 'cancellation')

getSentimentScore = function(sentences, words.positive,
                             words.negative, .progress='none')
{
  require(plyr)
  require(stringr)
  scores = laply(sentences,
                 function(sentence, words.positive, words.negative) {
                   # Let first remove the Digit, Punctuation character and Control characters:
                   sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '',
                                                           gsub('\\d+', '', sentence)))
                   # Then lets convert all to lower sentence case:
                   sentence = tolower(sentence)
                   # Now lets split each sentence by the space delimiter
                   words = unlist(str_split(sentence, '\\s+'))
                   # Get the boolean match of each words with the positive & negative opinion-lexicon
                   pos.matches = !is.na(match(words, words.positive))
                   neg.matches = !is.na(match(words, words.negative))
                   # Now get the score as total positive sentiment minus the total negatives
                   score = sum(pos.matches) - sum(neg.matches)
                   return(score)
                 }, words.positive, words.negative, .progress=.progress )
  # Return a data frame with respective sentence and the score
  return(data.frame(score=scores))
}


score<-getSentimentScore(kohli_tweets$tweets,pos.words,neg.words)
kohli_tweets<-cbind(kohli_tweets, data.frame(emotion,polarity,score))

Shiny Code

global.R

library(tm)
library(wordcloud)
library(memoise)
library(googleVis)
library(ggplot2)


#Create a list of players

players<-list("Virat Kohli"="kohli",
              "MS Dhoni"= "dhoni",
              "Jasprit Bumrah" ="bumrah",
              "Joe Root"="root",
              "Jos Buttler"="butler",
              "Ben Stokes"="Ben Stokes",
              "Chris Gayle"="gayle",
              "Dwayne Bravo"="bravo",
              "Carlos Brathwaite"="brathwaite",
              "David Warner"="warner",
              "Shane Watson"="watson",
              "Glenn Maxwell"="maxwell",
              "Mushfiqur Rahim"="mushfiqur",
              "Tamim Iqbal"="tamim",
              "Mustafizur Rahman"="mustafizur",
              "Shahid Afridi"="afridi",
              "Mohammad Amir"="amir",
              "Quinton de Kock"="dekock",
              "AB de Villiers"="devillers",
              "Hashim Amla"="amla",
              "Martin Guptill"="guptill",
              "Mitchell Santner"="santner",
              "Ross Taylor"="taylor",
              "Angelo Mathews"="mathews",
              "Tillakaratne Dilshan"="dilshan",
              "Lasith Malinga"="malinga",
              "Mohammad Shahzad"="shahzad"

              )

catch.error = function(x)
{
  # Create a missing value for test purpose
  y = NA

  # Try to catch that error (NA) we just created

  catch_error = tryCatch(tolower(x), error=function(e) e)

  # if not an error, convert y to lowercase
  if (!inherits(catch_error, "error"))

    y = tolower(x)

  # check result if error exists, otherwise the function works fine.
  return(y)
}

cleanTweets<- function(tweet){

  # Clean the tweet for sentiment analysis
  # remove html links, which are not required for sentiment analysis

  tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet)

  # First we will remove retweet entities from

  tweet = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", tweet)

  # Then remove all "#Hashtag"

  tweet = gsub("#\\w+", " ", tweet)

  # Then remove all "@people"

  tweet = gsub("@\\w+", " ", tweet)

  # Then remove all the punctuation

  tweet = gsub("[[:punct:]]", " ", tweet)

  # Then remove numbers, we need only text for analytics

  tweet = gsub("[[:digit:]]", " ", tweet)

  # finally, we remove unnecessary spaces (white spaces, tabs etc)

  tweet = gsub("[ \t]{2,}", " ", tweet)
  tweet = gsub("^\\s+|\\s+$", "", tweet)

  tweet = catch.error(tweet)

  tweet
}

cleanTweetsAndRemoveNAs<- function(Tweets) {

  TweetsCleaned = sapply(Tweets, cleanTweets)

  # Remove the "NA" tweets from this tweet list

  TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)]

  names(TweetsCleaned) = NULL

  # Remove the repetitive tweets from this tweet list

  TweetsCleaned = unique(TweetsCleaned)

  TweetsCleaned
}

#Get the tweets cleaned

getCleanTweets <- memoise(function(player) {


  if (!(player %in% players))
    stop("Unknown player")

  tweets <-readLines(sprintf("./Data/%s.txt",player)) 

  tweetsCleaned<-cleanTweetsAndRemoveNAs(tweets)

  tweetsCleaned

})

#Generate a term matrix for word cloud

getTermMatrix <- memoise(function(player) {


  if (!(player %in% players))
    stop("Unknown Player")

  text <- readLines(sprintf("./Data/%s.txt", player),
                    encoding="latin1",warn=FALSE)
  #Create a corpus   
  myCorpus = Corpus(VectorSource(text))
  #Convert text to lowercase
  myCorpus = tm_map(myCorpus, content_transformer(tolower))
  #Remove Punctuations
  myCorpus = tm_map(myCorpus, removePunctuation)

  # remove URLs
  removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
  myCorpus <- tm_map(myCorpus, content_transformer(removeURL))

  # remove anything other than English letters or space
  removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
  myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))

  #remove numbers
  myCorpus = tm_map(myCorpus, removeNumbers)
  #remove stopwords in english
  myCorpus = tm_map(myCorpus, removeWords,stopwords("en"))

  myCorpus = tm_map(myCorpus, removeWords,
                    c(stopwords("SMART"), "thy", "thou", "thee", "the", "and", "but"))
  # remove extra whitespace
  myCorpus <- tm_map(myCorpus, stripWhitespace)


  myDTM = TermDocumentMatrix(myCorpus,
                             control = list(minWordLength = 1))



  m<-as.matrix(myDTM)

  sort(rowSums(m), decreasing = TRUE)
})

#Get the data for emotions

getEmotions <- memoise(function(player) {


  if (!(player %in% players))
    stop("Unknown player")

  data <-read.csv(sprintf("./Emotions/%s.csv",player)) 

  data
})

ui.R

library(shinydashboard)
library(shiny)

dashboardPage(

  dashboardHeader(title = "T20 Cricket Players "),
  dashboardSidebar(
    h3("Choose the Player"),
    selectInput("selection", "",
                choices = players),
    actionButton("update", "Change"),
    hr(),

    sidebarMenu(
      menuItem("Wordcloud",tabName = "wordcloud",icon = icon("cloud")),
      menuItem("Top Words",tabName = "barchart",icon = icon("bar-chart")),
      menuItem("Emotions",tabName = "emotions",icon = icon("smile-o"))


    )
  ),

  dashboardBody(

    tabItems(

      tabItem(tabName ="wordcloud",



              fluidRow(

                tabBox(title = "",width = 12,

                       tabPanel(title = tagList(shiny::icon("comments"),"Tweets"),


                                box(plotOutput("wordcloud1",height=500,width = 300)),

                                box(title = "Controls",
                                    sliderInput("freq","Minimum Frequency:",
                                                min = 1,  max = 50, value = 15),
                                    sliderInput("max","Maximum Number of Words:",
                                                min = 1,  max = 300,  value = 100)
                                )
                       )






                )



              )
      ),

      tabItem(tabName = "barchart",

              fluidRow(

                tabBox(title = "",width = 12,

                       tabPanel(title = tagList(shiny::icon("heart"),"Reviews"),

                                plotOutput("bar1")
                       )



                )


              )


      ),

      tabItem(tabName = "emotions",

              fluidRow(

                tabBox(title = "",width = 12,

                       tabPanel(title = tagList(shiny::icon("smile-o"),"Polarity"),

                                htmlOutput("pie2")
                       ),

                       tabPanel(title = tagList(shiny::icon("pie-chart"),"Emotions"),

                                htmlOutput("pie1")
                       ),


                       tabPanel(title = tagList(shiny::icon("thumbs-o-up"),"Emotion Score"),

                                plotOutput("score1")

                       )


                )


              )


      )








    )


  )
)

server.r

function(input, output, session) {
  # Define a reactive expression for the document term matrix
  terms <- reactive({
    # Change when the "update" button is pressed...
    input$update
    # ...but not for anything else
    isolate({
      withProgress({
        setProgress(message = "Processing corpus...")
        getTermMatrix(input$selection)
      })
    })
  })

  data_emotion<-reactive({

    getEmotions(input$selection)

  })



  # Make the wordcloud drawing predictable during a session
  wordcloud_rep <- repeatable(wordcloud)

  #Create the wordcloud
  output$wordcloud1 <- renderPlot({
    v <- terms()
    wordcloud_rep(names(v), v, scale=c(5,0.5),
                  min.freq = input$freq, max.words=input$max,
                  colors=brewer.pal(8, "Dark2"))
  })

  #Create a barchart for high frequency terms

  output$bar1<-renderPlot({

    plot1<-head(data.frame(Freq=terms()),n=20)
    plot1$word<-row.names(plot1)
    ggplot(plot1,aes(x=reorder(word,Freq),y=Freq))+geom_bar(stat="identity",fill="steelblue")+theme_minimal()+coord_flip()+geom_text(aes(label=Freq),vjust=0.5,color="black",size=4.0)+ylab("Frequency of words")+xlab("Top Words")+ggtitle("Top frequency words")
  }) 

  #Create a pie chart for the emotions
  output$pie1<-renderGvis({

    data<-data_emotion()
    emotion1<-as.data.frame(table(data$emotion))
    Pie1<-gvisPieChart(emotion1,options = list(width=1200,height=600))
    return(Pie1)

  })

  #Create a pie chart for the polarity

  output$pie2<-renderGvis({

    data<-data_emotion()
    emotion2<-as.data.frame(table(data$polarity))
    Pie2<-gvisPieChart(emotion2,options = list(width=1200,height=600))
    return(Pie2)

  })

  #create a histogram for the emotion score
  output$score1<-renderPlot({

    data<-data_emotion()
    ggplot(data,aes(x=score))+geom_histogram(bins=50,color="black",fill="blue")+theme_minimal()+xlab("Sentiment Score")+ylab("count")+ggtitle("Sentiment Scores of reviews")
  })
}

6 Comments
    Picture

      Subscribe now

    Subscribe to Newsletter
    Picture

    RSS Feed

    Categories

    All
    Basics
    Classification
    Courses
    Foreacsting
    Mapping
    R
    Shiny
    Visualization

    Picture
    Picture
    Picture
    200x200 Machine Learning Expert
    Picture
Powered by Create your own unique website with customizable templates.
  • Data Science
  • Machine Learning
  • Deep Learning
  • Artificial Intelligence
  • Big Data
  • Computer Vision
  • Blog
  • Advertise with us