Sunday, 29 December 2019

Hypothesis Testing

> setwd('D:/Training_Material/Book/Files')
> getwd()
[1] "D:/Training_Material/Book/Files"
> Agedf<-read.csv("PlayersAge.csv",header = TRUE)
> AgePlr <- as.numeric(Agedf$Age)
> library(MASS)
> fitdistr(AgePlr,"Normal")
      mean          sd   
  28.9014025    4.6278804
 ( 0.1733155) ( 0.1225526)
> a.teo<-rnorm(n=713,mean=29,sd=4.6)
> qqplot(AgePlr,a.teo,main="QQ-plot for Normal distribution")

> abline(0,1)
It can be observed that the Age of cricketers follow Normal distribution.



Lower Tail Test of Population Mean with Known Variance
Problem:  Suppose mean age of a player is more than 32 in a sample of 713 players; Assume the population standard deviation is 5. At .05 significance level, can we reject the claim that the average age is more than 32 ?
Solution
# The null hypothesis is that mu > 32.
> xbar = 29 # sample mean
> mu0  = 32              # hypothesized value
> sigma= 5               # population standard deviation
> n = 713                 # sample size
> z = (xbar-mu0)/(sigma/sqrt(n))
> z                      # test statistic
[1] -16.02124
Øcompute the critical value at .05 significance level.
> alpha = .05
> z.alpha = qnorm(1-alpha)
> -z.alpha               # critical value
[1] -1.644854
§The test statistic -16.02124 is less than the critical value of -1.6449. Hence, at .05 significance level, we reject the claim “mean age of a player is more than 32“
> pval = pnorm(z)
> pval                   # lower tail p-value

§[1] 4.541344e-58 

Two-Tailed Test of Population Mean with Known Variance
Problem:  Suppose mean age of a player is  equal to 30 in a sample of 713 players; Assume the population standard deviation is 5. At .05 significance level, can we reject the claim that the average age is 30?

Solution: # The null hypothesis is that mu = 32.
> xbar = 28.9 # sample mean
> mu0  = 30              # hypothesized value
> sigma= 5               # population standard deviation
> n = 713                 # sample size
> z = (xbar-mu0)/(sigma/sqrt(n))
> z                      # test statistic
[1] -5.874453
compute the critical value at .05 significance level.
> alpha = .05
> z.half.alpha = qnorm(1-alpha/2)
> c(-z.half.alpha, z.half.alpha)
[1] -1.959964  1.959964
§The test statistic -5.874453 is not between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we  reject the null hypothesis that the “mean age of a player is equal to 32“
> pval = 2 * pnorm(z)    # lower tail
> pval                   # two-tailed p-value
[1] 4.242414e-09

Two-Tailed Test of Population Mean with Unknown Variance
Problem:  Suppose mean age of a player is  equal to 30 in a sample of 713 players and population standard deviation is unknown; Can we reject the claim that the average age is 29.5 at .05 significance level?
Solution: # The null hypothesis is that mu = 30.
> getwd()
[1] "D:/Training_Material/Book/Files"
> Agedf<-read.csv("PlayersAge.csv",header = TRUE)
> AgePlr <- as.numeric(Agedf$Age)
> library(fBasics)
> basicStats(AgePlr)
> basicStats(AgePlr)
 AgePlr
nobs          713.000000
NAs             0.000000
Minimum        17.200000
Maximum        45.400000
1. Quartile    25.500000
3. Quartile    31.800000
Mean           28.901403
Median         28.600000
Sum         20606.700000
SE Mean         0.173437
LCL Mean       28.560893
UCL Mean       29.241912
Variance       21.447358
Stdev           4.631129
Skewness        0.389330
Kurtosis        0.173905
> xbar= 28.901403       # sample mean
> mu0 = 30              # hypothesized value
> s   = 4.631129        # sample standard deviation
> n   = 713             # sample size
> t   = (xbar−mu0)/(s/sqrt(n))
> t                     # test statistic
[1] -6.334266
> alpha = .05
> t.half.alpha = qt(1−alpha/2, df=n−1)
> c(−t.half.alpha, t.half.alpha)
[1] -1.963301  1.963301
The test statistic -6.334266 is not between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we  reject the null hypothesis that the “mean age of a player is equal to 3o“
> pval = 2  pt(t, df=n−1)  # lower tail
> pval                      # two−tailed p−value

[1] 4.225286e-10


Sunday, 10 February 2013

R® for Fitting univariate Parametric Distributions to Cricketers Oneday

M.R.L.N. Panchanana (M.Sc Statistics from University of Hyderabad, India)
M.Tech (Compueter Science & Technology from Jawaharlal Nehru University, Delhi)
Abstract

Fitting a parametric distribution to a data set is a key step in analyzing the data. In many statistical procedures like ANOVA, several hypothesis tests are performed assuming the data as normal.  In this presentation, it is discussed how to perform statistical techniques using R Software, to fit a statistical distribution to oneday cricket Score made by Sachin Ramess Tendulkar. DNB (Does Not Batted) cases are not considered.

Introduction

Few functions used in R® is useful in deciding, which Univariate distribution is suitable to the specified variable from the data set. It provides Univariate discrete and continuous distributions.  Before going to perform, goodness of fit test, it is advisable to know the characteristics of the data by descriptive statistics methods like summary and stem and Density estimation methods like hist, boxplot, ecdf, density, fitdistr,ks.test. Graphical illustrative methods are qqplot and qqline.
Oneday cricket Runs

 The data for  illustrative examples is collected from http://www.howstat.com.au/ . Only the oneday runs made by few famous batsmen are used for the analysis. The data is saved in playername.ssa7bdat files. Primarly “Sachin Tendulkar’s oneday match runs are analyzed.
Open R Software and load the package sas7bdat to import sas data files.

> library(sas7bdat)
Read the data file as follows:
SACHIN <- read.sas7bdat("D:/Modeling/sachin.sas7bdat")
                                                                           
The following commands invoke the function and computes mean, median, and different percentiles.
> summary(SACHIN)
       RUNS      
 Min.       :  0.00 
1st Qu .  :  8.00
Median  : 28.50 
Mean      : 40.77 
3rd Qu .  : 63.00 
Max.       :200.00

Sachin's Avarage is around 40 runs. Here, we considered every match whether he is out or notout.
With quantile function, distribution of runs at each percentile can be observed. 

> quantile(sachin,seq(0, 1, by=.1))

   0%   10%   20%    30%   40%    50%   60%   70%   80%   90%     100%
  0.0     2.0    5.0      11.0   18.4    28.5   39.0   53.0   72.8   100.0   200.0

To have a look at the shape of the distribution of the Runs, Stem and leaf plot can be generated with the following statements.
> sachin <- as.numeric(SACHIN$RUNS)
>  stem(sachin)

   0 | 00000000000000000000111111111111111111112222222222222222223333333333+46
   1 | 00000011111111112222233344444445555555666677777788888889999
   2 | 00011111111222233344445555667777778888888999
   3 | 00000011112222223444555556666666777788889999999
   4 | 0000011112334444555556777888899
   5 | 0012222333334444557777
   6 | 0112222233334555555677788999
   7 | 0012347789
   8 | 011122223455678899
   9 | 011333334556778999
  10 | 000000112455
  11 | 001234457788
  12 | 002234778
  13 | 4789
  14 | 0111366
  15 | 2
  16 | 3
  17 | 5
  18 | 6
  19 |
  20 | 0
The Graph shows the runs made by Sachin in nice formatt. For Example, Sachin made 200 runs is shown in the lowest part and he did not make any runs in 190s. 

Histograms and Nonparametric density estimation methods

 Histogram is a density estimation method for continuous variable, where each bin contains the observations. Bin is an interval, which partitions range of observations into intervals. The following is the input.
> hist(sachin, col="darkblue", border = " yellow ",
+ xlab="Oneday match Runs",
+ ylab="count",
+ main="Oneday Runs made by Sachin Ramesh Tendulkar")

The output is:

Since the variable RUNS is greater than or equal to 0 and the height of the first bar is high and you can observe the gradual decline in height of the bars from second bar onwards. But, it is not forgettable that the histogram depends on the width of the bar and number of bars. If number of bars is increased, histogram is plotted with small intervals by decreasing the smoothness. At this point it is unable to decide what type of function it is. If we decrease the number of intervals to 1, then the function is uniform. R Software gives an option breaks= to the user to change the numbers of bars.

> bins=c(seq(0,150,by=10),200)
> hist(sachin,breaks=bins,col="darkblue", border = " yellow ", xlab="Oneday match Runs", ylab="count", main="Oneday Runs made by Sachin Ramesh Tendulkar")
> lines(density(sachin), col= "red")
> rug(sachin)
The Output is as follows:

You can observe that there is difference in the shape of the histogram by giving the bins as 0-10, 11-20, …,150-200. With rugs() function you can observe the data points on X-axis. You can clearly identify the runs mad in each interval, particularly 100+ runs.

 Before deciding the shape of the density function generate box plot. These are useful in deciding the symmetric and asymmetric shapes of the function of the existing data.
> boxplot(sachin, horizontal = TRUE, col = "orange", xlab="Oneday match Runs",main="Oneday Runs made by Sachin Ramesh Tendulkar")

<Bookmark(10You can observe the Outliers in box plot after 149+ runs.
















With the above statistics and graphs, the distribution is not Symmetric. Once again, we can look at summary statistics.

Load the library "fBasics" and look at the Statistics:
> library(fBasics)

> basicStats(sachin)
                  sachin
nobs                452.000000
NAs                 0.000000
Minimum        0.000000
Maximum       200.000000
1. Quartile      8.000000
3. Quartile      63.000000
Mean              40.765487
Median           28.500000
Sum                18426.000000
SE Mean         1.883299
LCL Mean       37.064357
UCL Mean      44.466617
Variance        1603.159959
Stdev              40.039480
Skewness       1.152907
Kurtosis          0.753945


With high variance, mean is almost equal to standard deviation. the distribution is heavily weighted far from the mean. Since skewness > 0, the distribution is right skewed as shown in the Histogram too, most values are concentrated on left side of the mean, with extreme values to the right. with all the above interpretation, Symmetric distributions will not be a fit to the data.

You can estimate the density function and Cumulative density function as follows:

> plot(density(sachin),main="Density estimate of sachin’s Oneday Runs")

> plot(ecdf(sachin),main="Empirical cumulative distribution function of Sachin’s Oneday Runs")

From all the above, it can be observed that the data is asymmetric and skewed positively. So, You can look at distributions like exponential, weibull, etc..,

Quantile-Quantile Plots


Before, going to decide the distribution, examine with other graphical procedures, Quantile-Quantile plots and Probability plots. The quantile of a sample is the data point corresponding to a given fraction of the data. A one-sample quantile plot looks like a cumulative sample distribution function. qqnorm()is used to test the goodness of fit of a Normal distribution and qqplot() for  any kind of other distribution. To plot Q-Q plot for variable RUNS, the input is:


>  x.teo<-rweibull(n=452,shape=0.83908325, scale=37.62877983)
>  qqplot(sachin,x.teo,main="QQ-plot for Weibull distribution")
> abline(0,1) 

 The output is:



fitdistr
Coverted 0 RUNS  to 0.1; From all the above methods, the variable with positive numbers greater than zero may follow some asymmetric distributions, with more observations in the first intervals and decreasing number of observations in the later intervals.
> SACHIN1 <-read.sas7bdat("D:/Modeling/SYSTAT/Cluster/GRAPHS/Rdatastore/sachin1.sas7bdat")
> sachin1 <- as.numeric(SACHIN1$RUNS)                                                            

We can see, whether any of the weibul or exponential distributions fit to the data with the help of fitdistr. Parameters of the distribution can be calculated by maximum likelihood estimation methods.  The following statements will fit distributions to the variable RUNS.
> library(MASS) 
> fitdistr(sachin1,"Exponential")

The Output is:

      rate    
  0.024527892 
 (0.001153695)

 > fitdistr(sachin1,"weibull")
The Output is:

      shape         scale   
   0.83908325   37.62877983 
 ( 0.03253613) ( 2.20929924)

Goodness of fit tests
KS test can be conducted for Goodness of fit test.
> ks.test(sachin1,"pexp",rate=0.024527892)
The output is as follows:

        One-sample Kolmogorov-Smirnov test
data:  sachin1

D = 0.0917, p-value = 0.0009926
alternative hypothesis: two-sided

>ks.test(sachin1,"pweibull",shape=0.83908325 ,scale=37.62877983)

The output is as follows:
        One-sample Kolmogorov-Smirnov test
data:  sachin1
D = 0.0596, p-value = 0.08023
alternative hypothesis: two-sided

We accept null hypothesis that the data follow a Weibull distribution because the p-value is enough higher than significance levels usually referred. Since shape parameter of weibull is <1, it is harder to get Sachin out as his score increases.

In an unpublished work done by me and Prof Traimbakam Krishnan during 2004 to 2007, analysis was done on censored data for several famous players. Not-outs are considered as right censored.
The results will be kept in separate Article.

Conclusions
By using function available in R software, we can analyze univariate data very well.
Similar kind of Analysis can be done on all famous oneday match players.
References

[1] Johnson, N. L., Kotz, S., and, Balakrishnan, N. (1994). Univariate continuous distributions. Vol. 1, 2nd ed. :New York: John Wiley & Sons.
[2] Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Univariate continuous distributions.Vol. 2, 2nd ed. :New York: John Wiley & Sons.
[3] Johnson, N. L., Kotz, S., and Kemp, A. W. (1993). Univariate discrete distributions 2nd ed.: New York: John Wiley & Sons.
[4] A. Law and D. Kelton. Simulation Modelling and Analysis. McGraw Hill, New York, 1990.
[5] http://cran.r-project.org/doc/manuals/R-intro.pdf

Thursday, 24 January 2013

Graphical Representation of Oneday and Test Cricket Runs

When comparing the Runs made by batsman, there are efficient ways of looking graphs. For example, runs made by batsman can be divided in to different categories as follows:


Observed each categories of oneday runs made by three batsmen Sachi Tendulkar, Ricky Ponting, and Brian Lara. The bar chart for three is represented pictorially as:

Three batsmen have substantial number of  times fall in the categories Above average (51-69 runs) and Above high score (100 - 129 runs) categories. Among the three Ponting has more "Above Averages". Lara has more "Below High (70 -89) compared to other two batsmen.. As expected all three batsmen have more frequency with low runs categories. Since Sachin played more oneday matches, he has more "Single digit Score" and "Best Score" categories.

Instead of bar charts as above Dot chart, Line chart, Pie chart, Pyramid chart can be developed. For example, following are the charts for Sachin's oneday runs categories.


Similarly, for the test runs various graphs can be generated for comparison between 1st and 2nd innings.
categories for the runs made in test cricket is











Bar chart in 3-dimension for runs made by Sachin Tendular in Test Cricket matches.


Instead of bar charts as above Dot chart, Line chart, and Pyramid chart can be developed. For example, following are the charts for Sachin's test match runs categories.

Three dimensional pictures are not providing the clear vision. So, we can go for 2-dimensional picture.

For Sachin, from Average category onward less frequency in second innings. Sachin did not make similar runs in second innings as compared to first innings. But with this we cannot come to conclusion on his performance in second innings. We need to check number of time he batted and sufficient time he got to make the runs.

[ MORE TO BE UPDATED]



References:



Wednesday, 23 January 2013

Data Analysis based on significant digits of runs made by one-day cricket batsmen


Abstract
          The pattern of first significant digit from collected numbers follows the Benford’s law. The first significant digit of runs made by several famous batsmen from various countries in one-day cricket is analyzed. Based upon this, the characteristics of the batsmen’s game plan are analyzed. For all the Batsmen, duck-outs are not considered.
1. Introduction
          The first significant digit of collected numbers doesn’t follow uniform distribution as we expected. According to Astronomer and Mathematician Siman Newcomb “the first pages of tables of logarithms wear out much faster than the last ones”.The law that follows for first significant digit is
                        P [first significant digit = d] = log10(1+(1/d)) , d=1,2,…,9.
            Dr. Frank Benford (1938), a physicist, working for General Electric in 1930’s, worked independently on the significant digits of naturally collected numbers. He collected several data sets, such as, the area of rivers, American league baseball statistics, numbers appearing in Reader’s Digest, death rates and atomic weights of elements and invented that Benford’s law fits well to these data sets.
            Interestingly, M. Nigrini (1996) applied Benford’s law to the tax returns data and found the fraudulent data. Ley (1996) extended Benford’s law applications to stack market indexes. Theodore P.Hill (1995) provided a statistical derivation of the Benford’s law.
            We collected one-day cricket runs made by famous one-day batsmen in world cricket and applied Benford’s law to the first significant digit excluding duck-outs. The statistical significance of uniform distribution to the significant digit in the units place is also tested.
            Section2 explains the Benford’s law in detail. Section 3 consists of description of one-day cricket. Section 4 gives the applications of significant digits on runs made by several famous one-day batsmen and section 5 makes the conclusions.
            The data sets are available from http://www.howstat.com.au/.

2. Benford’s law
The first significant digit is distributed in the set {1,2,…,9} as P[first significant digit = d] = log10((d+1)/d) , d= 1,2,…,9 . i.e., in any collected numbers, the digit 1 occurs 30.1 % [ log102 = 0.301 ] times, but the digit 9 occurs 4.6% [log2(10/9) = 0.046] times.
            The first significant digit of 40 European countries that is in square kilometers (P.M. Lee, 1989) follows Benford’s law. The following table gives the details.

Digit
1           2            3            4           5          6         7         8          9
True data
25         17.5       15          15         7.5       5         5         2.5      7.5
Benford’s law
30.103   17.609  12.494    9.691   7.978   6.695  5.799  5.115  4.576
Table 2.1
            The quantities that are measured may vary. Instead of square kilometers, square miles may be considered. Because of the scale invariant property, Benford’s law is still applicable to the changed data. If the considered data set is converted from one base (suppose base is 10) to another base (base 100), Benford’s law is applicable i.e., Benford’s law satisfies base invariant properties. About the detailed discussion of scale and base invariant properties, reader can have a look in to the publication of Theodore P. Hill (1995)

3. One-day Cricket

            Cricket was born in England. Because of Great Britain’s rule in many countries in 18thand 19th centuries, cricket is also spread in to colonial states. People are used to play cricket for 5 days, which is called Test cricket. To increase the enthusiasm in the viewers, a limited sort of game playing for a day is introduced. It is called one-day cricket. Cricket is very much famous in Australia, Newzeland, England, Indian sub continent, South Africa and West Indian Islands.
To learn more about this game, the URL http://encyclopedia.thefreedictionary.com/One-day%20cricketmay be helpful.

4. Application

            Runs made by several batsmen from various countries are collected. The test statistics of Chi-square goodness of fit test that follows Benford’s law and its p-value for first significant digit are given in table 4.1.
Player
Chi-square test for Benford’s law on first significant digit
Test statistic      P-value
Michael G Bevan
14.90506693
0.061017873
Andrew Flower
12.41756102
0.133523039
Jacques H Kallis
7.746278365
0.458638789
Desmond L Haynes
6.475067741
0.594174459
Rahul S Dravid
6.446699507
0.597325635
Stephen R Waugh
6.223625986
0.622198182
Adam C Gilchrist
5.901606244
0.658252569
Ricky T Ponting
5.858942527
0.663028872
Inzamam-Ul-Haq
5.779427034
0.671923685
Saurav C Ganguly
4.257584281
0.833167603
Pinnaduwage A De Silva
4.209530898
0.837741214
Mohammad Azharuddin
3.534059895
0.896530246
Mark E Waugh
2.993049155
0.934792991
Brian S Lara
2.885163313
0.941356018
Sachin R Tendulkar
2.285404993
0.97098821
Allan R Border
1.823578373
0.985950882
Herschelle H Gibbs
1.477624181
0.993073213
Sanath T Jayasurya
1.318275438
0.995330207
                                                     Table 4.1

By observing the table 4.1, it is confirmed that all the batsmen from the list are playing according to Benford’s law, where the test statistic is considered for 95% confidence interval.


For Michael G Bevan, values that are higher than 14.91 would be expected to occur about 6.1 % of the time, where as for Sanath T Jayasurya , values that are higher than the test statistic 1.318 would be expected to occur about 99.5% of the time. The test statistics is low and high for these two players respectively. The true data for these two players are in Table 4.2.



Digit
1
2
3
4
5
6
7
8
9
Bevan’s true data
23.037
13.089
19.895
13.613
11.518
5.759
7.853
4.712
0.524
Jayasurya's true data
32.5
15.833
10.833
10.833
7.5
7.5
3.75
5.833
5.417
Benford’s law data
30.103
17.609
12.494
9.691
7.978
6.695
5.799
5.115
4.576
                                                                  Table 4.2

The significant digit of one-day runs of Andrew Flower and Michael G Bevan tried to avoid Benford’s law, but are unable i.e., Both of these players played well in most of the games and they tried to decrease the effect of 1’s in the first significant digit of their runs. This reveals that these two players tried to be around their average in every match. The P-values for these two players are 13.4% and 6.1% respectively that are least compared to others. In the case of Sanath T. Jayasurya p-value is (highest) 99.5%. In the case of Sachin R Tendulkar, the p-value is 97%. Since, he made more than 30 centuries, which has 1 in its significant digit may be showing some effect.



5. References:

[1] Benford, F. (1938), “The Law of Anamolous Numbers,”Proceedings of the American Philosophical Society, 78, 551-572.
[2] Theodorw P.Hill. (1995), ”A Statistical Derivation of the Significant-Digit Law,” Statistical Science, 86, 4, 354-363.
[3 ] Ley, E. (1996), “On the Peculiar Distribution of the U.S.Stock Indices Digits,” The American Statistician, 50, 311-313.
[4] Nigrini, M. (1996), “ A Taxpayer Compliance Application of Benford’s law,” Journal of the American Taxation Association, 18, 72-91.