Importing dataset

library(readr)
trueData <- read_csv("C:/Users/ayoun/Downloads/trueData.csv")
## Rows: 500 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): word
## dbl (3): freq, article_count, rank
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
fakeData <- read_csv("C:/Users/ayoun/Downloads/fakeData.csv")
## Rows: 500 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): word
## dbl (3): freq, article_count, rank
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Setting hypothesis

My hypothesis is that strong words would occupy higher proportion of most commonly phrases in fake articles than in true articles. I defined “strong words” as words that include politician’s names, political party names, country names, and names of popular social events or groups.

\[ H_o:p_{fake} \ge p_{true} \\ H_a: p_{fake} < p_{true} \]

where \(p_{fake}\) is proportion of strong words out of most commonly used phrases in fake articles and \(p_{true}\) is proportion of strong words out of most commonly used phrases in true articles

Subsetting strong words

politicians <- c("Donald", "Trump", "Hillary", "Clinton", "Obama", "Barak", "Bill", "Bernie", "Sanders")
countries <- c("United", "States", "America", "US", "USA", "Russia", "China", "north", "Korea")
groups <- c("black", "lives", "african", "american", "islamic", "state", "republican", "democrat")
others <- c("fake", "said", "statement", "fact")

strongwords <- c(politicians, countries, groups, others)

Counting proportion of strong words out of most commonly used phrases

fake.strongwords <- 0
for (i in 1:length(fakeData$word)){
  if (TRUE %in% grepl(fakeData$word[i], strongwords, ignore.case=TRUE)){
    fake.strongwords <- fake.strongwords + fakeData$freq[i]
  }
}
fake.strongwords
## [1] 88600
true.strongwords <- 0
for (i in 1:length(trueData$word)){
  if (TRUE %in% grepl(trueData$word[i], strongwords, ignore.case=TRUE)){
    true.strongwords <- true.strongwords + trueData$freq[i]
  }
}
true.strongwords
## [1] 62935

Proportion test

To see if the proportion of strong words is higher in fake articles than in true articles, I ran a proportion equality test at 95% confidence interval.

total.fakewords <- sum(fakeData$freq)
total.truewords <- sum(trueData$freq)

prop.test(c(fake.strongwords, true.strongwords), c(total.fakewords, total.truewords), correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  c(fake.strongwords, true.strongwords) out of c(total.fakewords, total.truewords)
## X-squared = 5444.7, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.01656688 0.01747280
## sample estimates:
##     prop 1     prop 2 
## 0.05460170 0.03758186

Since the p-value is less than 0.05, I conclude that the proportions of strong words out of most commonly used phrases in fake article is not the same as such proportions in the true articles. Looking at the confidence interval, I can see that \(0.01656688 \le p_{fake} - p_{true} \le 0.01747280\) meaning that on population level, I can conclude that the proportion of fake articles is statistically higher than the proportion in true articles.

Proportion visualization

prop.fakestrong <- fake.strongwords/total.fakewords
prop.truestrong <- true.strongwords/total.truewords

proportions <- c(prop.fakestrong, prop.truestrong)

barplot(proportions, main="Proportion of strong words out of most popular words",
        xlab="Proportions", ylab="Proportion", col=c("wheat", "khaki"),
        ylim=c(0,0.06),
        names.arg = c("Fake Article", "True Article"))