library(readr)
trueData <- read_csv("C:/Users/ayoun/Downloads/trueData.csv")
## Rows: 500 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): word
## dbl (3): freq, article_count, rank
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
fakeData <- read_csv("C:/Users/ayoun/Downloads/fakeData.csv")
## Rows: 500 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): word
## dbl (3): freq, article_count, rank
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
My hypothesis is that strong words would occupy higher proportion of most commonly phrases in fake articles than in true articles. I defined “strong words” as words that include politician’s names, political party names, country names, and names of popular social events or groups.
\[ H_o:p_{fake} \ge p_{true} \\ H_a: p_{fake} < p_{true} \]
where \(p_{fake}\) is proportion of strong words out of most commonly used phrases in fake articles and \(p_{true}\) is proportion of strong words out of most commonly used phrases in true articles
politicians <- c("Donald", "Trump", "Hillary", "Clinton", "Obama", "Barak", "Bill", "Bernie", "Sanders")
countries <- c("United", "States", "America", "US", "USA", "Russia", "China", "north", "Korea")
groups <- c("black", "lives", "african", "american", "islamic", "state", "republican", "democrat")
others <- c("fake", "said", "statement", "fact")
strongwords <- c(politicians, countries, groups, others)
fake.strongwords <- 0
for (i in 1:length(fakeData$word)){
if (TRUE %in% grepl(fakeData$word[i], strongwords, ignore.case=TRUE)){
fake.strongwords <- fake.strongwords + fakeData$freq[i]
}
}
fake.strongwords
## [1] 88600
true.strongwords <- 0
for (i in 1:length(trueData$word)){
if (TRUE %in% grepl(trueData$word[i], strongwords, ignore.case=TRUE)){
true.strongwords <- true.strongwords + trueData$freq[i]
}
}
true.strongwords
## [1] 62935
To see if the proportion of strong words is higher in fake articles than in true articles, I ran a proportion equality test at 95% confidence interval.
total.fakewords <- sum(fakeData$freq)
total.truewords <- sum(trueData$freq)
prop.test(c(fake.strongwords, true.strongwords), c(total.fakewords, total.truewords), correct=FALSE)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(fake.strongwords, true.strongwords) out of c(total.fakewords, total.truewords)
## X-squared = 5444.7, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.01656688 0.01747280
## sample estimates:
## prop 1 prop 2
## 0.05460170 0.03758186
Since the p-value is less than 0.05, I conclude that the proportions of strong words out of most commonly used phrases in fake article is not the same as such proportions in the true articles. Looking at the confidence interval, I can see that \(0.01656688 \le p_{fake} - p_{true} \le 0.01747280\) meaning that on population level, I can conclude that the proportion of fake articles is statistically higher than the proportion in true articles.
prop.fakestrong <- fake.strongwords/total.fakewords
prop.truestrong <- true.strongwords/total.truewords
proportions <- c(prop.fakestrong, prop.truestrong)
barplot(proportions, main="Proportion of strong words out of most popular words",
xlab="Proportions", ylab="Proportion", col=c("wheat", "khaki"),
ylim=c(0,0.06),
names.arg = c("Fake Article", "True Article"))