This scripts are also available at:

The aim of this document is to discuss the methodological procedures followed in my article Football and Web: Lexical Analysis of a Genre through Time. The article is available at:

Lima-Lopes, Rodrigo Esteves de. 2020. ‘Football and Web: Lexical Analysis of a Genre through Time’. Papéis: Revista do Programa de Pós-Graduação em Estudos de Linguagem – UFMS 24 (47): 150–78. Available at link

I would it to be a contribution to the replicability of studies in Applied Linguistics field.

Please, if you have any comment, you can find me at: rll307@unicamp.br

Objectives

To study lexical choices in a group of sport news through a diachronic corpus

Hypothesis:

If the growth of importance in the internet news would change lexical choices within sports news.

Motivation

There has been a number of studies discussing the visual impact of technology in layout and image/text relationship, but very few comparing lexical changes
Most of lexical and discussion is centred on platforms that were created for digital interaction

Data collection and processing

The data for this study was a set of sports news from the Bristish newspaper The Guardian. The following createria was followed:

Only sports articles published in the months of the World Cup were considered;
Genres that did not fit the classification of sports articles were discarded;
- Amongst the despised genres are: quizzes, letters from readers and chronicles;
No photo or video galleries were considered;
- The focus of this work was the written material.

articles in the given period for research, all articles referring to the World Cups studied were collected.

Since Guardian’s API allows the total scraping the total of the articles in the given period for research, all articles referring to the World Cups studied were collected.

Data was collected through data scraping using R:

Package GuardianR
- Data scraping using Guardians’s API
Data processing using the following packages
1. tm
- Used for cleaning the corpus and producing dendrogram hierarchical clustering
1. tidytext
- Used for producing weighted wordlists
1. quanteda
- Used for producing network graphs

Lexical items were calculated using three measurements. Product of TF X IDF:

Term Frequency(TF): measures the frequency of a term in a document
Inverse document frequency(IDF): weights down the importance of more frequent words and scales up the rarer ones

Network Graph * A matrix of words was calculated thought co-occurrence

	I	like	traveling	to	Argentina
D1	10	20	11	22	0
D2	1	2	33	3	1

Table 1: Example of Matrix

The scripts

The packages

These are the necesssary packages for the commands to run:

library(readtext)
library(dplyr)
library(tidytext)
library(ggplot2)
library(scales)
library(stringr)
library(tidyr)
library(GuardianR)
library(tm)
require(quanteda)

Collecting the articles

#The command bellow will interact with the Newspapers API and scrape the data from 2002 world cup.
#Try one for each Cup  you are interested in. Gardian makes possible to collect data back to 1999. 

cup2002 <- get_guardian("world+cup",
                        section="football",
                        from.date="2002-05-31",
                        to.date="2002-06-30",
                        api.key="my API key")

Just repeat this same command for the World Cups from 2006 to 2018. My API Key stands for an unique code that the Guardian’s API makes available for each researcher.

Save as a character

The command bellow takes only the text of the result database and saves it as an string for later use.

your.text.string <- paste(as.character(cup2002$body))

Cleaning the texts and making dendrograms

These are a set of functions one might use for cleaning the texts before making the dendrograms.

limpar_texto.html <- function(x) {
  return(gsub("<.*?>", "", x))
} #Cleans special HRML codes

limpar.pontuacao <- function(x){
  return(gsub(pattern = "\\W"," ",x))
} #Cleans unnecessary paragraphs

limpar_texto.espacos <- function(x) {
  return(str_squish(x))
}#Cleans unnecessary spaces

limpar.letras.soltas <- function(x){
  return(gsub("\\b[A-z]\\b{1}", " ", x))
}

Clean_String <- function(string){
  # Lowercase
  temp <- tolower(string)
  # Remove everything that is not a number or letter (may want to keep more 
  # stuff in your actual analysis). 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  # Split it
  #temp <- stringr::str_split(temp, " ")[[1]]
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  } 
  return(temp)
}#

## In order to process the text-cleaning, just run it substituting "your.text"
your.text <- limpar_texto.html(your.text.string) #
your.text <- Clean_String(your.text)#
your.text <- limpar_texto.espacos(your.text)#
your.text <- limpar_texto.paragrafos(your.text)#
your.text <- limpar.pontuacao(your.text)#
your.text <- limpar.numeros(your.text)
your.text <- limpar.letras.soltas(your.text)
your.text <- gsub("getty images", " ", your.text)

Now lets apply a stoplists to the data. I used an ordinary stoplists of English, so grammatical words would be ingored in the word lists

#Make a dataframe out of your string
your.text.df <- data.frame(text = your.text, stringsAsFactors = F)

#Apply the stopwords
your.text.df  <- your.text.df  %>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)

#Repete one time for each World Cup

Now let us compare some frequence. I will guide you step by step.

#This will produce a list of the words in each cup side by side with their proportion in a new column

frequency.cups <- bind_rows(mutate(cup2002.tidy, cup = "2002"),
                            mutate(cup2006.tidy, cup = "2006"),
                            mutate(cup2010.tidy, cup = "2010"),
                            mutate(cup2014.tidy, cup = "2014"),
                            mutate(cup2018.tidy, cup = "2018")) %>%
  mutate(word = str_extract(word, "[:alpha:]+")) %>%
  count(cup, word) %>%
  group_by(cup)%>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) %>%
  spread(cup, proportion)

#This will add a total for each word cup
frequency.cups.total <- frequency.cups %>%
  group_by(cup) %>%
  summarize(total = sum(n))

#Now we join them together
frequency.cups.total <- left_join(frequency.cups,frequency.cups.total)

The result might look like this:

Head of List
cup	word	n	total
2002	aaah	1	243568
2002	aaarrgh	1	243568
2002	aahed	1	243568
2002	aand	1	243568
2002	aardvark	1	243568
2002	aarhus	1	243568
2002	aaron	4	243568
2002	aback	2	243568
2002	abacus	1	243568
2002	abandon	5	243568

This table does not tell us much. THis is beacuse words are not organised bby their frequency or their importance.

#This commands will process this table in terms of ID/ITF numbers
freq_by_rank <- frequency.cups.total %>%
  group_by(cup) %>%
  mutate(rank = row_number(),
         `term frequency` = n/total)
freq_by_rank %>%
  ggplot(aes(rank, `term frequency`, color = cup)) +
  geom_line(size = 1, alpha = 0.8, show.legend = TRUE) +
  scale_x_log10() +
  scale_y_log10()

# Now we a producing a table for plotting
palavras.imp  <- freq_by_rank  %>%
  bind_tf_idf(word, cup, n)

This table should look like something like this:

library(knitr)
kable(palavras.imp [1:15,], caption = "Words and their importance")

Words and their importance
cup	word	n	total	rank	term frequency	tf	idf	tf_idf
2002	aaah	1	243568	1	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	aaarrgh	1	243568	2	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	aahed	1	243568	3	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	aand	1	243568	4	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	aardvark	1	243568	5	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	aarhus	1	243568	6	4.10e-06	4.10e-06	0.9162907	3.8e-06
2002	aaron	4	243568	7	1.64e-05	1.64e-05	0.0000000	0.0e+00
2002	aback	2	243568	8	8.20e-06	8.20e-06	0.2231436	1.8e-06
2002	abacus	1	243568	9	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	abandon	5	243568	10	2.05e-05	2.05e-05	0.0000000	0.0e+00
2002	abandoned	11	243568	11	4.52e-05	4.52e-05	0.0000000	0.0e+00
2002	abandoning	1	243568	12	4.10e-06	4.10e-06	0.0000000	0.0e+00
2002	abate	1	243568	13	4.10e-06	4.10e-06	0.5108256	2.1e-06
2002	abatement	1	243568	14	4.10e-06	4.10e-06	1.6094379	6.6e-06
2002	abbot	1	243568	15	4.10e-06	4.10e-06	1.6094379	6.6e-06

This is a command to plot the list as we saw at the original paper. Printing the top ten

palavras.imp  %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(cup) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = cup)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~cup, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

Dendrogram hierarchical clustering

In order to to the clustering, we will reprocess the corpus. Much of what we will be doing now is related to the TM package. This data processes each year individually.

# Creating a corpus out of a set of strings. Here processing 2008 
corpus.cluster.2018 <- Corpus(VectorSource(c.2018))

# The text is clean removing URL and other unwanted features. It is possible you have to get some words awaya by hand, as I did in the last line
corpus.cluster.2018 <- tm_map(corpus.cluster.2018, content_transformer(tolower))
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
remove.users <-function(x) gsub("@[[:alnum:][:punct:]]*","",x)
corpus.cluster.2018 <- tm_map(corpus.cluster.2018, content_transformer(removeURL))
corpus.cluster.2018 <- tm_map(corpus.cluster.2018,content_transformer(remove.users))
corpus.cluster.2018 <- tm_map(corpus.cluster.2018, stripWhitespace)
corpus.cluster.2018 <- tm_map(corpus.cluster.2018, removePunctuation)
corpus.cluster.2018 <- tm_map(corpus.cluster.2018, 
                              function(x)removeWords(x,c(stopwords("en"),"bst","min","getty")))

#Now we are going to create a Document Matrix (see example above)
corpus.cluster.2018.tdm <- TermDocumentMatrix(corpus.cluster.2018)

#Deleting sparce words
corpus.cluster.2018.tdm <- removeSparseTerms(corpus.cluster.2018.tdm, sparse = 0.999)
cluster.2018.df <- as.data.frame(inspect(corpus.cluster.2018.tdm))

#Calculating distance amongst words
## Using Eucledean measurement
cluster.df.scale.2018 <- scale(cluster.2018.df)
cluster.d.2018 <- dist(cluster.2018.df, method = "euclidean")

#Ploting using the Ward Method
fit.ward2 <- hclust(cluster.d.2018, method = "ward.D2")
plot(fit.ward2)

Network of words

The idead behind this network is making a clearer view of the main collocates whithin a text. Its calculation relies in thq Quanteda package. The example here also relies on 2018 world cup, but you can apply it to all data.

#Creating a "Qunateda Corpus"
library(quanteda)
corpus.2018 <- corpus(clean.2018)

#Creating a data frame for processing the data. 
#Cleaning some custom words (those a more related to technical internet routine)
corpus.2018.df <- dfm(corpus.2018,
                      remove_numbers = TRUE, 
                      remove_punct = TRUE,
                      remove_symbols = TRUE,
                      verbose  = T) %>% 
  dfm_remove(c(stopwords("english"), "bst","min","getty","t", "s", "m","one", "re","ve","pm"))

## Creating a dfm from a corpus input...

##    ... lowercasing

##    ... found 832 documents, 34,413 features

##    ... created a 832 x 34,413 sparse dfm
##    ... complete. 
## Elapsed time: 1.46 seconds.

#Deliting words which occur less than 20 times
corpus.2018.select.df <- dfm_trim(corpus.2018.df, min_termfreq = 20)

#Creating a Features frequence matrix
corpus.2018.select.fcm <- fcm(corpus.2018.select.df)

#Selecting the 50 more frequent words
corpus.2018.topfeats <-names(topfeatures(corpus.2018.df,50))
select.2018 <- fcm_select(corpus.2018.select.fcm, pattern = corpus.2018.topfeats)

#Calculating the size of the network
size <- log(colSums(dfm_select(corpus.2018.select.fcm, corpus.2018.topfeats)))

#Ploting the network in dark grey
set.seed(100)
textplot_network(fcm_select(corpus.2018.select.fcm, corpus.2018.topfeats),min_freq =0.8,
                 vertex_size = size/max(size)*3, edge_color="darkgray")

Time Football