Welcome to my journey as I prepare for my PhD comprehensive exam! Since starting my PhD in 2019, I’ve used text analysis to help me with various aspects of my research, including gearing up for comps. In this notebook, I analyze over a hundred research papers that explore two critical questions: how new ideas gain acceptance and how social networks influence the careers of men and women differently. These research streams, which I refer to as “Novelty Reception” and “Network and Gender”, form the bedrock of my work into the systemic barriers women face in creative fields.

We start by extacting detailed metadata for each paper from their academic citations. Then, we delve into the textual content of the abstracts, using NLP tools like TF-IDF and weighted log odds to unearth key terms and phrases unique to each research stream. We also create visualizations to further clarify the thematic trends we uncover.

Join me on this journey as we explore the thematic trends in the research on the forces shaping acceptance of new ideas and the gendered effects of social networks.

Generating metadata from citations

I’ve set up a Google Sheets document that contains all the papers I plan to explore. Let’s pull this data from Google Sheets and see what we’ve got.

# Importing a Google Sheets document containing the paper citations and abstracts
gsheet::gsheet2tbl('https://docs.google.com/spreadsheets/d/1ic1Zc3CpXZyiYfD5whrHLizMN83WXyxjipN5moUtVG8/edit#gid=0') -> reading

# Sample 10 random rows to preview the data structure 
reading %>% 
  sample_n(10) %>% 
  # format the table for better visualization in HTML
  kbl(caption = "Sample of Reading List in Chronological Order") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"), fixed_thead = TRUE) %>%
  scroll_box(width = "100%", height = "700px")

Sample of Reading List in Chronological Order
Citation	Abstract	Topic
Sternberg, R. J. (1985). Implicit theories of intelligence, creativity, and wisdom. Journal of personality and social psychology, 49(3), 607.	In a prestudy, a questionnaire was sent to 97 professors in the fields of art, business, philosophy, and physics; it was also given to 17 laypersons. Subjects were asked to list behaviors characteristic of an ideally intelligent, creative, or wise person in one’s field of endeavor, or in general (for laypersons). In experiment I, 285 professors in the same fields and 30 laypersons rated the extent to which each of the behaviors listed at least twice in the prestudy was characteristic of an ideally intelligent, creative, or wise individual. In experiment II, a subset of the behaviors from the prestudy was sorted by 40 undergraduates to yield a multidimensional space characterizing the subjects’ implicit theories for intelligence, creativity, and wisdom. In experiment III, 30 adults rated themselves on a subset of the behaviors from the prestudy, and these ratings were correlated with “ideal prototype” ratings to yield a measure of resemblance to the prototype. Resemblance scores were then correlated with scores on standardized ability tests. In experiment IV, 30 adults rated hypothetical individuals described in simulated letters of recommendation in terms of their intelligence, creativity, and wisdom. Results reveal that people have systematic implicit theories of intelligence, creativity, and wisdom, which are used accurately both in evaluating themselves and in evaluating hypothetical others. Moreover, the implicit theories for each of the constructs show at least some convergent–discriminant validity with respect to each other.	Novelty reception
Boudreau, K. J., Guinan, E. C., Lakhani, K. R., & Riedl, C. (2016). Looking across and looking beyond the knowledge frontier: Intellectual distance, novelty, and resource allocation in science. Management science, 62(10), 2765-2783.	Selecting among alternative projects is a core management task in all innovating organizations. In this paper, we focus on the evaluation of frontier scientific research projects. We argue that the “intellectual distance” between the knowledge embodied in research proposals and an evaluator’s own expertise systematically relates to the evaluations given. To estimate relationships, we designed and executed a grant proposal process at a leading research university in which we randomized the assignment of evaluators and proposals to generate 2,130 evaluator–proposal pairs. We find that evaluators systematically give lower scores to research proposals that are closer to their own areas of expertise and to those that are highly novel. The patterns are consistent with biases associated with boundedly rational evaluation of new ideas. The patterns are inconsistent with intellectual distance simply contributing “noise” or being associated with private interests of evaluators. We discuss implications for policy, managerial intervention, and allocation of resources in the ongoing accumulation of scientific knowledge.	Novelty reception
Smith, E. B., Brands, R. A., Brashears, M. E., & Kleinbaum, A. M. (2020). Social networks and cognition. Annual Review of Sociology, 46, 159-174.	Social network analysis, now often thought of simply as network science, has penetrated nearly every scientific and many scholarly fields and has become an indispensable resource. Yet, social networks are special by virtue of being specifically social, and our growing understanding of the brain is affecting our understanding of how social networks form, mature, and are exploited by their members. We discuss the expanding research on how the brain manages social information, how this information is heuristically processed, and how network cognitions are affected by situation and circumstance. In the process, we argue that the cognitive turn in social networks exemplifies the modern conception of the brain as fundamentally reprogrammable by experience and circumstance. Far from social networks being dependent upon the brain, we anticipate a modern view in which cognition and social networks coconstitute each other.	Network and gender
Zhao, E. Y., Fisher, G., Lounsbury, M., & Miller, D. (2017). Optimal distinctiveness: Broadening the interface between institutional theory and strategic management. Strategic Management Journal, 38(1), 93-113.	Attaining optimal distinctiveness—positive stakeholder perceptions about a firm’s strategic position that reconciles competing demands for differentiation and conformity—has been an important focal point for scholarship at the interface of strategic management and institutional theory. We provide a comprehensive review of this literature and situate studies on optimal distinctiveness in the broader scholarly effort to integrate institutional theory into strategic management. Our review finds that much extant research on firm-level optimal distinctiveness is grounded in the strategic balance perspective that conceptualizes conformity and competitive differentiation as a trade-off along a single organizational attribute. We argue for a renewed research agenda that draws on recent developments in institutional theory to conceptualize organizational environments as more multiplex, fragmented, and dynamic, and discuss its implications for core strategic management topics.	Novelty reception
Chan, C. R., & Parhankangas, A. (2017). Crowdfunding innovative ideas: How incremental and radical innovativeness influence funding outcomes. Entrepreneurship Theory and Practice, 41(2), 237-263.	We investigate the effect of innovativeness on crowdfunding outcomes. Because crowdfunding campaigns characterized by greater incremental innovativeness are more comprehensible and generate more user value for typical crowdfunders, incremental innovativeness may result in more favorable funding outcomes. By comparison, campaigns that feature greater radical innovativeness are riskier to develop, harder for crowdfunders to understand and result in less favorable funding outcomes. This negative effect of radical innovativeness may be mitigated by incremental innovativeness, which may help crowdfunders to understand and appreciate radical innovativeness more. A sample of 334 Kickstarter campaigns provides support for our hypotheses.	Novelty reception
Hofstra, B., Kulkarni, V. V., Galvez, S. M. N., He, B., Jurafsky, D., & McFarland, D. A. (2020). The diversity–innovation paradox in science. Proceedings of the National Academy of Sciences, 117(17), 9284-9291.	Prior work finds a diversity paradox: Diversity breeds innovation, yet underrepresented groups that diversify organizations have less successful careers within them. Does the diversity paradox hold for scientists as well? We study this by utilizing a near-complete population of ∼1.2 million US doctoral recipients from 1977 to 2015 and following their careers into publishing and faculty positions. We use text analysis and machine learning to answer a series of questions: How do we detect scientific innovations? Are underrepresented groups more likely to generate scientific innovations? And are the innovations of underrepresented groups adopted and rewarded? Our analyses show that underrepresented groups produce higher rates of scientific novelty. However, their novel contributions are devalued and discounted: For example, novel contributions by gender and racial minorities are taken up by other scholars at lower rates than novel contributions by gender and racial majorities, and equally impactful contributions of gender and racial minorities are less likely to result in successful scientific careers than for majority groups. These results suggest there may be unwarranted reproduction of stratification in academic careers that discounts diversity’s role in innovation and partly explains the underrepresentation of some groups in academia.	Novelty reception
Brashears, M. E. (2008). Gender and homophily: Differences in male and female association in Blau space. Social Science Research, 37(2), 400-415.	Homophily, the tendency for similar individuals to associate, is one of the most robust findings in social science. Despite this robustness, we have less information about how personal characteristics relate to differences in the strength of homophily. Nor do we know much about the impact of personal characteristics on judgments of relative dissimilarity. The present study compares the strength of age, religious, and educational homophily for male and female non-kin ties using network data from the 1985 General Social Survey. It also compares the patterning of ties among dissimilar alters for both sexes. The results of this exploratory effort indicate that males and females are almost equally homophilous, although religious homophily exerts a stronger influence on females than males. Males and females do, however, differ in their tendency to associate with certain types of dissimilar alters. Education is essentially uniform for both sexes, religious difference is more important for females than males, and those over sixty or under thirty are less different from the middle categories of age for females than for males. The results suggest that males are able to bridge larger areas of social space in their non-kin interpersonal networks and likely accumulate greater social capital as a consequence.	Network and gender
Sgourev, S. V. (2013). How Paris gave rise to Cubism (and Picasso): Ambiguity and fragmentation in radical innovation. Organization Science, 24(6), 1601-1617.	In structural analyses of innovation, one substantive question looms large: What makes radical innovation possible if peripheral actors are more likely to originate radical ideas but are poorly positioned to promote them? An inductive study of the rise of Cubism, a revolutionary paradigm that overthrew classic principles of representation in art, results in a model where not only the periphery moves toward the core through collective action, as typically asserted, but the core also moves toward the periphery, becoming more receptive to radical ideas. The fragmentation of the art market in early 20th-century Paris served as the trigger. The proliferation of market niches and growing ambiguity over evaluation standards dramatically reduced the costs of experimentation in the periphery and the ability of the core to suppress radical ideas. A multilevel analysis linking individual creativity, peer networks, and the art field reveals how market developments fostered Spanish Cubist Pablo Picasso’s experiments and facilitated their diffusion in the absence of public support, a coherent movement, and even his active involvement. If past research attests to the importance of framing innovations and mobilizing resources in their support, this study brings attention to shifts in the structure of opportunities to do so.	Novelty reception
Wu, L., & Kane, G. C. (2021). Network-Biased Technical Change: How Modern Digital Collaboration Tools Overcome Some Biases but Exacerbate Others. Organization Science, 32(2), 273-292.	Using three years’ data from more than 1,000 employees at a large professional services firm, we find that adopting an expertise search tool improves employee work performance in billable revenue, which results from improvements in network connections and information diversity. More importantly, we also find that adoption does not benefit all employees equally. Two types of employees benefit more from adoption of digital collaboration tools than others. First, junior employees and women benefit more from the adoption of digital collaboration tools than do senior employees and men, respectively. These tools help employees overcome the institutional barriers to resource access faced by these employees in their searches for expertise. Second, employees with greater social capital at the time of adoption also benefit more than others. The tools eliminate natural barriers associated with traditional offline interpersonal networks, enabling employees to network even more strategically than before. We explore the mechanisms for these differential benefits. Digital collaboration tools increase the volume of communication more for junior employees and women, indicating greater access to knowledge and expertise than they had before adoption. The tools also decrease the volume of communication for people with greater social capital, indicating more efficient access to knowledge and expertise. An important implication of our findings is that digital collaboration tools have the potential to overcome some of the demographic institutional biases that organizations have long sought to change. It does so, however, at the expense of potentially creating new biases toward network-based features—a characteristic we call “network-biased technical change.”	Network and gender
Vásárhelyi, O., Zakhlebin, I., Milojević, S., & Horvát, E. Á. (2021). Gender inequities in the online dissemination of scholars’ work. Proceedings of the National Academy of Sciences, 118(39).	Unbiased science dissemination has the potential to alleviate some of the known gender disparities in academia by exposing female scholars’ work to other scientists and the public. And yet, we lack comprehensive understanding of the relationship between gender and science dissemination online. Our large-scale analyses, encompassing half a million scholars, revealed that female scholars’ work is mentioned less frequently than male scholars’ work in all research areas. When exploring the characteristics associated with online success, we found that the impact of prior work, social capital, and gendered tie formation in coauthorship networks are linked with online success for men, but not for women—even in the areas with the highest female representation. These results suggest that while men’s scientific impact and collaboration networks are associated with higher visibility online, there are no universally identifiable facets associated with success for women. Our comprehensive empirical evidence indicates that the gender gap in online science dissemination is coupled with a lack of understanding the characteristics that are linked with female scholars’ success, which might hinder efforts to close the gender gap in visibility.	Network and gender

This sample gives us a quick snapshot of our data: each paper’s citation, abstract, and its assigned research stream—either ‘Novelty reception’ or ‘Network and gender’. Next, we’re going to add layers of detail to our dataset. We’ll extract and create metadata such as publication year, author names, and more, thus setting the stage for deeper analysis.

# extract and create useful metadata for each paper
reading %>% 
  rename(research_stream = Topic) %>% 
  mutate(
  # create id for each paper based on the row number
  id = row_number(),
  
  # create publication year by extracting 4-letter words from citation that starts with 2 then 0 (i.e., articles published in the 21st century) or with 1 then 9 (i.e., articles published in the 20th century)
  year = str_extract_all(Citation, "[2][0][0-9]{2}|[1][9][0-9]{2}", simplify = T)[,1] %>% as.numeric(), 
  
  # create decade based on year
  decade = (year %/% 10) * 10, 
  decade = paste0(decade, "s"),
  
  # create authors by extracting all characters from citation that appear before the closing bracket )
  authors = str_replace(Citation, "\\).*$", ""), 
  # remove the begining bracket ( from the above extracted text 
  authors = str_replace(authors, "\\(", ""),
  
  # create journal by detecting journal names in citation
  journal = case_when(  
      str_detect(Citation, regex("American Sociological Review", ignore_case = T)) ~ "ASR",
      str_detect(Citation, regex("Academy of Management journal", ignore_case = T)) ~ "AMJ",
      str_detect(Citation, regex("Academy of Management review", ignore_case = T)) ~ "AMR",
      str_detect(Citation, regex("Academy of Management discoveries", ignore_case = T)) ~ "Discoveries",
      str_detect(Citation, regex("Academy of Management Annals", ignore_case = T)) ~ "Annals",
      str_detect(Citation, regex("Academy of Management learning", ignore_case = T)) ~ "Learning",
      str_detect(Citation, regex("Administrative Science Quarterly", ignore_case = T)) ~ "ASQ",
      str_detect(Citation, "Management science") ~ "Management sci",
      str_detect(Citation, regex("American journal of sociology", ignore_case = T)) ~ "AJS",
      str_detect(Citation, "Scientific reports") ~ "Nature",
      str_detect(Citation, regex("nature", ignore_case = T)) ~ "Nature",      
      str_detect(Citation, regex("Social forces", ignore_case = T)) ~ "Soc forces",
      str_detect(Citation, "Entrepreneurship Theory and Practice") ~ "ETP",
      
      # Soc net has to be after PSPB, annual review, jom, and Org Science because a few of papers in these journals have the phrase "social networks" in the title. Therefore, we have to categorize papers in PSPB, annual review, jom, and Org Science first, then assign the rest of the papers with the phrase "social networks" in citation to the journal Soc net
      str_detect(Citation, "Journal of Management") ~ "JoM",
      str_detect(Citation, regex("Organization Science", ignore_case = T)) ~ "Org Sci",
      
      str_detect(Citation, regex("annual", ignore_case = T)) ~ "Annual review",
      str_detect(Citation, regex("Social Networks", ignore_case = T)) ~ "Soc Net", 
      str_detect(Citation, regex("Social Science Research", ignore_case = T)) ~ "Social science research",
      
      str_detect(Citation, regex("Journal of experimental social psychology", ignore_case = T)) ~ "JESP",
      str_detect(Citation, "Proceedings of the National Academy of Sciences") ~ "PNAS",
      str_detect(Citation, "Psychological review") ~ "Psyc review",
      str_detect(Citation, "Psychological science") ~ "Psyc science",
      str_detect(Citation, "Strategic Management Journal") ~ "SMJ",
      str_detect(Citation, "Journal of personality and social psychology") ~ "JPSP",
      str_detect(Citation, "Journal of Applied Psychology") ~ "JAP",
      str_detect(Citation, regex("Personality and Social Psychology Bulletin", ignore_case = T)) ~ "PSPB",

      str_detect(Citation, regex("Research Policy", ignore_case = T)) ~ "Research policy",
      str_detect(Citation, regex("\\bScience\\b", ignore_case = T)) ~ "Science",
      str_detect(Citation, regex("venturing", ignore_case = T)) ~ "JBV",
      str_detect(Citation, regex("consumer", ignore_case = T)) ~ "Consumer research",
      str_detect(Citation, regex("rationality", ignore_case = T)) ~ "Rationality",
      str_detect(Citation, regex("Social psychology quarterly", ignore_case = T)) ~ "Social psyc quarterly",
      str_detect(Citation, regex("Journal of Political Economy", ignore_case = T)) ~ "Journal of political economy",

      TRUE ~ "others"),
  
  # manually recode those that are book chapters by assigning papers with the words ugly, speading, and routledge in citation as book chapter
  journal = case_when( 
    str_detect(Citation, paste(c("ugly", "spreading", "Routledge" ), collapse = '|')) ~ "book chapter",
    TRUE ~ as.character(journal)),
  
  # create fields based on journals
  field = case_when(
      journal == "AMJ" | journal == "Discoveries" | journal == "Annals" | 
      journal == "AMR" | journal == "Learning" | journal == "ASQ" | journal == "Management sci" | 
      journal == "JoM" | journal == "Org Sci" | journal == "SMJ" ~ "management",
      
      journal == "ASR" | journal == "AJS" | journal == "Soc forces" | journal == "Soc Net" ~ "sociology",
      
      journal == "Nature" | journal == "PNAS" | journal == "Science" ~ "science",
      
      journal == "ETP" | journal == "JBV" ~ "entrepreneurship",
      
      journal == "PSPB" | journal == "JESP" | journal == "Psyc review" | journal == "Psyc science" |
      journal == "JPSP" | journal == "JAP" ~ "psychology",
      
      journal == "Annual review" | journal == "book chapter" ~ "reviews and books",
      
      journal == "Research policy" | journal == "Consumer research" | journal == "Rationality" | 
      journal == "Social psyc quarterly" | journal == "Social science research" | 
      journal == "Journal of political economy" ~ "others"
  ),
  
  # get the title 
  paper_title = str_extract(Citation, "\\).*$") %>% # extract all strings in citation that appears after the closing bracket  
    str_replace("^\\S* ", "") %>% # remove everything from the beginning until the first white space
    str_replace("\\..*$", ""), # remove everything starting from the first period .
  
  # get the name of first author by extract the first word in the string variable Author 
  first_author = word(authors, 1),
  
  # get the type of papers based on the journal
  type = case_when(journal == "Annals" | journal == "book chapter" | journal == "Annual review" | 
                     journal == "JoM" |str_detect(paper_title, 'distinctiveness') ~ "review",
                   
                   journal == "AMR" ~ "theory",
                   
                   TRUE ~ "empirical"),
  
  method = case_when(str_detect(Abstract, regex("experiment", ignore_case = T)) | 
                       str_detect(Citation, regex("experiment", ignore_case = T)) ~ "experiment",
                     
                     TRUE ~ "others")
) %>% 
  
  # merge name of first author with publication year to create abbreviated id for each paper, but still keep original column. Here the original column for first_author disappears because it is overwritten by the new first_author column, so only the original column for year remains in the data 
  unite(first_author, c("first_author", "year"), sep = " ", remove = F) %>% 
  
  # reorder columns so that id and year (the numeric variables) are the first two columns
  relocate(where(is.numeric), .before = where(is.character)) %>% 
  relocate(paper_title, .after = id) %>% 
  janitor::clean_names() -> reading

saveRDS(reading, "reading.rds")

Let’s take a look at our new dataset now.

# generate a concise summary of the data
reading %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	107
Number of columns	14
_______________________
Column type frequency:
character	12
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
category	15	0.86	20	78	23
paper_title	0	1.00	14	160	107
citation	0	1.00	87	271	107
abstract	0	1.00	609	1765	107
research_stream	0	1.00	17	18	2
first_author	0	1.00	8	17	105
decade	0	1.00	5	5	6
authors	0	1.00	13	93	106
journal	0	1.00	3	28	33
field	0	1.00	6	17	7
type	0	1.00	6	9	3
method	0	1.00	6	10	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	54.93	31.90	1	27.5	54	82.5	109	▇▇▇▇▇
year	0	1	2014.23	8.53	1958	2012.0	2017	2019.0	2021	▁▁▁▁▇

We now have more details for each paper such as the authors, publication year, and journal, among others.

Now that our dataset is shaped up, let’s dive into some specific analyses. We’ll start by visualizing the number of theoretical papers, empirical papers, and review papers I have on my reading list.

# Analyzing the distribution of paper types across research streams
reading %>% 
  # Grouping data by research stream for aggregated counts by paper type
  group_by(research_stream) %>% 
  # Counting occurrences of each paper type and sorting by count
  count(type, sort = T) %>% 
  ungroup() %>% 
  # Reordering 'type' based on descending count for better visualization
  mutate(type = reorder(type, -n)) %>%
  # specifying type as the x axis, count as the y axis, each type of paper will have a different color
  ggplot(aes(type, n, fill = type)) +
  # create bar chart
  geom_col(show.legend = FALSE) +
  # graph for each research stream
  facet_wrap(~ research_stream) +
  # customizing color for each type of paper
  scale_fill_manual(values = c("#F9A12EFF", "#FC766AFF", "#9B4A97FF")) +
  # setting up labels and title for the plot
  labs(x = NULL, 
       y = "Number of papers",
       title = "How many papers are theoretical, empirical, and reviews?")

  # theme(plot.title = element_text(size=15))

From the charts, it’s clear most of my readings are empirical papers, while review and theoretical papers are fewer. This makes sense given my focus on data-driven research.

Next, let’s examine the diversity of publication venues, sorted by fields within each research stream. This will help us understand where most research in my reading list is being published.

# Examining the publication venues by field within each research stream
reading %>% 
  group_by(research_stream) %>% 
  # Counting papers by field and sorting by count
  count(field, sort = T) %>% 
  ungroup()  %>% 
  # Reordering fields within each research stream based on count for coherent visual ordering
  mutate(field_1 = field,
         field = reorder_within(field, n, research_stream)
         ) %>% 
  # Creating a bar chart with flipped coordinates for better label visibility
  ggplot(aes(field, n, fill = field_1)) +
  geom_col(show.legend = FALSE) +
  # Using reordered scale for the x-axis and adjusting y-axis scale
  scale_x_reordered() +
  facet_wrap(~ research_stream, scales = "free_y") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Number of papers",
       x = NULL,
       title = "How many papers were published in field-specific journals?") +
  theme(plot.title = element_text(size = 15))

It turns out a significant number of papers come from management journals, which makes sense since I’m doing a PhD in management. Two other types of journals that I also read quite a lot is sociology and psychology journals, although sociology papers here are mostly about network and gender, whereas psychology papers are mostly about novelty reception.

What about the number of papers across time?

# Analyzing the temporal distribution of papers
reading %>% 
  # Counting papers by publication year and topic, sorting by count
  count(year, research_stream, sort= T) %>% 
  ggplot(aes(year, n)) + 
  geom_col(show.legend = FALSE, fill = "cyan4") +
  # Faceting by research stream to see trends within each area of interest
  facet_grid(rows = vars(research_stream)) +
  labs(y = "Number of papers", x = NULL, title = "Temporal Distribution of Papers by Year")

It appears that most of the papers on my list were published within the last 10 years. The few papers published before 2000 must be the seminal papers on the topics I study.

I have also created a variable for the decade a paper was published in, which gives us a broader grouping than year. Let’s see how many papers in my list each decade has.

# Exploring the distribution of papers across decades
reading %>% 
  # Counting papers by decade and research stream, sorting by count
  count(decade, research_stream, sort = T) %>% 
  # Sorting decades in descending order for visual coherence
  arrange(desc(readr::parse_number(decade)))  %>% 
  # Creating a column chart to visualize the number of papers per decade
  ggplot(aes(decade, n, group = 1)) +
  geom_col(show.legend = FALSE, fill = "cyan4") +
  # graph for each research stream
  facet_grid(rows = vars(research_stream)) +
  labs(x = NULL,
       y = "Number of papers",
       title = "How many papers are published in each decade?")

Cleaning and tokenizing paper abstracts

Now that our dataset is properly formatted, let’s dive into some exploratory text analysis of the abstracts from our collection of papers. First, let’s transform the text into a tidy format, clean it up by removing common stop words and numbers, and ensure consistency by converting plural nouns into their singular forms.

# Preliminary Text Analysis

# Preparing the abstracts for text analysis by tokenizing the text and cleaning it.
reading_unnest <- reading %>%
  # Selecting only the ID and abstract text columns for analysis
  select(id, abstract) %>% 
  # Tokenizing abstracts into individual words
  unnest_tokens(word, abstract) %>% 
  # Removing common English stop words to focus on relevant terms
  anti_join(get_stopwords()) %>% 
  # Filtering out tokens that are numeric as they likely do not contribute to thematic analysis
  filter(!str_detect(word, "[0-9]+")) %>% 
  # Normalizing plural words to their singular form to consolidate word forms
  mutate(word = case_when(str_detect(word, "[^e|aies$]ies$") ~ str_replace(word, "ies$", "y"),
                          str_detect(word, "[^e|a|oes$]es$") ~ str_replace(word, "es$", "e"),
                          str_detect(word, "[^ss$|us$]s$") ~ str_remove(word, "s$"),
                          TRUE ~ word))

Finding most frequent words and phrases

Now that our abstracts are tidied up, let’s find out which words pop up most frequently. This helps us understand the dominant terms that shape the discussions in our collected papers.

## Analysis of Most Frequent Words
reading_unnest %>% 
  # Count occurrences of each word and sort in descending order
  count(word, sort = T) %>% 
  # Focus on the top 20 most frequent words
  slice_max(n, n = 20)%>%
  # Reorder words for plotting based on their count
  mutate(word = reorder(word, n)) %>%
  # Create a bar plot of word frequencies
  ggplot(aes(word, n, fill = word)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 250)) +
  labs(x = NULL, y = "Number of times a word appears",
       title = "What are the most frequent words in paper abstracts?",
       subtitle = "After removing stop words and plurals")

Next, we turn our attention to bigrams to uncover common word pairs that give us deeper insights into the themes prevalent in the abstracts. This analysis reveals how certain concepts are typically discussed together in paper abstracts.

## Analysis of Most Frequent Bigrams
reading_unnest %>% 
  # Converting long-format data to a wide format to facilitate bigram creation
  group_by(id) %>% 
  mutate(ind = row_number()) %>% 
  ungroup() %>% 
  pivot_wider(id, 
              names_from = ind, 
              values_from = word) %>% 
  # Replace missing values with an empty space for text cohesion
  mutate_all(replace_na, " ") %>% 
  # merge one-word columns together to form a continuous text for each abstract 
  unite(col = abstract_ns, 
        -id,
        sep = " ",
        remove = T) %>% 
  # break the text into bigrams
  unnest_tokens(bigram, abstract_ns, token = "ngrams", n = 2) -> reading_unnest_bigrams

# visualizing the most frequent bigrams
reading_unnest_bigrams %>% 
  # count the number of times a bigram appears in the corpus
  count(bigram, sort = TRUE) %>% 
  # get the 20 most frequent bigrams
  slice_max(n, n = 20) %>% 
  # arrange bigrams based on frequency count
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n, fill = bigram)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(x = NULL, 
       y = "Number of times a bigram appears",
       title = "What are the most frequent birgrams in paper abstracts?",
       subtitle = "After removing stop words and plurals") +
  theme(plot.title = element_text(size = 13))

The most frequently appearing words are “network”, “social”, “idea”, “study”, “women” and “gender”. The most common bigrams are “social network”, “social capital”, “creative ideas”, “men women”. This aligns perfectly with my research focus on the impact of social networks and the evaluation of creative ideas, especially in how these dynamics differ between men and women in professional settings.

Let’s drill down further and analyze the most frequent unigrams and bigrams within each specific research stream. This will help us pinpoint the key topics and discussions prevalent in each stream, providing a clearer picture of their thematic directions.

# Analyzing most frequent unigrams by topic
reading_unnest %>%
  # Joining research stream information to the unigrams
  left_join(reading %>% select(id, research_stream)) %>% 
  count(research_stream, word, sort = TRUE) %>% 
  group_by(research_stream) %>%
  # Selecting top 10 words in each research stream
  top_n(10) %>%
  ungroup() %>%
  # need to reorder_within() because ggplot otherwise would order the words (which is typed character) alphabetically. reorder_within() orders the words based on its frequency. Here "word" is what we want to order, "n" is what we want to order by, and "Topic" is the group we want to reorder within
  ggplot(aes(reorder_within(word, n, research_stream), 
             n,
             fill = research_stream)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~research_stream, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#317773", "#E2D1F9")) +
  labs(y = "Word count", x = NULL, title = "What are the most frequent words in each research stream?")

# Repeating the process for bigrams
reading_unnest_bigrams %>% 
  mutate(id = as.numeric(id)) %>% 
  left_join(reading %>% select(id, research_stream) ) %>% 
  count(research_stream, bigram, sort = TRUE) %>%
  group_by(research_stream) %>%
  top_n(10) %>%
  ungroup() %>% 
  # need to reorder_within() because ggplot otherwise would order the words (which is typed character) alphabetically. reorder_within() orders the words based on its frequency. Here "word" is what we want to order, "n" is what we want to order by, and "Topic" is the group we want to reorder within
  ggplot(aes(reorder_within(bigram, n, research_stream), 
             n,
             fill = research_stream)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~research_stream, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_manual(values = c("#317773", "#E2D1F9")) +
  labs(y = "Word count", x = NULL, title = "What are the most frequent bigrams in each research stream?")

Highlighting distinctive terms with TF-IDF and weighted log odds

TF-IDF

While frequent word counts can tell us a lot, they don’t always give the full picture. That’s where Term Frequency-Inverse Document Frequency (TF-IDF) comes into play. TF-IDF helps us identify words that are not only common in a specific document but also rare across the entire document corpus, enhancing their significance. This method decreases the weight for words that occur frequently across documents, and increases the weight for words that are rare, highlighting unique aspects of each text.

# Calculating TF-IDF to identify uniquely important words
reading_unnest %>% 
  # Joining research stream data to each word 
  left_join(reading %>% select(id, research_stream)) %>% 
  # Counting occurrences of each word within each research stream
  count(research_stream, word, sort = T) %>%
   # Calculating tf-idf for each word within research streams
  bind_tf_idf(word, research_stream, n) -> reading_tfidf

# Plotting the top 20 words with the highest tf-idf scores to identify uniquely important words
reading_tfidf %>% 
  # Select the top 20 words with the highest tf-idf scores
  top_n(20) %>% 
  # Reordering words based on tf-idf score for better visual representation
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = research_stream)) +
  # create bar chart
  geom_col() +
  # flip the x and y axis to better display word labels
  coord_flip() +
  # Adjusting text size for clarity
  theme(strip.text=element_text(size=11)) +
  # Set custom colors for each research stream
  scale_fill_manual(values = c("#0063B2FF", "#9CC3D5FF")) +
  # Adding labels and title to the chart
  labs(x = NULL, fill = NULL, y = "tf-idf",
       title = "What are the highest tf-idf words in the corpus?")

# Plotting the highest tf-idf words within each research stream to explore their specific vocabulary
reading_tfidf %>% 
  group_by(research_stream) %>%
  top_n(20) %>%
  ungroup() %>% 
  # reorder words based on tf-idf within each topic
  mutate(word = reorder_within(word, tf_idf, research_stream)) %>%
  # create graph of words and their tf-idf 
  ggplot(aes(word, tf_idf, fill = research_stream)) +
  # create bar chart
  geom_col(show.legend = FALSE) +
  # create graph for each topic
  facet_wrap(~ research_stream, scales = "free") +
  # reorder axis within facets
  scale_x_reordered() +
  # flip the x and y axis for better readability
  coord_flip() +
  theme(strip.text=element_text(size=11)) +
  scale_fill_manual(values = c("#0063B2FF", "#9CC3D5FF")) +
  labs(x = NULL, y = "tf-idf",
       title = "What are the highest tf-idf words in each research stream?")

# Remove the tf-idf dataframe from memory to free up resources
rm(reading_tfidf)

For papers focused on network and gender, words like “job,” “inequality,” “access,” “leadership,” “student,” and “board” stand out. These terms underline the study’s focus on career access, leadership roles, and the dynamics of inequality within professional settings. On the other hand, novelty reception papers frequently mention words such as “creativity,” “creative,” “novelty,” “contagion,” “selection,” and “product,” which reflect the discourse on how creative products are recognized, selected, and spread within the field.

This TF-IDF analysis helps us identify key themes and terms that define each research stream more effectively than mere frequency counts. It showcases the distinct language tied to different areas of study, providing us with a deeper understanding of the thematic underpinnings of the research.

Weighted log odds

Beyond frequency counts and TF-IDF, another compelling method to examine how words differ across groups is the weighted log odds ratio. This measure adjusts for word frequency across different documents and is particularly effective when we’re interested in pinpointing words that uniquely characterize each group. Unlike TF-IDF, which considers only the presence of words across documents, weighted log odds take into account the actual frequency, providing a more nuanced understanding of word importance.

Let’s start by identifying words in our corpus that have the highest weighted log odds, which will help us understand which terms are especially prominent in one research stream compared to others.

# Calculating Weighted Log Odds for Word Frequencies Across Research Streams
reading_unnest %>% 
  left_join(reading %>% select(id, research_stream)) %>% # Merging research stream information with the word data
  count(research_stream, word, sort = T) %>% # Counting the frequency of each word per research stream
  tidylo::bind_log_odds(research_stream, word, n) -> reading_log_odds # Calculating weighted log odds

# Plotting the Words with the Highest Weighted Log Odds
reading_log_odds %>% 
  slice_max(log_odds_weighted, n = 20) %>% # Selecting the top 20 words with highest weighted log odds
  ggplot(aes(log_odds_weighted, reorder(word, log_odds_weighted), fill = research_stream)) +
  geom_col() + # Creating a bar chart
  scale_fill_manual(values = c("#1868ae", "#d9a5b3")) + # Custom color palette
  scale_x_continuous(expand = c(0, 0)) + # Adjusting the x-axis scale for better display
  labs(x = "Weighted log odds", y = NULL) # Labeling the axes

Now, let’s break down the highest weighted log odds words by each research stream to better capture their specific vocabulary.

reading_log_odds %>%
  group_by(research_stream) %>%  
  slice_max(log_odds_weighted, n = 20) %>% # Filtering the top 20 distinct words in each research stream
  ungroup() %>% 
  ggplot(aes(log_odds_weighted, 
             reorder_within(word, log_odds_weighted, research_stream), 
             fill = research_stream)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~research_stream, scales = "free_y") + # Creating a separate plot for each research stream
  scale_y_reordered() + # Creating a separate plot for each research stream
  scale_fill_manual(values = c("#1868ae", "#d9a5b3")) +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = "Weighted log odds", y = NULL)

To delve deeper, let’s graph the weighted log odds of each word against its frequency in the context of ‘Network and Gender’ research.

# Specific Word Analysis in the Context of Gender and Network Topics
reading_log_odds %>% 
  filter(str_detect(research_stream, "gender"), n > 10) %>% # Filtering for words in the 'gender' stream with sufficient frequency
  ggplot(aes(n, log_odds_weighted, label = word)) +
  geom_hline(yintercept = 0, color = "gray50", lty = 2, size = 1.5) + # Adding a reference line at 0 for neutral log odds
  geom_point(color = "midnightblue") + # Plotting points
  ggrepel::geom_text_repel(family = "IBMPlexSans") + # Adding labels to points
  scale_x_log10() + # Logarithmic scale for x-axis to handle wide data range
  labs(x = "Number of times a word appears",
       y = "Weighted log odds",
       title = "Which words are most specific to the network and gender research stream?") +
  theme(plot.title = element_text(size = 15))

From this graph, it seems that terms like “network” and “social” are particularly prevalent in ‘Network and Gender’ papers, highlighting their focus on social structures. Conversely, words like “study” are common but less distinctive since they appear frequently across both research streams.

Similarly, we can analyze the ‘Novelty Reception’ stream to identify which terms are uniquely significant there.

reading_log_odds %>% 
  filter(str_detect(research_stream, "reception"), n > 10) %>% 
  ggplot(aes(n, log_odds_weighted, label = word)) +
  geom_hline(yintercept = 0, color = "gray50", lty = 2, size = 1.5) +
  geom_point(color = "midnightblue") +
  ggrepel::geom_text_repel(family = "IBMPlexSans") +
  scale_x_log10() +
  labs(x = "Number of times a word appears",
       y = "Weighted log odds",
       title = "Which words are most specific to the novelty reception research stream?") +
  theme(plot.title = element_text(size = 15))

# Clean up the created data frame to free up memory
rm(reading_log_odds)

In the ‘Novelty Reception’ papers, “idea” is the most frequently occurring words, but it is not as unique to this stream as “creativity” because it is also frequently used in the ‘Network and Gender’ stream too. Also, “social” and “network” are used quite often in this stream, but they are used much more often in the ‘Network and Gender’ stream and as such has a negative weighted log odds.

Conclusion and next steps

In this notebook, we’ve analyzed over a hundred research papers within two key areas: “Novelty Reception” and “Network and Gender.” We started by extracting important information from each paper’s references, such as the authors, publication year, and journal. Then, using TF-IDF and weighted log odds, we analyzed the texts to uncover the most significant words and phrases in discussions about how innovative ideas are accepted and how social networks influence the careers of men and women differently. We’ve presented these findings visually to make the trends and themes easier to understand.

Looking ahead, our next step is to apply topic modeling to these paper abstracts. This will help us identify and understand the main themes discussed across the papers in more detail, enhancing our understanding of the scholarly conversations on novelty reception and network and gender.

TF-IDF & Weighted Log Odds to Analyze Research Abstracts

Julie Nguyen