PHD Discussions Logo

Ask, Learn and Accelerate in your PhD Research

Question Icon Post Your Answer

Question Icon

What are the standard data cleaning and preprocessing steps required before performing a bibliometric analysis?

After exporting my dataset, I'm faced with thousands of records with inconsistent author names, duplicate entries, and strange journal abbreviations. I know this "janitor work" is vital, but could you outline the standard preprocessing pipeline to ensure my analysis isn't built on flawed data?

All Answers (1 Answers In All)

By Priya Answered 1 year ago

This is the unglamorous but most crucial phase. I always treat raw export data as messy and untrustworthy. The standard pipeline involves: 1) Deduplication merging records from the same paper imported from different searches or databases. 2) Author & Affiliation Disambiguation standardizing "Smith, J." and "Smith, John" and handling institutional name changes. 3) Source & Keyword Normalization correcting journal name abbreviations and merging synonymous keywords. I use a combination of software features (like in Bibliometrix or VOSviewer) and manual checking. Skipping this will produce elegant but misleading maps, so invest significant time here.

Your Answer