PubMedBridge User Guide

A Preprocessor for Auditable Country-Level Affiliation Resolution in Bibliometric Research

1. Overview

PubMedBridge is a suite of web-based tools designed to streamline the preprocessing of bibliometric data from PubMed for bibliometric analysis. It provides a user-friendly interface to parse raw PubMed data, convert it into structured formats, and prepare it for visualization and analysis software.

Key Features

  • User-Selected Metadata Fields – Choose from 34 metadata options grouped into core citation, author information, affiliations, country, publication content, publication details, and links
  • Defensive Resolution – Prioritizes precision over coverage, flagging ambiguous cases rather than forcing uncertain assignments
  • Transparent Output – All country resolutions are exported with their resolution method for verification and manual correction
  • Spreadsheet-Centric Workflow – Intermediate results exported as .xlsx for easy inspection and correction
  • Client-Side Processing – All operations execute locally in your browser, ensuring complete data privacy
  • Open Source – Fully transparent codebase available for inspection

2. System Architecture and Workflow

PubMedBridge is architected around two integrated modules that work together in a human-in-the-loop curation workflow, where users verify and, when necessary, manually correct country assignments.

Step 1

Input & Automated Resolution

Upload PubMed .txt file

PubMed2XLSX processes

Export .xlsx spreadsheet

Step 2

Human Validation

Review in Excel/Sheets

Verify countries

Filter & curate data

Step 3

Output Generation

Upload curated .xlsx

XLSX2PubMed converts

Export PubMed .txt

Workflow Details

1Input & Automated Resolution

Users upload a raw PubMed data file (.txt format) to the PubMed2XLSX tool and select metadata fields to be included in the output. The tool automatically parses the file, applies the country resolution algorithm to affiliation strings, and exports the results as a structured spreadsheet (.xlsx).

Key outputs:

  • Structured tabular data with user-selected metadata fields
  • Automated country assignments with resolution method labels
  • Flagged ambiguous or unresolved cases for manual review

2Human-in-the-Loop Validation & Curation

The exported spreadsheet serves as an auditable dataset that users can review and refine using standard spreadsheet software. This critical step integrates domain expertise where algorithmic resolution is uncertain.

Two types of curation:

  1. Country Assignment Verification: Review records and manually correct or confirm country assignments based on contextual knowledge
  2. Dataset Filtering: Apply filters based on metadata fields to construct tailored datasets:
    • Remove ineligible records based on content of other fields (e.g., Abstract)
    • Exclude publications outside specified date ranges
    • Filter by author criteria or publication types
    • Apply custom inclusion/exclusion criteria

After validation, users can perform preliminary analyses directly on the spreadsheet or proceed to Step 3.

3Output Generation

The XLSX2PubMed tool converts the curated spreadsheet back into PubMed format (.txt), ensuring compatibility with bibliometric analysis and visualization software such as VOSviewer, CiteSpace, and Bibliometrix.

Benefits:

  • Seamless integration with existing bibliometric workflows
  • Use of specialized visualization tools that require PubMed format
  • Sharing of curated datasets in standardized format
  • Closed-loop workflow maintaining original structure

3. Getting Started

3.1 Accessing PubMedBridge

No installation is required. You can access PubMedBridge at pubmedbridge.drmyo.com

  1. Click "Launch the Tool" for either PubMed2XLSX or XLSX2PubMed
  2. Begin processing your files

3.2 System Requirements

  • Modern web browser (Chrome, Firefox, Safari, or Edge recommended)
  • JavaScript enabled
  • Spreadsheet software for Step 2 curation (Excel, Google Sheets, LibreOffice Calc)
💡 Privacy Note: All processing occurs locally in your browser. Your data never leaves your computer, ensuring complete privacy and security.

4. Using PubMed2XLSX

4.1 Purpose

PubMed2XLSX resolve country names and parse metadata, and converts it into structured XLSX and JSON files. It's ideal for data validation, filtering and performance analysis.

4.2 Step-by-Step Instructions

Step 1: Prepare Your PubMed Data

  1. Conduct your search on PubMed (pubmed.ncbi.nlm.nih.gov)
  2. Click "Save" → "Save citations to file"
  3. Save the file in "PubMed" format with .txt extension

Step 2: Upload File

  1. Navigate to PubMed2XLSX tool
  2. Click "Choose File" or drag and drop your .txt file
  3. The tool will automatically detect the file format

Step 3: Select Metadata Fields

Choose which metadata fields to include in your output. Fields are organized into categories:

Category Fields
Core Citation PMID, Title, Journal, Journal Abbreviation, Publication Year, Volume, Pages
Author Information First Author, Last Author, Co-Authors
Country All Countries, First Author Country, Last Author Country, Co-Author Countries, Non-first Author Countries
Affiliations All Affiliations, First Author Affiliation, Last Author Affiliation, Co-Author Affiliations, Non-first Author Affiliations
Publication Content Abstract, Keywords, MeSH Terms, Major MeSH Terms, Publication Type, Country of Publication, Language
Publication Details ISSN, PMCID, Secondary Source ID, Grant Numbers
Links DOI, PubMed Link
💡 Tip: Select all fields initially. You can always filter or hide columns later in Excel.

Step 4: Process the File

  1. Click "Process"
  2. Wait for the processing to complete (progress bar will show status)
  3. Large files may take several minutes

Step 5: Download Results

Two files will be generated:

  • .xlsx file – Structured spreadsheet for review and curation
  • .json file – Machine-readable format for advanced users

5. Data Curation and Validation

5.1 Opening the Spreadsheet

Open the generated .xlsx file in your preferred spreadsheet software:

  • Microsoft Excel
  • Google Sheets
  • LibreOffice Calc
  • Apple Numbers
💡 Best Practice - Improving Readability: To make manual review easier (e.g., Excel):
  1. Select your data range (including headers)
  2. Go to Insert → Table (or press Ctrl+T / Cmd+T)
  3. Confirm the range and check "My table has headers"
  4. Select the entire table
  5. Go to Format Cells → Alignment
  6. Enable "Wrap Text"
  7. Adjust row heights as needed
This converts your data to a Table format (with filter dropdowns) and wraps text in cells. Fields like Country, Affiliations, and Authors will now display line-by-line, making it much easier to review multi-value entries and identify issues.
⚠️ Important: Do not modify the column headers in the XLSX file. These are required for the XLSX2PubMed tool to function correctly.

5.2 Understanding Country Resolution Methods

Each record includes a "Country Resolution Method" field that indicates how the country was determined:

Method Description Confidence
Direct Match Country name identified. High
alpha3 Country alpha3 code identified. High
US State Name US state name identified. High
US State Abbreviatin US state abbreviation identified. High
Institution Name Institution name matched in reference institution database. Low
Institution City Institution city ity matched in reference institution database. Low
USGeorgiaToCheck Cannot disambiguate between Georgia (US State) and Georgia (Country). Country could not be determined. No
Institution Name Confusion Identical institution name in more than one country. Country could not be determined. No
Institution City Confusion Identical insititution city in more than one country. Country could not be determined. No
UNRESOLVED Country could not be determined. No
Contribution Note Not affiliation string. -
Filtered String Not affiliation string. -

5.3 Verifying Country Assignments

Priority Order

  1. UNRESOLVED
  2. Institution City Confusion and Institution Name Confusion
  3. USGeorgiaToCheck
  4. Institution City and Institution Name
  5. US State Abbreviation and Name
  6. alpha3
  7. Direct Match

Manual Correction Process

  1. Review the affiliation string
  2. Identify the correct country using contextual information
  3. Enter the country name in the Country column
  4. Optionally update the Country Resolution Method to "Manual Correction"
💡 Best Practice: Use Excel's filter and sort features to group similar cases together for efficient batch review.

5.4 Dataset Filtering and Refinement

Common Filtering Scenarios

  • Date Range: Filter by Year to include only publications within your study period
  • Publication Type: Filter by Publication Type to include only original research articles
  • Language: Filter by Language if needed for your analysis
  • Abstract Content: Search within abstracts to identify relevant studies
  • Author Criteria: Filter by author names or affiliations
💡 Best Practice: Use PubMed's native filters for broad criteria (date, language, type) to reduce initial dataset size. Use PubMedBridge filtering for nuanced refinements based on country assignments, affiliation details, or complex content analysis after you've reviewed the data.
⚠️ Remember: Save your curated file before proceeding to XLSX2PubMed. Keep the original column headers unchanged.

6. Using XLSX2PubMed

6.1 Purpose

XLSX2PubMed takes a structured .xlsx file (generated and curated from PubMed2XLSX) and converts it back into standard PubMed format. This enables use with bibliometric analysis and visualization software.

6.2 Step-by-Step Instructions

Step 1: Prepare Your Curated File

  • Ensure all manual corrections are complete
  • Verify column headers remain unchanged
  • Save your .xlsx file

Step 2: Upload to XLSX2PubMed

  1. Navigate to the XLSX2PubMed tool
  2. Click "Choose File" or drag and drop your .xlsx file
  3. The tool will validate the file format and convert to PubMed .txt format
  4. Wait for processing to complete
  5. Progress indicator will show conversion status

Step 4: Download Outputs

Two files will be generated:

  • .txt file – Standard PubMed format ready for analysis software
  • .json file – Structured data format for advanced applications

6.3 Using Your Output Files

The PubMed .txt file is ready to use with any bibliometric analysis software that accepts standard PubMed format, including VOSviewer, CiteSpace, Bibliometrix, and others. Refer to your analysis software's documentation for specific import instructions.

7. Troubleshooting

File Upload Fails

Problem: File won't upload or shows error message

Solutions:

  • Verify file is in correct format (.txt for PubMed2XLSX, .xlsx for XLSX2PubMed)
  • Check file isn't corrupted or empty
  • Try a different browser
  • Ensure JavaScript is enabled

Processing Takes Too Long

Problem: Tool appears frozen or processing doesn't complete

Solutions:

  • Large files (10,000+ records) may take several minutes
  • Check browser console for errors (F12 key)
  • Try splitting large files into smaller batches
  • Ensure sufficient RAM is available

Security Check Error with PMID

Problem: Processing fails with error referencing specific PMID

Solutions:

  1. Locate the PMID mentioned in the error in your .txt file
  2. Check that record for HTML tags, special characters, or unusual patterns
  3. Common issues: <script> tags, excessive character repetition, HTML-like syntax
  4. Edit the problematic text (typically in abstract or title)
  5. Remove or replace the triggering content
  6. Re-upload the modified file

Note: Security checks prevent harmful content injection. Technical abstracts (e.g., bioinformatics, computer science) may occasionally trigger false positives and require manual editing.

Long Text Truncated in XLSX

Problem: Very long abstracts or affiliation lists appear cut off in Excel

Cause: Excel cell limit. PubMedBridge limits cells to 25,000 characters for cross-platform compatibility.

Solutions:

  • Use the .json output for complete, untruncated text
  • Look up the PMID directly on PubMed for full content
  • Most abstracts are under this limit; truncation is rare

Note: For text analysis or NLP work, always use the JSON file which contains complete content.

Column Headers Modified Error

Problem: XLSX2PubMed rejects your file

Solutions:

  • Ensure you haven't renamed any column headers
  • Check for extra spaces in header names
  • Verify you're using the original file from PubMed2XLSX
  • If headers were modified, regenerate from original PubMed file

Downloaded File Won't Open

Problem: XLSX file appears corrupted or won't open

Solutions:

  • Ensure download completed fully
  • Try opening with different spreadsheet software
  • Clear browser cache and regenerate file
  • Check file size isn't zero bytes

8. Frequently Asked Questions

General Questions

Q: Is my data secure when using PubMedBridge?

A: Yes. All processing occurs entirely within your web browser. No data is uploaded to external servers. Your files remain on your computer throughout the entire workflow.

Q: Do I need to install any software?

A: No installation is required for PubMedBridge itself. You only need a modern web browser. However, you will need spreadsheet software (Excel, Google Sheets, etc.) for the curation step.

Q: What file size limits exist?

A: PubMedBridge has a 100MB file size limit to prevent browser freezing and ensure stable processing. In practice, this is rarely an issue—PubMed's maximum export of 10,000 records per file typically results in files well under this limit. The file size depends on abstract length and metadata completeness.

Q: Can I use this with databases other than PubMed?

A: Currently, PubMedBridge is optimized for PubMed format only. Support for other databases may be added in future versions.

Technical Questions

Q: Why are some affiliations marked as "UNRESOLVED"?

A: The algorithm prioritizes accuracy over completeness. If the country cannot be determined with reasonable confidence, it's flagged for manual review rather than guessed.

Q: Can I add custom country resolution rules?

A: The current version doesn't support custom rules through the interface, but the open-source code can be modified by advanced users.

Q: Where can I inspect the country resolution algorithm code?

A: The country resolution algorithm is open-source and available at github.com/drmyo/pmbalgorithm. This repository provides an interactive, browser-based showcase of the exact country-resolution logic used in PubMedBridge.

What's included:

  • Interactive interface for testing the algorithm with custom affiliation strings
  • Complete source code showing the hierarchical, rule-based approach
  • Validation documentation (VALIDATION.md) with study design and accuracy results
  • Development guide (DEVELOPMENT.md) for contributors and maintainers

Note: This repository is specifically designed for validation, inspection, and methodological transparency. It is not the full PubMedBridge system and is not intended for production use - for actual data processing, use the main PubMedBridge tool at pubmedbridge.drmyo.com.

Q: What database does PubMedBridge use to match institution names and cities?

A: PubMedBridge uses a unified reference dataset combining two major sources:

  • ROR (Research Organization Registry) v1.74: A community-led registry of 120,196 research organizations with curated institution names, aliases, cities, and countries
  • OpenAlex: An open catalog of scholarly entities, contributing 115,781 institutions (107,709 with country data) for broader coverage

These datasets are merged and deduplicated to create a comprehensive reference of 120,428 unique institutions worldwide. When the algorithm encounters an institution name or city in an affiliation string, it searches this database to identify the corresponding country.

Why combine both sources? ROR provides high-quality curated data, while OpenAlex offers broader coverage. Together, they maximize the algorithm's ability to correctly resolve affiliations from diverse institutions globally.

Q: How accurate is the automated country resolution?

A: Very high. The algorithm was validated using a stratified random sample of 430 affiliation strings from 9,931 PubMed articles (108,557 total affiliation strings).

Validation results:

  • High-confidence methods (direct country matches, alpha-3 codes, US state names and abbreviations, covering 96.2% of resolved affiliations): Manual verification of 100 randomly sampled records from each category confirmed all assignments were correct
  • Overall algorithm performance:
    • 99.6% precision: Almost all country assignments made are correct
    • 98.7% specificity: Correctly identifies truly ambiguous cases
    • 66.4% recall: Successfully resolves about two-thirds of affiliations; the remaining third are flagged for manual review
    • F1-score: 0.780: Reflects the balance between high precision and conservative recall

The algorithm uses a conservative, defensive approach—it flags uncertain cases for manual review rather than risk incorrect assignments. This design prioritizes data quality and ensures users can verify all country resolutions. Complete validation methodology, including stratified sampling strategy and detailed results, is available in the algorithm repository.

Workflow Questions

Q: Can I skip the manual curation step?

A: Technically yes, but this defeats the primary purpose of PubMedBridge. Human-in-the-loop validation with auditable output is the essence of the tool.

PubMedBridge is specifically designed around the principle that automated algorithms should be verified by domain experts, not blindly trusted. The tool provides:

  • Transparent resolution methods for every country assignment
  • Flagged uncertain cases that require expert judgment
  • Auditable spreadsheet format enabling verification and correction
  • Domain expertise integration where algorithmic confidence is low

While the algorithm achieves high accuracy (99.6% precision), the ~34% of cases flagged for review often represent complex, ambiguous, or critical affiliations that benefit from human verification. Skipping manual curation means accepting potentially incorrect assignments in these cases and missing opportunities to refine your dataset based on research-specific criteria.

⚠️ Recommended Practice: Always perform manual curation, even if brief. At minimum, review all "Unresolved" and "Confusion" cases, and verify that your dataset meets your specific inclusion/exclusion criteria.

Q: What if I need to make changes after converting back to PubMed format?

A: Keep your curated XLSX file. You can make changes there and re-run XLSX2PubMed to generate a new PubMed file.

Q: Can I merge multiple PubMed files?

A: Yes, as long as the merged total doesn't exceed 10,000 records (matching PubMed's download limit).

Process each file through PubMed2XLSX, then merge by copying/pasting rows in Excel. Remove duplicates by PMID, perform manual curation, then convert through XLSX2PubMed. If your total exceeds 10,000 records, keep files separate or split into batches.

Q: How do I cite PubMedBridge in my research?

A: Please cite PubMedBridge using the following format:

Citation:

Tha, M., & Khin, N. (2026). PubMedBridge: A Preprocessor for Auditable Country-Level Affiliation Resolution in Bibliometric Research (Version 1.0) [Computer software]. https://pubmedbridge.drmyo.com

For the country resolution algorithm specifically:

Tha, M., & Khin, N. (2026). PubMedBridge Country Resolution Algorithm [Computer software]. https://doi.org/10.5281/zenodo.18212014

BibTeX:

@software{tha2025pubmedbridge,
  author = {Tha, Myo and Khin, Nilar},
  title = {PubMedBridge: A Preprocessor for Auditable Country-Level Affiliation Resolution in Bibliometric Research},
  year = {2026},
  version = {1.0},
  url = {https://pubmedbridge.drmyo.com},
  note = {Web application}
}

If discussing the methodology or validation of country resolution specifically, also cite the open-source algorithm repository.