Can scientific literature be made genuinely navigable — not just searchable?

A ranked list answers one question: which documents match these keywords. It says nothing about how a field is shaped, which results build on which, where two communities are quietly disagreeing, or which corner nobody has looked at yet. The work at SciX has been about putting that structure on top of the ADS corpus, using embeddings for meaning, citation graphs for how the ideas connect, and controlled vocabularies to keep the grounding honest. What sits below is the current state of that, and the papers I read to push on it.

retrievalembeddingsknowledge-graphsscientific-search

The work

SciX Agent Project
Literature Explorers Project
Code Intelligence Digest Project
Experimenting with Large Language Models and vector embeddings in NASA SciX Paper (mine)
Making Scientific Knowledge Navigable for Agents Talk

Reading path

A generated path through 5 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

SPECTER: Document-level Representation Learning using Citation-informed Transformers ↗

2020 · 125 cites

Start with citation-informed document embeddings: how papers cite each other turns out to teach a model what they mean.
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks ↗

2020 · 463 cites

Then the case for domain adaptation: generic models leave signal on the table in specialized corpora.
Building astroBERT, a Language Model for Astronomy & Astrophysics ↗

2024 · 17 cites

astroBERT applies that to astronomy, a domain model trained on the same ADS corpus I work in.
Experimenting with Large Language Models and vector embeddings in NASA SciX ↗

2023 · 2 cites

Our SciX experiments: embeddings + vector search over the live literature, and what broke.
Knowledge Graphs ↗

2020 · 118 cites

Close on knowledge graphs, the structural layer ranked retrieval can never give you.

Open literature

SPECTER: Document-level Representation Learning using Citation-informed Transformers ↗ 2020 · 125 cites
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks ↗ 2020 · 463 cites
Building astroBERT, a Language Model for Astronomy & Astrophysics ↗ 2024 · 17 cites
Improving astroBERT using Semantic Textual Similarity ↗ 2022 · 5 cites
Knowledge Graphs ↗ 2020 · 118 cites

Improving Text Embeddings with Large Language Models ↗ Wang, Liang et al. · 2023 · 370 cites
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models ↗ Baek, Jinheon et al. · 2024 · 157 cites
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery ↗ Zhang, Yu et al. · 2024 · 56 cites
The citation advantage of linking publications to research data ↗ Colavizza, Giovanni et al. · 2020 · 53 cites
AstroLLaMA: Towards Specialized Foundation Models in Astronomy ↗ Dung Nguyen, Tuan et al. · 2023 · 44 cites
Building astroBERT, a Language Model for Astronomy & Astrophysics ↗ Grezes, F. et al. · 2024 · 25 cites

Pulled from her curated ADS library.

Can scientific literature be made genuinely navigable — not just searchable?

The work

Reading path

SPECTER: Document-level Representation Learning using Citation-informed Transformers ↗

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks ↗

Building astroBERT, a Language Model for Astronomy & Astrophysics ↗

Experimenting with Large Language Models and vector embeddings in NASA SciX ↗

Knowledge Graphs ↗

Open literature

From my ADS library

Explore the links