Research library
Scientific Search & SciX
Navigating scientific literature: NASA ADS / SciX information systems, scientific language models, and fine-grained classification of research text.
84 documents
← All libraries-
Abstract
The NASA Astrophysics Data System (ADS) is the primary Digital Library portal for researchers in astronomy and astrophysics. Over the past 30 years, the ADS has gone from being an astronomy-focused bibliographic database to an open digital library system supporting research in space and (soon) earth sciences. This paper describes the evolution of the ADS system, its capabilities, and the technological infrastructure underpinning it. <P />We give an overview of the ADS's original architecture, constructed primarily around simple database models. This bespoke system allowed for the efficient indexing of metadata and citations, the digitization and archival of full-text articles, and the rapid development of discipline-specific capabilities running on commodity hardware. The move towards a cloud-based microservices architecture and an open-source search engine in the late 2010s marked a significant shift, bringing full-text search capabilities, a modern API, higher uptime, more reliable data retrieval, and integration of advanced visualizations and analytics. <P />Another crucial evolution came with the gradual and ongoing incorporation of Machine Learning and Natural Language Processing algorithms in our data pipelines. Originally used for information extraction and classification tasks, NLP and ML techniques are now being developed to improve metadata enrichment, search, notifications, and recommendations. we describe how these computational techniques are being embedded into our software infrastructure, the challenges faced, and the benefits reaped. <P />Finally, we conclude by describing the future prospects of ADS and its ongoing expansion, discussing the challenges of managing an interdisciplinary information system in the era of AI and Open Science, where information is abundant, technology is transformative, but their trustworthiness can be elusive.
-
Abstract
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we announce the first public release of the astroBERT language model, show how astroBERT improves over existing public language models on astrophysics specific tasks, and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
-
Abstract
Observatories need to measure and evaluate the scientific output and overall impact of their facilities. An observatory bibliography consists of the papers published using that observatory's data, typically gathered by searching the major journals for relevant keywords. Recently, the volume of literature and methods by which the publications pool is evaluated has increased. Efficient and standardized procedures are necessary to assign meaningful metadata; enable user-friendly retrieval; and provide the opportunity to derive reports, statistics, and visualizations to impart a deeper understanding of the research output. In 2021, a group of observatory bibliographers from around the world convened online to continue the discussions presented in Lagerstrom (2015). We worked to extract general guidelines from our experiences, techniques, and lessons learnt. The paper explores the development, application, and current status of telescope bibliographies and future trends. This paper briefly describes the methodologies employed in constructing databases, along with the various bibliometric techniques used to analyze and interpret them. We explain reasons for non-standardization and why it is essential for each observatory to identify metadata and metrics that are meaningful for them; caution the (over-)use of comparisons among facilities that are, ultimately, not comparable through bibliometrics; and highlight the benefits of telescope bibliographies, both for researchers within the astronomical community and for stakeholders beyond the specific observatories. There is tremendous diversity in the ways bibliographers track publications and maintain databases, due to parameters such as resources, type of observatory, historical practices, and reporting requirements to funders and outside agencies. However, there are also common sets of Best Practices.
-
Abstract
This review examines complexity science in the context of Heliophysics, describing it not as a discipline, but as a paradigm. In the context of Heliophysics, complexity science is the study of a star, interplanetary environment, magnetosphere, upper and terrestrial atmospheres, and planetary surface as interacting subsystems. Complexity science studies entities in a system (e.g., electrons in an atom, planets in a solar system, individuals in a society) and their interactions, and is the nature of what emerges from these interactions. It is a paradigm that employs systems approaches and is inherently multi- and cross-scale. Heliophysics processes span at least 15 orders of magnitude in space and another 15 in time, and its reaches go well beyond our own solar system and Earth's space environment to touch planetary, exoplanetary, and astrophysical domains. It is an uncommon domain within which to explore complexity science. After first outlining the dimensions of complexity science, the review proceeds in three epochal parts: 1) A pivotal year in the Complexity Heliophysics paradigm: 1996; 2) The transitional years that established foundations of the paradigm (1996-2010); and 3) The emergent literature largely beyond 2010. This review article excavates the lived and living history of complexity science in Heliophysics. It identifies five dimensions of complexity science, some enjoying much scholarship in Heliophysics, others that represent relative gaps in the existing research. The history reveals a grand challenge that confronts Heliophysics, as with most physical sciences, to understand the research intersection between fundamental science (e.g., complexity science) and applied science (e.g., artificial intelligence and machine learning (AI/ML)). A risk science framework is suggested as a way of formulating the grand scientific and societal challenges in a way that AI/ML and complexity science converge. The intention is to provide inspiration, help researchers think more coherently about ideas of complexity science in Heliophysics, and guide future research. It will be instructive to Heliophysics researchers, but also to any reader interested in or hoping to advance the frontier of systems and complexity science.
-
Abstract
Astronomical knowledge entities, such as celestial object identifiers, are crucial for literature retrieval and knowledge graph construction, and other research and applications in the field of astronomy. Traditional methods of extracting knowledge entities from texts face numerous challenging obstacles that are difficult to overcome. Consequently, there is a pressing need for improved methods to efficiently extract them. This study explores the potential of pre-trained Large Language Models (LLMs) to perform astronomical knowledge entity extraction (KEE) task from astrophysical journal articles using prompts. We propose a prompting strategy called Prompt-KEE, which includes five prompt elements, and design eight combination prompts based on them. We select four representative LLMs (Llama-2-70B, GPT-3.5, GPT-4, and Claude 2) and attempt to extract the most typical astronomical knowledge entities, celestial object identifiers and telescope names, from astronomical journal articles using these eight combination prompts. To accommodate their token limitations, we construct two data sets: the full texts and paragraph collections of 30 articles. Leveraging the eight prompts, we test on full texts with GPT-4 and Claude 2, on paragraph collections with all LLMs. The experimental results demonstrate that pre-trained LLMs show significant potential in performing KEE tasks, but their performance varies on the two data sets. Furthermore, we analyze some important factors that influence the performance of LLMs in entity extraction and provide insights for future KEE tasks in astrophysical articles using LLMs. Finally, compared to other methods of KEE, LLMs exhibit strong competitiveness in multiple aspects.
-
Abstract
In many practical applications, coarse-grained labels are readily available compared to fine-grained labels that reflect subtle differences between classes. However, existing methods cannot leverage coarse labels to infer fine-grained labels in an unsupervised manner. To bridge this gap, we propose FALCON, a method that discovers fine-grained classes from coarsely labeled data without any supervision at the fine-grained level. FALCON simultaneously infers unknown fine-grained classes and underlying relationships between coarse and fine-grained classes. Moreover, FALCON is a modular method that can effectively learn from multiple datasets labeled with different strategies. We evaluate FALCON on eight image classification tasks and a single-cell classification task. FALCON outperforms baselines by a large margin, achieving 22% improvement over the best baseline on the tieredImageNet dataset with over 600 fine-grained classes.
-
Abstract
Fine-grained category discovery using only coarse-grained supervision is a cost-effective yet challenging task. Previous training methods focus on aligning query samples with positive samples and distancing them from negatives. They often neglect intra-category and inter-category semantic similarities of fine-grained categories when navigating sample distributions in the embedding space. Furthermore, some evaluation techniques that rely on pre-collected test samples are inadequate for real-time applications. To address these shortcomings, we introduce a method that successfully detects fine-grained clusters of semantically similar texts guided by a novel objective function. The method uses semantic similarities in a logarithmic space to guide sample distributions in the Euclidean space and to form distinct clusters that represent fine-grained categories. We also propose a centroid inference mechanism to support real-time applications. The efficacy of the method is both theoretically justified and empirically confirmed on three benchmark tasks. The proposed objective function is integrated in multiple contrastive learning based neural models. Its results surpass existing state-of-the-art approaches in terms of Accuracy, Adjusted Rand Index and Normalized Mutual Information of the detected fine-grained categories. Code and data will be available at Code and data are publicly available at https://github.com/changtianluckyforever/F-grained-STAR.
-
Abstract
In many scientific fields, large language models (LLMs) have revolutionized the way text and other modalities of data (e.g., molecules and proteins) are handled, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one or two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 260 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.
-
Abstract
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.
-
Abstract
Ultra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing highly similar sub-categories within fine-grained objects, such as different soybean cultivars. Compared to traditional fine-grained visual categorization, Ultra-FGVC encounters more hurdles due to the small inter-class and large intra-class variation. Given these challenges, relying on human annotation for Ultra-FGVC is impractical. To this end, our work introduces a novel task termed Ultra-Fine-Grained Novel Class Discovery (UFG-NCD), which leverages partially annotated data to identify new categories of unlabeled images for Ultra-FGVC. To tackle this problem, we devise a Region-Aligned Proxy Learning (RAPL) framework, which comprises a Channel-wise Region Alignment (CRA) module and a Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to extract and utilize discriminative features from local regions, facilitating knowledge transfer from labeled to unlabeled classes. Furthermore, SemiPL strengthens representation learning and knowledge transfer with proxy-guided supervised learning and proxy-guided contrastive learning. Such techniques leverage class distribution information in the embedding space, improving the mining of subtle differences between labeled and unlabeled ultra-fine-grained classes. Extensive experiments demonstrate that RAPL significantly outperforms baselines across various datasets, indicating its effectiveness in handling the challenges of UFG-NCD. Code is available at https://github.com/SSDUT-Caiyq/UFG-NCD.
-
Abstract
The pace of scientific research, vital for improving human life, is complex, slow, and needs specialized expertise. Meanwhile, novel, impactful research often stems from both a deep understanding of prior work, and a cross-pollination of ideas across domains and fields. To enhance the productivity of researchers, we propose ResearchAgent, which leverages the encyclopedic knowledge and linguistic reasoning capabilities of Large Language Models (LLMs) to assist them in their work. This system automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them based on the feedback from collaborative LLM-powered reviewing agents. Specifically, starting with a core scientific paper, ResearchAgent is augmented not only with relevant publications by connecting information over an academic graph but also entities retrieved from a knowledge store derived from shared underlying concepts mined across numerous papers. Then, mimicking a scientific approach to improving ideas with peer discussions, we leverage multiple LLM-based ReviewingAgents that provide reviews and feedback via iterative revision processes. These reviewing agents are instantiated with human preference-aligned LLMs whose criteria for evaluation are elicited from actual human judgments via LLM prompting. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines, showing its effectiveness in generating novel, clear, and valid ideas based on both human and model-based evaluation results. Our initial foray into AI-mediated scientific research has important implications for the development of future systems aimed at supporting researchers in their ideation and operationalization of novel work.
-
Abstract
In Colombia, astronomical research is experiencing accelerated growth. To better understand its evolution and current state, we conducted a bibliometric study using data from the Astrophysics Data System (ADS) and the Web of Science (WoS). In the ADS, we identified 422 peer-reviewed publications from 1980, the year of the first publication, until 2023, the cut-off year of the study. Of the 25 Colombian institutions participating in at least one publication, 14 are private and 11 are state institutions. More than half of these institutions are concentrated in two main cities: Bogotá with 11 institutions, followed by Medellín with 3 institutions. The number of contributions from four universities stands out: Universidad de los Andes, Universidad Nacional de Colombia, Universidad Industrial de Santander, and Universidad de Antioquia with 104, 78, 68, and 67 publications, respectively. By cross-referencing the information from the ADS and the WoS, we found that the areas in which publications with the highest impact are found are three: high energies and fundamental physics, stars and stellar physics, and galaxies and cosmology. At the global level, according to the WoS, Colombia ranks 52nd in the number of peer-reviewed publications between 2019 and 2023 and fifth in Latin America. Additionally, we identified three highly cited publications (top 1% worldwide) belonging to the field of observational cosmology.
-
Abstract
This study explores the use of Large Language Models (LLMs) to analyze text comments from Reddit users, aiming to achieve two primary objectives: firstly, to pinpoint critical excerpts that support a predefined psychological assessment of suicidal risk; and secondly, to summarize the material to substantiate the preassigned suicidal risk level. The work is circumscribed to the use of "open-source" LLMs that can be run locally, thereby enhancing data privacy. Furthermore, it prioritizes models with low computational requirements, making it accessible to both individuals and institutions operating on limited computing budgets. The implemented strategy only relies on a carefully crafted prompt and a grammar to guide the LLM's text completion. Despite its simplicity, the evaluation metrics show outstanding results, making it a valuable privacy-focused and cost-effective approach. This work is part of the Computational Linguistics and Clinical Psychology (CLPsych) 2024 shared task.
-
Abstract
We examine over 68,000 refereed publications based on data from 25 missions in the ESA Science Programme and 11 additional missions in which ESA is involved as a junior partner. The publications cover the fields of astronomy, planetary science, and heliophysics and are spread over almost 50 years, spanning the period between the year a mission was launched and the end of 2021. We study the number of papers as a function of time and the evolution of several metrics, including citations and other indices. We also investigate the geographical distribution of the authors, and for ESA Member States we correlate the various indices with the level of financial contribution of the individual countries to the ESA Science Programme. We find that in general the involvement of the scientific communities in the various Member States follows the distribution expected from the countries' gross domestic products, with communities in some field and countries, both large and small, being particularly effective at turning data into scientific discoveries. We also analyse the differences between papers written by investigators directly involved in the provision of the payloads or in the definition of the scientific projects and those written by other scientists not directly involved in the process. We find that the latter, the so-called "archival papers", represent more than 50\,\% of the literature based on data from ESA Space Science missions, and have a similar impact on the literature in the respective fields, as judged by the number of citations. This highlights the importance of sharing and preserving the scientific data produced by the missions.
-
Abstract
Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatically populate an existing taxonomy with emerging concepts. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding "siblings" and finding "parents". We propose a taxonomy-guided instruction tuning framework to teach a large language model to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.
-
Abstract
Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.
-
Abstract
The Open Science paradigm and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are aiming at fostering scientific return, and reinforcing the trust in science production. The MASER (Measuring, Analysing and Simulating Emissions in the Radio range) services implement Open Science through a series of existing solutions that have been put together, only adding new pieces where needed. It is a "science ready" toolbox dedicated to time-domain low frequency radioastronomy, which data products mostly covers solar and planetary observations. MASER solutions are based on IVOA protocols for data discovery, on IHDEA tools for data exploration, and on a dedicated format developed by MASER for the temporal-spectral annotations. The service also proposes a data repository for sharing data collections, catalogues and associated documentation, as well as supplementary materials associated to papers. Each collection is managed through a Data Management Plan, which purpose is two-fold: supporting the provider for managing the collection content; and supporting the data centre for resource management. Each product of the repository is citable with a DOI, and the landing page contains web semantics annotations (using schema.org)
-
Abstract
We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. By employing a compact 7B-parameter LLaMA-2 model and focusing exclusively on a curated set of astronomy corpora-comprising abstracts, introductions, and conclusions-we achieve notable improvements in specialized topic comprehension. While general LLMs like GPT-4 excel in broader question-answering scenarios due to superior reasoning capabilities, our findings suggest that continual pre-training with limited resources can still enhance model performance on specialized topics. Additionally, we present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational data set, culminating in the release of the chat-enabled AstroLLaMA for community use. Comprehensive quantitative benchmarking is currently in progress and will be detailed in an upcoming full paper. The model, AstroLLaMA-Chat, is now available at https://huggingface.co/universeTBD, providing the first open-source conversational AI tool tailored for the astronomy community.
-
Abstract
In the evolving landscape of scientific publishing, it is important to understand the drivers of high-impact research, to equip scientists with actionable strategies to enhance the reach of their work, and to understand trends in the use of modern scientific publishing tools to inform their further development. Here, we study trends in the use of early preprint publications and revisions on ArXiv and the use of X (formerly Twitter) for promotion of such papers in computer science and physics. We find that early submissions to ArXiv and promotion on X have soared in recent years. Estimating the effect that the use of each of these modern affordances has on the number of citations of scientific publications, we find that peer-reviewed conference papers in computer science that are submitted early to ArXiv gain on average $21.1 \pm 17.4$ more citations, revised on ArXiv gain $18.4 \pm 17.6$ more citations, and promoted on X gain $44.4 \pm 8$ more citations in the first 5 years from an initial publication. In contrast, journal articles in physics experience comparatively lower boosts in citation counts, with increases of $3.9 \pm 1.1$, $4.3 \pm 0.9$, and $6.9 \pm 3.5$ citations respectively for the same interventions. Our results show that promoting one's work on ArXiv or X has a large impact on the number of citations, as well as the number of influential citations computed by Semantic Scholar, and thereby on the career of researchers. These effects are present also for publications in physics, but they are relatively smaller. The larger relative effect sizes, effects of promotion accumulating over time, and elevated unpredictability of the number of citations in computer science than in physics suggest a greater role of world-of-mouth spreading in computer science than in physics.
-
Abstract
Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines.
-
Abstract
It's been four years since the formation of the Python in Heliophysics Community (PyHC). In that time, the community has made great strides towards embodying and implementing the ideals of a "Heliophysics Framework" put forth by Burrell et al. (2018). Specifically, the components of such a framework include: 1) centralization of current Python packages, 2) increasing accessibility and connectivity of these projects, 3) consideration of software attribution issues, and 4) the establishment and implementation of best practices and standards for code development. We describe the manner in which, and to what extent, PyHC has realized these four tenants. We then set forth suggestions for advancing PyHC's efforts, including ways in which we can improve our information architecture, how we can grow our community, both in terms of project sustainability and usage, as well as the social component of the community itself, how we can improve PyHC package integration, and finally, non-Python library considerations. The suggested improvements and additions therein advance PyHC's mission and strategic goals, while helping better integrate PyHC into the broader Heliophysics and Space Weather community efforts.
-
Abstract
The automatic identification of planetary feature names in astronomy publications presents numerous challenges. These features include craters, defined as roughly circular depressions resulting from impact or volcanic activity; dorsas, which are elongate raised structures or wrinkle ridges; and lacus, small irregular patches of dark, smooth material on the Moon, referred to as "lake" (Planetary Names Working Group, n.d.). Many feature names overlap with places or people's names that they are named after, for example, Syria, Tempe, Einstein, and Sagan, to name a few (U.S. Geological Survey, n.d.). Some feature names have been used in many contexts, for instance, Apollo, which can refer to mission, program, sample, astronaut, seismic, seismometers, core, era, data, collection, instrument, and station, in addition to the crater on the Moon. Some feature names can appear in the text as adjectives, like the lunar craters Black, Green, and White. Some feature names in other contexts serve as directions, like craters West and South on the Moon. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars. We present a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.
-
Abstract
The Astrophysics Source Code Library (ASCL) is a free online registry for source codes of interest to astronomers, astrophysicists, and planetary scientists. It lists, and in some cases houses, software that has been used in research appearing in or submitted to peer-reviewed publications. As of December 2023, it has over 3300 software entries and is indexed by NASA's Astrophysics Data System (ADS) and Clarivate's Web of Science. In 2020, NASA created the Exoplanet Modeling and Analysis Center (EMAC). Housed at the Goddard Space Flight Center, EMAC serves, in part, as a catalog and repository for exoplanet research resources. EMAC has 240 entries (as of December 2023), 78% of which are for downloadable software. This oral presentation covered the collaborative work the ASCL, EMAC, and ADS are doing to increase the discoverability and citability of EMAC's software entries and to strengthen the ASCL's ability to serve the planetary science community. It also introduced two new projects, Virtual Astronomy Software Talks (VAST) and Exoplanet Virtual Astronomy Software Talks (exoVAST), that provide additional opportunities for discoverability of EMAC software resources.
-
Abstract
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
-
Abstract
In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.
-
Abstract
As part of NASA's Open Source Science Initiative, the NASA Astrophysics Data System (ADS) is extending its database to cover all research funded by the NASA Science Mission Directorate (SMD) divisions. The ADS was selected to lead this effort based on its decade-long, groundbreaking efforts in supporting the goals of Open Science for the Astronomy and Astrophysics community. The expansion plan, which has now begun in earnest, involves the creation of a literature-based, open digital information system covering and unifying the fields of Astrophysics, Planetary Science, Heliophysics, and Earth Science. It will also cover NASA funded research in Biological and Physical Sciences. Codenamed NASA Science Explorer, or SciX for short, it will extend ADS to become a permanent major component in the infrastructure of scientific research, providing important contributions towards the goal of open science. In this talk, I will discuss the features of the new system, highlighting the ones which will enhance discovery and access to Planetary Science research.
-
Abstract
Fine-grained entity typing (FET) is the task of identifying specific entity types at a fine-grained level for entity mentions based on their contextual information. Conventional methods for FET require extensive human annotation, which is time-consuming and costly. Recent studies have been developing weakly supervised or zero-shot approaches. We study the setting of zero-shot FET where only an ontology is provided. However, most existing ontology structures lack rich supporting information and even contain ambiguous relations, making them ineffective in guiding FET. Recently developed language models, though promising in various few-shot and zero-shot NLP tasks, may face challenges in zero-shot FET due to their lack of interaction with task-specific ontology. In this study, we propose OnEFET, where we (1) enrich each node in the ontology structure with two types of extra information: instance information for training sample augmentation and topic information to relate types to contexts, and (2) develop a coarse-to-fine typing algorithm that exploits the enriched information by training an entailment model with contrasting topics and instance-based augmented training samples. Our experiments show that OnEFET achieves high-quality fine-grained entity typing without human annotation, outperforming existing zero-shot methods by a large margin and rivaling supervised methods.
-
Abstract
With the rapid increase in paper submissions to academic conferences, the need for automated and accurate paper-reviewer matching is more critical than ever. Previous efforts in this area have considered various factors to assess the relevance of a reviewer's expertise to a paper, such as the semantic similarity, shared topics, and citation connections between the paper and the reviewer's previous works. However, most of these studies focus on only one factor, resulting in an incomplete evaluation of the paper-reviewer relevance. To address this issue, we propose a unified model for paper-reviewer matching that jointly considers semantic, topic, and citation factors. To be specific, during training, we instruction-tune a contextualized language model shared across all factors to capture their commonalities and characteristics; during inference, we chain the three factors to enable step-by-step, coarse-to-fine search for qualified reviewers given a submission. Experiments on four datasets (one of which is newly contributed by us) spanning various fields such as machine learning, computer vision, information retrieval, and data mining consistently demonstrate the effectiveness of our proposed Chain-of-Factors model in comparison with state-of-the-art paper-reviewer matching methods and scientific pre-trained language models.
-
Abstract
Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arXiv we create, to the authors' knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, within the limitations of our agreement with the arXiv, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.
-
Abstract
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
-
Abstract
Classifying research output into context-specific label taxonomies is a challenging and relevant downstream task, given the volume of existing and newly published articles. We propose a method to enhance the performance of article classification by enriching simple Graph Neural Network (GNN) pipelines with multi-graph representations that simultaneously encode multiple signals of article relatedness, e.g. references, co-authorship, shared publication source, shared subject headings, as distinct edge types. Fully supervised transductive node classification experiments are conducted on the Open Graph Benchmark OGBN-arXiv dataset and the PubMed diabetes dataset, augmented with additional metadata from Microsoft Academic Graph and PubMed Central, respectively. The results demonstrate that multi-graphs consistently improve the performance of a variety of GNN models compared to the default graphs. When deployed with SOTA textual node embedding methods, the transformed multi-graphs enable simple and shallow 2-layer GNN pipelines to achieve results on par with more complex architectures.
-
Abstract
This study investigates the application of Large Language Models (LLMs), specifically GPT-4, within Astronomy. We employ in-context prompting, supplying the model with up to 1000 papers from the NASA Astrophysics Data System, to explore the extent to which performance can be improved by immersing the model in domain-specific literature. Our findings point towards a substantial boost in hypothesis generation when using in-context prompting, a benefit that is further accentuated by adversarial prompting. We illustrate how adversarial prompting empowers GPT-4 to extract essential details from a vast knowledge base to produce meaningful hypotheses, signaling an innovative step towards employing LLMs for scientific research in Astronomy.
-
Abstract
We present a novel natural language processing (NLP) approach to deriving plain English descriptors for science cases otherwise restricted by obfuscating technical terminology. We address the limitations of common radio galaxy morphology classifications by applying this approach. We experimentally derive a set of semantic tags for the Radio Galaxy Zoo EMU (Evolutionary Map of the Universe) project and the wider astronomical community. We collect 8486 plain English annotations of radio galaxy morphology, from which we derive a taxonomy of tags. The tags are plain English. The result is an extensible framework, which is more flexible, more easily communicated, and more sensitive to rare feature combinations, which are indescribable using the current framework of radio astronomy classifications.
-
Abstract
Specialists offer seven tips for effectively sharing your data.
-
Abstract
Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FUTEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FUTEX significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples.
-
Abstract
The NASA Astrophysics Data System (ADS) is the primary Digital Library portal for Space Science Researchers. In addition to the scientific literature, the ADS has for a long time included in its database non-traditional scholarly resources such as research proposals, software packages, and high-level data products, making them discoverable and easily citable. Over the next three years, in response to NASA's efforts supporting interdisciplinary research and Open Science initiatives, the ADS will greatly expand its coverage of the literature, and will develop a new portal unifying access to the fields of Astrophysics, Planetary Science, Heliophysics, and Earth Science. It will also cover NASA funded research in Biological and Physical Sciences. The planned system will combine a scalable, discipline-agnostic core with a set of discipline specific knowledge centers which will curate and enrich its content using deep subject matter expertise from the NASA Science divisions. In this talk I will provide an overview of the ADS system, its distinguishing features, and then focus on our efforts to support and promote the FAIR principles as part of NASA's Year of Open Science initiatives.
-
Abstract
With the start of a new Great Observatories era, there is renewed concern that the demand for these forefront facilities, through proposal pressure, will exceed conventional peer-review management's capacity for ensuring an unbiased and efficient selection. There is need for new methods, strategies, and tools to facilitate those reviews. Here, we describe PACMan2, an updated tool for proposal review management that utilizes machine-learning models and techniques to topically categorize proposals and reviewers, to match proposals to reviewers, and to facilitate proposal assignments, mitigating some conflicts of interest. We find that the classifier has cross-validation accuracy of 80.0% ± 2.2% on proposals for time on the Hubble Space Telescope and the James Webb Space Telescope.
-
Abstract
Scientific literature understanding tasks have gained significant attention due to their potential to accelerate scientific discovery. Pre-trained language models (LMs) have shown effectiveness in these tasks, especially when tuned via contrastive learning. However, jointly utilizing pre-training data across multiple heterogeneous tasks (e.g., extreme multi-label paper classification, citation prediction, and literature search) remains largely unexplored. To bridge this gap, we propose a multi-task contrastive learning framework, SciMult, with a focus on facilitating common knowledge sharing across different scientific literature understanding tasks while preventing task-specific skills from interfering with each other. To be specific, we explore two techniques -- task-aware specialization and instruction tuning. The former adopts a Mixture-of-Experts Transformer architecture with task-aware sub-layers; the latter prepends task-specific instructions to the input text so as to produce task-aware outputs. Extensive experiments on a comprehensive collection of benchmark datasets verify the effectiveness of our task-aware specialization strategy, where we outperform state-of-the-art scientific pre-trained LMs. Code, datasets, and pre-trained models can be found at https://scimult.github.io/.
-
Abstract
Hierarchical text classification (HTC) is a challenging subtask of multi-label classification as the labels form a complex hierarchical structure. Existing dual-encoder methods in HTC achieve weak performance gains with huge memory overheads and their structure encoders heavily rely on domain knowledge. Under such observation, we tend to investigate the feasibility of a memory-friendly model with strong generalization capability that could boost the performance of HTC without prior statistics or label semantics. In this paper, we propose Hierarchy-aware Tree Isomorphism Network (HiTIN) to enhance the text representations with only syntactic information of the label hierarchy. Specifically, we convert the label hierarchy into an unweighted tree structure, termed coding tree, with the guidance of structural entropy. Then we design a structure encoder to incorporate hierarchy-aware information in the coding tree into text representations. Besides the text encoder, HiTIN only contains a few multi-layer perceptions and linear transformations, which greatly saves memory. We conduct experiments on three commonly used datasets and the results demonstrate that HiTIN could achieve better test performance and less memory consumption than state-of-the-art (SOTA) methods.
-
Abstract
Weakly-supervised text classification trains a classifier using the label name of each target class as the only supervision, which largely reduces human annotation efforts. Most existing methods first use the label names as static keyword-based features to generate pseudo labels, which are then used for final classifier training. While reasonable, such a commonly adopted framework suffers from two limitations: (1) keywords can have different meanings in different contexts and some text may not have any keyword, so keyword matching can induce noisy and inadequate pseudo labels; (2) the errors made in the pseudo label generation stage will directly propagate to the classifier training stage without a chance of being corrected. In this paper, we propose a new method, PIEClass, consisting of two modules: (1) a pseudo label acquisition module that uses zero-shot prompting of pre-trained language models (PLM) to get pseudo labels based on contextualized text understanding beyond static keyword matching, and (2) a noise-robust iterative ensemble training module that iteratively trains classifiers and updates pseudo labels by utilizing two PLM fine-tuning methods that regularize each other. Extensive experiments show that PIEClass achieves overall better performance than existing strong baselines on seven benchmark datasets and even achieves similar performance to fully-supervised classifiers on sentiment classification tasks.
-
Abstract
In 2019, while launching a multidisciplinary research project aimed at developing the Puna de Atacama region as a natural laboratory, investigators at the University of Atacama (Chile) conducted a bibliographic search identifying previously studied geographic points of the region and of potential interest for planetary science and astrobiology research. This preliminary work highlighted a significant absence of local institutional involvement in international publications. In light of this, a follow-up study was conducted to confirm or refute these first impressions, by comparing the search in two bibliographic databases: Web of Science and Scopus. The results show that almost 60% of the publications based directly on data from the Puna, the Altiplano, or the Atacama Desert with objectives related to planetary science or astrobiology do not include any local institutional partner (Argentina, Bolivia, Chile, and Peru). Indeed, and beyond the ethical questioning of international collaborations, Latin-American planetary science deserves a strategic structuring, networking, as well as a road map at national and continental scales, not only to enhance research, development, and innovation, but also to protect an exceptional natural heritage sampling extreme environmental niches on Earth. Examples of successful international collaborations such as the field of meteorites, terrestrial analogs, and space exploration in Chile or astrobiology in Mexico are given as illustrations and possible directions to follow to develop planetary science in South America. To promote appropriate scientific practices involving local researchers, possible responses at academic and institutional levels will eventually be discussed.
-
Abstract
The size of the National Aeronautics and Space Administration (NASA) Science Mission Directorate (SMD) is growing exponentially, allowing researchers to make discoveries. However, making discoveries is challenging and time-consuming due to the size of the data catalogs, and as many concepts and data are indirectly connected. This paper proposes a pipeline to generate knowledge graphs (KGs) representing different NASA SMD domains. These KGs can be used as the basis for dataset search engines, saving researchers time and supporting them in finding new connections. We collected textual data and used several modern natural language processing (NLP) methods to create the nodes and the edges of the KGs. We explore the cross-domain connections, discuss our challenges, and provide future directions to inspire researchers working on similar challenges.
-
Abstract
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify "high localization" levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.
-
Abstract
Due to the exponential growth of scientific publications on the Web, there is a pressing need to tag each paper with fine-grained topics so that researchers can track their interested fields of study rather than drowning in the whole literature. Scientific literature tagging is beyond a pure multi-label text classification task because papers on the Web are prevalently accompanied by metadata information such as venues, authors, and references, which may serve as additional signals to infer relevant tags. Although there have been studies making use of metadata in academic paper classification, their focus is often restricted to one or two scientific fields (e.g., computer science and biomedicine) and to one specific model. In this work, we systematically study the effect of metadata on scientific literature tagging across 19 fields. We select three representative multi-label classifiers (i.e., a bag-of-words model, a sequence-based model, and a pre-trained language model) and explore their performance change in scientific literature tagging when metadata are fed to the classifiers as additional features. We observe some ubiquitous patterns of metadata's effects across all fields (e.g., venues are consistently beneficial to paper tagging in almost all cases), as well as some unique patterns in fields other than computer science and biomedicine, which are not explored in previous studies.
-
Abstract
The NASA Astrophysics Data System (ADS) has been developing Natural Language Processing tools and datasets to further enhance its data holdings and services. To support broad community participation in these efforts, we have recently released the astroBERT astronomy specific language model, and the first version of the Detecting Entities in the Astrophysics Literature dataset. Other ongoing machine learning efforts at ADS that build upon these tools include leveraging well defined taxonomies, such as the Unified Astrophysical Thesaurus and the USGS/IAU Gazetteer of Planetary Nomenclature, to enhance the search and discovery of astrophysical concepts and planetary features. These new tools, and more that will build upon them, will both provide a richer user experience and allow internal processes to be scaled-up. One mode of disseminating these tools to the public was through the first Workshop on Information Extraction from Scientific Publications, held in November 2022. Along with submitted papers and talks, several teams competed to build their own Named Entity Recognition systems to compare against astroBERT. With our experience in conducting challenges such as these, we foresee using further challenges to grow awareness of our efforts and to gain valuable feedback from the larger astronomy and machine learning communities.
-
Abstract
Scientific software has been playing an increasingly large role in astronomy and astrophysics. It is important for continued innovation and research that these software tools are discoverable by the community and that researchers are properly acknowledged and credited for their contributions. The Astrophysics Data System (ADS) is working on multiple projects to promote software discoverability and properly include software in the larger corpus of scientific literature. As part of these efforts, ADS has constructed an internal pipeline for ingesting software records from the Zenodo repository. Recent enhancements of this pipeline include the creation of associations with related software versions already in the database and the inclusion of their metrics calculations with other ADS records. Additionally, the Citation Capture pipeline forwards cited Github and Zenodo repositories to the Asclepias Project-a joint collaboration of AAS, Zenodo, and ADS- that promotes software discoverability by maintaining a publicly searchable database linking software tools to the citing literature.
-
Abstract
Citation graphs can be helpful in generating high-quality summaries of scientific papers, where references of a scientific paper and their correlations can provide additional knowledge for contextualising its background and main contributions. Despite the promising contributions of citation graphs, it is still challenging to incorporate them into summarization tasks. This is due to the difficulty of accurately identifying and leveraging relevant content in references for a source paper, as well as capturing their correlations of different intensities. Existing methods either ignore references or utilize only abstracts indiscriminately from them, failing to tackle the challenge mentioned above. To fill that gap, we propose a novel citation-aware scientific paper summarization framework based on citation graphs, able to accurately locate and incorporate the salient contents from references, as well as capture varying relevance between source papers and their references. Specifically, we first build a domain-specific dataset PubMedCite with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient contents of references and the source paper. Based on it, we design a self-supervised citation-aware summarization framework (CitationSum) with graph contrastive learning, which boosts the summarization generation by efficiently fusing the salient information in references with source paper contents under the guidance of their correlations. Experimental results show that our model outperforms the state-of-the-art methods, due to efficiently leveraging the information of references and citation correlations.
-
Abstract
We estimate the carbon footprint of astronomical research infrastructures, including space telescopes and probes and ground-based observatories. Our analysis suggests annual greenhouse gas emissions of $1.2\pm0.2$ \Mtcoeyr\ due to construction and operation of the world-fleet of astronomical observatories, corresponding to a carbon footprint of 36.6$\pm$14.0 \tcoe\ per year and average astronomer. We show that decarbonising astronomical facilities is compromised by the continuous deployment of new facilities, suggesting that a significant reduction in the deployment pace of new facilities is needed to reduce the carbon footprint of astronomy. We propose measures that would bring astronomical activities more in line with the imperative to reduce the carbon footprint of all human activities.
-
Abstract
Information Extraction from scientific literature can be challenging due to the highly specialised nature of such text. We describe our entity recognition methods developed as part of the DEAL (Detecting Entities in the Astrophysics Literature) shared task. The aim of the task is to build a system that can identify Named Entities in a dataset composed by scholarly articles from astrophysics literature. We planned our participation such that it enables us to conduct an empirical comparison between word-based tagging and span-based classification methods. When evaluated on two hidden test sets provided by the organizer, our best-performing submission achieved $F_1$ scores of 0.8307 (validation phase) and 0.7990 (testing phase).
-
Abstract
In keeping with NASA's long term commitment to open science, the ADS services have always been open and freely accessible to everyone on Earth. By integrating from and linking to hundreds of data sources, such as publishers, archives, data centers, and libraries, the ADS has created a powerful international, multi-partner research discovery engine, unified by the ADS digital library system. Having done this for Astrophysics for three decades, the ADS has included Heliophysics in its efforts to support open science. This poster highlights these efforts and progress that has already been made.
-
Abstract
Datasets, disciplines, people, projects, institutions are all siloed, resulting in a lack of awareness and usability across silos that prevent reuse. Yet the challenges confronting Earth and Space Science are increasingly complex, requiring wider collaboration and data integration. We offer a framework to address the challenge of knowledge integration that precludes transdisciplinary progress: a knowledge commons [McGranaghan et al., 2021].We will detail three projects working in parallel to build the knowledge commons:Enriching Heliophysics data in the NASA Astrophysics Data System (ADS): Create a set of concepts and their semantic representation that improves the discovery of Heliophysics literature by NASA ADS;The Heliophysics KNOWledge Network (Helio-KNOW): collection of software and systems for improved information representation in Heliophysics, focusing on the magnetosphere-ionosphere system; andThe NASA Center for HelioAnalytics: a cross-community effort focused on improving knowledge and application of data science to Heliophysics.This talk will provide a clear discussion of what knowledge representation is and its importance for all scientific domains and collaborations. It will make that discussion actionable by relating it to three active projects. Finally, it will reveal how the community can participate-outlining the commons component.McGranaghan, R., Klein, S. J., Cameron, A., Young, E., Schonfeld, S., Higginson, A., … Thompson, B. (2021). The need for a Space Data Knowledge Commons. Structuring Collective Knowledge. Retrieved from https://knowledgestructure.pubpub.org/pub/space-knowledge-commons
-
Abstract
Recent news reports claim that China is overtaking the United States and all other countries in scientific productivity and scientific impact. A straightforward analysis of high-impact papers in astronomy reveals that this is not true in our field. In fact, the United States continues to host, by a large margin, the authors that lead high-impact papers. Moreover, this analysis shows that 90% of all high-impact papers in astronomy are led by authors based in North America and Europe. That is, only about 10% of countries in the world host astronomers that publish "astronomy's greatest hits".
-
Abstract
This paper presents results of a survey of authors of journal articles published over several decades in astronomy. The study focuses on determining the characteristics and accessibility of data behind papers, referring to the spectrum of raw and derived data that would be needed to validate the results of a particular published article as a capsule of scientific knowledge. Curating the data behind papers can arguably lead to new discoveries through reuse. However, as shown through related research and confirmed by the results of the present study, a fully accessible portrait of the data behind papers is often unavailable. These findings have implications for reusability efforts and are presented alongside a discussion of open science.
-
Abstract
In keeping with NASA's long term commitment to open science, the ADS services have always been open and freely accessible to everyone worldwide. By integrating from and linking to hundreds of data sources, such as publishers, archives, data centers, and libraries, the ADS has created a powerful international, multi-partner research discovery engine, unified by the ADS digital library system. Having done this for Astrophysics for three decades, the ADS has recently included Heliophysics in its efforts to support open science. This poster highlights these efforts and progress that has already been made.
-
Abstract
In September of 2016, the NASA Astrophysics Data System (ADS) started to work on the implementation of first-class support of software. This work was started as a result of the Asclepias project, funded through a grant from the Alfred P. Sloan Foundation to the American Astronomical Society. The main goal of the Asclepias project is to promote scientific software into an identifiable, citable, and preservable object. It highlighted the fact that no single stakeholder can solve the software citation problem. It requires close collaboration between a publisher (the American Astronomical Society), a repository (Zenodo) and an indexing service (ADS). This paper focuses on the contribution the ADS has made to this project. Five years later, the ADS has indexed just over 10k Zenodo software records, representing almost 19k citations. How did we get here? We describe the underlying infrastructure developed at ADS which implements a software citation detection, metadata capture and ingest, and event-driven notification system to a broker used by ADS collaborators. We include a discussion of the challenges that we have encountered in the implementation and operation of the system.
-
Abstract
What publications mention a given observatory? How many refer to a concrete telescope or instrument? Quickly answering these questions can boost the efficiency of astronomers, librarians, and administrative personnel when writing reports, applying for new funds, or collecting institutional statistics. The NASA Astrophysics Data System (ADS) has more than 16 million bibliographic records for which it has indexed the full-text content of more than 6 million scientific articles (more than 1.3 million of which correspond to astronomical publications). In this ocean of data, the NASA ADS allows users to quickly string match words and sentences against the indexed full-text content but this does not solve the word sense disambiguation problem (e.g., "Planck" can refer to the person, the mission, the constant or several institutions). The NASA ADS is preparing a dataset of manually tagged entities which is being used to train Deep Learning models to automatically recognize and disambiguate facilities. This poster details our efforts and lessons learned in our adventure to improve the NASA ADS search capabilities using Deep Learning techniques.
-
Abstract
The NASA Astrophysics Data System's API, or application programming interface, is designed to be easily accessible, allowing users to create their own tools and scripts. However, while our documentation for the API covered the most widely used features, there were a number of useful components that we had not properly documented. We have now remedied this shortcoming by using the OpenAPI specification, a language-agnostic, widely adopted industry standard for documenting APIs in a machine-readable way. This OpenAPI document underlies our new beautiful, easy to use documentation, powered by RapiDoc. In addition to providing a listing of all available API endpoints, the new documentation defines all available inputs and possible API responses, offers examples, and provides further description and explanation as necessary. The most unique feature of the new documentation is the built-in try-me functionality, which allows you to run API queries directly from your browser. This poster spotlights the capabilities of the new documentation and highlights other recent and forthcoming changes to our API documentation.
-
Abstract
We present statistics on the number of refereed astronomy journal articles that used data from NASA's Spitzer Space Telescope through the end of the calendar year 2020. We discuss the various types of science programs and science categories that were used to collect data during the mission and discuss how operational changes brought on by the depletion of cryogen in 2009 May, including the resulting budget cuts, impacted the publication rate. The post-cryogenic (warm) mission produced fewer papers than the cryogenic mission, but the percentage of the exposure time published did not appreciably change between the warm and cryogenic missions. This was mostly because in the warm mission the length of observations increased, so that each warm paper on average uses more data than the cryogenic papers. We also discuss the speed of publication, archival usage, and the tremendous efficacy of the Legacy and Exploration Science programs (large, coherent investigations), including the value of having well-advertised enhanced data products hosted in centralized archives. We also identify the observations that have been published the largest number of times, and sort them by a variety of metrics (including program type, instrument used, and observation length). Data that have the highest reuse rates in publications were taken early in the Spitzer mission, or belong to one of the large surveys (large either in number of objects, in number of hours observed, or in area covered on the sky). We also assess how often authors have cited the Spitzer fundamental papers or have correctly referenced the Spitzer data they used, finding that as many as 40% of papers have failed to cite the papers, and 15% have made it impossible to identify the data they used.
-
Abstract
We present an overview of best practices for publishing data in astronomy and astrophysics journals. These recommendations are intended as a reference for authors to help prepare and publish data in a way that will better represent and support science results, enable better data sharing, improve reproducibility, and enhance the reusability of data. Observance of these guidelines will also help to streamline the extraction, preservation, integration and cross-linking of valuable data from astrophysics literature into major astronomical databases, and consequently facilitate new modes of science discovery that will better exploit the vast quantities of panchromatic and multidimensional data associated with the literature. We encourage authors, journal editors, referees, and publishers to implement the best practices reviewed here, as well as related recommendations from international astronomical organizations such as the International Astronomical Union for publication of nomenclature, data, and metadata. A convenient Checklist of Recommendations for Publishing Data in the Literature (Appendix A) is included for authors to consult before the submission of the final version of their journal articles and associated data files. We recommend that publishers of journals in astronomy and astrophysics incorporate a link to this document in their Instructions to Authors.
-
Abstract
The Astrophysics Data System is well established as one of the most prominent discipline specific citation databases. As such, it is a rich source of astronomy related metrics. It has been specified as "designed to be useful to astronomers, not bibliometricians". There is a general agreement that the coverage, metadata and classifications found in the more general citation databases such as Web of Science and Scopus do not serve needs of the astronomy and astrophysics community equally well.
-
Abstract
The Unified Astronomy Thesaurus (UAT) is a living resource, with regular updates and opportunities for integration that ensure its place as a critical tool that meets the needs of the astronomy community, including researchers, authors, publishers, observatories, data centers, and librarians. The UAT has been integrated into multiple systems and institutions in recent years. We share the latest integrations with LISA participants and engage in a discussion on how we can work together to continue to build and leverage the UAT as a community. Future integrations and areas of exploration, such as ADS integration, applying the UAT as a controlled vocabulary for your institution's needs, and the UAT's potential as a testbed for Natural Language Processing, are also discussed.
-
Abstract
Today's scholarly, born-digital articles are no longer best represented by single documents but rather consist of a narrative connecting a collection of research components. They contain references to other papers, data products, people, institutions, and funding sources, among others (Accomazzi, 2015). Since the whole of the products referenced by a scholarly article forms the best representation of the science discussed in the article, having a proper coverage of these sources will help in properly representing different aspects and stages of research life cycles. Just capturing citations to other scholarly publications would not only leave out important information, it would also omit proper attribution to researchers who contribute as authors of non-traditional research products. The ADS has implemented a workflow for capturing software citations, the concept of which was presented at LISA VIII (Muench, Accomazzi, & Holm Nielsen, 2017); this workflow allows the detection and ingest of citations to software products used in scholarly publications (Henneken, Muench, Holm Nielsen, Blanco-Cuaresma, & Accomazzi, 2019). Since both data and software citations are crucial for the transparency of research results and for the transmission of credit (van de Sandt et al., 2019), the ADS will implement indexing of high-level data products, in particular those published by NASA Archives, and track their citations. We will conclude by speculating how additional text mining and curation efforts can be used to further link the literature to additional resources mentioned in the papers.
-
Abstract
Software has been a crucial contributor to scientific progress in astronomy for decades, but practices that enable machine-actionable citations have not been consistently applied to software itself. Instead, software citation behaviors developed independently from standard publication mechanisms and policies, resulting in human-readable software citations that cannot effectively represent the influence software has had in the field. These historical software citation behaviors need to be understood in order to improve software citation guidance and develop relevant publishing practices that fully support the astronomy community. To this end, a twenty-three year retrospective analysis of software citation practices in astronomy was developed. Astronomy publications were mined for 410 aliases associated with nine software packages and analyzed to identify practices and trends that negatively impact software citation implementation.
-
Abstract
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to present the information provided by the website. NASA ADS follows the official Web Content Accessibility Guidelines (WCAG) standard for ensuring accessibility of all its applications, striving to exceed this standard where possible. Through the use of both internal audits and external expert review based on these guidelines, we have identified many areas for improving accessibility in our current web application, and have implemented a number of updates to the UI as a result of this. We present an overview of some current web accessibility trends, discuss our experience incorporating these trends in our web application, and discuss the lessons learned and recommendations for future projects.
-
Abstract
Researchers are more likely to read and cite papers to which they have access than those that they cannot obtain. Thus, the objective of this work is to analyze the contribution of the Open Access (OA) modality to the impact of hybrid journals. For this, the research articles in the year 2017 from 200 hybrid journals in four subject areas, and the citations received by such articles in the period 2017-2020 in the Scopus database, were analyzed. The journals were randomly selected from those with share of OA papers higher than some minimal value. More than 60 thousand research articles were analyzed in the sample, of which 24% under the OA modality. As results, we obtain that cites per article in both hybrid modalities strongly correlate. However, there is no correlation between the OA prevalence and cites per article in any of the hybrid modalities. There is OA citation advantage in 80% of hybrid journals. Moreover, the OA citation advantage is consistent across fields and held in time. We obtain an OA citation advantage of 50% in average, and higher than 37% in half of the hybrid journals. Finally, the OA citation advantage is higher in Humanities than in Science and Social Science.
-
Abstract
The NASA Astrophysics Data System (ADS) is the primary Digital Library portal for researchers in Astronomy and Astrophysics. It is also used extensively by the broader community of Space Science Researchers. In addition to the scientific literature, the ADS has for a long time included in its database non-traditional scholarly resources such as research proposals, software packages, and high-level data products, making them discoverable and easily citable. Over the next three years, in response to NASA's efforts supporting interdisciplinary research across Science Mission Directorate Divisions, the ADS will expand its coverage of Planetary Science and Heliophysics content. During this time, an in-depth analysis of this disciplinary content will provide us with opportunities to improve our coverage of the literature as well as linked research objects such as datasets and software, improving their discoverability and citability. In this talk I will provide an overview of the ADS system, its distinguishing features, and then focus on our efforts to support and promote the goals of Open Science.
-
Abstract
Software citation has accelerated in astrophysics in the past decade, resulting in the field now having multiple trackable ways to cite computational methods. Yet most software authors do not specify how they would like their code to be cited, while others specify a citation method that is not easily tracked (or tracked at all) by most indexers. Two metadata file formats, codemeta.json and CITATION.cff, developed in 2016 and 2017 respectively, are useful for specifying how software should be cited. In 2020, the Astrophysics Source Code Library (ASCL, ascl.net) undertook a year-long effort to generate and send these software metadata files, specific to each computational method, to code authors for editing and inclusion on their code sites. We wanted to answer the question, "Would sending these files to software authors increase adoption of one, the other, or both of these metadata files?" The answer in this case was no. Furthermore, only 41% of the 135 code sites examined for use of these files had citation information in any form available. The lack of such information creates an obstacle for article authors to provide credit to software creators, thus hindering citation of and recognition for computational contributions to research and the scientists who develop and maintain software.
-
Abstract
In response to NASA's efforts supporting Interdisciplinary Research across Science Mission Directorate Divisions, the NASA Astrophysics Data System has been asked to expand its coverage of the literature in Planetary Sciences. This expansion project, which will be carried out over the next three years, involves the indexing and curation of refereed and non-refereed journal articles, conference proceedings, PhD Thesis, meeting abstracts, preprints, high-level datasets and software. During this time, ongoing citation and topic analysis on this disciplinary content will provide us with opportunities to improve content coverage and search relevance. Text mining efforts will be used to enrich these records through automated detection of concepts found in papers and the extraction of links to online datasets and software used in them. The ultimate goal of this effort is to provide the same level of support for Planetary Science as ADS currently provides for Astrophysics.
-
Abstract
We explore how astronomers take observational data from telescopes, process them into usable scientific data products, curate them for later use, and reuse data for further inquiry. Astronomers have invested heavily in knowledge infrastructures - robust networks of people, artifacts, and institutions that generate, share, and maintain specific knowledge about the human and natural worlds. Drawing upon a decade of interviews and ethnography, this article compares how three astronomy groups capture, process, and archive data, and for whom. The Sloan Digital Sky Survey is a mission with a dedicated telescope and instruments, while the Black Hole Group and Integrative Astronomy Group (both pseudonyms) are university-based, investigator-led collaborations. Findings are organized into four themes: how these projects develop and maintain their workflows; how they capture and archive their data; how they maintain and repair knowledge infrastructures; and how they use and reuse data products over time. We found that astronomers encode their research methods in software known as pipelines. Algorithms help to point telescopes at targets, remove artifacts, calibrate instruments, and accomplish myriad validation tasks. Observations may be reprocessed many times to become new data products that serve new scientific purposes. Knowledge production in the form of scientific publications is the primary goal of these projects. They vary in incentives and resources to sustain access to their data products. We conclude that software pipelines are essential components of astronomical knowledge infrastructures, but are fragile, difficult to maintain and repair, and often invisible. Reusing data products is fundamental to the science of astronomy, whether or not those resources are made publicly available. We make recommendations for sustaining access to data products in scientific fields such as astronomy.
-
Abstract
We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
-
Abstract
Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. Specifically, we first propose a label-conditioned finetuning formulation to attune these generators for our task. Furthermore, we devise a regularization objective based on the coarse-fine label constraints derived from our problem setting, giving us even further improvements over the prior formulation. Our framework uses the fine-tuned generative models to sample pseudo-training data for training the classifier, and bootstraps on real unlabeled data for model refinement. Extensive experiments and case studies on two real-world datasets demonstrate superior performance over SOTA zero-shot classification baselines.
-
Abstract
In response to NASA's efforts supporting interdisciplinary research across Science Mission Directorate Divisions, the NASA Astrophysics Data System has been tasked with expanding its coverage of the literature in Planetary Sciences and Heliophysics. While both disciplines are currently represented in the content that ADS indexes, their coverage is not as comprehensive and authoritative as ADS provides for its core Astronomy and Astrophysics collection. During the next year, we will develop a census to ensure research areas such as Space Science, Astrobiology, Aeronomy and Solar Physics are properly accounted for and represented in our database. The ultimate goal of this effort is to provide the same level of support for these disciplines as ADS currently provides for Astrophysics: current and accurate coverage of both refereed and gray literature, preprints, data and software. We expect that enhanced search capabilities will be developed in due time through collaborations with partners and stakeholders.
-
Abstract
We advocate that the Planetary Science (PS) community build a discipline-specific digital library, in collaboration with the existing astronomy digital library, ADS. We suggest that PS archives index and curate information on features and objects in our solar system and enable linking between datasets and the derived journal articles.
-
Abstract
Software citation contributes to achieving software sustainability in two ways: It provides an impact metric to incentivize stakeholders to make software sustainable. It also provides references to software used in research, which can be reused and adapted to become sustainable. While software citation faces a host of technical and social challenges, community initiatives have defined the principles of software citation and are working on implementing solutions.
-
Abstract
In this study we analyse the key driving factors of preprints in enhancing scholarly communication. To this end we use four groups of metrics, one referring to scholarly communication and based on bibliometric indicators (Web of Science and Scopus citations), while the others reflect usage (usage counts in Web of Science), capture (Mendeley readers) and social media attention (Tweets). Hereby we measure two effects associated with preprint publishing: publication delay and impact. We define and use several indicators to assess the impact of journal articles with previous preprint versions in arXiv. In particular, the indicators measure several times characterizing the process of arXiv preprints publishing and the reviewing process of the journal versions, and the ageing patterns of citations to preprints. In addition, we compare the observed patterns between preprints and non-OA articles without any previous preprint versions in arXiv. We could observe that the "early-view" and "open-access" effects of preprints contribute to a measurable citation and readership advantage of preprints. Articles with preprint versions are more likely to be mentioned in social media and have shorter Altmetric attention delay. Usage and capture prove to have only moderate but stronger correlation with citations than Tweets. The different slopes of the regression lines between the different indicators reflect different order of magnitude of usage, capture and citation data.
-
Abstract
Second Order Operators (SOOs) are database functions which form secondary queries based on attributes of the objects returned in an initial query; they can provide powerful methods to investigate complex, multipartite information graphs. The NASA Astrophysics Data System (ADS) has implemented four SOOs, reviews, useful, trending, and similar which use the citations, references, downloads, and abstract text. This tutorial describes these operators in detail, both alone and in conjunction with other functions. It is intended for scientists and others who wish to make fuller use of the ADS database. Basic knowledge of the ADS is assumed.
-
Abstract
During the last 15 years the number of astronomy-related papers published by scientists in Venezuela has been continuously decreasing, mainly due to emigration. If rapid corrective actions are not implemented, professional astronomy in Venezuela could disappear.
-
Abstract
Software has been a crucial contributor to scientific progress in astronomy for decades, but practices that enable machine-actionable citations have not been consistently applied to software itself. Instead, software citation behaviors developed independently from standard publication mechanisms and policies, resulting in human-readable citations that remain hidden over time and that cannot represent the influence software has had in the field. These historical software citation behaviors need to be understood in order to improve software citation guidance and develop relevant publishing practices that fully support the astronomy community. To this end, a 23 year retrospective analysis of software citation practices in astronomy was developed. Astronomy publications were mined for 410 aliases associated with nine software packages and analyzed to identify past practices and trends that prevent software citations from benefiting software authors.
-
Abstract
The application described has been designed to create bibliographic entries in large databases with diverse sources automatically, which reduces both the frequency of mistakes and the workload for the administrators. This new system uniquely identifies each reference from its digital object identifier (DOI) and retrieves the corresponding bibliographic information from any of several online services, including the SAO/NASA Astrophysics Data Systems (ADS) and CrossRef APIs. Once parsed into a relational database, the software is able to produce bibliographies in any of several formats, including HTML and BibTeX, for use on websites or printed articles. The application is provided free-of-charge for general use by any scientific database. The power of this application is demonstrated when used to populate reference data for the HITRAN and AMBDAS databases as test cases. HITRAN contains data that is provided by researchers and collaborators throughout the spectroscopic community. These contributors are accredited for their contributions through the bibliography produced alongside the data returned by an online search in HITRAN. Prior to the work presented here, HITRAN and AMBDAS created these bibliographies manually, which is a tedious, time-consuming and error-prone process. The complete code for the new referencing system can be found at \url{https://github.com/hitranonline/refs}.
-
Abstract
Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531,889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository -- rather than being available on request or included as supporting information files -- are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% ($\pm$~1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.
-
Abstract
The NASA Astrophysics Data System has now completed the transition from its 25-year-old legacy interface ("ADS Classic") to a new cloud-based platform ("the new ADS"). This transition has represented a significant change in the user experience as well as in the technology underpinning its search engine. Taken together, these changes represent a major challenge for its users but also provide an opportunity for them to discover capabilities previously unavailable or unexplored. The original paradigm of iterative searches championed by ADS Classic has now been expanded to allow workflows which include search, refine, and explore. In this poster we will provide a quick overview of the typical uses of ADS: finding a paper, looking up an author, and exploring a topic. In addition to showing how each one of these activities can be carried out with ease in the new system, we will highlight the additional exploration features available to the curious user who is willing to further explore the literature.
-
Abstract
The Agile manifesto encourages us to value individuals and interactions over processes and tools, while Scrum, the most adopted Agile development methodology, is essentially based on roles, events, artifacts, and the rules that bind them together (i.e., processes). Moreover, it is generally proclaimed that whenever a Scrum project does not succeed, the reason is because Scrum was not implemented correctly and not because Scrum may have its own flaws. This grants irrefutability to the methodology, discouraging deviations to fit the actual needs and peculiarities of the developers. In particular, the members of the NASA ADS team are highly creative and autonomous whose motivation can be affected if their freedom is too strongly constrained. We present our experience following Agile principles, reusing certain Scrum elements and seeking the satisfaction of the team members, while rapidly reacting/keeping the project in line with our stakeholders expectations.