MSR4P&S 2025

Accepted Papers

Usability of Static Application Security Testing Workflows

Bhagya Chembakottu and Martin P. Robillard

The usability of static application security testing tools (SASTs) can facilitate the development of secure code within GitHub workflows. We report on our experience applying these tools with Spring and Django web development frameworks, analyzing aspects such as setup complexity, build integration, and the utility of the generated vulnerability reports. A key observation is that Django projects require less effort to integrate, whereas Spring projects involve significant setup challenges, particularly with SonarCloud, due to build environment dependencies. Furthermore, we observed usability issues such as ambiguous error messages and inconsistent warnings. By examining setup time and error incidence, we provide insights for improving SAST usability and recommendations for easier installation and clearer notifications.

Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation

Rafiqul Rabin, Sean McGregor and Nick Judd

This paper explores the risk that a large language model (LLM) trained for code generation on data mined from software repositories will generate content that discloses sensitive information included in its training data. We decompose this risk, known in the literature as ``unintended memorization,'' into two components: unintentional disclosure (where an LLM presents secrets to users without the user seeking them out) and malicious disclosure (where an LLM presents secrets to an attacker equipped with partial knowledge of the training data). We observe that while existing work mostly anticipates malicious disclosure, unintentional disclosure is also a concern. Next, we describe methods to assess unintentional and malicious disclosure risks side-by-side across different releases of training datasets and models. We demonstrate these methods through an independent assessment of the Open Language Model (OLMo) family of models and its Dolma training datasets. Our results show, first, that changes in data source and processing are associated with substantial changes in unintended memorization risk; second, that the same set of operational changes may increase one risk while mitigating another; and, third, that the risk of disclosing sensitive information varies not only by prompt strategies or test datasets but also by the relevant types of sensitive information. These contributions rely on data mining to enable greater privacy and security testing required for the LLM training data supply chain.

Impact of Identifier Normalization on Vulnerability Detection Techniques

Torge Hinrichs, Tim Diercks and Riccardo Scandariato

This study examines the impact of identifier normalization on software vulnerability detection using three approaches: static analysis tools, specialized Machine Learning (ML) models, and Large Language Models (LLM). Using the BigVul dataset of vulnerabilities in C/C++ projects, the research evaluates the performance of these methods under normalized (generalized variables / functions names) and their original conditions. Static analysis tools such as Flawfinder and CppCheck exhibit limited effectiveness (F1 scores ~0.1) and are unaffected by normalization. Specialized ML models, such as LineVul, achieve high F1 scores on nonnormalized data (F1 ~ 0.9) but suffer significant performance drops when tested on normalized inputs, highlighting their lack of generalizability. In contrast, LLMs such as Llama3, although underperforming in their pre-trained state, show substantial improvement after fine-tuning, achieving robust and consistent results across both normalized and non-normalized datasets. The findings suggest that while static analysis tools are less effective, fine-tuned LLMs hold strong potential for scalable and generalized vulnerability detection. The study recommends further exploration of hybrid approaches that combine ML models, LLMs, and traditional tools to enhance accuracy and adaptability in diverse scenarios.

SBOM Generation Tools and Formats Affect Compliance with US Standard

Redempta Manzi Muneza, Aidan Keefe, Eric O'Donoghue, Clemente Izurieta and Ann Marie Reinhold

Software Bills of Materials (SBOMs) provide transparency in the software supply chain (SSC), allowing organizations to assess and manage software security risks. To enhance the security of the SSC, the National Telecommunications and Information Administration (NTIA), an agency under the U.S. Department of Commerce, released a set of minimum elements that an SBOM must contain for software provided to the U.S. government. Despite this requirement, the impact of SBOM generation on compliance with these minimum elements remains largely unknown. To address this concern, we conducted a systematic analysis to assess how SBOM generation tools and formats influence compliance with NTIA’s standards. We mined Docker Hub for 2,225 Docker images and generated four distinct sets of SBOMs by varying the generation tools (Trivy and Syft) and formats (CycloneDX and SPDX). We then evaluated the NTIA compliance of each SBOM using SBOMQS, an open-source SBOM quality assessment tool. We found numerous disparities exist in SBOM NTIA compliance attributed to SBOM generation; notably, compliance scores vary more according to the choice of generation tool than format. The disparities in SBOM compliance with NTIA standards, even for identical software artifacts, emphasize the critical need to refine and validate SBOM technologies to strengthen SSC security. This research informs SBOM generation from a government compliance perspective–filling a critical gap for software vendors and providing insights that help improve both privacy and security in software systems through better SBOM generation practices.

Links Between Package Popularity, Criticality, and Security in Software Ecosystems

Alexis Butler, Dan O'Keeffe and Santanu Kumar Dash

With the continued growth of Open Source Software (OSS), maintenance workloads have also continued to expand, this along with additional stressors results in maintainer burnout and churn. Given that the pool of those within a software ecosystem with the expertise and willingness to maintain a project is limited, maintenance efforts should be focussed on minimizing security risks with the greatest potential impact. One would expect a well maintained ecosystem to have strong security across all packages, or at the very least, strong security in packages that are core to the ecosystem. As such, dependency graphs for two ecosystems (Python, and JavaScript/Typescript) were captured to obtain criticality and popularity scores for each package. Security was measured at multiple points across the range of these metrics to establish the relationships between popularity, criticality and security. In doing so, a statistically significant moderate positive correlation between security and popularity for both ecosystems was found, along with inconclusive results for the correlation between security and criticality. These results can be used to assist in both, feature selection for machine learning based dependency risk measurement, and as a guide for dataset sampling for future security tooling evaluation.