We’re thrilled to share that the third annual INTERSECT Research Software Engineering Bootcamp, held July 14-18, 2025 at Princeton University, concluded with great success! This immersive 4.5-day event brought together a vibrant cohort of intermediate research software developers from diverse domains, many of whom lack formal computer science training.
Funded by a National Science Foundation (NSF) grant and organized in collaboration with Dr. Jeff Carver from the University of Alabama, the bootcamp focused on core Research Software Engineering (RSE) practices. Led by volunteer instructors from the broader RSE community, participants engaged in hands-on sessions covering:
Software Design
Collaborative Git & Pull Requests
Code Review
Licensing & Documentation
Testing & CI/CD
Packaging & Distribution
The energy and enthusiasm throughout the week were inspiring. Attendees not only sharpened their technical skills but also built lasting connections across institutions and disciplines. We’re proud to support the growth of the RSE community and grateful to everyone who made this event possible.
We’re excited to announce the publication of a new paper, Ten Simple Rules for Catalyzing Collaborations and Building Bridges between Research Software Engineers (RSEs) and Software Engineering Researchers (SERs), authored by Nasir Eisty, Jeffrey Carver, Johanna Cohoon, Ian Cosden, Carole Goble, and Samuel Grayson.
Published in IEEE Computing in Science & Engineering (CiSE), this work emerged from discussions at a Dagstuhl Seminar and addresses a critical but often overlooked opportunity in the research software ecosystem: fostering collaboration between RSEs and SERs.
While both communities share a passion for improving software in research, they often operate in distinct environments, with different vocabularies, incentives, and expectations. This paper offers ten actionable rules designed to bridge those gaps, encouraging meaningful, sustained partnerships that combine practical experience with theoretical insight.
By working together, RSEs and SERs can drive innovation in tools, practices, and infrastructure, ultimately advancing the quality and impact of scientific research.
ASPIRE is an open-source Python package for 3D molecular structure determination using cryo-EM with many submodules solving complex equations that could be accelerated on GPUs. ASPIRE developers have participated in GPU hackathons for the last 4 years, but new group members Josh and Chris attended for the first time this year. The problem selected by Josh and Chris was spectral decomposition using the power method for determining leading eigenvalue/eigenvector – a limiting step in ASPIRE’s computation pipeline. The problem was ideal in scope, tested and provided a range of problem sizes for scaling experiments onto a GPU. The team primarily worked by pair programming, frequently meeting together to share paired results and determine next steps.
Resolving Python Anti-Patterns
The initial implementation had a few “fast python anti-patterns”. On the first day, we replaced for-loops and Python list comprehensions with NumPy operations for large speedups, even before moving to a GPU. We used Python’s standard library profiler, cProfile to get an idea of the functions slowing down the code. We visualized the profiler output using SnakeViz.
Figure 1: Python cProfile of the original implementation visualized with SnakeViz.
Moving to CuPy
We initially used drop-in replacements of NumPy calls with CuPy. This actually worsened performance compared to a single CPU core, likely because optimizations for CPU do not necessarily translate to GPU speedups. We used NVIDIA’s Nsight Systems to generate a profile of the kernel launches being generated by CuPy and understand the memory transfers from host to device. We noticed that CuPy was generating hundreds of launches of a kernel called “gemmSN_NN_kernel”. Although each call performed a small operation, the launch time overhead was as costly as the operation itself. In aggregate, these launches added up significantly, and–worse–this issue would scale with problem size. Further analysis showed that the large number of kernel launches generated by CuPy was due primarily to two operations. First, a 4D matrix multiply was decomposed into a series of 3D multiplications, where the last dimensions were small. Flattening the matrix into 3D greatly reduced the number of kernel launches and increased the size of each operation. Second, a matrix conjugation was expressed as a direct translation of the underlying mathematical theory as two matrix multiply operations. Careful inspection of the final result revealed the costly multiplications could be replaced with a single, element-wise multiplication, further reducing the number of kernel launches performed by CuPy.
Figure 2: Nvidia Nsight Systems breakdown of naive drop-in CuPy replacement showing costly matrix multiply kernel launches.
Randomizing Batch Operation
As part of refactoring, we had modified the index generator to produce subproblems in batches, instead of individually. The batch size is derived from the problem size and the indices were generated in sorted order. This caused a memory bottleneck during a final sum-reduce operation that was expressed as a weighted histogram function. The sorted indices were causing collisions into bins, leading to inefficient memory access due to the required atomicAdd when accumulating into one bin. We addressed this slowdown by randomizing the indices that are batched together to decrease the frequency of stalls during the reduce operation. We also identified that we could cache the indices for each iteration, further accelerating the code.
Figure 3: Histogram of hit counts for each iteration in the loop (a) indices generated in sorted order shows each loop updating a small number of positions causing collisions (b) randomizing the indices that are batched together decreases the frequency of stalls during the reduce operation.
Host Memory Bug
As we ramped up the problem size, we noticed that we were running out of memory on the host machine after a few iterations. Since the code was performing all operations in-place, the accumulating host memory was initially mysterious.
Figure 4: Increasing memory utilization in the host machine suggesting a memory leak.
We tracked down the issue to a temporary variable overriding a variable that was used outside the loop scope in the outer iteration of the power method. In the code below, “vec” is initialized before the loop. On each iteration, “vec” is replaced with “vec_new” and the old value of “vec” can be safely discarded. However, since “vec” is used after the loop, CuPy retained a copy of “vec” on the host memory, causing copies of “vec” to accumulate.
# Power method iterations
for itr in range(max_iters):
vec_new = signs_times_v(vijs, vec, conjugate, edge_signs)
vec_new /= norm(vec_new)
residual = norm(vec_new - vec)
vec = vec_new
if residual < epsilon:
print(f'Converged after {itr} iterations of the power method.')
break
else:
print('max iterations')
# We need only the signs of the eigenvector
J_sync = np.sign(vec)
The simple solution was to initialize vec_new before the loop and use its value after the loop exits so that CuPy knows “vec” can be safely garbage collected in each iteration.
# initialize vec_new to prevent blocking garbage collection of vecvec_new = vec
# Power method iterations
for itr in range(max_iters):
triplets_iter = all_triplets_batch(triplets, batch_size)
vec_new = signs_times_v(vijs, vec, conjugate, edge_signs, triplets_iter)
vec_new /= norm(vec_new)
residual = norm(vec_new - vec)
vec = vec_new
if residual < epsilon:
print(f'Converged after {itr} iterations of the power method.')
break
else:
print('max iterations')
# We need only the signs of the eigenvector
J_sync = np.sign(vec_new)
Final Speedup
We achieved significant speedup on both CPU and the GPU. In our initial implementation the code would timeout after 4 hours, processing a problem size of N=400. After all optimizations, we are processing a realistic problem size of N=1000 in 2 hours on CPU and just over 1 minute on the GPU. The problem sizes that needed to be partitioned before can now be executed directly and the implementation is expected to scale to larger problem sizes as well.
Figure 5: Performance scaling chart showing improvement in both the serial and GPU versions. The serial version is estimated to have 30x speedup over the original version and the GPU accelerated version has a further speedup of 120x over the serial version.
Takeaways
There were some key points that we feel were essential to achieving these improvements within the constraints of a 4-day hackathon.
Preparation
It was crucial that the target algorithm could be run and tested in isolation, away from the rest of the codebase. Our algorithm is an intermediate step within a large computational pipeline. It would be difficult to quickly test modifications and pinpoint optimizations in the full problem context, so the algorithm was broken out of the main code and put into a standalone script. We also wrote a script to generate sample inputs of any problem size, which could be created and saved to disk outside of the code to be optimized. The key here is to identify what external information your code needs to run in a realistic setting, and try to eliminate everything else. In addition, as a sanity check, we saved *outputs* of the original code for comparison, in case our modifications unintentionally broke the code.
Don’t prematurely optimize
A lot of code is not initially written to be maximally efficient; there is often a trade-off between readability and efficiency. Correctness and communicating an algorithm are vital for code when it is first written. Look for common improvements to be made before attempting to profile the code on the GPU. For example, using cProfile we found bloated Python code that was easy to streamline without resorting to fancy solutions. Removing unnecessary list comprehensions and directly calculating indices instead of searching for them led to large speedups without sacrificing readability. Next, we optimized the code to run just on the CPU by translating verbose, multi-step calculations with more terse, single calls to NumPy routines. As a bonus, code that primarily uses NumPy is easy to translate into CuPy. Finally, we moved on to the GPU, confident that we were entering this stage of the hackathon with robust, vetted code. There are options for when code starts to become cryptic. Commenting code that has become too obtuse with the original, verbose version is an easy way to retain documentation. Moving the old version into a unit test is a more active way to ensure new versions continue to match legacy code while providing some documentation.
Working as a Group
Having two pair programming teams tackling a complicated piece of numerical code requires coordination. In addition to isolating all the code and data we needed to experiment, we put all of this into a git repository. With each pair working on an individual feature branch, along with asynchronously communicating via Slack, collisions and confusion were avoided. As mentioned above, optimization can make code more difficult to read, so verbal briefings of how and why changes were made kept everyone on the same page.
Intel Advisor is an optimization tool that helps the developers identify hot spots, performance issues and also provide recommendations for performance improvement. It has been installed at most of the Princeton research computing systems. Intel Advisor was part of the licensed Parallel Studio XE (PSXE) releases before. It is now included in the Intel OneAPI base toolkit, which is free to download. In this article, we will walk you through the process of collecting performance data remotely at Princeton Research Computing clusters using the Intel Advisor command line interface (CLI) and displaying the results on a local macOS system using the Intel Advisor graphical user interface (GUI).
Preparing Applications for Performance Analysis
For C/C++ and Fortran code (on Linux OS), it is recommended to setup the following compiler flags before running the performance analysis:
Request full debug information (compiler and linker): -g
Request moderate optimization: -O2 or higher
Disable inter procedural optimization that may inhibit the profiler to collect performance data: -no-ipo
Using Intel Advisor on Princeton Research Computing Clusters
Before, we usually recommended that you used the CLI to collect data via batch jobs at compute nodes and then viewed results using the GUI on a login node. Now as the Intel Advisor GUI is available free on macOS, we recommend that you copy the collected data from the remote system to your local macOS to view. Note Intel Advisor does not support data collection on macOS and you can only use macOS for displaying the data collected on a Windows or Linux OS.
Collecting Data at Remote System
Once in a remote system (e.g., Tigercpu, Adroit etc), you start by loading the module, e.g., module load intel-advisor. Then you can collect the data using the Intel Advisor CLI. The CLI is launched with advisor command. You can use advisor –helpto search for the command for a specific action. For example, after issuing advisor –help command, you will see
Intel(R) Advisor Command Line Tool
Copyright (C) 2009-2020 Intel Corporation. All rights reserved.
Usage: advisor <--action> [--action-option] [--global-option] [[--] <target>
[target options]]
<action> is one of the following:
collect Run the specified analysis and collect data. Tip: Specify the search-dir when collecting data.
command Issue a command to a running collection.
create-project Create an empty project, if it does not already exist.
help Explain command line actions with corresponding options.
import-dir Import and finalize data collected on an MPI cluster.
mark-up-loops After running a Survey analysis and identifying loops of interest, select loops (by file and line number or criteria) for deeper analysis.
report Generate a report from data collected during a previous analysis.
snapshot Create a result snapshot.
version Display product version information.
workflow Explain typical Intel Advisor user scenarios, with corresponding command lines.
For help on a specific action, type: advisor --help <action>
Examples:
Perform a Survey analysis.
advisor --collect=survey --project-dir=./advi --search-dir src:r=./src -- ./bin/myApplication
Generate a Survey report.
advisor --report=survey --project-dir=./advi --search-dir src:r=./src
Display help for the collect action.
advisor --help collect
advisor –help collect shows you the command to perform a specific analysis. For example, to perform a survey analysis to determine hotspots, we use
To collect the roofline, you can run a tripcounts analysis on top of the above survey analysis. Note the project directory needs to be the same for both analyses.
It is also helpful to use the GUI to find out the command. For example, you can:
Log in a remote head node with ssh -Y usersname@adroit.princeton.edu
Load the module with module load intel-advisor
Launch the Intel Advisor GUI with advisor-gui
Create a project
Set up the project properties
Choose the appropriate analysis type
Click the get command line button on the workflow tab under the desired analysis
Copy the command line to clipboard to paste to the script for remote runs
To view the results, you can copy the whole project directory to your local macOS. It is also recommended to first pack the analysis results in a snapshot and then copy the packed *.advixeexpz file. For example:
You can download the Intel Advisor for macOS from the oneAPI base toolkit. After launching the Intel Advisor GUI, you then go to File > Open > Project/Result and navigate the copied project directory/snapshot.
Intel Advisor GUI
This article covers the following NEW in Intel Advisor version 2021.1:
Intel Advisor is included as part of the Intel OneAPI base toolkit
The executables are renamed. advixe-cl is renamed to advisor. advixe-gui is renamed to advisor-gui
The roofline analysis is provided as a single command. In the earlier version, roofline analysis is done by first running a survey analysis followed by a tripcounts analysis. Now we can run the roofline in a single step using —collect=roofline option.
As a researcher, inquiries about previously published research probably evoke two feelings: panic-filled regret or calm authority. Often the difference is time; it’s easier to talk about the project you worked on last week than last decade. Talking about old protocols or software is a lot like someone critically examining the finger painting you did as a child. You know it’s not perfect and you would do several things differently in hindsight, but it is the method in the public record. The rate of change seems faster with software development, where new technologies redefine best practices and standards at dizzying rates.
Perhaps the most challenging problem is when researchers outside of your institution fail to reproduce results. How can you troubleshoot the software on every system or determine what missing piece is required to get things working? What was that magic bash command you wrote 5 years ago?
Blocklint is a simple command line utility for finding non-inclusive wording with an emphasis on source code. If you’ve used a modern IDE, you know the importance of immediate feedback for compilation errors or even stylistic slip-ups. Knowing all variables should be declared or that lines must be less than 80 characters long is good, but adhering to those rules takes a back seat when in the flow of writing code. A linter brings these issues back into your consciousness by highlighting the problematic lines of code. Over time, the enforced style becomes more intuitive but the linter is always there to nudge you if you slip.
I started working as a research software engineer for the Princeton Neuroscience Institute (PNI) in May 2017. At the end of my first week I received an email from Professor Mala Murthy and post-doc David Deutsch of the MurthyLab, asking if I would be interested in working on a project involving a “virtual reality environment for neural recording experiments”. The kid in me got very excited at the prospect of making video games. At the time I did not know the project was to develop a virtual reality simulation for flies!
The FlyVR setup. Projection arena for visual stimuli not shown as it surrounds parts of the setup and occludes components. Projector for visual stimuli is directed at a curved mirror to project onto half dome surrounding fly.
Virtual reality experiments have a long history in neuroscience . They allow researchers to restrict the movement of animal subjects so that they can use advanced microscopy to image their brains in “naturalistic” environments. In the Murthy lab’s VR setup, the fly is fixed to the objective of a two photon microscope and suspended above a small sphere floating on a column of air. This small sphere is used as a sort of omni-directional treadmill. While the fly cannot actually move, it can move its legs, which in turn move the freely rotating sphere. The movement of the sphere is tracked with computer vision algorithms and thus a fictive path for the fly in a virtual world can be reconstructed. This setup allows a “moving” fly’s brain to be imaged with techniques that require it to be stationary. The two photon imaging system then provides a very flexible and powerful tool for studying changes in the fly’s brain activity over time of the experiment. Different spatial and temporal resolutions are available depending on the needs of the experimenter.
As I started using Snakemake, I had hundreds of jobs that I wanted to get performance information about. seff gives the efficiency information I wanted, but for only a single job at a time. sacct handles multiple jobs, but couldn’t give the efficiency. With the current python implementation of reportseff, all job information is obtained from a single sacct call and with click the output is colored to quickly see how things are running.
The foundational histogramming package for Python, boost-histogram, was just updated to version 0.6! This is a major update to the new Boost.Histogram bindings. Version 0.6.1 is based on the
recently released Boost C++ Librariesversion 1.72 Histogram package.
This Python library is part of a larger picture in the Scikit-HEP ecosystem of tools for Particle Physics and is funded by DIANA/HEP and IRIS-HEP. It is the core library for making and manipulating histograms. Other packages are under development to provide a complete set of tools to work with and visualize histograms. The Aghast package is designed to convert between popular histogram formats, and the Hist package will be designed to make common analysis tasks simple, like plotting via tools such as the mplhep package. Hist and Aghast will be initially driven by HEP (High Energy Physics and
Particle Physics) needs, but outside issues and contributions are welcome and encouraged.
As with any substantial application package, the ASPIRE project needed a convenient way to specify configuration settings pertaining to different parts of the computational pipeline.
What follows below are some outlines from our attempts to tackle this configuration issue. Where a supplementary (and hopefully useful) nugget is provided, or a caveat discussed, I shall append a linked numeral, like so: (n)
A brief background of ASPIRE
ASPIRE is a Python (3.6) package under development, which ingests Micrographs, the output of Cryo-Electron Microscopy (images that closely resemble television static), and comes up with a 3D reconstruction of the molecule. Read the excellent writeup on the ASPIRE page for a more comprehensive review of the package.