Past Projects

Projects from 2023

Project 1: Deep Learning to Enable Multi-messenger Astrophysics Discovery in the LSST Era

Co-mentors

Eliu Huerta

Dr. Eliu Huerta
Argonne National Laboratory

Zhizhen-Zhao

Dr. Zhizhen Jane Zhao
Electrical and Computer Engineering

Project Description

This project will focus on the design of DL algorithms for the detection and parameter estimation of gravitational wave sources, and on the identification of their electromagnetic and gravitational emission. We will build upon our work in these research areas, and will leverage NCSA’s role as the main data hub for the Dark Energy Survey and the Legacy Survey of Space and Time to create DL tools to explore the processing and identification of astrophysical sources that may be observed concurrently in the gravitational and electromagnetic spectra.

Student contributions

The REU student will learn several skills to contribute to this project, including the use of supercomputers to curate the datasets to train, validate and test neural networks using GPU-based systems at NCSA, Argonne National Lab and Oak Ridge National Lab. The student will become acquainted with open source platforms for DL research, including TensorFlow and PyTorch, as well as the use of distributed learning solutions such as Horovord. The student will learn to process and correctly interpret the predictions of these deep learning models, as well as to use scientific visualizations to understand how these models abstract knowledge from data, and what sectors of the deep learning model are involved in the prediction of results.

Project 2: Machine Learning and Geospatial Approach to Targeting Humanitarian Assistance Among Syrian Refugees in Lebanon

Co-mentors

Angela Lyons

Dr. Angela Lyons
Agricultural & Consumer Economics

Aiman Soliman

Dr. Aiman Soliman
NCSA

Project Description

An estimated 84 million persons are forcibly displaced worldwide, and at least 70% of these are living in conditions of extreme poverty. More efficient targeting mechanisms are needed to better identify vulnerable families who are most in need of humanitarian assistance. Traditional targeting models rely on a proxy means testing (PMT) approach, where support programs target refugee families whose estimated consumption falls below a certain threshold. Despite the method’s practicality, it provides limited insights, its predictions are not very accurate, and it can impact the targeting effectiveness and fairness. Alternatively, multidimensional approaches to assessing poverty are now being applied to the refugee context. Yet, they require extensive information that is often unavailable or costly. This project applies machine learning and geospatial methods to novel data collected from Syrian refugees in Lebanon to develop more effective and operationalizable targeting strategies that provide a reliable complementarity to current PMT and multidimensional methods. The insights from this project will have important implications for humanitarian organizations seeking to improve current targeting mechanisms, especially given increasing poverty and displacement and limited humanitarian funding.

Student Contributions:

We are looking for a student with experience in basic programming in Python and/or R and basic knowledge and skills in machine learning; experience with GIS and geospatial analysis is a plus. Anticipated tasks include assisting the team with: (1) data preprocessing, (2) data modeling, analysis, and predictions, and (3) the creation of mappings and other data visualizations. The student will develop and review code and create documentation for the code. They will also assist in developing machine learning algorithms and then training, validating, and testing the models. The student will also create a GitHub repository for the team, where they will prepare and upload scripts and other documentation for the project.

Project 3: Estimation of Crop Productivity from Multi-sensor Fused Satellite Data

Co-mentors

Kaiyu Guan

Dr. Kaiyu Guan
Natural Resources & Environmental Sciences

Jian Peng

Dr. Jian Peng
Computer Science

Project Description

Reliable and time-lead forecasting systems for crop type and crop yield has critical values for various purposes for farmer communities and government agencies. Field-level estimation of crop yield is particularly useful for understanding how crop productivity responses to various management requirements and environmental factors. This project aims to develop scalable ML methods to integrate data from satellite remote sensing and other auxiliary information to make accurate yet cost-effective predictions of crop type. We are now working on generating a 30-meter (2000-present), daily, cloud-free data stack for the three major Corn Belt States, Illinois, Iowa, Indiana, by integrating three major satellite datasets (Landsat, MODIS, and new Sentinel-2).

Student contributions

The REU student will contribute to the implementation of a computational pipeline that extracts real-time satellite images, deposits data into a database and develops data fusion and ML components for predictive tasks. The student will be utilizing storage and run computation-intensive codes on the Blue Waters supercomputer. The student will have meetings with the mentors on a regular basis, participate in group meetings, and interact with graduate students.

Project 4: Gr-ResQ: Data-driven Approaches for Accelerating Synthesis of 2D Materials

Co-mentors

Elif Ertekin

Dr. Elif Ertekin
Mechanical Science & Engineering

Sameh Tawfick

Dr. Sameh Tawfick
Mechanical Science & Engineering

Project Description

The major goal of this project is to advance the state-of-the-art in manufacturing and synthesis by combining machine learning and experimentally validated data. While there exists a tremendous amount of academic and industrial research in synthesis of nano materials such as graphene, or in 3D printing, advancement in these fields currently utilize expensive or tedious trial-and-error experimentation. Utilizing a combination of crowd-sourced and locally derived process parameter data, we will be searching for better procedures for more controllable and repeatable manufacturing.

Student contributions

The REU students will apply ML on experimental data from Gr-ResQ, a database of chemical vapor deposition synthesis recipes, and from 3D printing projects to develop ML models to accelerate the discovery of large-scale graphene production. Students will have to identify and extract relevant features from the dataset of images, Raman spectra, and associated recipes, and then develop a model using traditional ML or DL techniques to predict potentially successful graphene recipes. Students will work closely with experimental collaborators to test their predictions and provide data to feed back into their model.

Project 5: An Integrated Sensing, Machine Learning, and High-performance Computing Framework for Real-time Decision-making in Smart Manufacturing

Co-mentors

Chenhui Shao

Dr. Chenhui Shao
The Grainger College of Engineering

Seid Koric

Dr. Seid Koric
Mechanical Science & Engineering

Project Description

The recent development of sensing, communication, and computing technologies and infrastructure is leading to a global data revolution in manufacturing, which provides an unprecedented opportunity for the manufacturing industry to march towards a new generation of digitalization and intelligence. In this project, we will develop an integrated sensing, machine learning, and HPC framework for next-generation manufacturing process control. New deep learning algorithms such as convolutional neural network (CNN) and residual neural network (ResNet) will be developed for decision-making tasks such as machine health condition monitoring and quality prediction. These algorithms will be implemented to ultrasonic metal welding, which is an important solid-state joining technology with widespread industrial applications. We will implement the algorithms using graphics processing units (GPUs) and HPC to pursue real-time decision-making.

Student contributions

The REU student will use TensorFlow or PyTorch to develop machine learning algorithms (e.g., CNN, ResNet) and train, validate, and test the algorithms using real-world sensing data collected from ultrasonic metal welding. Then the student will conduct benchmark analysis to evaluate the performance of the developed algorithms and inferencing predictions in real-time production.

Project 6: Spatial Analysis of Tumor Heterogeneity using Machine Learning Techniques

Co-mentors

Zeynep Madak-Erdogan

Dr. Zeynep Madak-Erdogan
Food Science & Human Nutrition

Aiman Soliman

Dr. Aiman Soliman
NCSA

Project Description

Tumor heterogeneity is an inherent feature of all tumors that drive resistance to therapies. With the advent of new approaches integrating single-cell sequencing with spatial tumor data, interdisciplinary teams can better understand local changes in tumor metabolism, biology, as well as different cell populations that affect different aspects of immune responses, drug resistance, and metastasis. In this project, we will leverage spatial data analysis tools from Geospatial Science to quantify tumor microenvironment spatial heterogeneity. Machine learning techniques will be utilized to prepare image data for the ensuing integration of location and sequencing data.

Student contributions

We are looking for a student with experience in basic programming in Python and experience with R or Scikit learn library and classical machine learning methods (e.g., classification and clustering); experience with Tensorflow/Keras is a plus. The student will work on implementing spatial indices to quantify the tumors’ 2D heterogeneity and training and evaluating classical machine learning and deep learning models to connect the spatial indices and higher dimension genetic markers data.

Project 7: Physics-informed Machine Learning: a pathway for explainable and efficient AI

Co-mentors

Bruno Abreau

Bruno Abreu
NCSA

Matthew Krafczyk

Matthew Krafczyk
NCSA

Project Description:

As we move further into the Big Data era, machine learning and artificial intelligence methods are becoming more and more critical tools to advance scientific knowledge that foster groundbreaking technological applications. Processing, modeling, and understanding large amounts of data requires enormous computational power and, fundamentally, new algorithms and theoretical approaches that can flexibly incorporate domain science knowledge. In this project, we are investigating how to improve our ability to more effectively incorporate such domain knowledge when developing and training novel machine learning models.

Student Contributions:

In this project, the student will work with a proof-of-concept to show that this incorporation process can lead to more efficient, more accurate, and/or more explainable models. The successful applicant will use DRYML, a new framework that encapsulates infrastructure and enables machine learning practitioners to focus on their models. Students will also contribute to improving DRYML's usability and flexibility.

Project 8: Deep Sea Video Classification for Ecological Conservation

Co-mentors

Matthew Krafczyk

Matthew Krafczyk 
NCSA

Aiman Soliman

Aiman Soliman
NCSA

Project Description:

Help make an impact on ecological conservation by improving the performance of image and video classification models of conservation data. Four-fifths of the vast underwater realm remain unexplored and only about 5% of the seafloor has been mapped at high resolution. One method that we are currently using to explore these mysterious habitats is deep-sea camera traps. However, the problem is the amount of footage that we are presently generating by these cameras far exceeds our ability to review the footage effectively, and while scientists can spot many animals and their behaviors, computer vision methods that use machine learning can likely detect more.

Student Contributions:

The successful applicant will learn about and improve existing image/video classification models and use the DRYML framework to ease hyperparameter search and distributed training. DRYML is a new open-source Machine Learning Meta-Library that enables practitioners to focus more on their model and less on infrastructure. Students will likely make contributions to DRYML to improve its usability and flexibility.

Project 9: Use of Neural Networks for Induced Earthquake Modeling

Co-mentors

Roman Makhnenko

Roman Y. Makhnenko
Civil & Environmental Engineering

Alex Tartakovsky

Alex Tartakovsky
Civil & Environmental Engineering

Project Description:

Many industrial activities (e.g., wastewater disposal, operation of enhanced geothermal systems, geologic carbon storage, and hydraulic fracturing) involve injection of fluid into the subsurface. Multiple physical processes, such as heat and fluid transport, as well as mechanical deformation of rock, might result in creating conditions favorable for earthquakes. These processes usually affect each other and, therefore, are called coupled processes. Analytical solutions describing these complex phenomena do not exist and advanced numerical models are utilized to assess the risks of creating an earthquake during the injection of fluid into subsurface. Most of the existing models remain quasistatic (all changes in the system assumed to be slow) without consideration of dynamic rupturing process (fast changes in system during fracture propagation), therefore predicting the evolution of the system toward the state favorable for failure rather than failure process itself. High-resolution models are computationally intensive, forcing to simplify the model to achieve results within reasonable time. This simplification is usually conducted having in mind the expected dominant effect and potential bias toward certain triggering mechanism might be introduced during the simplification. Moreover, manual analysis of the numerical modeling results is focused on particular locations and mechanisms and might be incapable of considering a “big picture”. The proposed project will deal with machine learning approaches to efficiently recognize hidden patterns in large seismic data sets collected in the Illinois Basin.

Student Contributions:

We are looking for students with Matlab and Python programming skills; machine learning and data science is highly desirable, knowledge of geomechanics and geophysics will be helpful. Use of neural networks is preferable over arbitrary data interpretation to unbiasedly determine the physical mechanisms responsible for the observed earthquakes during subsurface fluid injections. This will enable the implication of high-resolution numerical models that can properly describe the physical processes behind rock failure at different scales. The efficient interpretation of large seismic data sets and prediction of induced earthquake preparatory processes will promote the safe use of underground for renewable energy extraction and storage.

Project 10: Workflow tradeoffs in the context of cancer phylogeny

Co-mentors

Daniel S. Katz

Daniel S. Katz
NCSA

Matthew Berry

Matthew Barry
NCSA

Description:

The NCSA PhyloFlow project is building workflows to perform phylogenetic tree computations to understand cancerous tumor evolution. This work is intended to help researchers understand and track tumor evolution and to enable doctors to develop more personalized cancer treatment plans. These workflows involve execution of multiple dependent tasks, stored in Docker containers. There are many ways to build such workflows, including workflow definition languages like WDL and CWL, which then require runners, or programming systems such as Parsl, which just use their language's runtime, Python in the case of Parsl.

Student Work:

Students will implement one or more of the same workflows in WDL, CWL, and Parsl, and compare them in terms of performance, usability, programmability, reproducibility, etc. They will also explore mixed implementations, such as using Parsl to execute workflows with components that are defined using WDL and CWL. This work will lead to a report and poster, and could lead to contributions to the open source Parsl code and/or papers/presentations in conferences.


Projects from 2022

Project 1: Deep Learning to Enable Multi-messenger Astrophysics Discovery in the LSST Era

Co-mentors

Eliu Huerta

Dr. Eliu Huerta
Argonne National Laboratory

Zhizhen-Zhao

Dr. Zhizhen Jane Zhao
Department of Electrical and Computer Engineering

Project Description

This project will focus on the design of DL algorithms for the detection and parameter estimation of gravitational wave sources, and on the identification of their electromagnetic and gravitational emission. We will build upon our work in these research areas, and will leverage NCSA’s role as the main data hub for the Dark Energy Survey and the Legacy Survey of Space and Time to create DL tools to explore the processing and identification of astrophysical sources that may be observed concurrently in the gravitational and electromagnetic spectra.

Student contributions

The REU student will learn several skills to contribute to this project, including the use of supercomputers to curate the datasets to train, validate and test neural networks using GPU-based systems at NCSA, Argonne National Lab and Oak Ridge National Lab. The student will become acquainted with open source platforms for DL research, including TensorFlow and PyTorch, as well as the use of distributed learning solutions such as Horovord. The student will learn to process and correctly interpret the predictions of these deep learning models, as well as to use scientific visualizations to understand how these models abstract knowledge from data, and what sectors of the deep learning model are involved in the prediction of results.

Project 2: Comparing Deep Learning and Expert Knowledge for Sequential Pattern Mining

Co-mentors

Dr. Nigel Bosch

Dr. Nigel Bosch
School of Information Sciences

Luc Paquette

Dr. Luc Paquette
Education

Project Description

This project will explore recent methods for learning sequential patterns of behavior. Specifically, we will train convolutional neural networks to learn predictive sequences of students’ behaviors over time as the students interact with educational software. We will then compare the patterns that neural networks learn to patterns defined by experts to identify similarities and differences between what the convolutional filters learn and what experts believe is important. Finally, we will train machine learning models using these patterns as features to predict whether or not students are going to engage in “gaming the system” behaviors, where they attempt to skip through assignments as quickly as the software will allow without effortful learning.

Student contributions

The REU students will take advantage of recent advances in neural network model interpretability to enable comparisons between expert hypotheses and data-driven discovery of important sequential behavior patterns. These findings will also contribute to understanding of the tradeoffs between maximizing model accuracy and providing semantically meaningful insights. Students will learn how to use deep learning models to implement these analyses, as well as data processing and visualization methods that are needed to evaluate the results.

Project 3: Machine Learning for Genomics

Co-mentors

Dr. Liudmila S. Mainzer
University of Wyoming’s Advanced Research Computing Center (ARCC)

Dr. Christopher Fields
UIUC HPC Biological Computing

Project Description

NCSA Genomics team participates in the Consortium of Human Health and Heredity in Africa (H3A) as one of the bioinformatics nodes in the H3ABionet2.0, to build advanced computational genomics analyses that serve the health interests of people in Africa. ML is one of the projects in the Tools and WebServices Work Package that is being developed within that program. We are surveying current ML/DL approaches, adapting them to the analysis needs in Africa, and developing new ones that are appropriate for the use cases driven by the predominant local biomedical needs, such as tackling infectious diseases, psychiatric conditions and AIDS.

Student contributions

The REU student will work closely in a long-distance collaboration with H3A scientists to port analyses to the advanced ML infrastructure at NCSA, help identify computational performance considerations that stem from the nature and size of data that are being analyzed and run those analyses on data from African collaborators. Applications include but are not limited to 1) determining specificity of protein binding, 2) predicting protein function, 3) analyzing gene expression patterns, 4) predicting transcription factor binding sites and DNA methylation states.

Project 4: Estimation of Crop Productivity from Multi-sensor Fused Satellite Data

Co-mentors

Kaiyu Guan

Dr. Kaiyu Guan
Natural Resources & Environmental Sciences

Jian Peng

Dr. Jian Peng
Computer Science

Project Description

Reliable and time-lead forecasting systems for crop type and crop yield has critical values for various purposes for farmer communities and government agencies. Field-level estimation of crop yield is particularly useful for understanding how crop productivity responses to various management requirements and environmental factors. This project aims to develop scalable ML methods to integrate data from satellite remote sensing and other auxiliary information to make accurate yet cost-effective predictions of crop type. We are now working on generating a 30-meter (2000-present), daily, cloud-free data stack for the three major Corn Belt States, Illinois, Iowa, Indiana, by integrating three major satellite datasets (Landsat, MODIS, and new Sentinel-2).

Student contributions

The REU student will contribute to the implementation of a computational pipeline that extracts real-time satellite images, deposits data into a database and develops data fusion and ML components for predictive tasks. The student will be utilizing storage and run computation-intensive codes on the Blue Waters supercomputer. The student will have meetings with the mentors on a regular basis, participate in group meetings, and interact with graduate students.

Project 5: Development of Data-driven Machine Learning-Based Food Crises Prediction Model

Co-mentors

Dr. Hope Michelson
Agricultural & Consumer Economics

Dr. Aiman Soliman
National Center for Supercomputing Applications

Project Description

Methods currently in use to predict food crises have limitations that delay and impede humanitarian response: they are not model-driven, and they do not engage the full scope of available data. An effective early warning system is urgent, given the expectation that climate shocks disrupting agricultural production and market functioning will increase in frequency and severity in coming decades. This project develops and deploys a new model-driven method for predicting food crises across the world. We are working towards developing automated, real-time, sub-national food security prediction in developing countries.

Student contributions

The REU student will help with developing and analyzing new data sources for predicting food crises and will apply DL techniques to the prediction problem. We will employ publicly available data at high spatial granularity and high frequency, allowing rapid, real-time assessment of sub-national food security. The student will help to develop code to integrate multiple data source and will work on the DL-based prediction model utilizing these data.

Project 6: Gr-ResQ: Data-driven Approaches for Accelerating Synthesis of 2D Materials

Co-mentors

Dr. Elif Ertekin
Mechanical Science & Engineering

Dr. Sameh Tawfick
Mechanical Science & Engineering

Project Description

The major goal of this project is to advance the state-of-the-art in manufacturing and synthesis by combining machine learning and experimentally validated data. While there exists a tremendous amount of academic and industrial research in synthesis of nano materials such as graphene, or in 3D printing, advancement in these fields currently utilize expensive or tedious trial-and-error experimentation. Utilizing a combination of crowd-sourced and locally derived process parameter data, we will be searching for better procedures for more controllable and repeatable manufacturing.

Student contributions

The REU students will apply ML on experimental data from Gr-ResQ, a database of chemical vapor deposition synthesis recipes, and from 3D printing projects to develop ML models to accelerate the discovery of large-scale graphene production. Students will have to identify and extract relevant features from the dataset of images, Raman spectra, and associated recipes, and then develop a model using traditional ML or DL techniques to predict potentially successful graphene recipes. Students will work closely with experimental collaborators to test their predictions and provide data to feed back into their model.

Project 7: An Integrated Sensing, Machine Learning, and High-performance Computing Framework for Real-time Decision-making in Smart Manufacturing

Co-mentors

Dr. Chenhui Shao
The Grainger College of Engineering

Dr. Seid Koric
Mechanical Science & Engineering

Project Description

The recent development of sensing, communication, and computing technologies and infrastructure is leading to a global data revolution in manufacturing, which provides an unprecedented opportunity for the manufacturing industry to march towards a new generation of digitalization and intelligence. In this project, we will develop an integrated sensing, machine learning, and HPC framework for next-generation manufacturing process control. New deep learning algorithms such as convolutional neural network (CNN) and residual neural network (ResNet) will be developed for decision-making tasks such as machine health condition monitoring and quality prediction. These algorithms will be implemented to ultrasonic metal welding, which is an important solid-state joining technology with widespread industrial applications. We will implement the algorithms using graphics processing units (GPUs) and HPC to pursue real-time decision-making.

Student contributions

The REU student will use TensorFlow or PyTorch to develop machine learning algorithms (e.g., CNN, ResNet) and train, validate, and test the algorithms using real-world sensing data collected from ultrasonic metal welding. Then the student will conduct benchmark analysis to evaluate the performance of the developed algorithms and inferencing predictions in real-time production.

Project 8: Design of New Materials through Machine Learning

Co-mentors

Dr. Andre Schleife
Materials Science & Engineering

Dr. Michael Ondrejcek
NCSA

Project Description

The accurate description of excited electronic states is a very promising goal in computational materials science with significant impact on applications including photovoltaics, bioimaging, and optical materials. However, achieving accurate results requires heavy use of computation, limiting materials design. At the same time, computational materials science is benefiting from the data revolution, with both experimental and computational databases becoming more and more prevalent. In this project, we aim at mitigating the high computational cost of studying electronic excitations involved in optical properties by exploring use of ML and incorporation of materials databases.

Student contributions

The REU student will use atomistic simulations and Maxwell modeling techniques to accurately describe nano- and meso-structured materials. Based on preliminary work, we will further combine these simulations with experimental data for semiconductor nanocrystals, that we gather using a previously established web-based framework. Newly and previously produced data will be used to train ML models on the machine-learning optimized HAL computer at NCSA, to either predict optical spectra for materials, or energy and width of prominent spectral features. Inverting the ML model will be used to facilitate design of materials with desirable optical properties.

Project 9: Underpinnings of Racial Health Disparities

Co-mentors

Dr. Zeynep Madak-Erdogan
Food Science & Human Nutrition

Dr. Liudmila Mainzer
NCSA

Project Description

Health disparities, be it racial, economic, rural-urban, gender- or age-based, have come to the forefront across the world. To elucidate the biological, social, economic and psychological mechanisms of health disparities, and to develop interventions that engage community in targeting these mechanisms to reduce health disparities, it is necessary to work with complex multidimensional datasets containing molecular, genetic and biometric information from individuals, plus their socioeconomic status, local environment/safety, degree of segregation, access to medical care/education, and levels of pollution. We are developing novel statistical and ML approaches to harmonize these heterogeneous data and detect important contributors to health disparities. We are aiming to develop predictive tools to identify populations at-risk for poor health outcomes, in order to help community services, reach out and bring in those individuals for treatment earlier.

Student contributions

The REU student will work with NCSA computational scientists and faculty collaborators in areas of women’s health and infectious diseases, as well as the representatives of the public health district, to gather, prepare and analyze health-related data, and develop novel statistical and ML approaches.

Project 10: Spatial Analysis of Tumor Heterogeneity using Machine Learning Techniques

Co-mentors

Dr. Zeynep Madak-Erdogan
Food Science & Human Nutrition

Dr. Aiman Soliman
National Center for Supercomputing Applications

Project Description

Tumor heterogeneity is an inherent feature of all tumors that drive resistance to therapies. With the advent of new approaches integrating single-cell sequencing with spatial tumor data, interdisciplinary teams can better understand local changes in tumor metabolism, biology, as well as different cell populations that affect different aspects of immune responses, drug resistance, and metastasis. In this project, we will leverage spatial data analysis tools from Geospatial Science to quantify tumor microenvironment spatial heterogeneity. Machine learning techniques will be utilized to prepare image data for the ensuing integration of location and sequencing data.

Student contributions

We are looking for a student with experience in basic programming in Python and experience with R or Scikit learn library and classical machine learning methods (e.g., classification and clustering); experience with Tensorflow/Keras is a plus. The student will work on implementing spatial indices to quantify the tumors’ 2D heterogeneity and training and evaluating classical machine learning and deep learning models to connect the spatial indices and higher dimension genetic markers data.


Projects from 2021

Project 1: Mitigation of COVID-19 in the Era of Vaccination

Mentor

Ahmed Elbanna

Ahmed Elbanna
Civil and Environmental Engineering

Project description

The successful development of various COVID-19 vaccines has brought hope to the world that the control of this virus is within reach. However, at least 70-80% of the population must be vaccinated in order to achieve the target levels of herd immunity and to safely contain the virus. In that respect, several challenges still exist on both the national and international levels. These include: (1) The anti-vaxxers groups who spread misinformation about the vaccine technology pushing some people away from taking part of the vaccination campaign, (2) The emergence of new COVID-19 strains which higher transmission and some of which may escape vaccination, (3) Logistical problems in producing, distributing and administering vaccines doses on a large scale. While it is expected that there will be enough doses for 300 million Americans by fall 2020, the situation worldwide is very different. For some countries, it is expected that they will need 2-4 years in order to vaccinate enough of their population to achieve herd immunity. In a highly connected world like the one we live in, this disparity in vaccine distribution may have damaging effects including giving the virus enough time to mutate and evolve into more resistant strains as well as other side effects on global economy.

The objective of this project is multifold: (1) To analyze COVID-19 vaccine misinformation on social media and explore its correlation with the vaccine intake rate in different states. (2) To build data-calibrated epidemiological models that assess the effect of vaccination on the virus spread and severity, and (3) To model the effect of vaccine disparity on the future of COVID-19 transmission worldwide, and to assess the need of continuation of various non-pharmaceutical interventions (e.g. mask wearing and social distancing).

Interested students should have a good mathematical background and are able to code in Matlab, Python, or C++.

Project 2: Resolving Racial Health Disparities by using Advanced Statistics and Machine Learning on Complex Multidimensional Datasets

Mentor

Zeynep Madak-Erdogan

Zeynep Madak-Erdogan
Food Science & Human Nutrition

Social impact

Advanced, big-data computational methods developed in this project will provide tools for a true systems approach to health disparities, that will allow a multi-scale, multi-dimensional analysis of all aspects of this problem. This will arm citizen scientists with necessary data to argue their case to legislators, and identify the right complex of factors to be targeted as part of the societal intervention strategy.

Project description

Health disparities, be it racial, economic, rural-urban, gender- or age-based, have come to the forefront across the world. Two kinds of research have emerged: (1) elucidating the biological, social, economical and psychological mechanisms of health disparities, ​and (2) developing interventions that engage community in targeting these mechanisms to reduce health disparities. The first category is based on analysis of complex multidimensional datasets containing molecular, genetic and biometric information from individuals, plus their socioeconomic status, local environment/safety, degree of segregation, access to medical care/education, and levels of pollution. We will ​develop novel statistical and machine learning approaches to harmonize these heterogeneous data and detect important contributors to health disparities. ​The second category targets practical solutions for health disparities, lead by the community members working with policy makers. Our idea is to arm community members with tools that document their situation in scientifically rigorous ways, empowering these citizen scientists to pinpoint actionable items, problems and solutions to health disparities, and work with policy-makers to address them. One approach is to ​crowdsource data collection and analysis by providing citizen scientists with mobile apps to collect, upload, share and analyze relevant data.

Students will (1) develop novel methods for multi-scale data integration and analytics, using advanced statistical and machine learning techniques; and (2) prototype the software and mobile apps based on those novel computational approaches, for the purposes of elucidating Racial Health Disparities.


Projects from 2020

Project 1: Riding the Epidemic Wave: Nowcasting and Forecasting of COVID-19 in Illinois and Beyond with Various Intervention Protocols

Mentor

Ahmed Elbanna

Ahmed Elbanna
Civil and Environmental Engineering

Project 2: Resolving Racial Health Disparities by using Advanced Statistics and Machine Learning on Complex Multidimensional Datasets

Mentor

Zeynep Madak-Erdogan

Zeynep Madak-Erdogan
Food Science & Human Nutrition

Project 3: Human Fall Detection

Mentor

Volodymyr Kindratenko

Volodymyr Kindratenko
Electrical and Computer Engineering

Project 4: Weighing Black Holes with Deep Learning

Mentor

Xin Liu

Xin Liu
Astronomy


Projects from 2019

Project 1: Computational Materials Science: Multi-Scale Simulations and Machine Learning

Co-mentors

Andre Schleife

Andre Schleife
Materials Science and Engineering

Andrew Ferguson

Andrew Ferguson
Materials Science and Engineering

Social impact

Provide sophisticated, yet intuitive and user-friendly, visualization for effective materials data analysis and data dissemination to a broad scientific audience and the general public.

Project description

Modern computational materials science uses sophisticated simulation techniques to study properties of advanced and complex materials, including biomolecules, condensed-matter crystals, and polymers. At the same time, bridging length and time scales from atomistic resolution to actual samples is an important challenge. In this project, we aim to combine atomistic simulations and Maxwell modeling techniques, to accurately describe nano- and meso-structured materials. These simulations are computationally challenging and, while they yield accurate results, their high computational cost renders it difficult to apply them for high-throughput materials design.

By extracting data from these simulations, collecting that data in well-structured databases using modern materials schemas, and establishing connections to underlying structural descriptors, we aim to leverage supervised machine-learning techniques to significantly accelerate the materials design process. Working towards this goal, students will use existing and generate new data for complex structures, using Maxwell modeling. They will develop an open-source tool that interfaces an external Maxwell solver with the scikit-learn Python-based machine-learning library to perform supervised machine learning and guided materials design and discovery. In order to disseminate results to a broad scientific audience and the general public, using accurate yet intuitive visualization, students will have the opportunity to develop codes based on the open-source ray-tracer Blender/LuxRender and the open-source yt framework to produce image files and movies that are compatible with virtual and mixed reality viewers such as Google Cardboard or Windows Mixed Reality. View examples for possible outcomes.

Project 2: Data Storage and Analysis Framework for Semiconductor Nanocrystals Used in Bioimaging

Co-mentors

Andre Schleife

Andre Schleife
Materials Science and Engineering

Michal Ondrejcek

Michal Ondrejcek
NCSA

Social impact

By exploring systematic exchange of data and workflows, this project provides insights and best practices for the future of collaborative research. Providing access to sample specific data and analysis to the international community will accelerate deployment of novel semiconductor nanocrystals for bioimaging.

Project description

Light-emitting molecules are a central technology in biology and medicine that provide the ability to optically tag proteins and nucleic acids that mediate human disease. In particular, fluorescent dyes are a key part of molecular diagnostics and optical imaging reagents. We recently made major breakthroughs in engineering fluorescent semiconductor nanocrystals to increase the number of distinct molecules that can be accurately measured, far beyond what is possible with such organic dye molecules. We aim to develop nanocrystals that are able to distinguish diseased from healthy tissue and determine how the complex genetics underlying cancer respond to therapy, using measurement techniques and microscopes that are already widely accessible.

In order to achieve this goal, we need to understand a complex design space, that includes size, shape, composition, and internal structure of the different nanocrystals. To this end, we have started implementing a database that stores and catalogs optical properties and other relevant data describing semiconductor nanocrystals. Students in this team will work with computational and experimental researchers in several departments in order to turn this data into descriptors that are useful and efficient in the context of machine-learning. Schemas will be extended accordingly and the web interface will be improved such that data and analysis workflows can be efficiently shared between multiple researchers.

Students will first test the current descriptors and then implement improvements based on these tests. The framework will be interfaced with Globus and the Materials Data Facility and their underlying work flows. Students will also develop code that automatically analyzes data stored in the facility, e.g. to verify and validate experimental and computational results against each other. This project is highly interdisciplinary and students will work with a team of researchers in bioengineering, materials science, mechanical engineering, and NCSA.

Project 3: Modeling and Detection of Black Hole Collisions with the Blue Waters Supercomputer

Co-mentors

Gabrielle Allen

Gabrielle Allen
Astronomy/Education

Roland Haas

Roland Haas
NCSA

Eliu Huerta

Eliu Huerta
NCSA

Social impact

By developing open source software to further scientific community efforts to detect gravitational waves, students will learn skills that they can use to tackle grand computational challenges across science domains, including those with broad social benefits.

Project description

The Laser Interferometer Gravitational-Wave Observatory's (LIGO) detection of gravitational waves from merging black holes in September 2014 inaugurated a new era in astronomy and astrophysics, opening a window to observe the Universe through gravitational radiation. Occurring 100 years after Einstein's announcement of his theory of general relativity, the detection spurred world-wide interest in physics and science in general, making headline news around the world. The recent Nobel Prize awarded for this detection and the announcement of the detection of the double binary neutron star system by LIGO/Virgo underline the importance of these efforts and the interest that the wider society has in it.

In this project, a pair of REU-INCLUSION students will write Python/C libraries to extract information from numerical relativity simulations that describe mergers of black holes and neutron stars. Through this work the students will become familiar with one of the more exciting research topics in contemporary astronomy, and this work will provide them with new tools to study phenomena across science domains that require high performance environments. These simulations will also be used to create scientific visualizations for outreach purposes.

Project 4: Visualizing and Preserving Environmental Data for Improved Governance

Co-mentors

Anita Say Chan

Anita Say Chan
Media and Cinema Studies

Ben Grosser

Ben Grosser
School of Art & Design

Social impact

Supporting current environmental data justice initiatives that hold governments and companies accountable for the environmental damage their policies and actions cause, and attend to how these oversights impact marginalized communities disproportionately.

Project description

As more and more revisions are made to data and scientific analyses available on government websites concerning environmental and climate protection, there has been a growing need for researchers and coders to preserve environmental data and keep citizens informed of such changes. This position will assist the Environmental Data and Governance Initiative (EDGI), a network of scholars and researchers that archives federal environmental data to safeguard it against potential reductions in access by the current administration, develops online tools to support monitoring changes to federal environmental websites, and tracks cuts in funding, research, and regulation at environmentally oriented agencies. These agencies and departments include, but are not limited to, the EPA (Environmental Protection Agency), NOAA (National Oceanic and Atmospheric Administration), NASA (National Aeronautics and Space Administration), USGS (United State Geological Survey), OSHA (Occupational Safety and Health Administration), DOE (Department of Energy) and BLM (Bureau of Land Management).

This position will support collaborations under EDGI's public data working group that include projects for indexing millions of government web pages on a weekly basis, tracking changes on them, and producing regular reports. Additional ongoing efforts include distributed protocol development for data storage, machine learning work that can isolate the most important website changes for enhanced tracking efforts, and security advancements for privacy protection of EDGI volunteers and workshop participants engaging in data preservation and website monitoring. Potential project work could also extend developments made under EDGI's Google Summer of Code partnership, where recent collaborations utilized machine learning algorithms to identify and monitor changes on government agency websites using data from multiple sources: Versionista, PageFreezer, and Internet Archive; another recent collaboration used D3 to develop DataRescue Maps as impactful, publically-meaningful models to allow users to easily visualize changes to government websites archived by EDGI. The data being archived is vital for environmental research and protection, but it can be meaningless or overwhelming in the hands of users without clear graphs or interactive models that help provide context and a general overview of the data.

This project welcomes applicants with an interest in environmental data analysis and preservation and other interdisciplinary skills: including Spanish translation, experience in data visualization and coding (Python, Ruby on Rails, or JavaScript in particular); interest in or experience working with data and databases, web crawling, API work, machine learning, open science, and community organizing.

Project 5: Intelligent Synthesis: Statistical Learning to Optimize Graphene Synthesis Parameters for Nanomanufacturing

Co-mentors

Elif Ertekin

Elif Ertekin
Mechanical Science and Engineering

Sameh Tawfick

Sameh Tawfick
Mechanical Science and Engineering

Placid Ferreira

Placid Ferreira
Mechanical Science and Engineering

Social impact

Provide sophisticated data driven approaches to enable high quality, reproducible synthesis of nanomaterials for use as manufactured components in nanoelectronics.

Project description

Material synthesis is a primary bottleneck in emerging nanoelectronic devices. The promise of nanoelectronics will not become a reality unless the synthesis process is scalable and leads to high quality, reproducible materials. Although there exists a tremendous amount of academic and industry research in synthesis, most advancements in synthesis science are achieved by expensive and tedious trial and error approach.

In this project we will go beyond the traditional trial and error approach by adopting a data-driven methodology to rapidly optimize the chemical vapor deposition synthesis of graphene and other emerging 2D nanomaterials. They key aspects include building and populating a large database of synthesis parameters and results, and implementing a system for automated data capture and extraction from actual growth experiments in real time. The database will be populated by both experiments carried out at Illinois and via crowd-sourcing from research groups around the world.

Students will develop and populate the 2D materials synthesis database, and will implement tools that allow users to access and analyze the contents of the database. They will also explore the optimization of growth parameters by implementing python-based libraries for supervised machine learning. Students will also develop a configurable system for automated data collection from nanofabrication tools during growth experiments in real time. Experimental parameters will be pushed to a cloud server so that they can be curated and served to computational models. Educational video tutorials on the synthesis database and the machine learning approach will be developed.

Project 6: Optimization of Open-Source Software for Deep Learning

Co-mentors

Volodymyr Kindratenko

Volodymyr Kindratenko
Electrical and Computer Engineering

William Gropp

William Gropp
Computer Science

Social impact

Contributing to the advancement of machine learning, which is at the core of many modern approaches to solve real-world problems in the fields ranging from education to healthcare to engineering to core sciences.

Project description

Deep neural networks are at the core of artificial intelligence, machine learning, computer vision, and other advanced applications across many disciplines. Such networks allow computers to "learn and infer" rather than "compute," which is essential for many problems in which models that describe the data are multi-dimensional, non-linear, and generally are too complex for traditional mathematical techniques. Many deep learning frameworks have been developed over the course of past decade providing advanced neural network construction, training, and inference functionality. However, vast majority of these codes have been developed for a single compute node execution, which precludes them from training complex network models using large datasets in acceptable time. The challenge is to redesign existing or develop new frameworks that can take advantage of heterogeneous computing platforms to speed up the network training tasks while providing easy to use programming abstractions for domain scientists.

In this project, students will analyze open-source deep learning software frameworks and will work on optimizing them and removing bottlenecks in order to improve performance of the applications relying on these codes. Students will be expected to contribute their changes back to these codes, as well as making new codes open sourced. The project will contribute to the development of NSF-funded computer system for deep learning and will result in open-source software that will be deployed on this system. Students will learn about parallel programming systems such as MPI, OpenMP, and OpenCL, how to study their performance, and techniques for improving that performance.

Project 7: Resolving Racial Health Disparities by using Advanced Statistics and Machine Learning on Complex Multidimensional Datasets

Co-mentors

Liudmila Mainzer

Liudmila Mainzer
Institute for Genomic Biology

Zeynep Madak-Erdogan

Zeynep Madak-Erdogan
Women's Health, Hormones and Nutrition Lab

Social impact

Advanced, big-data computational methods developed in this project will provide tools for a true systems approach to health disparities, that will allow a multi-scale, multi-dimensional analysis of all aspects of this problem. This will arm citizen scientists with necessary data to argue their case to legislators, and identify the right complex of factors to be targeted as part of the societal intervention strategy.

Project description

Health disparities, be it racial, economic, rural-urban, gender- or age-based, have come to the forefront across the world. Two kinds of research have emerged: (1) elucidating the biological, social, economical and psychological mechanisms of health disparities, ​and (2) developing interventions that engage community in targeting these mechanisms to reduce health disparities. The first category is based on analysis of complex multidimensional datasets containing molecular, genetic and biometric information from individuals, plus their socioeconomic status, local environment/safety, degree of segregation, access to medical care/education, and levels of pollution. We will ​develop novel statistical and machine learning approaches to harmonize these heterogeneous data and detect important contributors to health disparities. ​The second category targets practical solutions for health disparities, lead by the community members working with policy makers. Our idea is to arm community members with tools that document their situation in scientifically rigorous ways, empowering these citizen scientists to pinpoint actionable items, problems and solutions to health disparities, and work with policy-makers to address them. One approach is to ​crowdsource data collection and analysis by providing citizen scientists with mobile apps to collect, upload, share and analyze relevant data.

Students will (1) develop novel methods for multi-scale data integration and analytics, using advanced statistical and machine learning techniques; and (2) prototype the software and mobile apps based on those novel computational approaches, for the purposes of elucidating Racial Health Disparities.

Project 8: PixSure Image Annotation System

Co-mentors

Colleen Bushell

Colleen Bushell
Visual Intelligence for Biology

Peter Groves

Peter Groves
Visual Intelligence for Biology

Social impact

Addressing a critical gap in the field of large-scale image analysis will facilitate new science research in multiple domains, accelerate advances in deep learning and computation, and will greatly increase the value of image data.

Project description

The Visual Intelligence for Biology (VI-Bio) group at NCSA, in close collaboration with two NCSA faculty affiliates, Steve Boppart and Larry Di Girolamo, are pursuing the creation of PixSure, an image annotation system incorporating artificial intelligence and a novel user experience to address a critical gap in the field of large-scale image analysis. Image Analysis has increased over the last several years largely due to advancements in deep learning (DL), including mature computational packages like Tensorflow and Keras. A critical requirement for image analysis is the development of an annotated image-set to use as training data for the algorithms to correctly learn image features. There are some tools available for simple image annotation, but there is a glaring gap in available tools to create sophisticated "ground truth" image annotations at the level required for the scientific research needs of today. Addressing this gap will facilitate new science research in multiple domains, accelerate advances in DL and computation, and will greatly increase the value of image data. VI-Bio is looking for REU students to participate in this project. Depending on skills, activities can include: 1) testing annotation software through the marking of images for tumor cells in pathology imagery and clouds in satellite imagery to assess user experience and quality and provide ground-truth examples, 2) refining/exploring use of a game controller as a method for interacting and marking images (software coding and testing), 3) running user-studies to assess user experience (survey, discussion, documentation), 4) testing and evaluating performance of machine learning and DL models using tumor imaging data.


Projects from 2018

Project 1: Visualizing and Preserving Environmental Data for Improved Governance

Co-mentors

Anita Say Chan

Anita Say Chan
Media and Cinema Studies

Ben Grosser

Ben Grosser
School of Art & Design

Social impact

Supporting current environmental data justice initiatives that hold governments and companies accountable for the environmental damage their policies and actions cause, and attend to how these oversights impact marginalized communities disproportionately.

Project description

As more and more revisions are made to data and scientific analyses available on government websites concerning environmental and climate protection, there has been a growing need for researchers and coders to preserve environmental data and keep citizens informed of such changes. This position will assist the Environmental Data and Governance Initiative (EDGI), a network of scholars and researchers that archives federal environmental data to safeguard it against potential reductions in access by the current administration, develops online tools to support monitoring changes to federal environmental websites, and tracks cuts in funding, research, and regulation at environmentally oriented agencies. These agencies and departments include, but are not limited to, the EPA (Environmental Protection Agency), NOAA (National Oceanic and Atmospheric Administration), NASA (National Aeronautics and Space Administration), USGS (United State Geological Survey), OSHA (Occupational Safety and Health Administration), DOE (Department of Energy) and BLM (Bureau of Land Management).

This position will support collaborations under EDGI's public data working group that include projects for indexing millions of government web pages on a weekly basis, tracking changes on them, and producing regular reports. Additional ongoing efforts include distributed protocol development for data storage, machine learning work that can isolate the most important website changes for enhanced tracking efforts, and security advancements for privacy protection of EDGI volunteers and workshop participants engaging in data preservation and website monitoring. Potential project work could also extend developments made under EDGI's Google Summer of Code partnership, where recent collaborations utilized machine learning algorithms to identify and monitor changes on government agency websites using data from multiple sources: Versionista, PageFreezer, and Internet Archive; another recent collaboration used D3 to develop DataRescue Maps as impactful, publically-meaningful models to allow users to easily visualize changes to government websites archived by EDGI. The data being archived is vital for environmental research and protection, but it can be meaningless or overwhelming in the hands of users without clear graphs or interactive models that help provide context and a general overview of the data.

This project welcomes applicants with an interest in environmental data analysis and preservation and other interdisciplinary skills: including Spanish translation, experience in data visualization and coding (Python, Ruby on Rails, or JavaScript in particular); interest in or experience working with data and databases, web crawling, API work, machine learning, open science, and community organizing.

Project 2: Optimization of Open-Source Software for Deep Learning

Co-mentors

Volodymyr Kindratenko

Volodymyr Kindratenko
Electrical and Computer Engineering

William Gropp

William Gropp
Computer Science

Social impact

Contributing to the advancement of machine learning, which is at the core of many modern approaches to solve real-world problems in the fields ranging from education to healthcare to engineering to core sciences.

Project description

Deep neural networks are at the core of artificial intelligence, machine learning, computer vision, and other advanced applications across many disciplines. Such networks allow computers to "learn and infer" rather than "compute," which is essential for many problems in which models that describe the data are multi-dimensional, non-linear, and generally are too complex for traditional mathematical techniques.  Many deep learning frameworks have been developed over the course of past decade providing advanced neural network construction, training, and inference functionality. However, vast majority of these codes have been developed for a single compute node execution, which precludes them from training complex network models using large datasets in acceptable time. The challenge is to redesign existing or develop new frameworks that can take advantage of heterogeneous computing platforms to speed up the network training tasks while providing easy to use programming abstractions for domain scientists.

In this project, students will analyze open-source deep learning software frameworks and will work on optimizing them and removing bottlenecks in order to improve performance of the applications relying on these codes. Students will be expected to contribute their changes back to these codes, as well as making new codes open sourced. The project will contribute to the development of NSF-funded computer system for deep learning and will result in open-source software that will be deployed on this system. Students will learn about parallel programming systems such as MPI, OpenMP, and OpenCL, how to study their performance, and techniques for improving that performance.

Project 3: Computational Materials Science: Visualization and Machine Learning

Co-mentors

Andre Schleife

Andre Schleife
Materials Science and Engineering

Andrew Ferguson

Andrew Ferguson
Materials Science and Engineering

Social impact

Provide sophisticated yet intuitive and user-friendly visualization for effective materials data analysis and data dissemination to a broad scientific audience and the general public.

Project description

Modern computational materials science produces large amounts of static and time-dependent data that is rich in information. Examples include atomic geometries of complex biomolecules, condensed-matter crystals, and electron-density probability distributions. Extracting the relevant information from these data to determine the important processes and mechanisms constitutes an important scientific challenge. The availability of sophisticated yet intuitive visualization is a crucial component of effective data analysis, and is vital in disseminating results to a broad scientific audience and the general public.

In this project we use and develop physics-based ray-tracing and stereoscopic rendering techniques to visualize the structure of existing and novel materials e.g., for solar-energy harvesting, optoelectronic applications, and focused-ion beam technology. We will couple these visualization tools with Maxwell solvers and supervised machine learning algorithms to perform targeted discovery and rational design of new materials with tailored optical properties. The team will establish a powerful and intuitive platform for visualization of atomic geometries, optical reflection and transmission spectra, and time-dependent electronic excitations. This platform will allow for guided design of next-generation optical materials for use in novel lenses or energy-saving window coatings.

Working towards this goal, students will analyze and visualize atomic geometries and electron densities from first-principles simulations of excited electronic states using density functional theory (DFT) and time-dependent DFT. They will develop an open-source tool that interfaces an external Maxwell solver with the scikit-learn Python-based machine-learning library to perform supervised machine learning and guided materials design and discovery. Students will also develop codes based on the open-source ray-tracer Blender/LuxRender and the open-source yt framework to produce image files and movies from these data. Stereoscopic images will be produced that can be visualized using e.g. Google Cardboard or other virtual-reality viewers. Examples for possible outcomes can be seen here.

Project 4: Data Storage and Analysis Framework for Semiconductor Nanocrystals used in Bioimaging

Co-mentors

Andre Schleife

Andre Schleife
Materials Science and Engineering

Michal Ondrejcek

Michal Ondrejcek
NCSA

Social impact

By exploring systematic exchange of data and workflows, this project provides insights and best practices for the future of collaborative research. Providing access to sample specific data and analysis to the international community will accelerate deployment of novel semiconductor nanocrystals for bioimaging.

Project description

Light-emitting molecules are a central technology in biology and medicine that provide the ability to optically tag proteins and nucleic acids that mediate human disease. In particular, fluorescent dyes are a key part of molecular diagnostics and optical imaging reagents. We recently made major breakthroughs in engineering fluorescent semiconductor nanocrystals to increase the number of distinct molecules that can be accurately measured, far beyond what is possible with such organic dye molecules. We aim to develop nanocrystals that are able to distinguish diseased from healthy tissue and determine how the complex genetics underlying cancer respond to therapy, using measurement techniques and microscopes that are already widely accessible.

In order to achieve this goal, we need to understand a complex design space, that includes size, shape, composition, and internal structure of the different nanocrystals. Students in this team will work with computational and experimental researchers in several departments in order to establish a database to store, share, and catalog optical properties and other relevant data describing semiconductor nanocrystals. This requires developing schemas and analysis workflows that can be efficiently shared between multiple researchers. Eventually, both the data and the workflows will be made available to the general public.

Students will first identify all information that will need to be included in this catalogue. Students will then write JSON and python code and interface with Globus and the Materials Data Facility. They will create well-documented iPython notebooks that operate directly on the Globus file structure and run in the web browser. Students will also develop code that automatically analyzes data stored in the facility, e.g. to verify and validate experimental and computational results against each other. This project is highly interdisciplinary and students will work with a team of researchers in bioengineering, materials science, mechanical engineering, and NCSA.

Project 5: Eye Tracking Race and Cultural Difference in Video Games

Co-mentors

Ben Grosser

Ben Grosser
School of Art & Design

Jodi Byrd

Jodi Byrd
English

Social impact

Identifying the effect of video games' designs on race and cultural differences perceptions, critical thinking, and split-second decision-making.

Project description

How do players of contemporary video games perceive and process race-based information as part of their gaming experience? What role does that perception play in split-second decision-making processes in terms of character combat, map navigation, and attitudes towards cultural information throughout game environments? This project at the NCSA Critical Technology Studies Laboratory will combine eye tracking of visual attention during game play, data analysis of that attention, and visualization of the results towards a critical understanding of race and cultural difference in video games. We will use the Irrational Games' 2013 AAA title, BioShock Infinite, to examine the effects of both active (e.g., visible race of characters and combatants in the game) and passive (e.g., racial background materials embedded in environments, level designs, and other artistic content) processing of race-based visual data. The game Never Alone—a collaboratively designed game with input from Alaskan Native elders that includes activities and artifacts intended to foreground indigenous cultural difference—will be used to explore the effectiveness of such an approach for creating cultural awareness and learning among gamers. The results of this research will lead to new insights about how the designs of video games affect gamer attitudes towards race and difference, and will suggest new approaches for future game designers aiming to positively affect such attitudes in the future.

We will employ commodity eye tracking hardware and software to capture gamer attention during gameplay. The data produced by these systems requires significant post-processing to understand. Furthermore, such systems aren't designed to temporally synchronize that data with active gaming experiences. Therefore, our students will develop software to process the eye tracking data, synchronize it over time with a video capture of the gameplay, and visualize aspects of the data on top of that video capture. Such visualizations, especially when comparing data from multiple subjects, will illustrate the relationships between race-based content in video games and player perceptions and actions. The software developed will be hosted on the Critical Technology Laboratory's GitHub account, made available via the NCSA Open Source License, and offered to the critical gaming academic community for use and comment. The students will work with Grosser and Byrd on all aspects of the project from critical understandings of race and race-based imagery in video games to consideration of how design of gaming interfaces affects user attention. Outcomes will include data visualization and software development, and the application of this research will include the analysis software itself, visualizations of results, publications of findings, and art exhibitions derived from the research.

Project 6: Machine Learning to Estimate Crop Productivity from Multi-Sensor Fused Satellite Data

Co-mentors

Kaiyu Guan

Kaiyu Guan
Natural Resources & Environmental Sciences

Jian Peng

Jian Peng
Computer Science

Social impact

Developing a reliable and time-lead forecasting systems for crop type and crop yield for farm communities and government agencies use.

Project description

Reliable and time-lead forecasting systems for crop type and crop yield has critical values for various purposes for farmer communities and government agencies. Field-level estimation of crop yield is particularly useful for understanding how crop productivity responses to various management requirements and environmental factors. With continued climate variability (e.g., 2012 Midwest drought) and the ongoing climate change, farmer community and our government require better information to monitor crop growth and their near-term prospects. As the most important staple food production area, the U.S. Corn Belt produces half of the global corn and soybean combined, and has significant importance for regional, national, and global economy and food security. However, given the great needs for such forecasting information, we do not have a forecasting system for the U.S. Corn Belt for public use. This project will develop scalable machine-learning methods to integrate data from satellite remote sensing and other auxiliary information to make accurate yet cost-effective predictions of crop type. First, we will generate a 30-meter (2000-present; 10-meter for post-2014 period), daily, cloud-free data stack for the three major Corn Belt States, Illinois, Iowa, Indiana, by integrating three major satellite datasets (Landsat, MODIS, and new Sentinel-2). We will build upon this data stack to develop a machine-learning forecasting system for crop type, up to a lead time of 1-to-2 months before the harvest time with a particular focus on rain-fed corn and soybean. We will fully leverage satellite information, including spectral, phenological, and field-level texture information, for our analytics to achieve field-level predictions with advanced deep machine learning approaches.

In this project, the REU students will implement a computational pipeline that extract real-time satellite images, deposit data into a database and develop data fusion and machine learning components for predictive tasks. Due to the enormity of the satellite data and the highly demanding computational requirements, the students will need to distribute storage- and computation-intensive modules on the Blue Waters supercomputer. The students will have meetings with both Guan and Peng on a regular basis, participate the group meetings and reading groups organized in their research labs, and interact with graduate students. The students will learn advanced knowledge in remote sensing and machine learning and improve scientific research skills.

Project 7: Intelligent Synthesis: Statistical Learning to Optimize Graphene Synthesis Parameters for Nanomanufacturing

Co-mentors

Elif Ertekin

Elif Ertekin
Mechanical Science and Engineering

Placid Ferreira

Placid Ferreira
Mechanical Science and Engineering

Social impact

Provide sophisticated data driven approaches to enable high quality, reproducible synthesis of nanomaterials for use as manufactured components in nanoelectronics.

Project description

Material synthesis is a primary bottleneck in emerging nanoelectronic devices. The promise of nanoelectronics will not become a reality unless the synthesis process is scalable and leads to high quality, reproducible materials. Although there exists a tremendous amount of academic and industry research in synthesis, most advancements in synthesis science are achieved by expensive and tedious trial and error approach.

In this project, we will go beyond the traditional trial and error approach by adopting a data-driven methodology to rapidly optimize the chemical vapor deposition synthesis of graphene and other emerging 2D nanomaterials. The key aspects include building and populating a large database of synthesis parameters and results, and implementing a system for automated data capture and extraction from actual growth experiments in real time. The database will be populated by both experiments carried out at Illinois and via crowd-sourcing from research groups around the world.

Students will develop and populate the 2D materials synthesis database, and will implement a tool that interfaces the database with Python-based libraries for supervised machine learning to perform targeted growth parameter optimization. Students will also develop a configurable system for automated data collection from nanofabrication tools during growth experiments in real time. Experimental parameters will be pushed to a cloud server so that they can be curated and served to computational models. Educational video tutorials on the synthesis database and the machine learning approach will be developed. These videos will be placed on the nanoHUB and aimed at audiences comprised of high school and undergraduate students.

Project 8: Energy Workflows

Co-mentors

Daniel Katz

Daniel Katz
NCSA

Kathryn Huff

Kathryn Huff
Nuclear, Plasma, & Radiological Engineering

Social impact

Making progress towards safer reactor designs, leading to more plentiful energy for societal needs.

Project description

The students will automate an existing high performance computing simulation workflow and its corresponding data pipeline using the Parsl python parallel scripting library.This collaboration will develop reproducible workflows for conducting simulation and analysis of phenomena in advanced nuclear reactor designs. The simulation and analysis workflow to be automated involves demonstration of the UIUC-developed Moltres application within the Multiphysics Object-Oriented Simulation Environment (MOOSE) Finite Element Modeling (FEM) ecosystem.

This work will be conducted in a transparent and open manner, with an emphasis on maximizing reproducibility and reuse potential. Our work will additionally leverage literate programming tools (e.g., Jupyter notebooks) as a platform for communicating analysis methods and results. The undergraduates will work as part of a team ensuring the efficiency, transparency, and reproducibility of the simulations being conducted and the underlying software (which is under constant development). Their tasks will familiarize them with scientific software development best practices such as pair programming, unit testing, automated documentation, and reproducible workflows.

Specifically, the students will pair program alongside the faculty mentors and one another to implement a Parsl-based data pipeline for molten salt nuclear reactor multiphysics simulations. As their familiarity with the project grows, they will contribute enhancements to the Parsl codebase and help researchers in nuclear engineering deploy this workflow to conduct repeatable validation and verification demonstrations of the simulation capabilities. Meanwhile, they will be guided to enriching both Parsl and Moltres documentation for the methods as needed. This work will make progress towards safer reactor designs, leading to more plentiful energy for societal needs.

Project 9: Modeling and Detection of Black Hole Collisions with the Blue Waters Supercomputer

Co-mentors

Gabrielle Allen

Gabrielle Allen
Astronomy/Education

Roland Haas

Roland Haas
NCSA

Eliu Huerta Escudero

Eliu Huerta Escudero
NCSA

Social impact

By developing open source software to further scientific community efforts to detect gravitational waves, students will learn skills that they can use to tackle grand computational challenges across science domains, including those with broad social benefits.

Project description

The Laser Interferometer Gravitational-Wave Observatory's (LIGO) detection of gravitational waves from merging black holes in September 2014 inaugurated a new era in astronomy and astrophysics, opening a window to observe the Universe through gravitational radiation. Occurring 100 years after Einstein's announcement of his theory of general relativity, the detection spurred world-wide interest in physics and science in general, making headline news around the world. The recent Nobel Prize awarded for this detection and the announcement of the detection of the double binary neutron star system by LIGO/Virgo underline the importance of these efforts and the interest that the wider society has in it.

In this project, a pair of REU-INCLUSION students will be involved in research within the larger LIGO community and will work on two key components of this project. One component will be porting an existing C library into LIGO's Algorithm Library. The code describes the gravitational waves emitted by the merger of black holes. The goal is to use this software in upcoming LIGO searches of gravitational wave sources. The second component requires development of a numerical routine to post-process numerical relativity simulations that describe the merger of black holes and neutron stars. The students will write python/c libraries to extract information from these simulations through an optimization procedure. Through this work the students will become familiar with one of the more exciting research topics in contemporary astronomy, and this work will provide them with new tools to study phenomena across science domains that require high performance environments. Having open source software to work with LIGO data makes it possible for interested members of the public to contribute to the science and also for LIGO science to be incorporated in, for example, high-school syllabi to train future scientists.


Projects from 2017

Project 1: Reproducible Best Practices for Development of Scientific Simulation Software

Co-mentors

Kathryn Huff

Kathryn Huff
Nuclear, Plasma, & Radiological Engineering

Matthew Turk

Matthew Turk
Astronomy

Social impact

Developing safe, and sustainable nuclear energy as a fundamental part of a clean, carbon-free, worldwide future energy.

Project description

Safe, sustainable nuclear energy will be a fundamental part of a clean, carbon-free, worldwide energy future. The Advanced Reactors and Fuel Cycles group (Huff) and the Data Exploration Laboratory (Turk) are collaborating to develop reproducible, open-source software (OSS) for simulation and analysis of phenomena in advanced nuclear reactor designs. Open source physics kernels and applications will be developed within the Multiphysics Object-Oriented Simulation Environment (MOOSE) Finite Element Modeling ecosystem. These kernels and simulations will extend current modeling capabilities to include physics appropriate for the unique phenomena encountered in advanced nuclear reactors. As much as possible, this work will be conducted in a transparent and open manner, with an emphasis on maximizing reproducibility and reuse potential. Accordingly, this work will leverage literate programming tools (e.g., Jupyter notebooks) as a platform for communicating analysis methods and results.

The undergraduates will work as part of a team ensuring the reproducibility of the kernels under development. Their tasks will familiarize them with scientific software development best practices such as pair programming, unit testing, automated documentation, and reproducible workflows. Specifically, the students will pair program alongside the faculty mentors, postdoctoral scholars, and one another to implement C++ unit tests using the MOOSE testing framework. As their familiarity with the project grows, they will engage in iterating upon repeatable validation and verification demonstrations of the simulation capabilities by using Jupyter notebooks to package and communicate simulation workflows. Meanwhile, they will be guided to enriching documentation for the methods as needed, using the Doxygen automated documentation framework.

Project 2: Simulating Momentary Time Sampling for Classroom Observations of Students Affective States

Co-mentors

Luc Paquette

Luc Paquette
Curriculum & Instruction

Jinming Zhang

Jinming Zhang
Educational Psychology

Social impact

Education that better meet student needs, via experiment-driven data.

Project description

In this project, students will develop a tool to study momentary time sampling (MTS) of affective (emotional) states in a simulated classroom and to test the effect of different parameters on the accuracy of such a sampling approach. MTS is one of the preferred methods for discontinuous sampling in social sciences, but it is known to sometimes produce a high measurement error based on several factors related to the design of the observation session (e.g., the sampling interval, length of the observation session) and the properties of the behavior or construct being studied (e.g., interactions between the duration of individual behavioral events, as well as the number of instances and the overall prevalence of each behavior during the observation session). The goal of this project is to develop a tool allowing educational researchers to better plan their classroom data collection in order to improve the accuracy of the collected data and increase the statistical power of their experiments.

Students participating in this project will be responsible for the implementation and the design of the MTS tool. Their design will be informed by a previous prototype that was developed with limited functionality. As the goal of this project is to develop a tool that can be used by educational researchers, the resulting product will need to be multi-platform, have a user-friendly interface, and run efficiently on a regular laptop computer. Paquette will overview the development of the tool and will help the students specify its design based on the needs from the research community. Zhang will contribute to the project by offering insights on how to generate random distributions when simulating students in a classroom and on how to evaluate the accuracy of the simulated observations.

Project 3: Quantification of the Potential for Epistatic Genomic Loci to Improve Maize Yield

Co-mentors

Liudmila Mainzer

Liudmila Mainzer
Institute for Genomic Biology

Alexander Lipka

Alexander Lipka
Crop Sciences

Social impact

Improved crop yields, including corn, through genetic assessment.

Project description

The biological objective of this project is to determine the extent to which epistasis between pairs of genomic loci contribute to variation of yield in maize. The motivation for this research is that it is anticipated that there will not be enough food for the world population by 2050. Thus, there is a critical need to improve the yields of crops including maize. Indeed, critically assessing the genetic sources contributing to maize yield is an essential first step for improving maize yield. Assessing these genetic sources will enable breeders to focus their resources on specific genes (and/or combinations of genes) to select to maximize yield. This project will assess three research questions: 1) How much of yield is explained by epistasis? 2) What are the effect sizes of epistatic loci? And 3) Will considering the identified epistatic loci result in substantially increased yields?

A pair of students, one with a more computational background and the other with a more biological background, will conduct this analysis. These students will conduct stepwise epistatic model selection in the maize nested association mapping panel. The traits to be analyzed will be yield and other related traits. The stepwise epistatic model selection program (which is part of our local version of TASSEL5) was developed by a team lead by Mainzer and Lipka. Since high core-count servers are best for this work, we will use the 46-core Dell system available at NCSA's Innovative Systems Lab. The dual-threaded cores provide ability to parallelize computation up to 96 threads, which would significantly speed up this analysis. Mainzer and Lipka will supervise this project and ensure that the analysis is completed in a timely manner. To facilitate communication between the students and supervisors, the students will work and have weekly meetings at NCSA. The students will also attend the biweekly Lipka Lab meetings and the HPCBio group meetings. These meetings will give the students the opportunity to ask "big-picture" questions, identify any bottlenecks with the conducting the analysis, and for Mainzer and Lipka to monitor the students' progress. This project should provide insight into the contribution of epistasis towards variation in yield in maize, and identification of and solutions for computational bottlenecks for running stepwise model selection. In addition, the computational student will learn more about biology, and the biology student will learn more about computational aspects. Both students will learn about conducting statistical analyses and participating in interdisciplinary collaborations.

Project 4: Evaluating Application Suitability for a Novel Migratory Near Memory Processing Architecture

Co-mentors

Volodymyr Kindratenko

Volodymyr Kindratenko
Electrical and Computer Engineering

William Gropp

William Gropp
Computer Science

Social impact

Solving larger Big Data problems more quickly, through a new computing platform.

Project description

In this project, students will gain first-hand experience with a cutting-edge computer architecture, Migratory Near Memory Processing Architecture, currently being developed by Emu Technology. This is new technology that could revolutionize several application areas involving big data analytics, many of which have potential social benefits to broad communities. Conventional HPC-scale distributed memory computers are designed with the assumption that the large majority of memory access operations will be to local memory. However, for truly large datasets with complex data access patterns, this is not the case and accessing data across many memory systems becomes necessary to carry out the required computations. Emu's solution to this problem, referred to as Migratory Memory-Side Processing, is to move the execution context to the data rather than moving large amounts of data to the computational thread. This approach is promising for a number of applications, including both numerical and data-intensive codes.

Emu Technology will provide remote access to a prototype system programmable in Cilk, a parallel language based on C. Students will receive appropriate training in both Cilk programming language and the software development tools and methodologies for Emu's system. Working with mentors, they will identify 2 to 4 kernels with distinctly different computational workloads and data access patterns and will re-implement them in Cilk for execution on both traditional shared memory systems and on Emu’s novel architecture. Students will carry out the code implementation, performance measurements, and analysis of the results, culminating in a white paper that will compare and contrast applications running on the traditional shared memory architecture and the Migratory Near Memory Processing architecture. The students will work closely with other students at the Innovative Systems Lab (ISL) at NCSA where they will be exposed to other research projects involving novel computer architectures. They will meet with both PIs on a weekly basis to review the results and receive feedback for their work.

Project 5: Computational Materials Science: Visualization and Machine Learning

Co-mentors

Andre Schleife

Andre Schleife
Materials Science and Engineering

Andrew Ferguson

Andrew Ferguson
Materials Science and Engineering

Social impact

Provide sophisticated yet intuitive and user-friendly visualization for effective materials data analysis and data dissemination to a broad scientific audience and the general public.

Project description

Modern computational materials science produces large amounts of static and time-dependent data that is rich in information. Examples include atomic geometries of complex biomolecules, condensed-matter crystals, and electron-density probability distributions. Extracting the relevant information from these data to determine the important processes and mechanisms constitutes an important scientific challenge. The availability of sophisticated yet intuitive visualization is a crucial component of effective data analysis, and is vital in disseminating results to a broad scientific audience and the general public.

In this project we use and develop physics-based ray-tracing and stereoscopic rendering techniques to visualize the structure of existing and novel materials e.g., for solar-energy harvesting, optoelectronic applications, and focused-ion beam technology. We will couple these visualization tools with Maxwell solvers and supervised machine learning algorithms to perform targeted discovery and rational design of new materials with tailored optical properties. The team will establish a powerful and intuitive platform for visualization of atomic geometries, optical reflection and transmission spectra, and time-dependent electronic excitations. This platform will allow for guided design of next-generation optical materials for use in novel lenses or energy-saving window coatings.

Working towards this goal, students will analyze and visualize atomic geometries and electron densities from first-principles simulations of excited electronic states using density functional theory (DFT) and time-dependent DFT. They will develop an open-source tool that interfaces an external Maxwell solver with the scikit-learn Python-based machine-learning library to perform supervised machine learning and guided materials design and discovery. Students will also develop codes based on the open-source ray-tracer Blender/LuxRender and the open-source yt framework to produce image files and movies from these data. Stereoscopic images will be produced that can be visualized using e.g. Google Cardboard or other virtual-reality viewers. Examples for possible outcomes can be seen here.

Project 6: Machine Learning to Estimate Crop Productivity from Multi-Sensor Fused Satellite Data

Co-mentors

Kaiyu Guan

Kaiyu Guan
Natural Resources & Environmental Sciences

Jian Peng

Jian Peng
Computer Science

Social impact

Developing a reliable and time-lead forecasting systems for crop type and crop yield for farm communities and government agencies use.

Project description

Reliable and time-lead forecasting systems for crop type and crop yield has critical values for various purposes for farmer communities and government agencies. Field-level estimation of crop yield is particularly useful for understanding how crop productivity responses to various management requirements and environmental factors. With continued climate variability (e.g., 2012 Midwest drought) and the ongoing climate change, farmer community and our government require better information to monitor crop growth and their near-term prospects. As the most important staple food production area, the U.S. Corn Belt produces half of the global corn and soybean combined, and has significant importance for regional, national, and global economy and food security. However, given the great needs for such forecasting information, we do not have a forecasting system for the U.S. Corn Belt for public use. This project will develop scalable machine-learning methods to integrate data from satellite remote sensing and other auxiliary information to make accurate yet cost-effective predictions of crop type. First, we will generate a 30-meter (2000-present; 10-meter for post-2014 period), daily, cloud-free data stack for the three major Corn Belt States, Illinois, Iowa, Indiana, by integrating three major satellite datasets (Landsat, MODIS, and new Sentinel-2). We will build upon this data stack to develop a machine-learning forecasting system for crop type, up to a lead time of 1-to-2 months before the harvest time with a particular focus on rain-fed corn and soybean. We will fully leverage satellite information, including spectral, phenological, and field-level texture information, for our analytics to achieve field-level predictions with advanced deep machine learning approaches.

In this project, the REU students will implement a computational pipeline that extract real-time satellite images, deposit data into a database and develop data fusion and machine learning components for predictive tasks. Due to the enormity of the satellite data and the highly demanding computational requirements, the students will need to distribute storage- and computation-intensive modules on the Blue Waters supercomputer. The students will have meetings with both Guan and Peng on a regular basis, participate the group meetings and reading groups organized in their research labs, and interact with graduate students. The students will learn advanced knowledge in remote sensing and machine learning and improve scientific research skills.

Project 7: Understanding Types of Research Problems from a Replication Perspective

Co-mentors

Daniel Katz

Daniel Katz
NCSA

Victoria Stodden

Victoria Stodden
iSchool

Social impact

Identify the software capabilities needed to enhance reproducibility of computational findings to restore public confidence in research community and scientific discoveries.

Project description

Replication is a growing concern in modern research, as an increasing number of papers and results each year are shown to be irreproducible or are withdrawn. When these are publicized in the research community, the community can respond to fix the specific problems (reasonably easily) or the systematic problems (much harder, but being attempted). However, when the publicity stretches into the general press, this lead the public towards skepticism of all science and all research, which is harder to recover from. If we can move towards science being more generally and more automatically reproducible, we can avoid this loss of public confidence. We can also work towards democratization of science, where more people being able to start with existing findings and ask new questions means more and better results. And we can increase general human well-being by recognizing that a small amount of current research, including in health and all other fields, results in incorrect conclusions, due to bugs, errors, etc., and by checking all research, we could find these errors more easily, giving us more faith in the research that has been reproduced.

In this project, we will investigate the role of software in the scientific discovery process by exposing gaps in the verifiability of published computational results, based on Stodden's work on reproducibility. We plan to understand the software capabilities needed to enhance the reproducibility of computational findings, including workflow and automated process capture, the role of software citation in addressing incentives, and best practicing in documentation and development. Students will attempt to reproduce published computational studies to understand these gaps and solutions that are needed to facilitate reproducibility in computational science, understand why this is or is not possible, and try to design solutions. The faculty and students will use these experiences to classify different types of workflows with respect to their reproducibility, and the students will create a website to demonstrate this to the public. This project will also expose the students to a wide variety of computational tools and develop a broad base of skills in computational science research.

Project 8: Eye Tracking Race and Cultural Difference in Videogames

Co-mentors

Ben Grosser

Ben Grosser
School of Art & Design

Jodi Byrd

Jodi Byrd
English

Social impact

Identifying the effect of video games' designs on race and cultural differences perceptions, critical thinking, and split-second decision-making.

Project description

How do players of contemporary video games perceive and process race-based information as part of their gaming experience? What role does that perception play in split-second decision-making processes in terms of character combat, map navigation, and attitudes towards cultural information throughout game environments? This project at the NCSA Critical Technology Studies Laboratory will combine eye tracking of visual attention during game play, data analysis of that attention, and visualization of the results towards a critical understanding of race and cultural difference in video games. We will use the Irrational Games' 2013 AAA title, BioShock Infinite, to examine the effects of both active (e.g., visible race of characters and combatants in the game) and passive (e.g., racial background materials embedded in environments, level designs, and other artistic content) processing of race-based visual data. The game Never Alone—a collaboratively designed game with input from Alaskan Native elders that includes activities and artifacts intended to foreground indigenous cultural difference—will be used to explore the effectiveness of such an approach for creating cultural awareness and learning among gamers. The results of this research will lead to new insights about how the designs of video games affect gamer attitudes towards race and difference, and will suggest new approaches for future game designers aiming to positively affect such attitudes in the future.

We will employ commodity eye tracking hardware and software to capture gamer attention during gameplay. The data produced by these systems requires significant post-processing to understand. Furthermore, such systems aren't designed to temporally synchronize that data with active gaming experiences. Therefore, our students will develop software to process the eye tracking data, synchronize it over time with a video capture of the gameplay, and visualize aspects of the data on top of that video capture. Such visualizations, especially when comparing data from multiple subjects, will illustrate the relationships between race-based content in video games and player perceptions and actions. The software developed will be hosted on the Critical Technology Laboratory's GitHub account, made available via the NCSA Open Source License, and offered to the critical gaming academic community for use and comment. The students will work with Grosser and Byrd on all aspects of the project from critical understandings of race and race-based imagery in video games to consideration of how design of gaming interfaces affects user attention. Outcomes will include data visualization and software development, and the application of this research will include the analysis software itself, visualizations of results, publications of findings, and art exhibitions derived from the research.

Project 9: Practical Process Topology Assignment

Co-mentors

William Gropp

William Gropp
Computer Science

Marc Snir

Marc Snir
Computer Science

Social impact

Improve performance of critical scientific codes running on the nation's largest HPC systems.

Project description

Some of the most challenging problems facing society today require the fastest available computers. Such problems include everything from understanding climate change to modeling the actions of diseases. A critical barrier to successful simulations is the scalability of massively parallel simulation codes, and one element of that is the effective placement of the computing processes on a massively parallel machine. The mapping of processes to cores and nodes in a massively parallel computer can have significant impact on the performance of tightly-coupled applications as several studies have shown. In addition, the dominant programming model for these applications, MPI, provides an interface to allow applications to provide information about how the processes communicate; this interface was intended to allow MPI implementations to provide an appropriate mapping of processes to the nodes and cores. However, no implementation of MPI provides a useful implementation of this feature, in part because a perfect solution is NP complete. The challenge is to develop practical approximate solutions that match common application needs.

In this project, students will develop both benchmarks to measure the impact of different process assignments and tools to provide good (if not optimal) process assignments for common communication patterns. Several approaches will be considered, including bandwidth reduction permutations for the matrix that represents the communication graph, as proposed by Snir and Hoefler, as well as exploring hierarchical approaches that consider the assignment to nodes, chips, and cores separately. The project will result in better implementations for the MPI process topology routines that can be integrated into existing open source implementations such as MPICH and Open MPI, as well as research papers detailing the approach and results. The work will take advantage of the Blue Waters system for tests involving measurements of the impact of process topology assignment on scalability, and work with the Center for the Extreme Scale Simulation of Plasma Coupled Combustion to work with these methods in a large-scale complex application.

Project 10: Data Storage and Analysis Framework for Semiconductor Nanocrystals Used in Bioimaging

Mentor

Andre Schleife

Andre Schleife
Materials Science and Engineering

Social impact

By exploring systematic exchange of data and workflows, this project provides insights and best practices for the future of collaborative research. Providing access to sample specific data and analysis to the international community will accelerate deployment of novel semiconductor nanocrystals for bioimaging.

Project description

Light-emitting molecules are a central technology in biology and medicine that provide the ability to optically tag proteins and nucleic acids that mediate human disease. In particular, fluorescent dyes are a key part of molecular diagnostics and optical imaging reagents. We recently made major breakthroughs in engineering fluorescent semiconductor nanocrystals to increase the number of distinct molecules that can be accurately measured, far beyond what is possible with such organic dye molecules. We aim to develop nanocrystals that are able to distinguish diseased from healthy tissue and determine how the complex genetics underlying cancer respond to therapy, using measurement techniques and microscopes that are already widely accessible.

In order to achieve this goal, we need to understand a complex design space, that includes size, shape, composition, and internal structure of the different nanocrystals. Students in this team will work with computational and experimental researchers in several departments in order to establish a database to store, share, and catalog optical properties and other relevant data describing semiconductor nanocrystals. This requires developing schemas and analysis workflows that can be efficiently shared between multiple researchers. Eventually, both the data and the workflows will be made available to the general public.

Students will first identify all information that will need to be included in this catalogue. Students will then write JSON and python code and interface with Globus and the Materials Data Facility. They will create well-documented iPython notebooks that operate directly on the Globus file structure and run in the web browser. Students will also develop code that automatically analyzes data stored in the facility, e.g. to verify and validate experimental and computational results against each other. This project is highly interdisciplinary and students will work with a team of researchers in bioengineering, materials science, mechanical engineering, and NCSA.