How ProPublica Used Genomic Sequencing Data to Track an Ongoing Salmonella Outbreak

Last week, ProPublica published an investigation documenting the failures in the U.S. food safety system that allowed the spread of a type of salmonella known as multidrug-resistant infantis. The bacteria has sickened tens of thousands of people, but outdated government policies and pushback from trade groups have left federal agencies with little power to stop infantis from spreading through the poultry industry.

Our reporting relied on public records requests and dozens of interviews, the bread and butter of journalism. But I also made use of a type of data ProPublica has never before tapped into: publicly available genomic sequencing data.

Before I became a data journalist, I was a doctoral student in electrical engineering. Most of my research was in bioinformatics — the analysis and interpretation of genetic data — and for seven years, culturing bacteria, purifying nucleic acids and writing code to analyze sequencing data were my bread and butter.

So when my ProPublica colleagues Michael Grabell and Bernice Yeung approached me with questions about genomic sequencing data, I was all ears. They explained that they were digging into a salmonella outbreak investigation that the Centers for Disease Control and Prevention had closed in 2019, albeit with a warning that “illnesses could continue because this salmonella strain appears to be widespread in the chicken industry.”

Health officials hadn’t said anything to consumers about the outbreak since then. Bernice and Michael wanted to know if there was a way to track the strain to see if the outbreak was still going on. If the ultimate goal of foodborne illness investigations is to identify and address the source of an outbreak, why had the CDC closed this investigation without finding the cause and while people were still getting sick?

Several experts explained the investigation hadn’t found clear links to any particular product, brand or food processing plant. But they also mentioned something that would turn out to provide a crucial map for tracking the outbreak’s path. The federal government was compiling data about foodborne illnesses through a system called the Pathogen Detection project. It’s run by the National Center for Biotechnology Information, or NCBI, part of the National Institutes of Health, and it integrates data about bacterial pathogens taken from samples of food, the environment and human patients. The NCBI analyzes data in real time, and the results are monitored by public health agencies, including the CDC and food safety regulators.

We thought we could use the same information to figure out whether multidrug-resistant salmonella infantis is still out there — and if so, why.

The Infantis Chronicles

So Michael and Bernice started poking around the NCBI database. It was, as promised, a bonanza of bacterial detail. After talking to sources and digging into the scientific literature, they found a unique identifier code that yielded a “cluster” of more than 8,000 salmonella infantis samples, including nearly all of the 300 samples that were designated as part of the outbreak. But some terms were hard for them to decipher, like “phylogenetic trees” or “single-nucleotide polymorphisms.”

Turns out, these are attributes derived from genomic sequencing data, the type of data I worked with during my doctoral studies. Though my background isn’t specifically in bacterial genetics (much less this particular strain of salmonella), I knew enough to let me start digging into these details.

From reviewing the NCBI database’s documentation, reading academic papers and talking to scientists, I learned that samples are assigned to the cluster based on genetic similarity. But if this is true, what could the other bacterial samples in the cluster — about 7,700 samples that weren’t officially documented as part of the outbreak — tell us about infantis? Was it possible that the outbreak never ended, and remained just as rampant as before the CDC closed its investigation? Or maybe this outbreak was challenging the very definition of what an outbreak has always been thought to be?

The NCBI data alone wouldn’t tell us. The database is stripped of key details, like the poultry plant where a salmonella sample was taken or when and where a patient got sick, a shortcoming that industry scientists and consumer advocates have complained about. This is where public records proved vital. Michael and Bernice, along with Mollie Simon on our research team, filed dozens of public records requests with the CDC, the U.S. Department of Agriculture and state public health agencies that had worked on the infantis outbreak. Through those requests, we obtained records from the USDA’s microbiological sampling program that revealed how often different types of salmonella were being found at which poultry plants. We also obtained epidemiological information about patients who had been part of the outbreak, including the date they’d been tested for salmonella, details about their illness and recent food consumption. The records didn’t include patients’ names, of course, but we could match both of the datasets we’d obtained to the sequencing data available on the NCBI database.

The USDA sampling data also allowed ProPublica to create an online tool that consumers could use to check the salmonella records of the plants that process their chicken and turkey.

As we pored over the data and public records, we learned about how the CDC has analyzed DNA to connect food poisoning cases. From the 1990s to just around the time of the infantis outbreak, investigators used a technique called pulsed-field gel electrophoresis, or PFGE.

The difference between PFGE and sequencing data was crucial to this outbreak and our investigation.

PFGE Is Dead. Long Live WGS.

When a patient shows up at the hospital with symptoms of foodborne illness, a stool or urine sample may be taken. Then, DNA from bacteria found in the sample can be extracted in a lab.

A DNA sequence can be thought of as a huge compound word spelled with only four possible molecular “letters,” or nucleotide bases. PFGE uses a special protein to cut up DNA into smaller sections — imagine breaking up a giant compound word into chunks of words. Then, an electric field is applied, and the segments of cut-up DNA will rearrange based on their weight, resulting in a visible barcode-like pattern.

Scientists can compare PFGE patterns to make informed guesses about how closely related pathogens are to one another. The more similar their PFGE patterns, the more similar their underlying DNA must have been. For years, including the time covered by the infantis investigation, PFGE patterns were used to define outbreak strains.

But in the early 2000s, new technology called next-generation sequencing made it possible to relatively quickly get a readout of the full sequence of nucleotide bases in a DNA sample, a process called whole-genome sequencing, or WGS. Individuals are distinguished by the tiniest differences in the genes we share, but that is beyond the abilities of PFGE. Whole-genome sequencing, though, can reveal the unique “spellings” of our DNA that differentiate you from me — or one strain of a pathogen from another.

Sequencing data is the backbone of the NCBI Pathogen Detection project. NCBI groups genetically similar samples into clusters and then compares each sequence, nucleotide base by nucleotide base, to the other sequences in the cluster.

For each cluster of samples, NCBI also creates a phylogenetic tree, an evolutionary biologist’s version of your Aunt Sue’s hand-drawn family tree. This models how a group of organisms might be related to possible common ancestors and to one another.

But phylogenetic trees that are drawn based on hypothetical common ancestors, like NCBI’s, are interpreted differently than known family trees. Genetic changes occur by evolution, but also by chance. In the case of humans, millions of unrelated strangers might have a particular gene that gives rise to a particular disease, but that’s different from knowing that I inherited that gene directly from my parents. It’s largely the same for bacteria like salmonella.

So I wondered: What could the tree for this infantis cluster tell us about how closely related the outbreak samples were to the thousands of more recent food and patient samples in the same cluster?

No Silver Bullet

To find out, I freed up 100 gigabytes on my work laptop and asked my editors for 50 euros. The hard-drive space was for comparing approximately 32 million pairs of samples from the NCBI data, and the euros were for phylogenetic visualization software created by researchers in Germany.

By comparing the bacteria samples found in USDA tests to the outbreak samples, I found that more than twice a day this year, on average, the agency has been finding drug-resistant infantis in chickens destined for supermarkets and restaurants that’s genetically similar to the outbreak strain. We also confirmed that the CDC is still receiving reports of infantis infections — as recently as last week.

This finding highlights the power of WGS databases like NCBI’s to help investigators draw connections between human illness and foods they may have eaten. Thanks to WGS, public health officials have discovered that certain foods, like raw flour and peaches, were vectors for outbreaks of foodborne illnesses, even though they had rarely been linked to a particular bug. Sequencing data has even helped solve cases that had long gone cold, like a sprinkling of food poisoning cases linked to ice cream that were finally connected after half a decade.

But WGS is no silver bullet. Even a seemingly “perfect” DNA match in NCBI cannot conclusively identify the specific culprit behind a foodborne illness. Bacteria accumulate changes in DNA relatively rapidly and have an annoying habit of swapping genes like Pokémon cards. Bacterial samples might share the same set of genes and mutations because they came from the same source, or they might have acquired them independently under completely different circumstances. So a genetic match between a food and human sample must be corroborated with epidemiological proof to make sure it fits in the outbreak timeline and matches a theorized source of the outbreak.

I’d hoped that visualizing the outbreak samples on a phylogenetic tree would reveal insights about the more recent infantis samples versus the ones collected during the outbreak. Perhaps there would be patterns in the tree showing that newer samples shared more genetic similarities than the outbreak samples did. Or that certain outbreak samples had spawned mini-outbreaks of their own. Instead, the visualization software showed that, evolutionarily speaking, the outbreak samples were all over the place: They couldn’t be tracked back to one particular source.

How ProPublica Used Genomic Sequencing Data to Track an Ongoing Salmonella Outbreak 3

How ProPublica Used Genomic Sequencing Data to Track an Ongoing Salmonella Outbreak 4

The lack of obvious patterns in the tree that could be tied to geography, time or food product supported the CDC’s theory that infantis contamination was likely originating not at particular slaughterhouses or processing plants, but rather upstream in the poultry supply chain, perhaps in feed or breeding flocks. (The two major breeding companies, Aviagen and Cobb-Vantress, the latter a subsidiary of Tyson Foods, declined to comment.) Comparing the DNA sequences yielded no further clues — the number of genetic differences between two samples from during the outbreak was, on average, about the same as that between an outbreak isolate and a more recent multidrug-resistant infantis isolate. To put it simply: The infantis samples before, during and after the outbreak were, in the end, all pretty similar.

We shared our findings with numerous experts, including former and current CDC researchers and food safety scientists. They agreed that our analysis indicated something very different from the traditional foodborne illness outbreak that can be traced back to a definitive single source. What we’ve been looking at, they said, is indicative of a bug that’s so deeply entrenched in the poultry supply chain that it’s hard to figure out where it came from.

A New Landscape of Bacterial Foodborne Illness

The closure of the infantis investigation without any conclusions about the outbreak’s origins is, it appears, a harbinger. Similarities in genetic data are linking seemingly unrelated cases of people getting sick in different states and consuming different products. At a USDA meeting on salmonella last year, Robert Tauxe, director of the CDC’s Division of Foodborne, Waterborne and Environmental Diseases, described a “new landscape” of foodborne illnesses revealed by WGS: strains that cause recurring outbreaks, that were newly emerging and that persist in a population from year to year.

The very definition of what constitutes an outbreak is in question, experts told us.

Scientists are beginning to answer that question with sequencing data and are piecing together how bacteria are taking advantage of our interconnected food supply chains. A 2015 study on salmonella in fish products destined for sushi in restaurants and grocery stores identified certain countries in the global tuna supply chain where salmonella contamination is more likely to occur.

“Public health agencies,” wrote the authors, “could use this information to determine most effective intervention points to minimize or eliminate outbreak risk.”

Up to now, the USDA hasn’t fully used the information at its disposal to prevent the most dangerous strains of salmonella from spreading in our food supply.

It’s possible that will change, though. Last month, after years of public pressure (and weeks of inquiries from ProPublica), the USDA’s top food safety official, Sandra Eskin, said the agency was rethinking its approach to salmonella. The agency will set up pilot projects and hold meetings to develop a new plan, but its announcement was short on specifics.

Michael Grabell and Bernice Yeung contributed reporting. Illustrated explainer by Laila Milevski.