Wired - Groups like DataRefuge and the Environmental Data and Governance Initiative, which organized [a] Berkeley hackathon to collect data from NASA’s earth sciences programs and the Department of Energy, are doing more than archiving. Diehard coders are building robust systems to monitor ongoing changes to government websites. And they’re keeping track of what’s been removed—to learn exactly when the pruning began. Tag It, Bag It
The data collection is methodical, mostly. About half the group immediately sets web crawlers on easily-copied government pages, sending their text to the Internet Archive, a digital library made up of hundreds of billions of snapshots of webpages. They tag more data-intensive projects—pages with lots of links, databases, and interactive graphics—for the other group. Called “baggers,” these coders write custom scripts to scrape complicated data sets from the sprawling, patched-together federal websites.
It’s not easy. “All these systems were written piecemeal over the course of 30 years. There’s no coherent philosophy to providing data on these websites,” says Daniel Roesler, chief technology officer at UtilityAPI and one of the volunteer guides for the Berkeley bagger group.