This past summer I took a data analysis in python course from Dataquest. At the end of it I felt pretty uncertain about my grasp of some of the statistical techniques it introduced, but I loved writing python and SQL code to massage and analyze data sets and I wanted to do more of it. But what to analyze?
Last year we completely replaced our connection to the city sewer. It was expensive, but we’d already spent a good chunk of change on two previous spot-repairs and saw that it was better to bite the bullet. Our sewer was about 90y old. Houses like ours had sewers made with 1st generation concrete pipe and we’d reached the end of the serviceable life, along with many others — in the time before and since we had our line replaced, we’ve noticed dozens of other houses in the neighborhoods with major sewer work.
I thought it would be interesting to see what data the data showed about sewer repairs, and what predictions one could make.
I didn’t know what data was available, but it seemed like I should be able to find the construction dates of all houses in Seattle. Permitting data should also be available for at least the past 5 years. Combining the two should cast light on how common sewer repairs are for various cohorts of housing.
Initial scrutiny of the permit data suggested I should be able to identify sewer work without much trouble, but determining the extent of that work could be more difficult. Then I found more detailed information, a dataset that cataloged all the “side sewers” in the city, and down to the level of individual stretches of pipe. There were fields for installation and inspection dates. The majority of records left these blank, but it seemed that recent repairs and replacements might have dates populated. My hypothesis is that I can
Now that I’ve identified the data I can use, I have to figure out how to combine it. In some cases, there may be primary and foreign key relationships I can use. In others, though, it looks like I’m going to have to use geographic coincidence, something I’ve never done. It looks like I can use GeoPandas to do the necessary operations, unfortunately there are some issues with the data that may complicate things.
More to come…