5 solvable tech problems in science
By Andy Chase Aug 27, 2016.
Story of these interviews
For my senior project at Oregon State University I was assigned to work with NASA JPL to build a science tool for researchers to use. These problems are the results of interviews I performed as part of customer validation.
Problems
(All interviews referenced are dated 2015-11-19 through 2015-11-24)
(P.1) Software packaging is inadequate
Documentation isn’t always there
- (Hutchings Q.40) [Talking about derived data] “can’t figure out what’s going on without documentation on how the product was gridded”
- (O’Neill Q.16) “I take somebodies stuff and sometimes times it takes a little bit of time– to see how it’s supposed to be used?”
Outside code doesn’t work / Isn’t fully tested
- (O’Neill Q.15) “Yeah, occasionally, sometimes I get code from other people that [laughs] doesn’t work? It’s because it works on their stuff and not mine.”
- (Hutchings Q.27) [About software bugs:] “So you know how endemic they are then– they are mistakes everywhere!”
Code not available in all languages
- (Hutchings Q.25) “It all comes down to where you find your code, so I’ve used R– because there was code available.”
Outside code isn’t trusted
- (O’Neill Q.18) “Like, I mean I have had people give me code that I didn’t trust them so I didn’t use it”
Researchers are open to using software packages
- (Scientists via Kuuipo Q.13) [Are scientists open to using packages?] “Yes, yeah. Especially Open Source tools and libraries,” – “[For example] students will start off by learning R then they will quickly start using all the libraries that can manipulate the statistics geographically.”
- (O’Neill Q.13) “Sometimes [I look for] utilities like [ellipse routines] And yeah I download it and try it once and if it works like it’s supposed to like that then “that’s cool”
- (Hutchings Q.26) “It helps to have access to people’s code when they have solved problems” (Q.23) “I think we’re now in a world where free sharing information and algorithms is a good thing to do.”
(P.2) Work is often re-done / Wasted work
Researchers write software that isn’t saved or reused
- (Chelton, not noted) – Feel free to re-implement the algorithms I listed in my paper for finding Eddies
- (O’Neill, Q.14) Shares only some code and only with certain people – “Yeah. I share it fairly freely. I share my stuff– at least the stuff I know– I’m pretty sure it’s not buggy [laughing]”
Derived Products aren’t trusted
- (Shell) A lot of work goes into derived products but many researchers don’t use them
(P.3) Version control is inadequate
- (O’Neill Q.28) “I think I ended up having to ask the computer guy to get the backup because they do backups every night– yeah version control would be very good.”
(P.4) Knowledge is not shared / Researchers have to learn about things outside their domain
- (O’Neill Q.25) Had to learn how things were encoding for visualization “- you end up getting into the details of like how these things get encoded and as a researcher it’s not-“
- (Hutchings Q.47) “Well we don’t even realize that it’s that easy to get the data haha, that’s funny.”
- (Kennedy, Q.3) “that’s a lot to manage and it requires a certain level of expertise and interest in doing the computer management and all that stuff which not everyone has”
- (Scientists via Kuuipo Q.16) “researchers are still creating their own data” – “and don’t even know that other researchers exist or that other data exists”
- (O’Neill Q.17) Researchers don’t always use existing formats – “but sometimes people have like their own binary format or something”
(P.5) Data can be hard to work with
Not indexed in the right way (temporally)
- (O’Neill, Q.4) Going through to find time series is a pain – ”so you have to look through [millions of files to find one point in each one] and it’s kind of a pain”
Data not being in the right format / poorly documented formats
- (O’Neill Q.17) “but sometimes people have like their own binary format or something– or it’s just put into an unformatted binary file”
- (Hutchings Q.16) – ”if the data is not provided with a way of reading it– no one else can use it so– as you said it’s useless.”
Researchers have to deal with a lot of data
- (Kennedy) Landsat 100tb
- (Jamon) Landsat 100 gb (currently)-100tb (ideally)
- (O’Neill) Various 70 terabytes
- (Shell) Various 10tb
Data goes away (or at least used to)
- (O’Neill, Q.30) “now you don’t have to worry about: is it going to be there in two years? or something..”
- (Shell) You don’t want to lose access right before the deadline
Transcript sources
Here are edited transcripts from the interviews