|
|
recent entries "Looking forward to summer!" "Save Alcator C-Mod!" "A New Appreciation for Boredom" "That's All Folks!" "kOREA? you're not no boy..."
|
Jan 13, 2012 Understanding your data is importantHello Everyone! It is January again. There seem to be a lot of IAP activities going on: from dancing to cooking to physics to sports. One interesting event which me and my friend have been going to is about data processing and visualization using python. The class is being organized by Adam and Eugene from CSAIL (Computer Science and AI Laboratory), both of whom are grad researchers in databases and systems. Needless to say, I am greatly benefitting from the class! Up till now, we learned how to understand data by visualizing it using various graphing tools (never underestimate histograms, scatter plots etc), set forth hypotheses and then reject/not-reject these hypotheses using suitable tests (classical statistics). We also learned how to clean data, especially when it is textual so as to improve our own subsequent analyses. The importance of amassing such useful tools for applied mathematicians and model engineers cannot be understated. A recent poll on the kdnuggets website shows that people think ‘big data’ and data analytics are two important developments in related industries. But while using such tools in practice, I feel one must tread large scale data in a delicate way. The reason is because of the following: There might be a lot of domain wisdom involved with related activities like data preprocessing, making hypotheses, making subsequent data collection and so on which can make or break the entire analysis process. This not only applies to the easily available web data but also to engineering and natural science datasets which have also exploded rapidly thanks to better systems and high quality/faster measurements (e.g., images sent by satellites, surveillance footage, data from physical experiments like those at CERN etc). There are quite a few small companies which have come up around the web data industry which exclusively maintain and sell datasets. I wonder which areas (that un till now haven’t seen data deluge) could immensely benefit from aggressive large scale data analysis?
Post A Response
|