Removing big data bias
Posted May 5, 2013on:
In a recent post Haowen Chan and Robin Morris warn “the last thing you want to do is implement a [big data] system that develops and propagates data, only to learn it’s hopelessly biased.” All research and analysis has bias built in by the very nature of human involvement. However Chan and Morris provide four useful bias-quelling tactics that can be used to improve the big data science process:
- Employ domain experts Rely on them to help select relevant data and explore which features, inputs and outputs produce the best results. If heuristics are used to gain insights into smaller data sets, the data scientist will work with the domain expert to test the heuristics and ensure they actually produce better results. Like a pitcher and catcher in a baseball game, they are on the same team, with the same goal, but each brings different skill sets to complementary roles.
- Look for white spaces Data scientists who work with one data set for periods of time risk complacency, making it easier to introduce bias that reinforces preconceived notions. Don’t settle for what you have; instead, look for the “white spaces” in your data sets and search for alternate sources to supplement “sparse data.”
- Open a feedback loop This will help data scientists react to changing business requirements with modified models that can be accurately applied to the new business conditions. Applying Lean Startup like continuous delivery methodologies to your big data approach will help you keep your model fresh.
- Encourage your data scientists to explore. If you can afford your own team of data scientists, be sure they have the space and autonomy to explore freely. Some equate big data to the solar system, so get out there and explore this uncharted universe!
We can also consider what bias we are encouraging when we develop systems – from social media plugins to smart objects – which collect ‘big data,’ or data which could be aggregated into big data analysis. Might we be unfairly representing a picture from our data subjects, either by representation or omission? Collection, processing and analysis are all crucial to consider in the quest for useful and accurate big data outcomes.