Companies are beginning to learn that analytics algorithms are only as good as the data they run against. Here are some ways to improve data quality to get the best insights possible.
In her book, Weapons of Math Destruction, Cathy O'Neill explained how big data algorithms can yield incorrect results if the data they are run against isn't top quality.
O'Neill described a school district that ran an algorithm to identify its 200 lowest-performing teachers, who were then let go. One of the teachers who was released was actually a top performer, but had many students in her classes who had moved from poorly performing schools. As a consequence, the teacher's students didn't perform well on tests, and the teacher was blamed for the results.
O'Neill argued that other forms of input, such as the teacher's stellar reviews from administrators, students and peers, should have been factored into the data run against the algorithm, and perhaps could have prevented this unfortunate firing. It is a reminder to every big data practitioner that an analytics algorithm is only as good as the data it is being run against.
How do you ensure that the quality of your data will optimize the performance of your algorithms, and ultimately, the intelligence that you derive from them?
The key rests in data preparation and matching the business use cases you want to apply your algorithms to.
Here are six best practices for developing quality data and algorithms:
1. "True up" your algorithms
Last year, we put in a new French door, and the carpenter said he had to "true up" the framing. The framework in the doorway looked straight to me, but then the carpenter showed me how the door wouldn't perfectly fit into the opening because the original doorway hadn't been framed at true right angles. He corrected the framing and inserted to door.
Data algorithms are no different. You have to carefully construct the algorithm to "right fit" your business case. If you are a healthcare provider and you want to identify individuals in your service area who are at high risk for heart problems, you might want to construct an algorithm that asks, "Who above the age of 65 has already had a heart procedure?" instead of just, "Who is over the age of 65?"
2. Standardize your data
Thomas Gibson, Tom Gibson and T Gibson are likely to be the same person if they all live at the same address. To avoid getting duplicate data and potentially skewing your analytics results, Mr. Gibson's record should be standardized to a single data occurrence.
3. Fix broken data
In some cases, humans need to get involved to hand-correct broken data before the data is examined by an algorithm. Broken data might consist of a misspelling (e.g., MN instead ME for someone who lives in Maine), or it might be a misspelling of someone's surname that creates an extra record that shouldn't be in a dataset. The better your data accuracy, the more accurate your analytics results will be.
4. Eliminate extraneous data
If your goal is selling a catcher's mitt to professional ballplayers between the ages of 18-35, you might not be interested in what a player's favorite soft drink is, or in a player who is in a weekend amateur software league. The more you can narrow down your data to the boundaries of the specific use case you are examining, the faster your algorithm will be able to process the data—and the likelier the algorithm is going to deliver the insights you are seeking.
5. Get consensus from users
Never make unilateral decisions about the data you're going to exclude without first checking with users—because they might know something you don't. You might think it only matters to include data for parents of children under the age of five when it comes to selling a certain toy—but what if single aunts with nieces and nephews also are buyers?
6. Check results
The tendency with big data algorithms and queries is to revise and rerun them as needed, but not necessarily to record results. Instead, a baseline for results should always be set and measured against.. For example, if your first data algorithm yields you only a 3% response rate from potential purchasers of a product with 1% ultimately buying—you want to know if a revised query outperforms that.