Need help with data cleaning
Posted: Tue Apr 09, 2013 2:57 pm
Over the past few months, I've been rewriting the data cleaning process for the number crunching. I've gotten the workflow simplified and expedited considerably. It has kept me very busy over the past few months as I rewrote the entire process while continuing to work on and use the old process. Based on the speed differences I'm seeing, it will be worth the time spent. I'm now at the point where I need to group the differences between the results of the old and the new systems into common cases and resolve each group.
My goal for this stage of the project is to cut over to the new process as soon as it produces results as good as or better than the existing process. Currently, there are about 29,445 records with differing results out of a total of 1,049,285 records (with thousands more coming every month) split across eight or so different data sources. I am expecting the number of differences to chance after doing one last pass of the old system. Hopefully the number of differences drops dramatically. Once all of the differences are accounted for and addressed then I can flip over to the new system which would save me a ton of time every month on the number crunching.
What I need is help digging through the records with differing results and identifying what shortcomings there are in the new system and to help me define what "right" looks like in some cases. Don't worry about the programming side of things as I'll be doing all of the coding. I'm not looking for a huge time investment on this, just someone with an attention to detail willing to help me break things down to common cases and concrete actions for dealing with them.
I'd really like to be on the new system before the April 2013 sales data gets released in about four weeks. But I can't do that until the results of the new system are at least as good as the current system.
Is anybody willing to help me on this project?
My goal for this stage of the project is to cut over to the new process as soon as it produces results as good as or better than the existing process. Currently, there are about 29,445 records with differing results out of a total of 1,049,285 records (with thousands more coming every month) split across eight or so different data sources. I am expecting the number of differences to chance after doing one last pass of the old system. Hopefully the number of differences drops dramatically. Once all of the differences are accounted for and addressed then I can flip over to the new system which would save me a ton of time every month on the number crunching.
What I need is help digging through the records with differing results and identifying what shortcomings there are in the new system and to help me define what "right" looks like in some cases. Don't worry about the programming side of things as I'll be doing all of the coding. I'm not looking for a huge time investment on this, just someone with an attention to detail willing to help me break things down to common cases and concrete actions for dealing with them.
I'd really like to be on the new system before the April 2013 sales data gets released in about four weeks. But I can't do that until the results of the new system are at least as good as the current system.
Is anybody willing to help me on this project?