In seminar class we have to write a script that takes in a gigantic (3.2MB) text file data dump, parse the data out, and insert it into a MySQL database for an application we're working on. The first pass was done by a classmate, and although it got the job done quickly (averaged around 13 seconds), the other requirement was that the script be easy to modify by non-programmers, and this was not even remotely easy to modify (it took me a half an hour to add one line). So I took it upon myself to rewrite the whole thing, and so far the result has been an interesting exercise.
Problem 1: The data is delineated by XML-style tags, but is not in an XML structure.
Problem 2: Some of the records (collections of data that represent one art piece) are invalid, as they just describe different image file names for a single art piece.
Problem 3: Some of the data is unique (such as the style, technique, etc), and the values are often in a list separated by semicolons.
Problem 4: The current version of my rewrite takes well over 400 seconds to run.
Problem 5: I had pretty much a weekend to write this.
The data dump and the way the records are structured is unavoidable. I approached the problem by reading in one record at a time and passing it through a series of functions to pull out the appropriate values, then inserting them into the MySQL database. It does this one query at a time, however, which I suspect is part of the problem.
The first step in improving is exploring the REPLACE function. I'm currently running a query that checks a table if the current values exists, otherwise it needs to be added in. Making these required entries unique should remove the need for these extra queries.
The result? Down to around 330 seconds, not as bad. The primary keys are a little screwed up, as expected, but since it's an auto-incremented number, it isn't a huge deal.
At this point the primary bottleneck is in the bridge tables. Here's how this works: all the bridge tables simply connect an art piece with its corresponding style, technique, etc. So there's a style table, which is only a list of styles, but we need to take the artID (one select query), then select the corresponding styleID, and put them in one table. This wouldn't be so bad except it's 2 queries in a row for each of the tables; that's a lot of individual queries.
**Note: at this point I realized I made an extremely stupid error and kept adding onto the records array rather than clearing it after each record was processed *facepalm!*
Fixing that major leak got the script down to 131 seconds.
After making a huge difference with the array I managed to cut it down even more by fixing the art table creation. This function was using two different queries to build the table, which was unnecessary. It's now running at around 16 seconds!
Right now I'm pretty happy with where the script is at, so I'll save the optimization of the bridge tables for later.
A New Home
12 years ago
No comments:
Post a Comment