Pastie now auto-senses if line-wrap is a bad or good idea. Feedback?
## mark a section (Learn more)
import java.util.regex._ import scalax.data.Implicits._ import scalax.io.Implicits._ import scalax.control.ManagedSequence /** * I've made this immutable, representing some fixed accumulation of times for some * fixed number of instances. Having this class immutable will make our lives a lot * easier if we ever decide to make this program concurrent. Of course, this is * probably IO-bound, so concurrency probably won't help unless we're reading from * multiple log files at the same time. * I've also added a + method to the class, so I can add two Times together and get * a new Time that represents the accumulation of both of the previous time. Yes, * this means I'll be creating and gargabe collecting lots of objects, but the JVM * GC is actually really good at disposing of short-lived objects, so the performance * penalty is near-zero. */ case class Time(instances: Int, totalTime: Int, viewTime: Int, dbTime: Int) { def this() = this(0, 0, 0, 0) def +(other: Time) = Time(instances + other.instances, totalTime + other.totalTime, viewTime + other.viewTime, dbTime + other.dbTime) def avg(time: Int) = time.toFloat/instances override def toString = "Total Time: " + totalTime + "ms (View: " + viewTime +"ms DB:" + dbTime + "ms )" def avgToString = "Average Time: " + avg(totalTime) + "ms (View: " + avg(viewTime) +"ms DB:" + avg(dbTime) + "ms )" } object LogParser { def main(args: Array[String]) { val results = parseLogFile(args(0)) println("Total URIs: " + results.size) for ((uri, time) <- results) { println(uri + " => " + time.instances) println("\t" + time) println("\t" + time.avgToString) } } def parseLogFile(filename: String): Map[String, Time] = { // Pattern is this: "Completed in 100ms (View: 25, DB: 75) | 200 OK [http://app.domain.com?params=here]" val p = Pattern.compile("Completed in (\\d+)ms \\(View: (\\d+), DB: (\\d+)\\) \\| (\\d+) OK \\[http://app.domain.com(.*)\\?") /** * The next two lines have some deep Scalax magic. Well, not really, but it might * look like deep magic if you've never used Scalax. The rest of this expression * is fairly straight-forward. * * First, "toFile" is a method that Scalax adds to any String to turn it into * a java.io.File. Finally, "lines" is a method that Scalax adds to any * java.io.File to turn it into a ManagedSequence[String]. ManagedSequence does * some Automatic Resource Management for us. It will take care of opening and * closing the file properly (even if an exception is thrown while processing it!) * and making sure that only one line is processed at a time (that is, the entire * file is processed in O(1) space). * * In order to make this work, ManagedSequence is lazy. That is, defining my "times" * variable doesn't actually read anything from, or even open, the file. All of the * IO happens only once I do something with the stuff inside the ManagedSequence. * In this case, that happens when I call foldLeft on my ManagedSequence later on. */ val times: ManagedSequence[(String, Time)] = for { line <- filename.toFile.lines val m = p.matcher(line) // This line filters only those lines which match our pattern. // Non-matching lines get skipped. if m.find val uri = m.group(5) val time = Time(1, m.group(1).toInt, m.group(2).toInt, m.group(3).toInt) } yield (uri, time) /** * This is just a standard empty Map. I've made it immutable instead of mutable, * in part because I prefer immutable data structures, but also because we might * want to add concurrency later and if we do immutability will turn out to be a * big bonus. I'm adding a default value to my map. That is, if I try to access * a key that is not in the map, I'll get an empty (zeroed out) Time instead * of an error. */ val emptyMap: Map[String, Time] = Map.empty.withDefaultValue(new Time) /** * If you're not familiar with foldLeft, this can look a little scary at first. * Don't worry, folds are easy. Here is your most basic fold, which adds up all * the numbers in a List: * * List(1, 2, 3).foldLeft(0)(_ + _) === (((0 + 1) + 2) + 3) === 6 * * The second argument to foldLeft is a function of two arguments. If we wanted * to name the arguments, we could expand the above code to: * * List(1, 2, 3).foldLeft(0)((sum, n) => sum + n) * * As you can see, the second argument to foldLeft takes a running sum, which starts * off at zero, and the next number (n = 1, 2, or 3) and returns the new running sum * * We're going to take the same approach, but instead of keeping a running sum we're * going to keep a running map of Uri -> Time. We start off with the emptyMap, * then we provide a function which takes a running map and a pair of Uri -> Time * (from our ManagedSequence) and returns a new running map. * * So how do we return the new running map? Our running map is immutable, so how come * it looks like we're changing it? Well, let's look a closer look at the line that * returns our new running map. * * map(uri) += time * * First, map(uri) would typically return a Time. However, Time has no += method defined * on it. Instead, Scala transforms += into a sequence of = and +, like so: * * map(uri) = map(uri) + time * * Next, Scala turns expressions of the form a(b) = x into method calls of the form * a.update(b, x). It's another piece of syntactic convenience, just like a(b) is turned * into a.apply(b). * * map.update(uri, map.apply(uri) + time) * * Now this starts to make sense. Map's apply method returns the Time for that uri, if no Time * is found we get our default value, the empty Time. Then Time + Time returns the sum * of both the older Times. Finally, map's update method returns a new map with a modified * value for the uri key. * * (To be fair, that last line was a little too sugary even for me. I probably would have * written it as map(uri) = map(uri) + time. You still have to know how that translates * into apply and update, but that's a little more common and less prone to bugs like += is.) * * Since we're keeping a running map, every (uri, time) pair in our ManagedSequence will * get added to the running map we're keeping. The eventual return result will be an * immutable Map[String, Time]. Neat! * * Note that this is where the laziness of ManagedSequence breaks down and it's forced to * do all the work it's been putting off. As soon as our foldLeft starts, the file gets * opened and each line gets regex matched and turned into a pair of (uri, time) and then * incorporated into a running map of Uris -> Times. All of this is done one line at * a time, to make sure we don't blow up our memory if the file is really big. Once the * foldLeft is over, ManagedSequence takes care to close the file properly. * * Because everything is immutable, we'll be creating a lot of intermediate data structures * like small Times and partial Maps. However, they're so short lived (all thanks to the * laziness of ManagedSequence!) that they'll hopefully never leave the GC's nursery and * get collected fairly efficiently. If performance is a concern I'd profile the program, * but I would be very surprised if didn't run almost as fast as the optimal imperative version. */ times.foldLeft(emptyMap) { case (map, (uri, time)) => map(uri) += time } } }
This paste will be private.
From the Design Piracy series on my blog: