-
Notifications
You must be signed in to change notification settings - Fork 46
/
Copy path04_atwork.scala
1991 lines (1557 loc) · 80.2 KB
/
04_atwork.scala
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/**
# (Chapter 0) Intro: Abstraction Without Regret
Outline:
<div id="tableofcontents"></div>
LMS is a dynamic multi-stage programming approach: We have the full Scala
language at our disposal to compose fragments of object code. In fact, DSL
programs are program _generators_ that produce an object program IR when run.
DSL or library authors and application programmers can exploit this multi-
level nature to perform computations explicitly at staging time, so that the
generated program does not pay a runtime cost. Multi-stage programming shares
some similarities with partial evaluation [(*)](jones93partial), but instead
of an automatic binding-time analysis, the programmer makes binding times
explicit in the program. We have seen how LMS uses `Rep` types for this
purpose:
val s: Int = ... // a static value: computed at staging time
val d: Rep[Int] = ... // a dynamic value: computed when generated program is run
Unlike with automatic partial evaluation, the programmer obtains a guarantee
about which expressions will be evaluated at staging time.
While moving _computations_ from run time to staging time is an interesting
possibility, many computations actually depend on dynamic input and cannot be
done before the input is available. Nonetheless, explicit staging can be used
to _combine_ dynamic computations more efficiently. Modern programming
languages provide indispensable constructs for abstracting and combining
program functionality. Without higher-order features such as first-class
functions or object and module systems, software development at scale would
not be possible. However, these abstraction mechanisms have a cost and make
it much harder for the compiler to generate efficient code.
Using explicit staging, we can use abstraction in the generator stage to
remove abstraction in the generated program. This holds both for control
(e.g. functions, continuations) and data abstractions (e.g. objects,
boxing). Some of the material in this chapter is taken from
[(*)](DBLP:journals/corr/abs-1109-0778).
# Common Compiler Optimizations
We have seen in [Part 3](03_compiler.html) how many classic compiler
optimizations can be applied to the IR generated from embedded programs in a
straightforward way. Among those generic optimizations are common
subexpression elimination, dead code elimination, constant folding and code
motion. Due to the structure of the IR, these optimizations all operate in an
essentially global way, at the level of domain operations. An important
difference to regular general-purpose compilers is that IR nodes carry
information about effects they incur (see [here](#sec:321)). This permits to
use quite precise dependency tracking that provides the code generator with a
lot of freedom to group and rearrange operations. Consequently, optimizations
like common subexpression elimination and dead code elimination will easily
remove complex DSL operations that contain internal control-flow and may span
many lines of source code.
Common subexpression elimination (CSE) / global value numbering (GVN) for pure
nodes is handled by `toAtom`: whenever the `Def` in question has been
encountered before, its existing symbol is returned instead of a new one (see
[here](#sec:320cse)). Since the operation is pure, we do not need to check via
data flow analysis whether its result is available on the current path.
Instead we just insert a dependency and let the later code motion pass (see
[here](#sec:320codemotion)) schedule the operation in a correct order. Thus,
we achieve a similar effect as partial redundancy elimination
(PRE~[(*)](DBLP:journals/toplas/KennedyCLLTC99)) but in a simpler way.
Based on frequency information for block expression, code motion will hoist
computation out of loops and push computation into conditional branches. Dead
code elimination is trivially included. Both optimizations are coarse
grained and work on the level of domain operations. For example, whole data
parallel loops will happily be hoisted out of other loops.
Consider the following user-written code:
v1 map { x =>
val s = sum(v2.length) { i => v2(i) }
x/s
}
This snippet scales elements in a vector `v1` relative to the sum of `v2`'s
elements. Without any extra work, the generic code motion transform places the
calculation of `s` (which is itself a loop) outside the loop over `v1` because
it does not depend on the loop variable `x`.
val s = sum(v2.length) { i => v2(i) }
v1 map { x =>
x/s
}
# Delite: An End-to-End System for Embedded Parallel DSLs
This section gives an overview of our approach to developing and executing
embedded DSLs in parallel and on heterogeneous devices. A more thorough
description of Delite can be found [here](http://ppl.stanford.edu/delite).
Delite seeks to alleviate the burden of building a high performance DSL by
providing reusable infrastructure. Delite DSLs are embedded in Scala using
LMS. On top of this layer, Delite is structured into a _framework_ and a
_runtime_ component. The framework provides primitives for parallel
operations such as `map` or `reduce` that DSL authors can use to define
higher-level operations. Once a DSL author uses Delite operations, Delite
handles code generating to multiple platforms (e.g. Scala and CUDA), and
handles difficult but common issues such as device communication and
synchronization. These capabilities are enabled by exploiting the domain-
specific knowledge and restricted semantics of the DSL compiler.
## Building a Simple DSL
On the surface, DSLs implemented on top of Delite appear very similar to
purely-embedded (i.e. library only) Scala-based DSLs. However, a key aspect
of LMS and hence Delite is that DSLs are split in two parts, _interface_ and
_implementation_. Both parts can be assembled from components in the form of
Scala traits. DSL programs are written in terms of the DSL interface, agnostic
of the implementation. Part of each DSL interface is an abstract type
constructor `Rep[_]` that is used to wrap types in DSL programs. For example,
DSL programs use `Rep[Int]` wherever a regular program would use `Int`. The
DSL operations defined in the DSL interface (most of them are abstract
methods) are all expressed in terms of `Rep` types.
The DSL _implementation_ provides a concrete instantiation of `Rep` as
expression trees (or graphs). The DSL operations left abstract in the
interface are implemented to create an expression representation of the
operation. Thus, as a result of executing the DSL program, we obtain an
analyzable representation of the very DSL program which we will refer to as IR
(intermediate representation).
To substantiate the description, let us consider an example step by step. A
simple (and rather pointless) program that calculates the average of 100
random numbers, written in a prototypical DSL `MyDSL` that includes numeric
vectors and basic console IO could look like this:
object HelloWorldRunner extends MyDSLApplicationRunner with HelloWorld
trait HelloWorld extends MyDSLApplication {
def main() = {
val v = Vector.rand(100)
println("today's lucky number is: ")
println(v.avg)
}
}
Programs in our sample DSL live within traits that inherit from
`MyDSLApplication`, with method `main` as the entry point.
`MyDSLApplication` is a trait provided by the DSL that defines the DSL
interface. In addition to the actual DSL program, there is a singleton object
that inherits from `MyDSLApplicationRunner` and mixes in the trait that
contains the program. As the name implies, this object will be responsible for
directing the staged execution of the DSL application.
Here is the definition of `MyDSL`'s components encountered so far:
trait MyDSLApplication extends DeliteApplication with MyDSL
trait MyDSLApplicationRunner extends DeliteApplicationRunner with MyDSLExp
trait MyDSL extends ScalaOpsPkg with VectorOps
trait MyDSLExp extends ScalaOpsPkgExp with VectorOpsExp with MyDSL
`MyDSLApplicationRunner` inherits the mechanics for invoking code generation
from DeliteApplication. We discuss how Delite provides these facilities
[here](#subsec:delite). We observe a structural split in the inheritance
hierarchy that is rather fundamental: `MyDSL` defines the DSL _interface_,
`MyDSLExp` the _implementation_. A DSL program is written with respect to the
interface but it knows nothing about the implementation. The main reason for
this separation is safety. If a DSL program could observe its own structure,
optimizing rewrites that maintain semantic but not structural equality of DSL
expressions could no longer be applied safely.\footnote{In fact, this is the
main reason why MSP languages do not allow inspection of staged code at all
[(*)](DBLP:conf/pepm/Taha00).} Our sample DSL includes a set of common Scala
operations that are provided by the core LMS library as trait `ScalaOpsPkg`.
These operations include conditionals, loops, variables and also `println`. On
top of this set of generic things that are inherited from Scala, the DSL
contains vectors and associated operations. The corresponding interface is
defined as follows:
trait VectorOps extends Base {
abstract class Vector[T] // placeholder ("phantom") type
object Vector {
def rand(n: Rep[Int]) = vector_rand(n) // invoked as: Vector.rand(n)
}
def vector_rand(n: Rep[Int]): Rep[Vector[Double]]
def infix_length[T](v: Rep[Vector[T]]): Rep[Int] // invoked as: v.length
def infix_sum[T:Numeric](v: Rep[Vector[T]]): Rep[T] // invoked as: v.sum
def infix_avg[T:Numeric](v: Rep[Vector[T]]): Rep[T] // invoked as: v.avg
...
}
There is an abstract class `Vector[T]` for vectors with element type `T`. The
notation `T:Numeric` means that `T` may only range over numeric types such as
`Int` or `Double`. Operations on vectors are not declared as instance methods
of `Vector[T]` but as external functions over values of type `Rep[Vector[T]]`.
Returning to our sample DSL, this is the definition of `VectorOpsExp`, the
implementation counterpart to the interface defined above in `VectorOps`:
trait VectorOpsExp extends DeliteOpsExp with VectorOps {
case class VectorRand[T](n: Exp[Int]) extends Def[Vector[Double]]
case class VectorLength[T](v: Exp[Vector[T]]) extends Def[Int]
case class VectorSum[T:Numeric](v: Exp[Vector[T]]) extends DeliteOpLoop[Exp[T]] {
val range = v.length
val body = DeliteReduceElem[T](v)(_ + _) // scalar addition (impl not shown)
}
def vector_rand(n: Rep[Int]) = VectorRand(n)
def infix_length[T](v: Rep[Vector[T]]) = VectorLength(v)
def infix_sum[T:Numeric](v: Rep[Vector[T]]) = VectorSum(v)
def infix_avg[T:Numeric](v: Rep[Vector[T]]) = v.sum / v.length
...
}
The constructor `rand` and the function `length` are implemented as new plain
IR nodes (extending `Def`). Operation `avg` is implemented directly in terms
of `sum` and `length` whereas `sum` is implemented as a `DeliteOpLoop` with a
`DeliteReduceElem` body. These special classes of structured IR nodes are
provided by the Delite framework and are inherited via `DeliteOpsExp`.
## Code Generation
The LMS framework provides a code generation infrastructure that includes a
program scheduler and a set of base code generators. The program scheduler
uses the data and control dependencies encoded by IR nodes to determine the
sequence of nodes that should be generated to produce the result of a block.
After the scheduler has determined a schedule, it invokes the code generator
on each node in turn. There is one _code generator_ object per target platform
(e.g. Scala, CUDA, C++) that mixes together traits that describe how to
generate platform-specific code for each IR node. This organization makes it
easy for DSL authors to modularly extend the base code generators; they only
have to define additional traits to be mixed in with the base generator.
Therefore, DSL designers only have to add code generators for their own
domain-specific types. They inherit the common functionality of scheduling and
callbacks to the generation methods, and can also build on top of code
generator traits that have already been defined. In many cases, though, DSL
authors do not have to write code generators at all; the next section
describes how Delite takes over this responsibility for most operations.
## The Delite Compiler Framework and Runtime
On top of the LMS framework that provides the basic means to construct IR
nodes for DSL operations, the Delite Compiler Framework provides high-level
representations of execution patterns through `DeliteOp` IR, which includes a
set of common parallel execution patterns (e.g. map, zipWith, reduce).
`DeliteOp` extends `Def`, and DSL operations may extend one of the `DeliteOps`
that best describes the operation. For example, since `VectorSum` has the
semantics of iterating over the elements of the input Vector and adding them
to reduce to a single value, it can be implemented by extending `DeliteOpLoop`
with a reduction operation as its body. This significantly reduces the amount
of work in implementing a DSL operation since the DSL developers only need to
specify the necessary fields of the `DeliteOp` (`range` and `body` in the case
of `DeliteOpLoop`) instead of fully implementing the operation.
`DeliteOpLoop`s are intended as parallel for-loops. Given an integer index
range, the runtime guarantees to execute the loop body exactly once for each
index but does not guarantee any execution order. Mutating global state from
within a loop is only safe at disjoint indexes. There are specialized
constructs to define loop bodies for map and reduce patterns
(`DeliteCollectElem`, `DeliteReduceElem`) that transform a collection of
elements point-wise or perform aggregation. An optional predicate can be added
to perform filter-style operations, i.e. select or aggregate only those
elements for which the predicate is true. All loop constructs can be fused
into `DeliteOpLoops` that do several operations at once.
Given the relaxed ordering guarantees, the framework can automatically
generate efficient parallel code for `DeliteOps`, targeting heterogeneous
parallel hardware. Therefore, DSL developers can easily implement parallel
DSL operations by extending one of the parallel `DeliteOps`, and only focus on
the language design without knowing the low-level details of the target
hardware.
The Delite Compiler Framework currently supports Scala, C++, and CUDA targets.
The framework provides code generators for each target in addition to a main
generator (_Delite generator_) that controls them. The _Delite generator_
iterates over the list of available target generators to emit the target-
specific kernels. By generating multiple target implementations of the kernels
and deferring the decision of which one to use, the framework provides the
runtime with enough flexibility in scheduling the kernels based on dynamic
information such as resource availability and input size. In addition to the
kernels, the Delite generator also generates the _Delite Execution Graph_
(DEG) of the application. The DEG is a high-level representation of the
program that encodes all necessary information for its execution, including
the list of inputs, outputs, and interdependencies of all kernels. After all
the kernels are generated, the Delite Runtime starts analyzing the DEG and
emits execution plans for each target hardware.
# (Chapter 1) Control Abstraction
Among the most useful control abstractions are higher order functions. We can
implement support for higher order functions in DSLs while keeping the
generated IR strictly first order. This vastly simplifies the compiler
implementation and makes optimizations much more effective since the compiler
does not have to reason about higher order control flow. We can implement a
higher order function `foreach` over Vectors as follows:
def infix_foreach[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Unit]) = {
var i = 0; while (i < v.length) { f(v(i)); i += 1 }
}
// example:
Vector.rand(100) foreach { i => println(i) }
The generated code from the example will be strictly first order, consisting
of the unfolded definition of `foreach` with the application of `f`
substituted with the `println` statement:
while (i < v.length) { println(v(i)); i += 1 }
The unfolding is guaranteed by the Scala type system since `f` has type
`Rep[A]=>Rep[Unit]`, meaning it will be executed statically but it operates on
staged values. In addition to simplifying the compiler, the generated code
does not pay any extra overhead. There are no closure allocations and no
inlining problems [(*)](cliffinlining).
Other higher order functions like `map` or `filter` could be expressed on top
of foreach. Actual Delite DSLs implement these operations as data parallel
loops. The rest of this chapter shows how other control structures such as
continuations can be supported in the same way.
# Leveraging Higher-Order Functions in the Generator
Higher-order functions are extremely useful to structure programs but also
pose a significant obstacle for compilers, recent advances on higher-order
control-flow analysis notwithstanding
[(*)](DBLP:conf/esop/VardoulakisS10,DBLP:journals/corr/abs-1007-4268). While
we would like to retain the structuring aspect for DSL programs, we would like
to avoid higher-order control flow in generated code. Fortunately, we can use
higher-order functions in the generator stage to compose first-order DSL
programs.
Consider the following program that prints the number of elements greater than
7 in some vector:
val xs: Rep[Vector[Int]] = ...
println(xs.count(x => x > 7))
The program makes essential use of a higher-order function `count` to count
the number of elements in a vector that fulfill a predicate given as a
function object. Ignoring for the time being that we would likely want to use
a `DeliteOp` for parallelism, a good and natural way to implement `count`
would be to first define a higher-order function `foreach` to iterate over
vectors, as shown at the beginning of the chapter:
def infix_foreach[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Unit]) = {
var i: Rep[Int] = 0
while (i < v.length) {
f(v(i))
i += 1
}
}
The actual counting can then be implemented in terms of the traversal:
def infix_count[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Boolean]) = {
var c: Rep[Int] = 0
v foreach { x => if (f(x)) c += 1 }
c
}
It is important to note that `infix_foreach` and `infix_count` are static
methods, i.e. calls will happen at staging time and result in inserting the
computed DSL code in the place of the call. Likewise, while the argument
vector `v` is a dynamic value, the function argument `f` is again static.
However, `f` operates on dynamic values, as made explicit by its type
`Rep[A] => Rep[Boolean]`. By contrast, a dynamic function value would have
type `Rep[A => B]`.
This means that the code generated for the example program will look roughly
like this, assuming that vectors are represented as arrays in the generated
code:
val v: Array[Int] = ...
var c = 0
var i = 0
while (i < v.length) {
val x = v(i)
if (x > 7)
c += 1
i += 1
}
println(c)
All traces of higher-order control flow have been removed and the program is
strictly first-order. Moreover, the programmer can be sure that the DSL
program is composed in the desired way.
This issue of higher-order functions is a real problem for regular Scala
programs executed on the JVM. The Scala collection library uses essentially
the same `foreach` and count `abstractions` as above and the JVM, which
applies optimizations based on per-call-site profiling, will identify the
call site _within_ `foreach` as a hot spot. However, since the number of
distinct functions called from foreach is usually large, inlining or other
optimizations cannot be applied and every iteration step pays the overhead of
a virtual method call [(*)](cliffinlining).
# Using Continuations in the Generator to Implement Backtracking
Apart from pure performance improvements, we can use functionality of the
generator stage to enrich the functionality of DSLs without any work on the
DSL-compiler side. As an example we consider adding backtracking
nondeterministic computation to a DSL using a simple variant of McCarthy's
`amb` operator [(*)](McCarthy63abasis). Here is a nondeterministic program
that uses `amb` to find pythagorean triples from three given vectors:
val u,v,w: Rep[Vector[Int]] = ...
nondet {
val a = amb(u)
val b = amb(v)
val c = amb(w)
require(a*a + b*b == c*c)
println("found:")
println(a,b,c)
}
We can use Scala's support for delimited continuations
[(*)](DBLP:conf/icfp/RompfMO09) and the associated control operators `shift`
and `reset` [(*)](DBLP:journals/mscs/DanvyF92,DBLP:conf/lfp/DanvyF90) to
implement the necessary primitives. The scope delimiter `nondet` is just the
regular `reset`. The other operators are defined as follows:
def amb[T](xs: Rep[Vector[T]]): Rep[T] @cps[Rep[Unit]] = shift { k =>
xs foreach k
}
def require(x: Rep[Boolean]): Rep[Unit] @cps[Rep[Unit]] = shift { k =>
if (x) k() else ()
}
Since the implementation of `amb` just calls the previously defined method
`foreach`, the generated code will be first-order and consist of three nested
`while` loops:
val u,v,w:Rep[Vector[Int]]=...
var i = 0
while (i < u.length) {
val a = u(i)
val a2 = a*a
var j = 0
while (j < v.length) {
val b = v(j)
val b2 = b*b
val a2b2 = a2+b2
var k = 0
while (k < w.length) {
val c = w(k)
val c2 = c*c
if (a2b2 == c2) {
println("found:")
println(a,b,c)
}
k += 1
}
j += 1
}
i += 1
}
Besides the advantage of not having to implement `amb` as part of the DSL
compiler, all common optimizations that apply to plain `while` loops are
automatically applied to the unfolded backtracking implementation. For
example, the loop invariant hoisting performed by code motion has moved the
computation of `a*a` and `b*b` out of the innermost loop.
The given implementation of `amb` is not the only possibility, though. For
situations where we know the number of choices (but not necessarily the actual
values) for a particular invocation of `amb` at staging time, we can
implement an alternative operator that takes a (static) list of dynamic
values and unfolds into specialized code paths for each option at compile
time:
def bam[T](xs: List[Rep[T]]): Rep[T] @cps[Rep[Unit]] = shift { k =>
xs foreach k
}
Here, `foreach` is not a DSL operation but a plain traversal of the static
argument list xs. The `bam` operator must be employed with some care because
it incurs the risk of code explosion. However, static specialization of
nondeterministic code paths can be beneficial if it allows aborting many paths
early based on static criteria or merging computation between paths.
val u: Rep[Vector[Int]] = ...
nondet {
val a = amb(u)
val b = bam(List(x1), List(x2))
val c = amb(v)
require(a + c = f(b)) // assume f(b) is expensive to compute
println("found:")
println(a,b,c)
}
If this example was implemented as three nested loops, `f(x1)` and `f(x2)`
would need to be computed repeatedly in each iteration of the second loop as
they depend on the loop-bound variable `b`. However, the use of `bam` will
remove the loop over `x1,x2` and expose the expensive computations as
redundant so that code motion can extract them from the loop:
val fx1 = f(x1)
val fx2 = f(x2)
while (...) { // iterate over u
while (...) { // iterate over v
if (a + c == fx1) // found
}
while (...) { // iterate over v
if (a + c == fx2) // found
}
}
In principle, the two adjacent inner loops could be subjected to the loop
fusion optimization discussed in [here](#sec:360fusionComp). This would remove
the duplicate traversal of `v`. In this particular case fusion is currently
not applied since it would change the order of the side-effecting `println`
statements.
# Using Continuations in the Generator to Generate Async Code Patterns
The previous section used continuations that were completely translated away
during generation. In this section, we will use a CPS-transformed program
generator to generate code that is in CPS. While the previous section
generated only loops, we will generate actual functions in this section, using
the mechanisms described [here](#sec:220functions). The example is taken
from~[(*)](DBLP:conf/ecoop/KossakowskiARO12) and concerned with generating
JavaScript but the techniques apply to any target.
A callback-driven programming style is pervasive in JavaScript programs.
Because of lack of thread support, callbacks are used for I/O, scheduling and
event-handling. For example, in an Ajax call (Asynchronous JavaScript and
XML), one has to provide a callback that will be called once the requested
data arrives from the server. This style of programming is known to be
unwieldy in more complicated scenarios. To give a specific example, let us
consider a scenario where we have an array of Twitter account names and we
want to ask Twitter for the latest tweets of each account. Once we obtain the
tweets of all users, we would like to log that fact in a console.
We implement this program both directly in JavaScript and in the embedded
JavaScript DSL [(*)](DBLP:conf/ecoop/KossakowskiARO12). Let us start by
implementing logic that fetches tweets for a single user by using the jQuery
library for Ajax calls:
function fetchTweets(user, callback) {
jQuery.ajax({
url: "http://api.twitter.com/1/"
+ "statuses/user_timeline.json/",
type: "GET",
dataType: "jsonp",
data: {
screen_name: user,
include_rts: true,
count: 5,
include_entities: true
},
success: callback
})
}
def fetchTweets(user: Rep[String]) =
(ajax.get { new JSLiteral {
val url = "http://api.twitter.com/1/"
+ "statuses/user_timeline.json"
val `type` = "GET"
val dataType = "jsonp"
val data = new JSLiteral {
val screen_name = user
val include_rts = true
val count = 5
val include_entities = true
}
}
}).as[TwitterResponse]
type TwitterResponse =
Array[JSLiteral {val text: String}]
Note that the JavaScript version takes a callback as second argument that will
be used to process the fetched tweets. We provide the rest of the logic that
iterates over array of users and makes Ajax requests:
var processed = 0
var users = ["gkossakowski",
"odersky", "adriaanm"]
users.forEach(function (user) {
console.log("fetching " + user)
fetchTweets(user, function(data) {
console.log("finished " + user)
data.forEach(function (t) {
console.log("fetched " + t.text)
})
processed += 1
if (processed == users.length) {
console.log("done")
}
})
})
val users = array("gkossakowski",
"odersky", "adriaanm")
for (user <- users.parSuspendable) {
console.log("fetching " + user)
val tweets = fetchTweets(user)
console.log("finished " + user)
for (t <- tweets)
console.log("fetched " + t.text)
}
console.log("done")
Because of the inverted control flow of callbacks, synchronization between
callbacks has to be handled manually. Also, the inverted control flow leads to
a code structure that is distant from the programmer's intent. Notice that the
in JavaScript version, the call to console that prints "done" is put inside
of the foreach loop. If it was put it after the loop, we would get "done"
printed _before_ any Ajax call has been made leading to counterintuitive
behavior.
As an alternative to the callback-driven programming style, one can use an
explicit monadic style, possibly sugared by a Haskell-like "do"-notation.
However, this requires rewriting possibly large parts of a program into
monadic style when a single async operation is added. Another possibility is
to automatically transform the program into continuation passing style (CPS),
enabling the programmer to express the algorithm in a straightforward,
sequential way and creating all the necessary callbacks and book-keeping code
automatically. Links~[(*)](links) uses this approach. However, a whole-program
CPS transformation can cause performance degradation, code size blow-up, and
stack overflows. In addition, it is not clear how to interact with existing
non-CPS libraries as the whole program needs to adhere to the CPS style. Here
we use a _selective_ CPS transformation, which precisely identifies
expressions that need to be CPS transformed.
In fact, the Scala compiler already does selective, `@suspendable`
type-annotation-driven CPS transformations of Scala programs
[(*)](DBLP:conf/icfp/RompfMO09,DBLP:conf/lfp/DanvyF90,DBLP:journals/mscs/DanvyF92).
We show how this mechanism can be used for transforming our DSL code before
staging and stripping out most CPS abstractions at staging time. The generated
JavaScript code does not contain any CPS-specific code but is written in CPS-style
by use of JavaScript anonymous functions.
## CPS and Staging
As an example, we will consider a seemingly blocking `sleep` method
implemented in a non-blocking, asynchronous style. In JavaScript, there are no
threads and there is no notion of blocking. However the technique is useful in
other circumstances as well, for example when using thread pools, as no thread
is being blocked during the sleep period. Let us see how our CPS transformed
`sleep` method can be used:
def foo() = {
sleep(1000)
println("Called foo")
}
reset {
println("look, Ma ...")
foo()
sleep(1000)
println(" no callbacks!")
}
We define `sleep` to use JavaScript's asynchronous `setTimeout` function,
which takes an explicit callback:
def sleep(delay: Rep[Int]) = shift { k: (Rep[Unit]=>Rep[Unit]) =>
window.setTimeout(lambda(k), delay)
}
The `setTimeout` method expects an argument of type `Rep[Unit=>Unit]` i.e. a
staged function of type `Unit=>Unit`. The `shift` method offers us a function
of type `Rep[Unit]=>Rep[Unit]`, so we need to reify it to obtain the desired
representation. The reification is achieved by the `fun` function provided by
LMS and performed at staging time.
It is important to note that reification preserves function composition.
Specifically, let `f: Rep[A] => Rep[B]` and `g: Rep[B] => Rep[C]` then
{\tt\small lambda(g compose f) == (lambda(g) compose lambda(f))} where we
consider two reified functions to be equal if they yield the same observable
effects at runtime. That property of function reification is at the core of
reusing the continuation monad in our DSL. Thanks to the fact that the
continuation monad composes functions, we can just insert reification at some
places (like in a `sleep`) and make sure that we reify _effects_ of the
continuation monad without the need to reify the monad itself.
## CPS for Interruptible Traversals
We need to be able to interrupt our execution while traversing an array in
order to implement the functionality shown above. Let us consider a
simplified example where we want to sleep during each iteration. Both
programs, when executed, will print the following output:
//pause for 1s
1
//pause for 2s
2
//pause for 3s
3
done
We present both JavaScript and DSL code that achieves that:
var xs = [1, 2, 3]
var i = 0
var msg = null
function f1() {
if (i < xs.length) {
window.setTimeout(f2, xs[i]*1000)
msg = xs[i]
i++
}
}
function f2() {
console.log(msg)
f1()
}
f1()
val xs = array(1, 2, 3)
// shorthand for xs.suspendable.foreach
for (x <- xs.suspendable) {
sleep(x * 1000)
console.log(String.valueOf(x))
}
log("done")
In the DSL code, we use a `suspendable` variant of arrays, which is achieved
through an implicit conversion from regular arrays:
implicit def richArray(xs: Rep[Array[A]]) = new {
def suspendable: SArrayOps[A] = new SArrayOps[A](xs)
}
The idea behind `suspendable` arrays is that iteration over them can be
interrupted. We will have a closer look at how to achieve that with the help
of CPS. The `suspendable` method returns a new instance of the `SArrayOps`
class defined here:
class SArrayOps(xs: Rep[Array[A]]) {
def foreach(yld: Rep[A] => Rep[Unit] @suspendable):
Rep[Unit] @suspendable = {
var i = 0
suspendableWhile(i < xs.length) { yld(xs(i)); i += 1 }
}
}
Note that one cannot use while loops in CPS but we can simulate them with
recursive functions. Let us see how regular while loop can be simulated with
a recursive function:
def recursiveWhile(cond: => Boolean)(body: => Unit): Unit = {
def rec = () => if (cond) { body; rec() } else {}
rec()
}
By adding CPS-related declarations and control abstractions, we implement
`suspendableWhile`:
def suspendableWhile(cond: => Rep[Boolean])(
body: => Rep[Unit] @suspendable): Rep[Unit] @suspendable =
shift { k =>
def rec = fun { () =>
if (cond) reset { body; rec() } else { k() }
}
rec()
}
## Defining the Ajax API
With the abstractions for interruptible loops and traversals at hand, what
remains to complete the Twitter example from the beginning of the section is
the actual Ajax request/response cycle.
The Ajax interface component provides a type `Request` that captures the
request structure expected by the underlying JavaScript/jQuery implementation
and the necessary object and method definitions to enable the use of
`ajax.get` in user code:
trait Ajax extends JS with CPS {
type Request = JSLiteral {
val url: String
val `type`: String
val dataType: String
val data: JSLiteral
}
type Response = Any
object ajax {
def get(request: Rep[Request]) = ajax_get(request)
}
def ajax_get(request: Rep[Request]): Rep[Response] @suspendable
}
Notice that the `Request` type is flexible enough to support an arbitrary
object literal type for the `data` field through subtyping. The `Response`
type alias points at `Any` which means that it is the user's responsibility
to either use `dynamic` or perform an explicit cast to the expected data
type.
The corresponding implementation component implements `ajax_get` to capture a
continuation, reify it as a staged function using `fun` and store it in an
`AjaxGet` IR node.
trait AjaxExp extends JSExp with Ajax {
case class AjaxGet(request: Rep[Request],
success: Rep[Response => Unit]) extends Def[Unit]
def ajax_get(request: Rep[Request]): Rep[Response] @suspendable =
shift { k =>
reflectEffect(AjaxGet(request, lambda(k)))
}
}
During code generation, we emit code to attach the captured continuation as a
callback function in the `success` field of the request object:
trait GenAjax extends JSGenBase {
val IR: AjaxExp
import IR._
override def emitNode(sym: Sym[Any], rhs: Def[Any])(
implicit stream: PrintWriter) = rhs match {
case AjaxGet(req, succ) =>
stream.println(quote(req) + ".success = " + quote(succ))
emitValDef(sym, "jQuery.ajax(" + quote(req) + ")")
case _ => super.emitNode(sym, rhs)
}
}
It is interesting to note that, since the request already has the right
structure for the `jQuery.ajax` function, we can simply pass it to the
framework-provided `quote` method, which knows how to generate JavaScript
representations of any `JSLiteral`.
The Ajax component completes the functionality required to run the Twitter
example with one caveat: The outer loop in listing \ref{code:twitter_example}
uses `parSuspendable` to traverse arrays instead of the `suspendable`
traversal variant we have defined in listing \ref{code:suspendable_foreach}.
In fact, if we change the code to use `suspendable` instead of
`parSuspendable` and run the generated JavaScript program, we will get
following output printed to the JavaScript console:
fetching gkossakowski
finished fetching gkossakowski
fetched [...]
fetched [...]
fetching odersky
finished fetching odersky
fetched [...]
fetched [...]
fetching adriaanm
finished fetching adriaanm
fetched [...]
fetched [...]
done
Notice that all Ajax requests were done sequentially. Specifically, there was
just one Ajax request active at a given time; when the callback to process one
request is called, it would resume the continuation to start another request,
and so on. In many cases this is exactly the desired behavior, however, we
will most likely want to perform our Ajax request in parallel.
## CPS for Parallelism
The goal of this section is to implement a parallel variant of the `foreach`
method from listing~\ref{code:suspendable_foreach}. We will start with
defining a few primitives like futures and dataflow cells. We start with
cells, which we decide to define in JavaScript, as another example of
integrating external libraries with our DSL:
function Cell() {
this.value = undefined
this.isDefined = false
this.queue = []
this.get = function (k) {
if (this.isDefined) {
k(this.value)
} else {
this.queue.push(k)
}
}
this.set = function (v) {
if (this.isDefined) {
throw "can't set value twice"
} else {
this.value = v
this.isDefined = true
this.queue.forEach(function (f) {
f(v) //non-trivial spawn could be used here
})
}
}
}
A cell object allows us to track dependencies between values. Whenever the
`get` method is called and the value is not in the cell yet, the continuation
will be added to a queue so it can be suspended until the value arrives. The
`set` method takes care of resuming queued continuations. We expose `Cell` as
external library using our typed API mechanism and we use it for implementing
an abstraction for futures.
def createCell(): Rep[Cell[A]]
trait Cell[A]
trait CellOps[A] {
def get(k: Rep[A => Unit]): Rep[Unit]
def set(v: Rep[A]): Rep[Unit]
}
implicit def repToCellOps(x: Rep[Cell[A]]): CellOps[A] =
repProxy[Cell[A],CellOps[A]](x)
def spawn(body: => Rep[Unit] @suspendable): Rep[Unit] = {
reset(body) //non-trivial implementation uses
//trampolining to prevent stack overflows
}
def future(body: => Rep[A] @suspendable) = {
val cell = createCell[A]()
spawn { cell.set(body) }
cell
}
The last bit of general functionality we need is `RichCellOps` that ties
`Cell`s and continuations together inside of our DSL.
class RichCellOps(cell: Rep[Cell[A]]) {
def apply() = shift { k: (Rep[A] => Rep[Unit]) =>
cell.get(lambda(k))
}
}
implicit def richCellOps(x: Rep[Cell[A]]): RichCell[A] =
new RichCellOps(x)
It is worth noting that `RichCellOps` is not reified so it will be dropped at
staging time and its method will get inlined whenever used. Also, it contains
CPS-specific code that allows us to capture the continuation. The `fun`
function reifies the captured continuation.
We are ready to present the parallel version of `foreach` defined in listing
\ref{code:suspendable_foreach}.
def foreach(yld: Rep[A] => Rep[Unit] @suspendable):
Rep[Unit] @suspendable = {
val futures = xs.map(x => future(yld(x)))
futures.suspendable.foreach(_.apply())
}
We instantiate each future separately so they can be executed in parallel. As
a second step we make sure that all futures are evaluated before we leave the
`foreach` method by forcing evaluation of each future and "waiting" for its
completion. Thanks to CPS transformations, all of that will be implemented in
a non-blocking style.
The only difference between the parallel and serial versions of the Twitter
example~\ref{code:twitter_example} is the use of `parSuspendable` instead of
`suspendable` so the parallel implementation of the `foreach` method is used.
The rest of the code stays the same. It is easy to switch between both
versions, and users are free to make their choice according to their needs
and performance requirements.
# Guarantees by Construction