Final Exam - Computer Architecture

Question 1 Please check all answers that apply to Branch Target Buffer (BTB).

The BTB is indexed by the current PC.
The BTB allows the fetch stage of the pipeline to predict the address of the next instruction.

Question 2 For Problems 2 through 8, use the following Code Sequence: [image] The above program sums the odd and even numbers in array and outputs them to two registers. Conceptually, what values are kept in R2 and R3? What do R5, and R6 contain when the loop exits?

R2 contains the pointer that points to the different elements in the array
R3 contains the loop iteration variable
R5 contains the sum of the odd numbers in the array
R6 contains the sum of the even numbers in the array

Question 3 For Problems 2 through 8, use the following Code Sequence: [image] For this problem, consider a processor with a 2048-entry Branch History Table (BHT), which is indexed by the low-ordered bits of the program counter (PC). The BHT implements a two-bit predictor with the state transition diagram shown below. The edges in the state transition diagram represent the branch outcomes that cause a transition. The predictor begins in the “00” state. The predictor predicts taken when it is in states “01”, “10”, and “11”. The predictor predicts “not taken” when it is in state “00”. The processor is a simple 5-stage pipeline with no branch delay slots. Assume that the result of the code is in registers R5 and R6. R5 and R6 contain zero at the beginning of execution. Assume that the initial value of R2 is 100. At address 100, there is an array of 32-bit integers with the following values in order of increasing memory location, [2, 1, 3, 4, 6, 6, 5, 7, 3, 2, 1]. Assume that at the beginning of execution that R3 contains 8. Fill in the following table by simulating the execution of the above loop. [image] [image] In the table above, how many actual outcomes are NT when the predicted outcome is T? (Including 1. and 2.)

Question 4 For Problems 2 through 8, use the following Code Sequence: [image] For this problem, consider a processor with a 2048-entry Branch History Table (BHT), which is indexed by the low-ordered bits of the program counter (PC). The BHT implements a two-bit predictor with the state transition diagram shown below. The edges in the state transition diagram represent the branch outcomes that cause a transition. The predictor begins in the “00” state. The predictor predicts taken when it is in states “01”, “10”, and “11”. The predictor predicts “not taken” when it is in state “00”. The processor is a simple 5-stage pipeline with no branch delay slots. Assume that the result of the code is in registers R5 and R6. R5 and R6 contain zero at the beginning of execution. Assume that the initial value of R2 is 100. At address 100, there is an array of 32-bit integers with the following values in order of increasing memory location, [2, 1, 3, 4, 6, 6, 5, 7, 3, 2, 1]. Assume that at the beginning of execution that R3 contains 8. Fill in the following table by simulating the execution of the above loop. [image] [image] How many not taken (NT) in the predicted outcome column? (Including row 1. and 2.)

Question 5 For Problems 2 through 8, use the following Code Sequence: [image] For this problem, consider a processor with a 2048-entry Branch History Table (BHT), which is indexed by the low-ordered bits of the program counter (PC). The BHT implements a two-bit predictor with the state transition diagram shown below. The edges in the state transition diagram represent the branch outcomes that cause a transition. The predictor begins in the “00” state. The predictor predicts taken when it is in states “01”, “10”, and “11”. The predictor predicts “not taken” when it is in state “00”. The processor is a simple 5-stage pipeline with no branch delay slots. Assume that the result of the code is in registers R5 and R6. R5 and R6 contain zero at the beginning of execution. Assume that the initial value of R2 is 100. At address 100, there is an array of 32-bit integers with the following values in order of increasing memory location, [2, 1, 3, 4, 6, 6, 5, 7, 3, 2, 1]. Assume that at the beginning of execution that R3 contains 8. Fill in the following table by simulating the execution of the above loop. [image] [image] Choose the correct answer for row 15 and 16.

Question 6 For Problems 2 through 8, use the following Code Sequence: [image] For this problem, consider a processor with a 2048-entry Branch History Table (BHT), which is indexed by the low-ordered bits of the program counter (PC). The BHT implements a two-bit predictor with the state transition diagram shown below. The edges in the state transition diagram represent the branch outcomes that cause a transition. The predictor begins in the “00” state. The predictor predicts taken when it is in states “01”, “10”, and “11”. The predictor predicts “not taken” when it is in state “00”. The processor is a simple 5-stage pipeline with no branch delay slots. Assume that the result of the code is in registers R5 and R6. R5 and R6 contain zero at the beginning of execution. Assume that the initial value of R2 is 100. At address 100, there is an array of 32-bit integers with the following values in order of increasing memory location, [2, 1, 3, 4, 6, 6, 5, 7, 3, 2, 1]. Assume that at the beginning of execution that R3 contains 8. Fill in the following table by simulating the execution of the above loop. [image] [image] What is the branch prediction accuracy for the above code execution for branch b_1? Put your answer in the form of a decimal rounded to the nearest 0.01

0.25

Question 7 For Problems 2 through 8, use the following Code Sequence: [image] For this problem, consider a processor with a 2048-entry Branch History Table (BHT), which is indexed by the low-ordered bits of the program counter (PC). The BHT implements a two-bit predictor with the state transition diagram shown below. The edges in the state transition diagram represent the branch outcomes that cause a transition. The predictor begins in the “00” state. The predictor predicts taken when it is in states “01”, “10”, and “11”. The predictor predicts “not taken” when it is in state “00”. The processor is a simple 5-stage pipeline with no branch delay slots. Assume that the result of the code is in registers R5 and R6. R5 and R6 contain zero at the beginning of execution. Assume that the initial value of R2 is 100. At address 100, there is an array of 32-bit integers with the following values in order of increasing memory location, [2, 1, 3, 4, 6, 6, 5, 7, 3, 2, 1]. Assume that at the beginning of execution that R3 contains 8. Fill in the following table by simulating the execution of the above loop. [image] [image] What is the branch prediction accuracy for the above code execution for branch b_2? Put your answer in the form of a decimal rounded to the nearest 0.01

0.75

Question 5 For Problems 2 through 8, use the following Code Sequence: [image] The architecture used in the previous questions has a 1-cycle branch mispredict penalty and no penalty for correctly predicted branches. Jumps are always predicted correctly. Assuming that the above architecture is modified to support partial predication (conditional move with both move-if-zero and move-if-not-zero, the above code can be changed to: [image] Would it be fruitful to transform branch b_1 with partial predication?

No

Question 9 For questions 9 through 12, consider the following VLIW architecture: The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles. Questions 9 through 12 will ask about the resulting schedule of the following task: Optimally schedule and bundle the following (sequential) code for this processor using the Equals (EQ) scheduling module. Unroll the code where appropriate and show prolog and epilog code. Perform code motion and register renaming where appropriate. Note that the loop has a fixed number of iterations. Also, achieve peak performance while minimizing instruction storage. As a secondary constraint, minimize register usage. Code sequence for questions 9 through 12: [image] What is the loop unrolling factor for the optimal scheduling of this code?

Question 10 For questions 9 through 12, consider the following VLIW architecture: The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles. Questions 9 through 12 will ask about the resulting schedule of the following task: Optimally schedule and bundle the following (sequential) code for this processor using the Equals (EQ) scheduling module. Unroll the code where appropriate and show prolog and epilog code. Perform code motion and register renaming where appropriate. Note that the loop has a fixed number of iterations. Also, achieve peak performance while minimizing instruction storage. As a secondary constraint, minimize register usage. Code sequence for questions 9 through 12: [image] Which of the following is a valid loop to the optimally scheduled code ?

Question 11 For questions 9 through 12, consider the following VLIW architecture: The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles. Questions 9 through 12 will ask about the resulting schedule of the following task: Optimally schedule and bundle the following (sequential) code for this processor using the Equals (EQ) scheduling module. Unroll the code where appropriate and show prolog and epilog code. Perform code motion and register renaming where appropriate. Note that the loop has a fixed number of iterations. Also, achieve peak performance while minimizing instruction storage. As a secondary constraint, minimize register usage. Code sequence for questions 9 through 12: [image] Which of the following is a valid prologue to the optimally scheduled code?

Question 12 For questions 9 through 12, consider the following VLIW architecture: The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles. Questions 9 through 12 will ask about the resulting schedule of the following task: Optimally schedule and bundle the following (sequential) code for this processor using the Equals (EQ) scheduling module. Unroll the code where appropriate and show prolog and epilog code. Perform code motion and register renaming where appropriate. Note that the loop has a fixed number of iterations. Also, achieve peak performance while minimizing instruction storage. As a secondary constraint, minimize register usage. Code sequence for questions 9 through 12: [image] Which of the following is a valid epilogue to the optimally scheduled code? Assume that the prolog is the one chosen from question 11.

Question 13 For questions 13 through 15, consider the following vector processor architecture: The 3-lane VMIPS vector processor is shown below. The vector processor fetches instructions in-order, issues instructions in-order, and writes-back results out-of-order. Assume the processor can fetch one instruction per cycle and decode one instruction per cycle. Assume that the processor has two read ports per lane and one write port per lane for a total of 6 read ports and 3 write ports. Assume that the processor has a scoreboard to detect hazards. Assume that the processor can only bypass values through the register file and that values written to the register file in one cycle are not available until the next cycle. Assume that pipeline X can execute branches and ALU operations, pipeline L can excute loads, pipeline S can execute stores, and pipeline Y can execute all multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The processor has a dedicated register read stage, denoted as “R” in the figure. [image] [Figure showing Three Lane VMIPS Processor Pipeline] Questions 13 through 15 will ask about the resulting pipeline diagram of the following task: Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor. Problem 13 through 15 Code Sequence: [image] [image] Which instructions stall due to data hazard? Check all that apply

MULVV.D V3, V1, V2

Question 14 For questions 13 through 15, consider the following vector processor architecture: The 3-lane VMIPS vector processor is shown below. The vector processor fetches instructions in-order, issues instructions in-order, and writes-back results out-of-order. Assume the processor can fetch one instruction per cycle and decode one instruction per cycle. Assume that the processor has two read ports per lane and one write port per lane for a total of 6 read ports and 3 write ports. Assume that the processor has a scoreboard to detect hazards. Assume that the processor can only bypass values through the register file and that values written to the register file in one cycle are not available until the next cycle. Assume that pipeline X can execute branches and ALU operations, pipeline L can excute loads, pipeline S can execute stores, and pipeline Y can execute all multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The processor has a dedicated register read stage, denoted as “R” in the figure. [image] [Figure showing Three Lane VMIPS Processor Pipeline] Questions 13 through 15 will ask about the resulting pipeline diagram of the following task: Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor. Problem 13 through 15 Code Sequence: [image] [image] Which instructions stall in the fetch stage due to structural hazard? Check all that apply

MULVV.D V3, V1, V2
ADDVV.D V4, V1, V2
SV R7, V3
SV R6, V4

Question 15 For questions 13 through 15, consider the following vector processor architecture: The 3-lane VMIPS vector processor is shown below. The vector processor fetches instructions in-order, issues instructions in-order, and writes-back results out-of-order. Assume the processor can fetch one instruction per cycle and decode one instruction per cycle. Assume that the processor has two read ports per lane and one write port per lane for a total of 6 read ports and 3 write ports. Assume that the processor has a scoreboard to detect hazards. Assume that the processor can only bypass values through the register file and that values written to the register file in one cycle are not available until the next cycle. Assume that pipeline X can execute branches and ALU operations, pipeline L can excute loads, pipeline S can execute stores, and pipeline Y can execute all multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The processor has a dedicated register read stage, denoted as “R” in the figure. [image] [Figure showing Three Lane VMIPS Processor Pipeline] Questions 13 through 15 will ask about the resulting pipeline diagram of the following task: Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor. Problem 4 Code Sequence: [image] [image] How many cycles does it take to execute the code above?

Question 16 What problem does vector stripmining solve? How is the Vector Length Register (VLR) involved with stripmining? Please check all that apply.

VLR is set to the maximum for most of the iterations of the loop.
Vector stripmining allows loops larger than the Vector length Register (VLR) to execute.

Question 17 For questions 17 through 19, consider the following instruction sequence table: There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB. [image] Questions 17 through 19 will ask about the resulting state table of the following task: Show for each cache line and per cache what state it is in on every cycle for the code above. You do not need to show lines which are invalid in the cache, but you should remove them from the table when the lines transition to the Invalid state. Assume that the caches implement a snoopy, bus-based, cache coherence protocol implementing the MESI protocol. MESI stands for the four states, Modified, Exclusive, Shared, and Invalid. Assume all lines are Invalid in all caches at the beginning of execution. Be sure to duplicate the information from time step to following time step if the data stays valid in the cache and show the state that the data is in. Which of the following table is the correct state one at time 9?

Question 18 For questions 17 through 19, consider the following instruction sequence table: There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB. [image] Questions 17 through 19 will ask about the resulting state table of the following task: Show for each cache line and per cache what state it is in on every cycle for the code above. You do not need to show lines which are invalid in the cache, but you should remove them from the table when the lines transition to the Invalid state. Assume that the caches implement a snoopy, bus-based, cache coherence protocol implementing the MESI protocol. MESI stands for the four states, Modified, Exclusive, Shared, and Invalid. Assume all lines are Invalid in all caches at the beginning of execution. Be sure to duplicate the information from time step to following time step if the data stays valid in the cache and show the state that the data is in. Which of the following table is the correct state one at time 11?

Question 19 For questions 17 through 19, consider the following instruction sequence table: There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB. [image] Questions 17 through 19 will ask about the resulting state table of the following task: Show for each cache line and per cache what state it is in on every cycle for the code above. You do not need to show lines which are invalid in the cache, but you should remove them from the table when the lines transition to the Invalid state. Assume that the caches implement a snoopy, bus-based, cache coherence protocol implementing the MESI protocol. MESI stands for the four states, Modified, Exclusive, Shared, and Invalid. Assume all lines are Invalid in all caches at the beginning of execution. Be sure to duplicate the information from time step to following time step if the data stays valid in the cache and show the state that the data is in. Which of the following time do the cache conflict miss happen? Check all that apply

9
14

Question 20 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bits width of tag for data cache?

Question 21 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bits width of index for data cache?

Question 22 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bytes size of the data field for data cache?

Question 23 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bits width of the page offset for TLB?

Question 24 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bits width of the physical page number for TLB?

Question 25 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. For the above described processor, what's the bits width of the total TLB data payload width (not including tag, valid bit, or NMRU)?

Question 26 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. Questions 26 through 27 will ask about the resulting state table of the following task: Even though the above processor has a software managed TLB, it is desirable for the operating system to use a multi-level page table in order to save space. Design a multi-level page table structure which fits page table entries naturally into a page and favors having the leaf page table level full of page table entries. How many levels is this page table structure?

Question 27 For questions 20 through 27, consider the following processor spec: You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy. Questions 26 through 27 will ask about the resulting state table of the following task: Even though the above processor has a software managed TLB, it is desirable for the operating system to use a multi-level page table in order to save space. Design a multi-level page table structure which fits page table entries naturally into a page and favors having the leaf page table level full of page table entries. How many bits in each level? Check all that apply

Level 1: 6
Level 2:12

Question 28 Which of the following code shows false sharing? Assume that the code is executing on a shared memory multiprocessor system where the cache block size is 32 bytes and each load/store each handle 4 bytes. Assume the cache is direct mapped with 32 indices.

Question 29 Two threads are executing on two independent processors. The data stored at address ‘p’ is initialized to the value 5 and the data stored at address ‘q’ is initialized to the value 6 before the threads begin executing. Note that SW R0, 0(q) stores the value zero at address ‘q’. After both threads complete execution, is the state where the data stored at address ‘p’ contains the value 31 and the data stored at address ‘q’ contains the value 10 a sequentially consistent execution? [image]

No, it is not a sequentially consistent execution.

Question 30 For questions 30 through 32, consider the following code: The following code is designed to implement a producer-consumer relationship between two independent threads which execute on different processors. These threads share memory and the memory system is sequentially consistent. Assume that the variables ring_buffer, head_index, tail_index, and number_in_buffer are shared between the threads. Assume that statements in the code provided below execute in order, but any given statement may not be atomic. The functionality of SendData puts one value into the communication buffer and ReceiveData removes one value from the buffer. Assume that the first thread only ever calls SendData and the second thread only ever calls ReceiveData. Assume that there is a library which implements a mutual exclusion lock (mutex) which have two functions AcquireLock(int * lock) and ReleaseLock(int * lock). Both AcquireLock and ReleaseLock take a parameter which is the address of a lock variable. Also, the lock library has a function InitializeLock(int * lock) which initializes a lock in the unlocked state when given a parameter which is the address of a lock. [image] For proper operation, does the code above need locking?

Yes

Question 31 For questions 30 through 32, consider the following code: The following code is designed to implement a producer-consumer relationship between two independent threads which execute on different processors. These threads share memory and the memory system is sequentially consistent. Assume that the variables ring_buffer, head_index, tail_index, and number_in_buffer are shared between the threads. Assume that statements in the code provided below execute in order, but any given statement may not be atomic. The functionality of SendData puts one value into the communication buffer and ReceiveData removes one value from the buffer. Assume that the first thread only ever calls SendData and the second thread only ever calls ReceiveData. Assume that there is a library which implements a mutual exclusion lock (mutex) which have two functions AcquireLock(int * lock) and ReleaseLock(int * lock). Both AcquireLock and ReleaseLock take a parameter which is the address of a lock variable. Also, the lock library has a function InitializeLock(int * lock) which initializes a lock in the unlocked state when given a parameter which is the address of a lock. [image] If locking is needed, where should we insert the function AcquireLock(&lock)? Check all that apply

2
6

Question 32 For questions 30 through 32, consider the following code: The following code is designed to implement a producer-consumer relationship between two independent threads which execute on different processors. These threads share memory and the memory system is sequentially consistent. Assume that the variables ring_buffer, head_index, tail_index, and number_in_buffer are shared between the threads. Assume that statements in the code provided below execute in order, but any given statement may not be atomic. The functionality of SendData puts one value into the communication buffer and ReceiveData removes one value from the buffer. Assume that the first thread only ever calls SendData and the second thread only ever calls ReceiveData. Assume that there is a library which implements a mutual exclusion lock (mutex) which have two functions AcquireLock(int * lock) and ReleaseLock(int * lock). Both AcquireLock and ReleaseLock take a parameter which is the address of a lock variable. Also, the lock library has a function InitializeLock(int * lock) which initializes a lock in the unlocked state when given a parameter which is the address of a lock. [image] If locking is needed, where should we insert the function ReleaseLock(&lock)? Check all that apply

3
7

Question 33 What is the bisection bandwidth of the following network? Give your answer rounded to the nearest 0.1 Gbps A 4-node shared multi-drop bus which is 64-bits wide and clocks at 200MHz.

12.8

Question 34 What is the bisection bandwidth of the following network? Give your answer rounded to the nearest 0.1 Gbps A 2-ary 3-cube where each link is 16-bits wide, with 8-bits in each direction and the links are clocked at 100MHz

6.4

LPU Verto

Final Exam - Computer Architecture

Post a Comment

Week1 - Introduction to Databases / Explore

Week1 - ER Diagram - Designing

Tutorial - How to Copy Paste in NeoColab WITHOUT EXTENSIONS

Week1 - Introduction to Databases / Assess

Week1 - Introduction to Databases / Analyze

LPU Verto