A time‐series perspective on executive functioning: The benefits of a dynamic approach to random number generation

Abstract Objectives Executive functioning (EF) is a key topic in neuropsychology. A multitude of underlying processes and constructs have been suggested to explain EF, which are measured by at least as many different neuropsychological tests. However, these tests often refer to summary statistics to quantify the construct under study, failing to capture the dynamic nature of EF. An alternative to these summary statistics is a time‐series approach that quantifies all the available temporal information. Methods We used recurrence quantification analysis (RQA) to quantify the characteristics of any temporal pattern in random number generation data and we compared RQA to the traditional and static analysis of random number sequences. Results The traditional measures yield inconsistent results with increasing sequences length, both for computer‐generated and human‐generated sequences, whereas the RQA measures do not. Conclusion The results suggest that a time‐series approach does a better job at modelling what is happening on different time‐scales, and, therefore, is better at explaining how EF is changing in the course of the random number generation task. We argue that it is likely that these findings also apply to other neuropsychological EF tests, and that a time‐series approach is an important addition to the study of EF.

behaviors, cognition, and emotions (Baggetta & Alexander, 2016). On the other hand, factor analyses on test results mostly yield a limited number of distinguishable latent or core constructs (Karr et al., 2018;Miyake et al., 2000).
This discrepancy between the theoretical definition of EFs (suggesting a multitude of different processes) and the empirical outcomes (pointing to a relatively limited set of constructs) lies in the methods used to identify EF constructs (Delis et al., 2003), and the psychometric properties of test results. For example, data reduction techniques point to a clear distinction between the information gained from order-dependent and order-independent statistics (Giuliani et al., 2001) and interpretation of test results becomes problematic when trial-by-trial information is removed from the data (Rouder & Haaf, 2019). In other words, an accurate interpretation of test results is only feasible when researchers and clinicians account for behavioral changes during the task, for example, the transition from perseverations to correctly sorted cards in a sorting task. This order-dependent information is often accounted for by 'subjective' observations, while the actual test results are order-independent; the amount of perseveration in our sorting task stays the same regardless of where and when the perseverations are made. Since the early 19th century, psychological tests were primarily used to assess the cognitive skills between (clinical) groups or place the cognitive per- EFs transpire as a change process, as a sequence of cascading and interacting events, suggesting the presence of essential orderdependent information (Gazzaniga et al., 2019). In our card sorting example, each card sorting is likely to depend on previous card sorts.
This suggests that the underlying dynamics of executive functioning, as measured during EF assessment, likely change from start to finish.
In order to observe and adequately describe the properties of the change process, measurement, analysis, and interpretation of EF data should be focused on quantifying its dynamics. For example, what is the significance of a varying number of errors over the course of an entire test? The amount of errors might be irregular throughout the whole test or slope down gradually, but what this actually translates to is how predictable it is that the participant is going to make an error anytime during the test. Test results that can actually capture this predictability would be very informative. The static measures, such as the sum and mean of errors aggregated over the entire measurement process, used in the EF assessment do not adequately capture this predictability on the intra-individual level (Fisher et al., 2018;Molenaar & Campbell, 2009) (Bar-Yam, 1997) and provides many methods and measures to describe these changes.
In recent years, these methods are found applicable and relevant to the interpretation of performance on the Random Number Generation (RNG) task (Multani et al., 2016;Oomens et al., 2015). The RNG task is one of many EF assessment tools used in clinical practice and scientific research, and has a strong sequential dependency (Montare, 1999;Schulz et al., 2012;Shteingart & Loewenstein, 2016). Participants are instructed to generate a random sequence of digits and EFs are assessed by several randomization measures that quantify departures from mathematical randomness (Towse & Neil, 1998). For example, the amount of redundancy (R) is calculated, which is the deviation from the mathematical random number distribution expressed as a percentage. That is, if the numbers 1 through 9 are equally used during the task, R would approach 0. Other randomization measures are RNG, which is comparable to R but for digram usage (combinations of two numbers) at lag 1 instead of single numbers, Phi 2 for digrams that are repetitions of the same number and Coupon which equals the mean amount of numbers used before the number 1 through 9 are cycled. For a full explanation of randomization measures see Oomens et al. (2021) and Towse and Neil (1998).

| Quantifying dynamic patterns in RNG series
One of the methods used in complexity science is Recurrence Quantification Analysis (RQA). RQA is a nonlinear time series analysis, which quantifies patterns of recurring values that may occur during all possible lags of time given the length of the time series. The analysis is based on a recurrence matrix R(i,j), which is an n * n binary matrix, where n is the length of the time-series and a value of 1 at (i,j) indicates a value observed at time i was observed again at time j: a recurrence. A graphical representation of the recurrence matrix is the Recurrence Plot (RP), in which recurring values appear as recurrent points. Multiple consecutive recurrences are represented as either straight (vertical/horizontal) lines or as diagonal lines in the RP.
Straight lines signify a recurring sequence of the same value, while diagonal lines signify a recurring sequence of any combination of values. Quantification of these patterns of recurrence is called RQA.
There are many RQA measures, the most important of which are the Recurrence Rate (RR), Determinism (DET), Laminarity (LAM) and Entropy (ENT). RR is the proportion of recurrent observations (dots in the RP) relative to the maximum possible points that can recur (usually the size of the matrix minus its diagonal). DET is the proportion of dots in the RP that form diagonal lines, while LAM is the proportion of dots that form straight lines. These diagonal and straight lines can vary in length, which is quantified by the maximum line length and mean line length for both diagonal and straight lines (the mean vertical line length is called trapping time). ENT is the entropy of the distribution of diagonal line lengths, which is indicative of the variability of pattern recurrence. For a full explanation on RPs and RQA see Marwan et al. (2007Marwan et al. ( , 2002, Webber and Zbilut (2005), Coco and Dale (2014) and Wallot and Leonardi (2018).
By comparing the description of the RNG randomization measures to RQA measures, it is clear that the RQA measures provide a better vocabulary to explain what occurs during the RNG task. For example, an observed human-generated sequence that has a high determinism and a low entropy, which translates to predictable and rigid behavior, is easier to picture then a sequence that has a high RNG and a low Coupon. These randomization measures only get meaning through factor analytic studies (Maes et al., 2011;Peters et al., 2007;Towse, 1998), but it is not, on first glance, evident that a high RNG and low Coupon also translate to predictable and rigid behavior. In 2015, we showed, by means of exploratory factor analysis, that RQA captures the same variance in RNG performance, but in a more parsimonious way (Oomens et al., 2015). At the time, we dubbed the latent variables, in accordance with EF theory (Miyake et al., 2000), inhibition of prepotent responses and updating (of working memory content). In hindsight, the variable updating in our two-factor model might be better understood as output inhibition, which is the tendency to refrain from generating the same number in short order.
This is exactly what LAM and trapping time quantify in random number sequences, and LAM and trapping time were two of three manifest variables that loaded on updating in our model (for the rationale behind our interpretation see Oomens et al. (2015)).
However, output inhibition is an automatic process and, therefore, only an artifact of human-generated number sequences (Baddeley et al., 1998), rather than an emergent property of EF. Hence, we are left with inhibition of prepotent responses, which might also be referred to as smoothness of EF, predictability of behavior or determinism. A more hierarchical model of EF with only a single higher order construct is also what was proposed by Friedman and Miyake (2017), and better fits how or whether a person goes about doing something.

| RNG series as system history
The fact that RQA offers a better vocabulary to comprehend EF processes lies in its sensitivity to time-based information. RQA leaves the time-series (i.e., the sequence of observations) intact, while the randomization measures break the time-series into (mostly) digrams, without reference to the location in the time-series, and use this collection of digrams as input for further analysis. However, breaking up a time-series like this is only valid under the assumption of stationarity and homogeneity, a requirement which is commonly referred to as ergodicity (Molenaar & Campbell, 2009). The ergodicity assumption is convenient, because the space averaged behavior of an ergodic system is equivalent to its time averaged behavior (e.g., throwing 100 dice all at once should give the same distribution of numbers as throwing one die 100 times in a row). This property does not apply to a complex system in which all behavior strongly depends on its unique history of interactions with its internal and external environment. Consequently, the assumption that ergodicity would apply to the RNG task, or neuropsychological tests in general, might be questionable (Hasselman & Bosman, 2020;Molenaar & Campbell, 2009;Van Orden et al., 2003). For the RNG task, the ergodicity assumption implies that the statistical characteristics of every response (i.e., digram usage) is stable in time (stationarity), and that these statistical characteristics used to describe each response of an individual sequence apply to all other sequences (homogeneity). In the current study we explored, in particular, the assumption of stationarity of the RNG task. To this end, we created eight computer-generated random sequences of increasing length (from as low as 50 numbers up to a sequence of 8800 numbers) and calculated the corresponding randomization values and RQA values.
Being computer-generated random sequences, output values of both techniques should be approximately the same for each number sequence. However, since many randomization measures are proportioned to an expected value or distribution, their values tend to approximate a mathematical maximum with increasing length (Oomens et al., 2021). This is an undesirable feature of the randomization measures, for it captures something that has nothing to do with EF. To test this presumed behavior of randomization measures and RQA measures, we collected five successive humangenerated random number sequences of increasing length (from 50 up to 1100 numbers), to explore the variability of randomization and RQA measure in real-world data. Contrary to the computergenerated sequences, human behavior lacks the property of memorylessness (Ramachandran, 1979), and therefore, this asymptote-like behavior might be absent in human-generated sequences. Furthermore, windowed analysis (Wallot, 2017) was performed on the longest human-generated number sequence and the corresponding computer-generated sequence of length 1100, to test the cascading nature of RNG performance. If the RNG task abides to the assumption of stationarity, output values should stay approximately the same for each window.

| Participants
The participant who volunteered to complete the RNG task five consecutive times was a 23 year old healthy woman who provided written informed consent and was not compensated for partaking in the study. The study was approved by the Vincent van Gogh Institutional Review Board (#15.05562).

| Material
Number sequences were generated by clicking with a standard input device on the cells of a 3 � 3 grid, shown on a laptop computer. The top row displayed the numbers 1, 2 and 3 whereas the middle and bottom row showed the numbers 4, 5 and 6, and 7, 8 and 9, respectively. Hoovering over the numbers on the grid with the cursor would illuminate them in a blue hue, to facilitate responding. The start of each trial was marked by a bell sound and the task stopped automatically when the last number was selected. The next trial OOMENS ET AL. would start by pressing the 'start' button after the setting (length of the required sequence) was adjusted by the researcher. The number sequences were automatically stored after each trial.

| Procedure
Before the task, the participant was asked to define random in her own words to assess the participant's knowledge of the concept of randomness. 1 Hereafter, she received the following instruction: 'Imagine a bowl containing 9 pieces of paper, each with one of the numbers 1 through 9 written on it. By iterating on the following steps: take a piece of paper out of the bowl, write down the number and return the paper to the bowl, you will get a sequence of numbers that is perfectly random. Your task is to make a random sequence, like in the bowl metaphor, by selecting numbers on the screen. Keep in mind that it is possible to take the same piece of paper out of the bowl multiple times in a row'.
No explicit reference was made to the length of the number sequence during the instructions. On each consecutive trial, the participant was prompted to be as random as possible, referring to the bowl metaphor, but without repeating any of the instructions. In this way, five consecutive trials were administered. The participant was allowed to take a short rest between trials. The participant started the next trial by pushing the start button. The order of the five trials was randomized using the base R sample function. 2 Running this code yielded the following order of conditions: 50, 1100, 550, 100, and 275 numbers. The time it took the participant to complete each sequence was approximately 0m23, 12m20, 5m24, 1m09 and 2m27 respectively, for a total duration of 21m43. No fatigue was observed during the session.
Furthermore, eight computer-generated sequences of increasing length were created using the base R sample function. 3 Figures 1 and 2 show the time-series of the human-generated numbers and computergenerated numbers, respectively. The human-generated and the computer-generated sequences of length 1100, were partitioned in 11 non-overlapping windows of 100 numbers, randomization measures and RQA measures were calculated within each window.

| Data analysis
Randomization measures and recurrence measures were obtained using the randseqR package for R (Oomens et al., 2021). Parameters for RQA were kept to their default settings as prescribed for nominal time-series by Coco and Dale (2014), which means that both the embedding dimensions ðmÞ and delay ðτÞ were set to 1, the minimal line length was set to 2, and the radius to less than 1.
The randomization measures we used to describe performance on the RNG task were redundancy, RNG, RNG2, coupon, null score quotient (NSQ), adjacency (ascending, descending and total adjacency), turning point index (TPI), runs, repetition gap (mean, median and mode), and phi index (phi 2 through phi 7). Of these measures, redundancy, RNG, phi 2, and coupon were already described in the introduction.
RNG2 is similar to RNG but for number combinations at lag 2 (interleaved digrams), NSQ is the proportion digrams that are not used in the sequence, and adjacency is the proportion of digrams with an ordinal sequence of response alternatives. TPI is a value for the number of turning points in the sequence that mark a change in numerical direction, and runs is the variance of sequence lengths between turning points. Repetition gap is the sequence length between two identical numbers and the total of all repetition gaps can be expressed as the mean gap, the median gap or the mode gap. The phi index is a value for the number of repetitions at a certain lag. We already described phi 2, which is the number of repetitions at lag 1.

Phi 3 is the number of repetitions at lag 2 (interleaved digrams) and
Phi n þ 1 is the number of repetitions at lag n.
The recurrence measures we used to describe the sequential properties of the number sequences were recurrence rate (RR),

F I G U R E 1 Human-generated time-series data
As mentioned, Figures 1 and 2 show the time-series of the humangenerated numbers and computer-generated numbers, respectively.
Since these number sequences are nominal data (i.e. all numbers carry the same amount of information), it is difficult to see structure or patterns in these time-series. A 9 followed by a 1 is marked by a steep decline in the plot, whereas 9 followed by 8 is marked by a shallow decline. There is, however, no difference in information between the diagrams 91 and 98 and, therefore, between a steep and a shallow decline in the plot. This is a different if we take the entire time-series into account. Multiple consecutive shallow declines or increases are associated with more white space in the plot, which can be indicative of more counting-like behavior as measured by adjacency. The human-generated sequences in Figure 1, especially the longer time-series, tend to be more 'open' compared to the computer-generated sequences in Figure 2. Comparing the values for combined adjacency for computer-generated sequences ( -5 of 11 and human generated-sequences (Table 3), confirms a higher counting behavior for human-generated sequences.  Figure 3 shows randomization values (y-axis) for the eleven nonoverlapping length 100 windows (x-axis) of the computer-generated sequence, whereas Figure 4 shows the RQA values for the windowed analysis. Figures 5 and 6 show the randomization values and RQA values for the length 1100 human-generated sequence, respectively.
In each graph, the blue horizontal line represents the mean are the obvious exception to this, because the red horizontal line for these measures is always related to the highest peak (i.e., maximum value) in the windowed analysis; the red horizontal line for L_max is even higher than the highest peak, probably due to breaking the longest sequence by the windowed analysis.

| DISCUSSION
In the current study, we explored the time-dependency of performance on the random number generation task and how this timedependency corresponds to our understanding of executive functioning. Performance on the RNG task has solely been expressed in terms of departures from mathematical randomness, disregarding all contextual information on timescales of more than two numbers.
Interpretations based on these static randomization measures are only feasible if the RNG task abides to the assumption of stationarity, which means that these randomization measures are a fitting model for each moment in time. To determine whether randomization and RQA measures can qualify the sequences as stationary, we used multiple computer-generated and human-generated random number sequences of increasing length and subjected these sequences to both classical randomization analyses and recurrence quantification analyses. By partitioning the length 1100 computer-generated and human-generated sequences in smaller non-overlapping windows of length 100 we could gauge the time-dependency (stationarity) of the RNG task.
Generated numbers in the RNG task have a nominal level of information and, therefore, only limited information is gained from a single number, a digram, or a sequence of multiple numbers. Information on RNG performance and, therefore, predictability of executive functioning comes from the repetitions of numbers.
Randomization measures, however, quantify redundancy, but do not directly quantify repetitive behavior. Adjacency, for example, is the proportion of digrams with an ordinal level of response alternatives, but this tally does not, necessarily, have to contain repetitions of the same digram, especially for shorter sequences. Another example is Phi 2, which is a measure of the amount of identical number pairs.
Like adjacency, an increase in Phi 2, does not necessarily indicate repetitions of the same identical number pair. The RQA measures determinism and laminarity, on the other hand, do quantify repetitions of number sequences, including digrams. We already saw that human-generated sequences seem to have a higher combined adjacency score than computer-generated sequences. Determinism, however, seems more comparable between human-and computergenerated sequences. This suggests that the behavior of our participant is not overly repetitive, although she does seem to have a tendency to count during number generation. While counting can be considered stereotyped behavior, it is not necessarily repetitive behavior.
Next, the randomization measures have the tendency to approximate a theoretical maximum with increased series length and this is seen in both computer-generated sequences as well as in The current RNG data was produced by a healthy volunteer with, presumably, adequate EFs or at the least with an adequate notion of the concept of randomness. Especially in the field of neuropsychology, this is often not the case. Compromised EFs are a common phenomenon in many psychopathological conditions (Snyder et al., 2015;Zelazo, 2020). Putatively, ineffective EFs will yield different number sequences (Delis et al., 2003) and it would be interesting to see how the randomization measures and RQA measures behave for these sequences. Further research is needed in this regard. This brings us to the limitations of the current data. To explore the time-dependency of performance on the random number generation task we limited ourselves to the analyses of number sequences, since these are the input for all randomization measures. RQA, on the other hand, can easily be adapted to take other information as input for analyses, like reaction times between numbers or mouse coordinates during random generation. Contrary to random number sequences, these types of information are not measured at a nominal level and therefore might add a different perspective on the dynamics of EF during random number generation. For example, the variability in reaction time patterns might give even more insight in the decision process of our participant.
Present findings might also have implications for the interpretation of many other EF tests and it is worthwhile to see how the neuropsychological toolbox withstands the assumption of ergodicity.
For example, the Stroop task is among the most commonly used EF test in the scientific literature in recent years (Baggetta & Alexander, 2016) and this test also happens to be one of the backbones of the factor-analytic studies to reveal executive latent or core constructs (Miyake et al., 2000). The classic Stroop task consists of reading aloud color-words in black ink followed by naming aloud the color of ink in which color-words are written, whilst the color of ink does not match the color-word (incongruent trials). There are many OOMENS ET AL.
-9 of 11 variants of this paradigm, but the rationale is always that an automatic or learned response has to be inhibited for the purpose of the required response (e.g., reading is the automatic response and naming the color of ink is the required response). This inhibition is quantified by the amount of time needed to complete the entire set of incongruent color-words, whereby a faster pace equals better inhibition. In other words, the interpretation of inhibition is based on a model of the average time that is needed to inhibit an incongruent  (Stephen et al., 2009;Wijnants et al., 2012). This illustration of the Stroop task is also applicable to many other common neuropsychological EF tests like the trail making, Wisconsin Card Sorting, or Tower tasks. These tests need to be adapted to record a critical amount of temporal data (i.e., reaction times or hand/eye movements) to unravel the dynamics of EFs. The neuropsychological toolbox, however, will greatly benefit from a time-series approach to assess executive functioning to actually make inferences about how a person goes about doing the task instead of only explaining how good a person is in doing the task.