Guidelines: Test Data

In the test design activity, two significant artifacts were identified and described: Test Scripts and Test Cases. Without Test Data, these two artifacts cannot be implemented and executed. They are merely descriptions of conditions, scenarios, and paths without concrete values to succinctly identify them. Test Data, while not an artifact in its own, significantly impacts the success (or failure) of test. Testing cannot be implemented and executed without Test Data, as Test Data is required for the following:

There are four attributes of Test Data that should be addressed when identifying the actual Test Data:

Each of these characteristics are discussed in greater detail in the sections below:

Depth

Depth is the volume or amount of data used in testing. Depth is an important consideration in that too little data may not reflect real-life conditions, while too much data is hard to manage and maintain. Ideally, testing should begin with a small set of data that supports the critical Test Cases (usually the positive Test Cases). As confidence is gained during testing, the Test Data should be increased until the depth of data is representative of the deployed environment (or what is appropriate and feasible).

Breadth

Breadth refers to the degree to which the Test Data values vary. One could increase the depth of Test Data by just creating more records. While this is often a good solution, it does not address the true variations in data that we would expect to see in actual data. Without these variations in our Test Data, we may fail to identify defects (after all, not every withdrawal from an ATM is for $50.00). Therefore, Test Data values should reflect the data values found in the deployed environment, such as withdrawing $10.00, or $120.00. Additionally, Test Data should reflect real-world information such as:

Test Data values can be either a physical representation or a statistical representation of the real data to obtain sufficient breadth. Both methods are valuable and suggested.

To create Test Data based upon a physical representation of the deployed data, identify the allowable values (or ranges) for each data element in the deployed database and ensure that, for each data element, at least one record in the Test Data contains each allowable value.

	Account Number (range)	PIN number (integer)	Account Balance (decimal)	Account Type (string)
	(S) 0812 0000 0000 to 0812 9999 9999 (C) 0829 0000 0000 to 0829 9999 9999 (X) 0799 0000 0000 to 0799 9999 9999	0000 - 9999	-999,999.99 to 999,999.99	S, C, X
record 1	0812 0837 0293	8493	-3,123.84	S
record 2	0812 6493 8355	3558	8,438.53	S
record 3	0829 7483 0462	0352	673.00	C
record 4	0799 4896 1893	4896	493,498.49	X

The above matrix contains the minimum number of records that would physically represent the acceptable data values. For the Account Number, there is one record for each of the three ranges, all the PIN numbers are within the range specified, there are several different Account Balances - including one that is negative, and there are records for each of the different Account Types. The matrix above is the minimum data, best practice would be to have data values at the limits of each range as well as inside the range (see Guidelines: Test Case).

The advantage of physical representation is that the Test Data is limited in size and manageable, focused on and targeting the acceptable values. The disadvantage however, is that actual, real-world data is not completely random. Real data tends to have statistical profiles that may affect performance, which when using physical representation, would not be observed.

Statistical Test Data representation is Test Data that reflects a statistical sampling (of the same percentages) of the production data. For example, using the same data elements as above, if we analyzed the production database and discovered the following:

our Test Data, based upon statistical sampling would include 294 records (as compared to the four we noted above):

	Test Data (at .1 percent of production)
	Number of Records	Percent
Total Number of records	294	100
Account numbers (S) 0812 0000 0000 to 0812 9999 9999	141	48
Account numbers (C) 0829 0000 0000 to 0829 9999 9999	144	49
Account numbers (X) 0799 0000 0000 to 0799 9999 9999	9	3

The above matrix only addresses the account types. In developing the best Test Data based upon statistical representation, you'd include the significant data elements. In the above example, that would include reflecting the actual account balances.

A disadvantage of the statistical representation is that may not reflect the full range of acceptable values.

Typically, both methods of identifying Test Data are used to ensure that the Test Data address all values and performance / population issues.

Test Data breadth is relevant to the Test Data used as input as well as the Test Data used to support testing (in pre-existing data).

Scope

Scope is the relevancy of the Test Data to the test objective, and is related to depth and breadth. Having a lot of data does not mean its the right data. As with the breadth of Test Data, we must ensure that the Test Data is relevant to the test objective, that is, that there is Test Data to support our specific test objective.

For example, in the matrix below, the first four Test Data records address the acceptable values for each data element. However, there are no records to evaluate negative balances for account types C and X. Therefore, although this Test Data correctly includes a negative balances (valid breadth), the data below would be insufficient in its scope to support any testing using negative account balances for each account type. Expanding this data to include additional records, including negative balances for each of the different account types would be necessary to address this oversight.

	Account Number (range)	PIN number (integer)	Account Balance (decimal)	Account Type (string)
	(S) 0812 0000 0000 to 0812 9999 9999 (C) 0829 0000 0000 to 0829 9999 9999 (X) 0799 0000 0000 to 0799 9999 9999	0000 - 9999	-999,999.99 to 999,999.99	S, C, X
record 1	0812 0837 0293	8493	-3,123.84	S
record 2	0812 6493 8355	3558	8,438.53	S
record 3	0829 7483 0462	0352	673.00	C
record 4	0799 4896 1893	4896	493,498.49	X
New Record 1	0829 3491 4927	0352	-995,498.34	C
New Record 2	0799 6578 9436	4896	-64,913.87	X

Test Data scope is relevant to the Test Data used as input as well as the Test Data used to support testing (in pre-existing data).

Architecture

The physical structure of Test Data is relevant only to any pre-existing data used by the target-of-test to support testing, such as an application's database or rules table.

Testing is not executed once and finished. Testing is repeated within and between iterations. In order to consistently, confidently, and efficiently execute testing, the Test Data should be returned to its initial state prior to the execution of test. This is especially true when the testing is to be automated.

Therefore, for to ensure the integrity, confidence, and efficiency of testing, it is critical that Test Data be free of all external influences, and it state be known at the start, during, and end of the test execution. There are two issues that must be addressed in order to achieve this test objective:

Each of these issues will affect how you manage your test database, design your test model, and interact with other roles.

Instability / Segregation

To maintain the confidence and integrity of testing, the Test Data should be highly controlled and isolated from these influences. Strategies to insure the Test Data is isolated include:

Initial State

The other Test Data architecture issue that must be addressed is that of the initial state of the Test Data at the beginning of test execution. This is especially true when test automation is being used. Just as the target-of-test must begin the execution of test in a known, desired state, so to must the Test Data. This contributes to the repeatability and confidence that each test execution is the same as the previous.

The method used will depend upon several factors, including the physical characteristics of the database, the technical competence of the testers, the availability of external (non-test) roles, and the target-of-test.

Data Refresh

The most desirable method of returning Test Data to its initial state is Data Refresh. This method involves creating a copy of the data base in its initial state and storing it. Upon the completion of test execution (or prior to the execution of test), the archived copy of the test database is copied into the test environment for use. This ensures that the initial state of the Test Data is the same at the start of each test execution.

An advantage of this method is that data can be archived in several different initial states. For example, Test Data maybe archived at end-of-day state, end-of-week state, end-of-month state, etc. This provides the tester a method of quickly refreshing the to a given state to support a test, such as testing of the end of month use case(s).

Data Re-initialize

If data cannot be refreshed, the next best method is to restore the data to its initial state through some programmatic means. Data re-initialize relies on special use cases and tools to return the Test Data to its initial values.

Care must be taken to ensure all data, relationships, and key values are returned to their appropriate initial value to ensure that no errors are introduced into the data.

On advantage of this method is that it can support the testing of the invalid values in the database. Under normal conditions, invalid data values would be trapped not allowed entry into the data (for example by a validation rule in the client). However, another actor may affect the data (for example an electronic update from another system). Testing needs to verify that invalid data is identified and handled appropriately, independent of how it occurs.

Data Reset

A simple method of returning data to its initial state is to "reverse the changes" made to the data during the test. This method relies upon using the target-of-test to enter reversing entries, that is, adding records / values that were deleted, un-modifying modified records / values, and deleting data that was added.

If this is the only method available in your test environment, avoid using database keys, indices and pointers as the primary targets for verification. That is, for example, use the Patient Name field to determine if the patient was added to the database instead of using a system generated Patient ID number.

Data Roll Forward

Data roll forward is the least desirable method of addressing the initial state of the Test Data. In fact, it doesn't really address the issue. Instead, the state of the data at the completion of test execution becomes the new initial state of the Test Data. Typically, this requires modifying the Test Data used for input and / or the Test Cases and Test Data used for the evaluation of the results.

There are some instances when when this is necessary, for example at month-end. If no archive of the data, just prior to month's end, then the Test Data and Test Scripts from each day and week must be executed to "roll forward" the data to the state needed for the test of the month end processing.