Data Collection Part Deux

More of the back story – technology

SQL Server, Windows Server, AS-400s, DB2, Desktops, Oracle and Teredata with others I would have to look at archived resumes.  The security of the data was always critical in all of these various systems.  Data portability is problematic when the data being ported is unencrypted. WebMD employs a high level of analysis on making sure the data output is full de-identified.  HIPAA regulations require this analysis as well as on going HIPAA training.

data breach causes 2015
2015 Data breaches from the HIPAA Journal

From the operations perspective, there are technical layers employed to secure information internally.  As with any structure, any opening presents a weak point in the wall.  Often overlooked is an additional door the technology used to make the openings creates.  A fancy way of saying backdoors into security.  Hackers attack both.  Brute force focused on the front door and direct attack at the security layer.

Prior to the additional security the cloud now provides, one project required a MS SQL Server to be placed outside in the DMZ.  This was a new requirement placed on the company due to a change in regulations.  At the time, I was the programming manager, SharePoint Administrator and DBA.  Small companies, you wear a few hats.

Well the server was placed in the DMZ, secure encrypted connection on a VPN, custom ports, and private identity.  Using the best practices of the time, the server still took abuse.  As part of the disaster and recovery plan, monitoring the areas exposed was included.  Monitoring was setup with both weekly and threshold alarms.

The structure of the VPN required using an approved interface that ran on an Apache server running Java.  SQL 2012 was the state of the environment and 2014 was on the upgrade path.  The connect between the Java Interface and SQL was a dedicated user account through the VPN.  The process also involved a combination of Listeners scheduled through server jobs and SSIS packages that fired when content arrived.

The structure of the VPN exposed a vulnerability.  The SQL instance was required to expose a public login.  That port was a main point of attack.  It was interesting to review the security log, especially the login attempts.  The variety of user password combinations were impressive and at times, amusing.  Every pass or attack always started by hitting the manufacturer or provider default administrator access passwords.

Every device exposed to the net gets hit with default attacks.  I shudder to think about all of the personal wireless routers out there still with the default passwords.  Open to both data and bandwidth theft.

Securing Data before it leaves the barn…

We employ a variety of layers to the data before we send it through the pipe.  The data is encrypted, the connection is encrypted and routed through a VPN.  This sounds like a secure arrangement and it is.  Once the data is in the pipe, its out of reach.  The additional weak point in this scenario is the end point of the data.  The data sent and how it will be used puts the data in a less secure environment.

The endpoint is where the information gets used, input and processed.  It may start out within a secure application and then it gets downloaded as a CSV to be used in an Excel report.  As the data moves further from the original secure source it becomes more vulnerable and more likely to be mishandled or lost.  One copy?  That’s easy to track.  Multiple copies of multiple time frames?  Much harder.

So if the data gets out, can the impact be minimized before the information is even exposed?  Leaving behind a USB drive happens.  They are small and unlabeled.  Who needs a label?  You just plug it in right?  Leaving a report on a bench or a USB at a convention.  Data gets exposed.  Protect it before it gets out.

De-Identification of Data

This is the first area to inspect when releasing data.  What is the minimum information needed to satisfy the request?  If it is not needed, do not send it.  Too often the “client record” is exposed with all the fields.  What happens when the output requires specific census details?  This is where using de-identification plays and important role.  What about when the endpoint needs all the information?  Using de-identification and minimizing core data transmission also help with endpoint responsiveness.

The minimum information needed to satisfy the request

The business or need drives what data needs to be presented.  Here is a scenario that presents some of the points to consider:

You need to take home customer records to prepare a report. You have access to the following data structure:

Tables Person connected to ContactDetails on PersonId, AddressToContacts on AddressId
The customer record

These tables were chosen because they contained all the information needed.  The query for the report ended up as:

SQL query image
Sample Query

Pretty straightforward set of data.  The “PersonDetails” table was left out because that information was not needed.

top segment person data
Top segment, Person data

The first question to ask about this segment,  what do we need for the report?  Is the granular information like Name etc, needed or perhaps, would the PersonId be sufficient?  What is the big deal?  Name and address is in the public domain.  Why not include it anyway? In this data set, both address and person are in the public domain.    What isn’t in the public domain is how much rent the person is paying or that they are associated directly with the address.

Using the PersonId instead of the Census Details, preserves the anonymity of the relationship.  Basically De-Identification boils down to removing data points that can lead to discovery through reverse engineering.  A medical data report can end up being in violation of HIPAA using generic types like, medical condition, employee type, and location.  The seems ok.  No names are used.  It is until, there is only one employ of that type at a particular, small location.  By cross referencing the public company directory with location and employee type, you discover that persons condition.    The granularity of the data is inversely related to population density.

While De-Identifying data is used to obscure specific identifying markers, by limiting what data is sent down to the minimum needed also benefits performance.

Minimizing re-transmission of core data

Even in a secure private intranet, minimizing what data that gets sent, cached and used in subsequent transactions is important from a performance perspective especially in a wireless environment.  Wireless is a requirement today so that increases this as a priority.

It is a given that a certain set of data is needed to complete the transaction.  This data set may be relatively dense such as an application profile.  However more often the flow follows an initial large input followed by incremental updates that affect the state of the object.

Transactions impact the state of some objects continually and others little or not at all.  These slowing changing facts and dimensions should be cached when accessed the first time.  For example, an address detail record is unlikely to change during the session but it is referenced several times throughout the work flow.  Cached at the beginning of the work flow and released at the end.

What is needed between workflow steps?  This slowly changing information is needed for the workflow but is not impacted by it.  Unless the workflow step needs any of the details for the transaction, the only data point needed is the clustered index or surrogate key that points back to this detail.  Using data pointers instead of the full record reduces the surface area that can be exposed.

The minimized record results in a very small packet that improves overall responsiveness.  Large frequently accessed BLOB objects like images should all be cached at the beginning even if the page that will need it is steps into the process.  This keeps workflow performance responsive.  It is also key to effective database interaction.

Data…data, but who gets the tables?

I’ve worked my way through Develop and Operations throughout my career.  Developers want to touch everything and DBAs want nothing touched unless they do the touching.  So where there is conflicts, solutions emerge.  Today, it is an open market for choice.  The focus here is on a Relational Database Model and Object Oriented programming using SOLID.

The ability to extend feature support dynamically and add additional core objects were requirements. What is the minimum data necessary to represent the core object.  Of those data points, what items are static and which may change and how often.

Take associating a phone number with an individual.  Then add an email address and perhaps a physical address.

We know a person may have one or more phone numbers and email address addresses.  These items both change through updates, deletes and adds.  When you modify a database table to extend support for a feature, the effort is costly both in expense and time.  Mainframe systems used wide flat tables with fields for each attribute.  Relational systems remove the duplication.

Back to minimizing the structure.  It needs to be extensible and the core objects also need to be flexible to support new classifications.  Where else are the phone, email and address attributes used?  The address attribute could easily be associated with other core objects.  The address object gets centralized and mapping tables are created to record the relationship.  A mapping table using the clustered indexes of both tables provides an optimized filter to focus on the union between the record sets.

The final step is to define a dynamic object model to support the extensibility requirement.  We have to store specific data type values and metadata to identify what that value represents and its state.

Stepping through the underlying dynamic model.  Entities are composed of attributes storing data of a specific type.  These attributes may change or added to.  The stored information can be represented as virtual tables through views.

Storing object names (entities and fields) with additional metadata in the label table.  Each label stored is qualified by a label type.  In addition to name and description, label state replaces delete.  Once a record is instantiated, it can only be deactivated.  That state applies forward from the time it was changed.  This state change will be recorded in the transaction warehouse.  Attribute values that change throughout the life of the object also need to be reflected in the transaction warehouse.  The transactional storage model applies to all structures in this model.

The label type table plays a more significant role and begins to build up the model.  The type defines what type of data the label represents.  The type table supports hierarchical grouping.  This second piece is part of the organizational structure defining where the label fits in the data model.

These two elements are mapped using a mapping table as well.  Using the mapping model preserves extensibility. Next, the definition of the value the label represents is needed.  This is separate from how the value is stored.  The definition informs the request how the stored value should be parsed.  Additional filters for data integrity can be applied to the definition such as data ranges, value type, access restrictions and other aspects.  This helps enforce data integrity.

Finally the information is stored in a transaction log that reflects current state.  The transaction log stores a reference to the label, type and definition record.  It also contains fields that are buckets for specific value types.  All types used by application except for max fields.  Those were stored in a separate structure to keep the transaction log performant.

Using this structure, object views are created to represent the entities defined in the structure tables.  When new features or updates require changing the structure, it’s accomplished through adding or updating the records in the definition tables and adjusting or adding additional views.  All labels are exposed for easy translation to support regionalization.

Separating the application from the tables

Databases are built to store and maintain data.  The user interface is a query window with SQL script.  The application presents the data transformed for the work flow need and user requirements.  This transformation occurs at the database API layer comprised of stored procedures that support atomic transactions.  Optimally the process is designed to support asynchronous transactions.

At the interface layer the procedures can be single purpose or designed to support overloading.  Typically it is a mix where single purpose procedures are reserved for specific critical tasks.  Cached data is used to maintain any end point static detail.  The message packet is comprised of the key map that identifies the values representing the update.  Based on the key map submitted, the procedure processes the data based on the key signature.

The workflow determines if the application waits for a return message from the procedure indicating the transaction has been recorded.  This return message is typically the updated state of the information to support the workflows confirmation process or data required for the next step.  The parameters for the procedures along with their name and description support discovery.  Using the object definition model also carries down to the supporting procedures allow them to be generalized.

Now, the rest of the story…

We have devices that store gigabytes and terabytes of information.  We try to send large objects around with out worry.  Only to find that our email program rejected it.  Security doesn’t mean sacrificing performance.  Reducing your data footprint is a good thing.  The less you send the faster it goes.

Wireless is the way we mostly connect.  Wireless networks are more limited.  Sending less in each message and caching large objects keeps traffic off the network.  The data can be stored and retrieved in a de-identified format to support security and comply with regulations around PII and PMI.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *