Quick and dirty data wiping

How to wipe a disk with pre-determined bit patterns

Many of you who know me may have heard of my ‘mythical’ data wiping script, which I have maintained can be done in just a few lines shell script on pretty much *any* UNIX box. Well, here it is:-

for pattern in 85 170 ; do

awk -v p=$pattern ‘END {while (1) printf(“%c”,p);};’ \

< /dev/null > /path/to/device/or/file

done

The patterns 85 and 170 represent 01010101 and 10101010 in 8-bit binary. These patterns can be replaced with any sequence which can be generated or pre-defined prior to the wiping run.

This command converts the decimal value of $pattern into a binary pattern using awk. The output is then generated until the target device reaches the end of the media or fills the containing filesystem.

Please use with caution, my overly simplified version does not check what it is writing to – I accept no responsibility for damages arising as a result of using this script or any derivative works.

Search Engines and Privacy

Have you ever wondered how search engines actually make their money from advertising?

Have you ever had privacy concerns over the search terms you use?

Have you ever been freaked out by how well targeted modern advertising is?

If you can say yes to any of these – then read on! Otherwise, still read on for an eye-opener.

Search engines only exist because there is a financial model behind them which is there, naturally, to generate profits. So, how do search engines make their profit? Search engines primarily make their money from advertising (as their search and associated services are free to end-users) and they achieve this by three primary techniques:-

  1. Provide space for adverts on the site and rent them out.
  2. Extend scope of adverts through syndication schemes, embedded content and ‘like’ buttons.
  3. Sell your data (search terms and the IP addresses that they come from along with browser unique IDs held in cookies etc) to 3rd parties.

Through the many different tracking technologies available- it is easy to identify and build a profile of an Internet user. This data is collected in the form of search engine logs (on their servers) which can then be analysed either in real-time or at a later date, This statistical analysis provides deep insight into what other products and services might be of interest to you in order to elicit targeted marketing, however there can be a far more sinister use for this data too.

This data can be used to profile a person in order to find out things such as:-

  • Your name and any aliases (such as ‘internet’ names and previous names)
  • Birthday
  • Location
  • Address
  • Telephone Numbers
  • Email Addresses
  • Your interests
  • Tastes in music, clothes, and also food and drink
  • Your faith and beliefs
  • Your friends and associations
  • Your spending habits
  • Where you like to go
  • Times when you are not at home
  • Your car make, model, and registration
  • Where you work and what you do for a living
  • Your thoughts and feelings
  • Pictures of you in places and with people

All of these are used for profiling you in order to place you within a demographic classification which can then be identified and targeted for a number of uses including advertising.

How are these stats collected? The user normally wilfully gives them without a second thought. Sources such as your favourite search engine, along with Facebook, Twitter, Flickr, and other social media, which is then tied together into continuous sessions using tracking cookies.

Many social media tools like facial recognition on Facebook enable people to be accurately associated with others, another social engineering danger, which usually starts with “do you know ‘so-and-so’?”

Many web pages register your presence with syndication partners merely by viewing the page, for example, Google and Facebook get informed of your visit every time you visit a page with some of their ‘like’ buttons (but not all), so in many cases – even if you don’t click it, you may get tracked. Same for YouTube videos on sites other than YouTube itself – YouTube (and thus Google) will know you’ve visited even if you don’t watch any movies because it will have linked back to YouTube in order to provide the ‘player’ for that shared content. In another example, Amazon’s affiliate advertising syndication could be used to track users across pages which show Amazon affiliate adverts.

An interesting concept on social tracking is that if you use your friend’s wireless with your own equipment then there will be a chance to trace your tracking session transferring to a source of other known tracking sessions, thus it is able to also track your physical movements and correlate you to others through simultaneously sharing the same IP. Your iPhone or Android will go with you everywhere, and wherever there is wireless configured (with the correct password), it will use it and tell on you. Being fair on the matter, the telephone companies can do this far easier through 3G phone networks but being in the same proximity does not always positively prove a relationship – unlike sharing an IP.

It is also possible for ISPs to proxy your web traffic for the purpose of caching content in order to deliver a faster network – this can also be used as a source of data.

Many people simply ‘like’ products, creating for themselves an association with which allows others to gauge your persona because when you ‘like’ a product on Facebook, it tells all of your friends (or at least ‘friends’ on Facebook). This could then be used in social engineering attacks against you. Google track you simply for viewing a page with a Google+ button on it.

Google is so elusive – you need their “opt-out” plugin to avoid them – which as discussed before comes with an auto-update program which still ‘phones-home’.

So, by using social, open-source, and purchased data from many different sources such as Facebook, Twitter, MySpace, Google, Bing! Etc it is easily possible to build an accurate profile of a person, their relationships, and their surfing habits which will reveal insight into that person in order to target them for one purpose or another. This indeed is what the advertisers are after hence this is the reason why this data has value.

While this data has value, this data is also private data about individuals, and it is about your data too, which while a few traces will reveal little about you, long term traces can reveal lots more than you realise. How often do you clear your cookie cache? And do you have ‘Do Not Trace’ set on your web browser?

A recent row has broken out between online advertisers and Microsoft, who have taken the bold step to enable ‘Do Not Trace’ on their latest web browsers, which if you’re not aware is the default action of actively blocking tracking cookies. The advertisers are up in arms – and this tells you a lot about the value of the data.

To protect yourself from this form of personal data leakage, you should choose your web browser and search engines wisely. I am currently using SRW Iron for a web browser because it has all the power and prowess of Google Chrome with all of the Google-phone-home stuff taken out and a few safety features made default, and I use duckduckgo.com for a search engine because it supports encrypted connections through https, it does not record your IP in its logs, and they discard your results after 2 days. Duckduckgo.com also has a search portal within Tor and are strong advocates of internet privacy.

See duckduckgo’s privacy statement here:-

https://www.duckduckgo.com/privacy.html

Another search engine worth considering is ixquick, who’s privacy policy can be seen here:-

https://ixquick.com/eng/privacy-policy.html

For comparison, here’s Google’s privacy statement:-

https://www.google.co.uk/intl/en/policies/privacy/

Wow!, need I say more?

It is worth noting that all searches done using standard http can be recorded ‘on-the-wire’ by anyone who is monitoring the traffic. This includes all searches, returned content, and modifications to those searches, often, character-by-character where auto-fill offers search suggestions. For this reason, it is always worth using a search engine which can support https as this will stop a degree (but not all) snooping on the wire.

Another point worth noting is that you often lose protection the second you leave the https encrypted search engine page because you then give away the site which the search engine led you too. In the cases of many search engines, they too track this information, adding to their knowledge base of not only what you searched, but which links you clicked on. Once you click your intended site, you leave the protection of the encrypted search thus revealing your next site, so therefore using an encrypted search does not protect you beyond the initial search you undertake.

So, now knowing the extend of browser tracking, I encourage you to consider this when surfing the Internet, and take measures to protect your privacy.

Backups: Part 5 – Process Dependencies – Databases Example

It is a frequent occurrence for a backup of a database system to be de-coupled from the tape backup such that there is a risk that an over-run or failure of the database backup schedule would not be detected by the subsequent schedule of the tape backup.

I recommend that the database and tape backups always be coupled such that they can be considered one composite job where the backup is dependent upon the application being in a suitable state for a backup.

If scheduled from the UNIX cron scheduler or application scheduler itself should call a process which calls a backup once the system is known to be in a suitable state defined within the backup script.

If scheduled from the backup software, the database dump, quiesce, or shutdown should be scheduled as a backup pre/post command.

Verify that the backup destination is available – in the case of disk – make sure it is mounted and writable, and in the case of tape, make sure it is loaded and writable.

The backup-pre command should not be run if the backup media cannot be verified as available

The backup-pre command should bring the system to a safe state

The backup-post command should bring the system to an open state

The backup job should not be initiated if the backup-pre command fails

The backup-post command should be run whether or not the backup-pre command or backup command itself fails

The backup-post command should return success if the program is already running satisfactorily

Any media movement should be checked and performed prior to entering the backup-pre process.

In the event of a media movement error, the backup-pre process should not be run, nor should the backup be run.

The pre and post commands should be attempted multiple times to mask over transient errors. Something like the following code fragment is sufficient for providing a 3-strike attempt:-

SRC="/some/list/of/app/dirs"
DST="/path/to/backup/device"
RETVAL=1
try_thrice() {
  $@ || $@ || $@
}
backup_pre() {
  /path/to/some/app/command_to_freeze
}
do_backup() {
  tar cf - $SRC | bzip2 -9c - > $DST
}
backup_post() {
  /path/to/some/app/command_to_thaw
}
try_thrice backup_pre || exit $?
try_thrice do_backup
RETVAL=$?
try_thrice backup_post
exit `expr $? + $RETVAL`

So…in a nutshell:-

To improve data security and backup capacity management – a database backup should be linked to the tape backup such that the newly created database backup is copied to tape at the earliest opportunity, and that the tape backup should be configured so that it is not run if the database backup-pre command fails.

Backups: Part 4 – Dropped Files

In this little brain-splat I bring to you another backup woe – “Dropped Files”. These are an area of backup which is frequently overlooked. Many are just concerned on whether or not the backup job finished and succeeded. Many times I have seen backup reports where 1000’s of files are dropped daily as a matter of course due to lack of content exclusion base-lining.

All backups should be recursively bedded-in on initial configuration until they run for at least 99% of the time with 0 dropped files.

The danger of dropped files is that if you accept it as the norm – you will miss the critical files when they happen – Only through striving to maintain 0 dropped files through appropriate exclusions can it be possible to meet an absolute criteria of a good backup and enable you to see the real backup exceptions when they happen.

Dropped files are a critical part of the assessment on whether a backup is good so that makes it a mandatory process to eliminate any hot files and directories which are not required for a DR such as temp files and spool files. Elimination of these sources also reduces the backup payload thus reducing your backup times but also your RTO too as there is less data to restore.

Backup Basics (Don’t throw the baby out with the bath-water!)

In part 2 of my backup basics snippets, I introduce you to the concept of “not throwing the baby out with the bath-water” and what this means in backup-terms is that you do not delete your last-good backups before you have secured your next backup.

It seems for many so tempting to just wrench the retention period down to 0 and obliterate what might be in some cases a situation where a non-zero retention is the only valid recovery plan from the previous backup failure.

The mitigation of this risk has a capacity consequence, which should be calculated at the time of system design. The mistake illustrated here is most frequently encountered on database systems where there is often insufficient space to back-up the database once it breaches 50% of it’s data LUN.

If the data set is expected to be up to s Gb and c cycles are required then the backup volume or device needs to have the capacity of at least (c * s)+s Gb in order to hold sufficient copies of a backup; preferably on cheaper, albeit slower storage.

Do not skimp on backup storage if you want your DBAs to love you, they hate application-based backup agents and will give their left sock for a flat-file on-disk backup any-day!

Most importantly – it will also improve your RTO and if archive logs are available it will improve your RPO too.

The thing is to think ahead and always provision at least (c * s) + s storage for the backup volume

Keep this golden rule and your backup solution should remain fit for purpose given the finite storage resource available to the production data itself.

Good Backup (The basics)

Backup Blues

 

Tonight, I extol the virtue of good backups. It is not for me to say how frequently you should take a backup – that is your risk to assess. What I can say is that I believe that backups should meet a certain criteria to be considered a valid backup. I find it desperately lacking how the holistic view is missed by most backup admins who fail to consider the sum of the following as the minimum criteria for success:-

  1. Was the data in a suitable state for backup prior to the backup taking place?
  2. Did the backup start on-time as scheduled?
  3. Did the backup complete without error within it’s defined backup window? e.g. did it finish on-time?
  4. Did the backup drop any files? See dropped files
  5. Did the post-backup processes (if required) succeed?

Any failure of this 5-point plan constitutes a backup failure.

Only a backup which fulfills this criteria in full can be considered a good backup.

 

Dropped Files

 

All dropped files should be investigated and excluded if not required for a successful DR in order to not encourage a culture of accepting the failure of dropped files – excluding files not required for DR also improve the backup times because excluding temporary or re-constructable data means that there is less critical data to back-up.

This is an iterative process which should tend to zero dropped files as sources of temporary files are identified.

Some files may require that the backup is run at a different scheduled time or it may require that an application based scheduled task is moved to a different time outside of the backup window if the data cannot be quiesced.