this is a barely working project. release number is 0.1 but 0.0.0.1 should be more realistic — last update: 2/4/2007

Stegoclick project shows how to realize a sort of steganography over click. In order to better understand this idea, we need to overview some basic concepts.

Steganography is the art and science of writing hidden messages in such a way that no one apart from the intended recipient knows of the existence of the message.
Wikipedia has a very well explained page about steganography and a list of useful related links (http://en.wikipedia.org/wiki/Steganography).
In steganography we have containers. As an example, an image can be used as container for other data, available only to those having the key to reveal them.
Data containers can be different: images, music files, texts, communication flows.
During the years a number of applications demonstrated the use of different kind of containers: steganography of mail among spam, network steganography by sending data not connection related, and others.
Steganalysis is the process of finding a way for attacking a steganographc system.
An such attack is successfull when an observer, which is excluded from the secret communication, is able to tell that some data are hidden.
Steganographic systems are vulnerable to statistical analysis, revealing that something not usual is happening.

Let see an example of steganalysis on “apparent spam” (a real application able to do http://www.spammimic.com):
A user receives in his mymailbox some mail messages ressembling as spam. These messages are not different from normal spam, whilest they are crafted with a tego system taking the words more frequently used in the wholes spam messages
system and creating from these a 20k dictionary of words, mixed between three different languages (says Italian, English and Spanish).
By using this dictionary it is possibile to hiddenly communicate with anyone else is using the same dictionary, by hiding the messages among the vast amount of spam usually circulating on the internet.
In this example, the spam flow is the container and the 20k dictionary is the key.
An automatic tool of analysis can’t see any difference between the mail traffic with hidden communication going on and the usual mail traffic. An analyst could recognise as suspect the spam messages containing meaningless sentences
composed by words from three different languages.
Once the anomaly is seen occurring a certain number of times, then the spam based steganographic system is broken: it’s evident that some sort of communication is happening between two elements (anyway, due to its broad diffusion, spam is an efficient container for steganographic purposes. In order to conceal that such spam-based steganographic system is in use between two people, it is sufficient to address the same sort of spam messages to a great number of receivers, just like usual spam is spread on mailbox around the world with little or not regarding of the main language used by the mail users).

A similar approach can be adopted based on the observation that when observing the connections a pc do when online are meaninglessly addressed towards different services like irc, web, ftp, mail etc. But, if towards a specific host a packet flow seemengly meaningless (and that cannot be related to malconfiguration nor malfunctioning of some components) is established, an expert analyst can recognize this anomaly.

Steganalysis is based on anomalies spotting principle: one anomaly is considered ok, two are considered coincidences, but three or more are considered a sign that something hidden is probably happening, thus requesting deeper analysis.

Let see, now, a brief excursus on communication and its codes. Communication is based on codes: alphabets are communications codes using letters as symbols (italian alphabet has 21 symbols, english has 26 sysmbols, chinese has pictograms as symbols and there are millions on them). In binary languages two symbols are used, forming infinite combinations. But what if we use some other symbols to communicate? What we have to do is only to decide which symbols has to be used and which are the meaning of their combinations.
As an example, let us use clothes and their colors as symbols for a communication.

yellow=0, black=1, white=2, green=3

That is meaning that we have 4 possible states for each piece of clothes we decide to wear.
So, if someone is wearing yellow socks, green trousers, black shirt and white hat (ahahah :) ) we can decide to translate this in binary (assuming a bottom-up rule in forming combinations) as 10110100. Wow! This is a byte!
By setting appropriate correspondances and rules, every thing under human control can assume significate. This approach has yet been used.
Now it is time to find a vehicle for this communication.

Think a big brother able to sniff Internet? What’s look? HTTP GET, for the most.
Click, users click on a web browser.
Wow, our vehicle could be HTTP GET, and the order of requested page our data-segment!
The web administrator should parse his webserver logs, find our sequence, interpreting them and extract our data.

Suppose a webserver with this page:
index.html with link to a.html and b.html
a.html with link to 1.html, 2.html, 3.html, 4.html
b.html with link to 1.html, 2.html, 3.html, 5.html

A common client makes HTTP GET link:
GET index.html
GET a.html (with referrer index.html)
GET 1.html (with referrer a.html)

Possible referrer+link is our dictionary! And appears as:

code 0 referrer: a.html link:1.html
code 1 referrer: a.html link:2.html
code 2 referrer: a.html link:3.html
code 3 referrer: a.html link:4.html
code 4 referrer: b.html link:1.html
code 5 referrer: b.html link:2.html
code 6 referrer: b.html link:3.html
code 7 referrer: b.html link:5.html

Our stego-client has to send the bits:

100 101 001 001

Encoded according to our dictionary, 100 = 4, 101 = 5, 001 = 1.

And our stego-client makes these GETs:

GET index.html
GET a.html (with referrer index.html)
GET b.html (with referrer index.html)
GET 1.html (with referrer b.html)
GET 2.html (with referrer b.html)
GET 2.html (with referrer a.html)
GET 2.html (with referrer a.html)

Server side, web admin able to parse his logs and sharing the same dictionary, could understand the meaning of get+referrer requested and rebuild the data:

100 101 001 001 :)

STEGOCLICK EXAMPLE:

This is a mirror of Italian e-privacy mailing list https://lists.firenze.linux.it/mailman/listinfo/e-privacy/, used locally:

debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy# ls
1999-March 2002-January 2003-March 2004-October 2006-August
2000-July 2002-July 2003-May 2004-September 2006-December
2000-September 2002-June 2003-November 2005-April 2006-February
2001-April 2002-March 2003-October 2005-August 2006-January
2001-August 2002-May 2003-September 2005-December 2006-July
2001-December 2002-November 2004-April 2005-February 2006-June
2001-July 2002-October 2004-August 2005-January 2006-March
2001-June 2002-September 2004-December 2005-July 2006-May
2001-May 2003-April 2004-February 2005-June 2006-November
2001-November 2003-August 2004-January 2005-March 2006-October
2001-September 2003-December 2004-July 2005-May 2006-September
2002-April 2003-February 2004-June 2005-November 2007-February
2002-August 2003-January 2004-March 2005-October 2007-January
2002-December 2003-July 2004-May 2005-September index.html
2002-February 2003-June 2004-November 2006-April
debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy#

Index.html file contains:

[...]
February 2007: [ Thread ] [ Subject ] [ Author ] [ Date ]
January 2007: [ Thread ] [ Subject ] [ Author ] [ Date ]
December 2006: [ Thread ] [ Subject ] [ Author ] [ Date ]
[…]

That is for each months, a link to $YEAR-$Month/author.html (email message sorted by author), subject, date, thread. 4 possible referrals for each post in a month.

normalizer.sh is the script to generate the dictionary:

debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy# ./normalizer.sh index.html
…………………………..
building stegocode.list with 55636 lines before sort | uniq
sorted and uniqed number of stegocode is 13900
debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy#

normalizer.sh script builds a dictonary parsing html pages in the local mirror (both communication parties share same knowledge of pages in the mirror). Resulting dictionary has fixed name ’stegocode.list’:

debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy# head -5 stegocode.list
1 1999-March/author.html 1999-March/000646.html
2 1999-March/date.html 1999-March/000646.html
3 1999-March/subject.html 1999-March/000646.html
4 1999-March/thread.html 1999-March/000646.html
5 2000-July/author.html 2000-July/000310.html
debian:~/mirror-ep/lists.firenze.linux.it/pipermail/e-privacy#

Built dictionary counts 13900 elements. The client has to send ‘file’ with steganography over click:

debian:/tmp# ls
file file2click stegoclicker.sh stegocode.list
debian:/tmp# ./stegoclicker.sh file http://www.delirandom.net/ep-mirror
[+] checking first argument (I’m looking for data file to send)… OK file file
[+] checking baseurl as 2nd argument, http://www.delirandom.net/ep-mirror/ as example
[+] using http://www.delirandom.net/ep-mirror as base url
[+] checking stegocode.list… OK
[+] checking curl executable… OK: /usr/bin/curl
[+] checking number of stegano code available… OK: 13900
[+] compression of file, with gzip -9
[+] checking length of file.click_gz… OK: 37
[+] executing ./file2click file.click_gz 37 13900 $RANDOM
############################################################### 100.0%
[+] calling curl with referral & code
############################################################### 100.0%
############################################################### 100.0%
[…]

How does it work?

1) Gzip the file, to reduce his size.
2) Split file.click_gz in elements of the dictionary and build a ‘referrer+link’ array
3) GET index.html
4) Call curl to send HTTP GET, using ‘referrer+link’ contained in the array.

And server side?

1) The server, like everyday, collects httpd logs.
2) Run stegoreader.sh with the collected log and try to extract files.
3) Any visiting IP appears in the log, and then any IP could be a sender.
4) The script tries to interpreter link order for each IP address, because it does not have enough information to suppose who is the sender and who are not.
5) index.html is used as marker for “starting communication” and a new array is extracted after each ‘IP+GET index.html’, each array becames a file, but only a few files are meaningfull :)

debian:/tmp# ./stegoreader.sh access.log ep-mirror/index.html
[+] stegoreader.sh, part of stegoclick project: http://www.delirandom.net
[+] cleaning extractclick.* files
[+] checking for stegocode.list required file
[+] checking baseurl and index file…
[+] basename ep-mirror and index file index.html, parsing access.log …
[+] for each unique IP asking for index.html, get him history
….
[+] built 4 files to analyze (tempfile like extractclick.*)
[+] analysis of extractclick.113.155.114.186.1… OK
[+] analysis of extractclick.113.155.114.186.2… OK
[+] analysis of extractclick.71.208.60.232.3… OK
[+] analysis of extractclick.71.208.60.232.4… OK
[+] cleaning temp extractclick.* files…
[+] generated file has this file magic number:

clicked_data.1: data
clicked_data.2: very short file (no magic)
clicked_data.3: very short file (no magic)
clicked_data.4: gzip compressed data, was “file”, from Unix, last modified: Tue Mar 6 05:46:12 2007, max compression

wow! clicked_data.4 is our file!

this is the software package stegoclick 0.1 , and is not developed for the end-user.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]