Posts Tagged ‘python’

A Simple N-gram Calculator: pyngram

Thursday, May 20th, 2010
Updated v1.0.1 5/21/2010 – Improved the exception handling, and changed xrange(len(inputstring)) to xrange(len(inputstring)-nlen+1)). Thanks to colleague Arik Baratz!

Recently, as I was trying to solve a cryptogram, I wrote tool to parse the bigrams and trigrams from the ciphertext, tally the frequency, and then display the results sorted from most to least frequently occuring bigram and trigram.

First, a quick history of why I did this and how this was handy.

One of the ways to solve a substitution cipher is to do a frequency analysis. Here’s a typical distribution of letters in the English language. Just as it is obvious that the alphabet ‘e’ is by far the most popular in the English language, you can also calculate the most frequently occurring bigram (2 consecutive characters) and trigram (3 consecutive characters). In English, the top most frequently occurring bigrams are ‘th’ (1.52%), ‘he’ (1.28%), ‘in’ (0.94%) (full list from Wikipedia here). For trigrams, the most popular are ‘ th’ (note the leading whitespace), ‘he ‘ (trailing whitespace), followed by ‘the’ (full list here). The biggest assumption here is that the plaintext is in English. If it’s in say, German, then you’ll have to find the corresponding statistical distribution (Wikipedia has the 1-gram frequency distribution for other languages here).

Whatever the plaintext’s (human) language is, you’d have to find the top n-grams occurring the ciphertext first—and that’s what this calculator will do for you. You can import the python module and call the function calc_ngram, or just write it from your *nix command line.

Example usage from python shell:

>>> from pyngram import calc_ngram
>>> results = calc_ngram('bubble bobble, bubble bobble, bubble bobble', 3) # (inputstring, n-gram size)
>>> for l in results: print l[0] + ' occured ' + str(l[1]) + ' times'
...
bbl occured 6 times
ble occured 6 times
le  occured 3 times
obb occured 3 times
 bo occured 3 times
e b occured 3 times
ubb occured 3 times
bub occured 3 times
bob occured 3 times
, b occured 2 times
 bu occured 2 times
le, occured 2 times
e,  occured 2 times

You can install pyngram from cheeseshop, Python’s package index with

sudo pip install pyngram

For some strange reason, Perl’s CPAN had a few such utilities (just search for ngram, bigram, digram), but there wasn’t any for Python that I could find. Although CPAN’s offering on average looked more feature-rich, pyngram by comparison is more light-weight. It does one thing and one thing only, and it does it efficiently well.

Writing the calculator was actually the easiest part. Putting it together in a nice package for the pypi repository and making sure it works with pip was the most time consuming part! But it’s worth it, because now that I’ve been through the process once (whole topic on its own), I can easily do it again. Contributing a small module to open source gives me a small jolt of happiness :)

Here’s the source code. Enjoy!

#!/usr/bin/env python
"""
 A simple Python n-gram calculator.

 Given an arbitrary string, and the value of n as the size of the n-gram (int), this module
 will show you the results, sorted from most to least frequently occuring n-gram.

 The 'sort by value' operation for the dict follows the PEP 265 recommendation.

 Quick start:

 >>> from pyngram import calc_ngram

 method expects inputstring as 1st arg, size of n-gram as 2nd arg

 >>> calc_ngram('bubble bobble, bubble bobble, bubble bobble', 3)

 Or just run it from the command line prompt:
 user@host:~$ ./pyngram.py

 Enjoy!

 Jay Liew
 @jaysern

"""

__version__ = '1.0'
__author__ = 'Jay Liew' # @jaysern from @websenselabs
__license__ = 'MIT'

from operator import itemgetter

def calc_ngram(inputstring, nlen):
    if nlen < 1:
        raise ValueError, "Uh, n-grams have to be of size 1 or greater. Makes no sense to have a 0 length n-gram."

    if len(inputstring) < 1:
        raise ValueError, "umm yeah, ... the inputstring has to be longer than 1 char"

    # now, fish out the n-grams from the input string
    ngram_list = [inputstring[x:x+nlen] for x in xrange(len(inputstring)-nlen+1)]

    ngram_freq = {} # dict for storing results

    for n in ngram_list: # collect the distinct n-grams and count
        if n in ngram_freq:
            ngram_freq[n] += 1
        else:
            ngram_freq[n] = 1 # human counting numbers start at 1

    # set reverse = False to change order of sort (ascending/descending)
    return sorted(ngram_freq.iteritems(), key=itemgetter(1), reverse=True)

if __name__ == '__main__':
    inputstring = raw_input('Enter input string: ')
    nlen_str = raw_input('Enter size of n-gram (int): ')
    nlen = int(nlen_str) # cast string to int

    for t in calc_ngram(inputstring, nlen):
        print t[0] + ' occured ' + str(t[1]) + ' times'

This is a cross-posting from my company’s blog post here: http://community.websense.com/blogs/securitylabs/archive/2010/05/20/a-simple-n-gram-calculator-pyngram.aspx.

locale.Error: unsupported locale setting?

Thursday, April 22nd, 2010

Is your Python code crashing over some kind of locale setting?

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/tweepy/binder.py", line 178, in _call
    return method.execute()
  File "/usr/local/lib/python2.6/dist-packages/tweepy/binder.py", line 164, in execute
    result = self.api.parser.parse(self, resp.read())
  File "/usr/local/lib/python2.6/dist-packages/tweepy/parsers.py", line 72, in parse
    result = model.parse_list(method.api, json)
  File "/usr/local/lib/python2.6/dist-packages/tweepy/models.py", line 35, in parse_list
    results.append(cls.parse(api, obj))
  File "/usr/local/lib/python2.6/dist-packages/tweepy/models.py", line 50, in parse
    setattr(status, k, parse_datetime(v))
  File "/usr/local/lib/python2.6/dist-packages/tweepy/utils.py", line 20, in parse_datetime
    locale.setlocale(locale.LC_TIME, '')
  File "/usr/lib/python2.6/locale.py", line 513, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Or when using aptitude or apt-get or dpkg-reconfigure, you also get some LANGUAGE or LC_ALL unset error?

jaysern@jaysern:/home/jaysern# dpkg-reconfigure localeconf
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
      LANGUAGE = (unset),
      LC_ALL = (unset),
      LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
Package `localeconf' is not installed and no info is available.
Use dpkg --info (= dpkg-deb --info) to examine archive files,
and dpkg --contents (= dpkg-deb --contents) to list their contents.

If you tried this in your python code,

locale.setlocale(locale.LC_ALL, '')

or

locale.setlocale(locale.LC_TIME, '')

does it fail too?

If so, try

aptitude install language-pack-en

and see if that does the trick for you (credit). I spent a few hours trying to hunt down a bug in python – turns out the solution was just a one-liner. I hope this saves someone else the headache!

What do a philologist and a lollipop have in common?

Wednesday, February 24th, 2010

Question: What do a philologist and a lollipop have in common?

Answer: LOL (if you don’t get it, you will LOL when you see it below)

The generalized problem statement

Given a few strings:

  ewf3hardyharharoiew
  p90weuhardyharhar
  hardyharharoie78wjf
  ahardyharhar787834

Determine the longest substring that they all have in common:

 p90weuhardyharhar
       |||||||||||
      ahardyharhar787834
       |||||||||||
   ewf3hardyharharoiew
       |||||||||||
     d1hardyharharoie78wjf

Which in this example above, is ‘hardyharhar’. The question is, how do you code this?

The solution

Today on Hug-An-Algorithm day, we’re giving the hat-tip to an algorithm called longest common substring (LCS).

The solution can be implemented using a generalized suffix tree, or by dynamic programming
(Wikipedia).

Specifically, I’ll walk you through a Python implementation—but focusing on the approach behind it, so that you can implement this in any language you want. I found a few LCS implementations on Wikibooks.org, which is where I got this simple Python implementation from (and adapted it slightly for this demo).


Technical nitty-gritty details warning! (If you’re interested in how this may be useful to you, skip to real-world applications section after it):

Select full screen + high quality plz:

Code from the video above for plugging into your python interpreter:

def LCS(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    LCS_set = set()
    longest = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                v = L[i][j] + 1
                L[i+1][j+1] = v
                if v > longest:
                    longest = v
                    LCS_set = set()
                if v == longest:
                    LCS_set.add(S[i-v+1:i+1])
    print LCS_set

Real world application

As researchers, we often find ourselves swimming in a sea of seemingly random data, and we’re always looking for ways to make sense of this wealth of data, how we can obtain insights and the act accordingly.

Here’s how I use the LCS algorithm to find patterns in malicious links. Obviously, it can be applied to any kind of data outside the boundaries of security research. Hope you’ll find this tool handy in your problem-solving toolkit!

Security Researcher: Jay Liew

p.s. This blog post is dedicated to the folks in #python on Freenode.

This is a cross-post from my company’s blog.

Allow your visitors to sign in via Twitter, Facebook, FriendFeed, OpenID, and OAuth with django-socialregistration

Wednesday, October 28th, 2009


The title says it all. Allow your visitors to sign in via Twitter, Facebook, FriendFeed, OpenID, and OAuth with django-socialregistration. Hence the name, “social registration”. I’ve been tinkering with this module and I’ve just got the Twitter OAuth login/registration to work. It’s pretty neat!

You know how much you loathe having to register/sign-up for yet another new service, just to try it? Well, as a site owner, it’s really in your best interest to lower the bar for the public to figure out if they want to use your service. Make it as frictionless as possible.

The README provides all the instructions you need, except one thing: you must explicitly tell Django’s authentication process to also check the socialregistration module’s authentication backend

In other words, in your settings.py:

AUTHENTICATION_BACKENDS = (

    # this is the default backend, don't forget to include it!
    'django.contrib.auth.backends.ModelBackend', 

    # this is what you're adding for using Twitter
    'socialregistration.auth.TwitterAuth', 

    'socialregistration.auth.FacebookAuth', # Facebook

    'socialregistration.auth.OpenIDAuth', # OpenID

    )

You can either add it in your settings.py or global_settings.py (mine’s in /usr/local/lib/python2.6/dist-packages/django/conf/ )
(more…)

mod_wsgi compile error: missing Python development package?

Tuesday, September 22nd, 2009

If you’re seeing an error like this when trying to compile mod_wsgi, it’s likely that you don’t have the Python development package installed. Solution: apt-get install python-dev

Turns out, a lot of things can go wrong when you try to compile mod_wsgi, and I spent a good chunk of time before finding it out here. I hope that by posting this here, I would save others the headache. The last few lines of the error was quite misleading, imho. But like they all say, in hindsight, everything’s 20/20

root@jaysern:/usr/local/mod_wsgi-2.5# make > error
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
/usr/local/apache2/build/libtool --silent --mode=compile gcc -prefer-pic -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE -g -O2 -pthread -I/usr/local/apache2/include -I/usr/local/apache2/include -I/usr/local/apache2/include -I/usr/include/python2.6 -DNDEBUG -c -o mod_wsgi.lo mod_wsgi.c && touch mod_wsgi.slo
mod_wsgi.c:113:20: error: Python.h: No such file or directory
mod_wsgi.c:114:21: error: compile.h: No such file or directory
mod_wsgi.c:115:18: error: node.h: No such file or directory

*** snip ***

mod_wsgi.c:11684: error: expected expression before ')' token
mod_wsgi.c:11691: error: expected ';' before 'ap_log_rerror'
mod_wsgi.c:11696: error: expected ';' before '}' token
mod_wsgi.c:11710: error: expected expression before 'module'
apxs:Error: Command failed with rc=65536
.
make: *** [mod_wsgi.la] Error 1
root@jaysern:/usr/local/mod_wsgi-2.5#