'Logging into SAML/Shibboleth authenticated server using python

I'm trying to login my university's server via python, but I'm entirely unsure of how to go about generating the appropriate HTTP POSTs, creating the keys and certificates, and other parts of the process I may be unfamiliar with that are required to comply with the SAML spec. I can login with my browser just fine, but I'd like to be able to login and access other contents within the server using python.

For reference, here is the site

I've tried logging in by using mechanize (selecting the form, populating the fields, clicking the submit button control via mechanize.Broswer.submit(), etc.) to no avail; the login site gets spat back each time.

At this point, I'm open to implementing a solution in whichever language is most suitable to the task. Basically, I want to programatically login to SAML authenticated server.



Solution 1:[1]

Basically what you have to understand is the workflow behind a SAML authentication process. Unfortunately, there is no PDF out there which seems to really provide a good help in finding out what kind of things the browser does when accessing to a SAML protected website.

Maybe you should take a look to something like this: http://www.docstoc.com/docs/33849977/Workflow-to-Use-Shibboleth-Authentication-to-Sign and obviously to this: http://en.wikipedia.org/wiki/Security_Assertion_Markup_Language. In particular, focus your attention to this scheme:

enter image description here

What I did when I was trying to understand SAML way of working, since documentation was so poor, was writing down (yes! writing - on the paper) all the steps the browser was doing from the first to the last. I used Opera, setting it in order to not allow automatic redirects (300, 301, 302 response code, and so on), and also not enabling Javascript. Then I wrote down all the cookies the server was sending me, what was doing what, and for what reason.

Maybe it was way too much effort, but in this way I was able to write a library, in Java, which is suited for the job, and incredibily fast and efficient too. Maybe someday I will release it public...

What you should understand is that, in a SAML login, there are two actors playing: the IDP (identity provider), and the SP (service provider).

A. FIRST STEP: the user agent request the resource to the SP

I'm quite sure that you reached the link you reference in your question from another page clicking to something like "Access to the protected website". If you make some more attention, you'll notice that the link you followed is not the one in which the authentication form is displayed. That's because the clicking of the link from the IDP to the SP is a step for the SAML. The first step, actally. It allows the IDP to define who are you, and why you are trying to access its resource. So, basically what you'll need to do is making a request to the link you followed in order to reach the web form, and getting the cookies it'll set. What you won't see is a SAMLRequest string, encoded into the 302 redirect you will find behind the link, sent to the IDP making the connection.

I think that it's the reason why you can't mechanize the whole process. You simply connected to the form, with no identity identification done!

B. SECOND STEP: filling the form, and submitting it

This one is easy. Please be careful! The cookies that are now set are not the same of the cookies above. You're now connecting to a utterly different website. That's the reason why SAML is used: different website, same credentials. So you may want to store these authentication cookies, provided by a successful login, to a different variable. The IDP now is going to send back you a response (after the SAMLRequest): the SAMLResponse. You have to detect it getting the source code of the webpage to which the login ends. In fact, this page is a big form containing the response, with some code in JS which automatically subits it, when the page loads. You have to get the source code of the page, parse it getting rid of all the HTML unuseful stuff, and getting the SAMLResponse (encrypted).

C. THIRD STEP: sending back the response to the SP

Now you're ready to end the procedure. You have to send (via POST, since you're emulating a form) the SAMLResponse got in the previous step, to the SP. In this way, it will provide the cookies needed to access to the protected stuff you want to access.

Aaaaand, you're done!

Again, I think that the most precious thing you'll have to do is using Opera and analyzing ALL the redirects SAML does. Then, replicate them in your code. It's not that difficult, just keep in mind that the IDP is utterly different than the SP.

Solution 2:[2]

Selenium with the headless PhantomJS webkit will be your best bet to login into Shibboleth, because it handles cookies and even Javascript for you.

Installation:

$ pip install selenium
$ brew install phantomjs

from selenium import webdriver
from selenium.webdriver.support.ui import Select # for <SELECT> HTML form

driver = webdriver.PhantomJS()
# On Windows, use: webdriver.PhantomJS('C:\phantomjs-1.9.7-windows\phantomjs.exe')

# Service selection
# Here I had to select my school among others 
driver.get("http://ent.unr-runn.fr/uPortal/")
select = Select(driver.find_element_by_name('user_idp'))
select.select_by_visible_text('ENSICAEN')
driver.find_element_by_id('IdPList').submit()

# Login page (https://cas.ensicaen.fr/cas/login?service=https%3A%2F%2Fshibboleth.ensicaen.fr%2Fidp%2FAuthn%2FRemoteUser)
# Fill the login form and submit it
driver.find_element_by_id('username').send_keys("myusername")
driver.find_element_by_id('password').send_keys("mypassword")
driver.find_element_by_id('fm1').submit()

# Now connected to the home page
# Click on 3 links in order to reach the page I want to scrape
driver.find_element_by_id('tabLink_u1240l1s214').click()
driver.find_element_by_id('formMenu:linknotes1').click()
driver.find_element_by_id('_id137Pluto_108_u1240l1n228_50520_:tabledip:0:_id158Pluto_108_u1240l1n228_50520_').click()

# Select and print an interesting element by its ID
page = driver.find_element_by_id('_id111Pluto_108_u1240l1n228_50520_:tableel:tbody_element')
print page.text

Note:

  • during development, use Firefox to preview what you are doing driver = webdriver.Firefox()
  • this script is provided as-is and with the corresponding links, so you can compare each line of code with the actual source code of the pages (until login at least).

Solution 3:[3]

Extending the answer from Stéphane Bruckert above, once you have used Selenium to get the auth cookies, you can still switch to requests if you want to:

import requests
cook = {i['name']: i['value'] for i in driver.get_cookies()}
driver.quit()
r = requests.get("https://protected.ac.uk", cookies=cook)

Solution 4:[4]

You can find here a more detailed description of the Shibboleth authentication process.

Solution 5:[5]

I wrote a simple Python script capable of logging into a Shibbolized page.

First, I used Live HTTP Headers in Firefox to watch the redirects for the particular Shibbolized page I was targeting.

Then I wrote a simple script using urllib.request (in Python 3.4, but the urllib2 in Python 2.x seems to have the same functionality). I found that the default redirect following of urllib.request worked for my purposes, however I found it nice to subclass the urllib.request.HTTPRedirectHandler and in this subclass (class ShibRedirectHandler) add a handler for all the http_error_302 events.

In this subclass I just printed out values of the parameters (for debugging purposes); please note that in order to utilize the default redirect following, you need to end the handler with return HTTPRedirectHandler.http_error_302(self, args...) (i.e. a call to the base class http_errror_302 handler.)

The most important component to make urllib work with Shibbolized Authentication is to create OpenerDirector which has Cookie handling added. You build the OpenerDirector with the following:

cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)
response = opener.open("https://shib.page.org")

Here is a full script that may get your started (you will need to change a few mock URLs I provided and also enter valid username and password). This uses Python 3 classes; to make this work in Python2 replace urllib.request with urllib2 and urlib.parse with urlparse:

import urllib.request
import urllib.parse

#Subclass of HTTPRedirectHandler. Does not do much, but is very
#verbose. prints out all the redirects. Compaire with what you see
#from looking at your browsers redirects (using live HTTP Headers or similar)
class ShibRedirectHandler (urllib.request.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        print (req)
        print (fp.geturl())
        print (code)
        print (msg)
        print (headers)
        #without this return (passing parameters onto baseclass) 
        #redirect following will not happen automatically for you.
        return urllib.request.HTTPRedirectHandler.http_error_302(self,
                                                          req,
                                                          fp,
                                                          code,
                                                          msg,
                                                          headers)

cookieprocessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(ShibRedirectHandler, cookieprocessor)

#Edit: should be the URL of the site/page you want to load that is protected with Shibboleth
(opener.open("https://shibbolized.site.example").read())

#Inspect the page source of the Shibboleth login form; find the input names for the username
#and password, and edit according to the dictionary keys here to match your input names
loginData = urllib.parse.urlencode({'username':'<your-username>', 'password':'<your-password>'})
bLoginData = loginData.encode('ascii')

#By looking at the source of your Shib login form, find the URL the form action posts back to
#hard code this URL in the mock URL presented below.
#Make sure you include the URL, port number and path
response = opener.open("https://test-idp.server.example", bLoginData)
#See what you got.
print (response.read())

Solution 6:[6]

Though already answered , hopefully this helps someone.I had a task of downloading files from an SAML Website and got help from Stéphane Bruckert's answer.

If headless is used then the wait time would need to be specified at the required intervals of redirection for login. Once the browser logged in I used the cookies from that and used it with the requests module to download the file - Got help from this.

This is how my code looks like-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options  #imports

things_to_download= [a,b,c,d,e,f]     #The values changing in the url
options = Options()
options.headless = False
driver = webdriver.Chrome('D:/chromedriver.exe', options=options)
driver.get('https://website.to.downloadfrom.com/')
driver.find_element_by_id('username').send_keys("Your_username") #the ID would be different for different website/forms
driver.find_element_by_id('password').send_keys("Your_password")
driver.find_element_by_id('logOnForm').submit()
session = requests.Session()
cookies = driver.get_cookies()
for things in things_to_download:    
    for cookie in cookies: 
        session.cookies.set(cookie['name'], cookie['value'])
    response = session.get('https://website.to.downloadfrom.com/bla/blabla/' + str(things_to_download))
    with open('Downloaded_stuff/'+str(things_to_download)+'.pdf', 'wb') as f:
        f.write(response.content)            # saving the file
driver.close()

Solution 7:[7]

I wrote this code following the accepted answer. This worked for me in two separate projects

import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib


cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)

br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


br.open("The URL goes here")

br.select_form(nr=0)

br.form['username'] = 'Login Username'
br.form['password'] = 'Login Password'
br.submit()

br.select_form(nr=0)
br.submit()

response = br.response().read()
print response

Solution 8:[8]

Mechanize can do the work as well except it doesn't handle Javascript. Authentification successfully worked but once on the homepage, I couldn't load such link:

<a href="#" id="formMenu:linknotes1"
   onclick="return oamSubmitForm('formMenu','formMenu:linknotes1');">

In case you need Javascript, better use Selenium with PhantomJS. Otherwise, I hope you will find inspiration from this script:

#!/usr/bin/env python
#coding: utf8
import sys, logging
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text

br = mechanize.Browser() # Browser
cj = cookielib.LWPCookieJar() # Cookie Jar
br.set_cookiejar(cj) 

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36')]

br.open('https://ent.unr-runn.fr/uPortal/')
br.select_form(nr=0)
br.submit()

br.select_form(nr=0)
br.form['username'] = 'myusername'
br.form['password'] = 'mypassword'
br.submit()

br.select_form(nr=0)
br.submit()

rs = br.open('https://ent.unr-runn.fr/uPortal/f/u1240l1s214/p/esup-mondossierweb.u1240l1n228/max/render.uP?pP_org.apache.myfaces.portlet.MyFacesGenericPortlet.VIEW_ID=%2Fstylesheets%2Fetu%2Fdetailnotes.xhtml')

# Eventually comparing the cookies with those on Live HTTP Header: 
print "Cookies:"
for cookie in cj:
    print cookie

# Displaying page information
print rs.read()
print rs.geturl()
print rs.info();

# And that last line didn't work
rs = br.follow_link(id="formMenu:linknotes1", nr=0)

Solution 9:[9]

I faced a similar problem with my university page SAML authentication as well.

The base idea is to use a requests.session object to automatically handle most of the http redirects and cookie storing. However, there were many redirects using both javascript as well, and this caused multiple problems using the simple requests solution.

I ended up using fiddler to keep track of every request my browser made to the university server to fill up the redirects I've missed. It really made the process easier.

My solution is far from ideal, but seems to work.

Solution 10:[10]

If all else fails, I'd suggest using Selenium's webdriver in 'headfull' mode (i.e. a browser window will open, allowing one to input the username, password, and any other necessary login info), which would allow easy access the target website even if your form is more complex than the standard 'username' and 'password' duo and you're unsure how to fill in the br.form sections mentioned in the other answers.

from selenium import webdriver
import time

DRIVER_PATH = r'C:/INSERT_YOUR_PATH_HERE/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://moodle.tau.ac.il/login/index.php') # This is the login screen

Once you do so, you can create a loop which checks if you've reached your destination URL - if so, you're in! This snippet of code worked for me; My goal was to access my university's coursework website Moodle and download all of the PDFs automatically.

targetUrl = False
timeElapsed = 0

def downloadAllPDFs():         # Or any other function you'd like, the point is that 
    print("Access Granted!")   # you now have access to the HTML. 

while not targetUrl and timeElapsed < 60:
    time.sleep(1)
    timeElapsed += 1
    if driver.current_url == r"https://moodle.tau.ac.il/my/": # The site you're trying to login to.
        downloadAllPDFs()
        targetUrl = True 

Solution 11:[11]

With the help of @Gian-Segato's answer I was able to write the below function to login with my university's Shibboleth.

It works by using a requests.Session to keep track of the login cookies, and beautifulsoup to access form elements etc. No selenium needed.

The usage is as follows:

import requests

ILIAS_URL = 'https://ovidius.uni-tuebingen.de/ilias3/ilias.php'


def main():
    with requests.Session() as session:
        shibboleth_auth(session, ILIAS_URL, {
            'j_username': 'USERNAME',
            'j_password': 'PASSWORD',
            '_shib_idp_revokeConsent': '1',
        })

        # We can now access shibbolized resources:
        session.get(ILIAS_URL)

and this is the actual authentication code:

from bs4 import BeautifulSoup
from urllib.parse import urljoin


def shibboleth_auth(session, url, credentials):
    print("Shibboleth Auth…")
    print("??? request target resource")
    response = session.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    if link := soup.find('a', id='shib_login_link'):
        print("??? landing page")
        response = session.get(urljoin(response.url, link['href']))
        soup = BeautifulSoup(response.content, 'html.parser')

    if soup.find('input', id='_shib_idp_revokeConsent'):
        print("??? login credentials")
        form = soup.find('form')
        data = [
            (name, value)
            for name, value in get_form_data(form)
            if name not in credentials
        ]
        data.extend(credentials.items())
        data.append(('_eventId_proceed', ''))
        response = session.post(
            urljoin(response.url, form['action']), data=dict(data))
        soup = BeautifulSoup(response.content, 'html.parser')

    if soup.find('input', attrs={'name': '_shib_idp_consentIds'}):
        print("??? grant permissions")
        form = soup.find('form')
        data = get_form_data(form)
        response = session.post(
            urljoin(response.url, form['action']), data=data)
        soup = BeautifulSoup(response.content, 'html.parser')

    if soup.find('input', attrs={'name': 'SAMLResponse'}):
        print("??? forward login token")
        form = soup.find('form')
        data = get_form_data(form)
        response = session.post(
            urljoin(response.url, form['action']), data=data)

    print("??? done")
    return response


def get_form_data(form):
    return [
        (elem['name'], elem['value'])
        for elem in form.find_all('input', attrs={
            'name': True,
            'value': True,
        })
        if elem['type'] != 'submit' or elem['value'].lower() != 'reject'
    ]

Individual steps are guarded by if blocks because some or all of the steps can be skipped depending on prior login state.

It should be noted that form fields, element IDs and other specifics may be different for different sites, and there are probably better suited indicators for checking which page was loaded.

If a step doesn't work, I recommend saving the response.content to a local .html file and look at that file to adapt the login process.